Why do QGIS annotation layers produce misaligned polygons after export?

The most common cause is on-the-fly reprojection in QGIS combined with an annotation layer stored in a different CRS than the project. Lock project CRS before annotation begins and store all layers in the same projected CRS (e.g. EPSG:3857 or a local UTM zone) to prevent coordinate transformation drift at export time.

Which QGIS export format is best for ML training pipelines?

FlatGeobuf is preferred for large exports because it supports streaming reads and binary geometry encoding with no size limit. GeoJSON is acceptable for smaller datasets or web-based pipelines. Avoid Shapefile for new pipelines due to the 2 GB file limit and truncated field names.

How do I safely run heavy processing in QGIS without freezing the UI?

Use QgsTask subclasses for all processing that touches large datasets. Never call QgsVectorLayer methods directly from a background thread. Emit signals on completion and update the UI only from the main thread.

QGIS Plugin Ecosystem for Annotation Teams

When a polygon layer annotated in QGIS arrives at a model training job only to collapse IoU scores by 30 percentage points, the culprit is rarely the annotators. It is almost always a silent coordinate reference system mismatch between the QGIS project, the background imagery, and the export format — compounded by missing schema constraints that let attribute drift accumulate undetected. QGIS has matured into a programmable orchestration layer for geospatial ML pipelines, but extracting that value requires deliberate plugin configuration, PyQGIS scripting discipline, and tight integration with downstream training infrastructure.

This page covers the full production workflow: plugin stack deployment, EPSG:3857-enforced CRS harmonization, attribute schema constraints, SAM-based pre-label ingestion, FlatGeobuf export, and CI/CD hooks — all grounded in the broader Labeling Workflows & Toolchain Integration pipeline.

Prerequisites & Toolchain Alignment

Lock down your dependency stack before the first annotator opens a project. Version drift between QGIS releases and PyQGIS API changes causes subtle geometry handling regressions that are expensive to trace after data is collected.

System dependencies (install via OS package manager):

QGIS 3.34 LTR or 3.36+ (PyQGIS API stable; Qt6; long-term security patches)
GDAL 3.8+ / PROJ 9.3+ (required for gdalwarp resampling options and accurate datum grids)
Python 3.11 (use a conda environment or system venv; never install into QGIS’s bundled Python directly)

Python packages with pinned versions:

code

geopandas==1.0.1
rasterio==1.3.10
shapely==2.0.5
pyproj==3.6.1
gdal==3.8.5
requests==2.31.0

Spatial knowledge prerequisites: Annotators should understand projected vs. geographic CRS distinctions before editing. The coordinate reference systems in annotation pipelines page covers datum transformations, EPSG code selection, and the consequences of mixing geographic and projected coordinates in the same processing chain.

Core Workflow

Step 1 — Environment Initialization & Plugin Stack Deployment

Provision a clean QGIS profile per project to isolate plugin versions from legacy desktop configurations. A dedicated profile prevents annotation plugins from breaking shared desktop GIS environments and allows reproducible rollout to all annotator workstations.

Install the following plugins via QGIS Plugin Manager, then pin them to the installed version (disable auto-updates in production):

QuickMapServices — basemap context (aerial, OpenStreetMap, Esri)
Digitizing Tools — topology-aware vertex editing and constraint snapping
GeoJSON Validator — inline schema compliance checks against RFC 7946
fmtPlugin (optional) — FlatGeobuf export with attribute pass-through

Automate profile provisioning for team rollouts:

bash

# Create a named profile and launch QGIS with it
qgis --profile annotation-prod

# Distribute custom processing scripts via the profile's plugin directory
# Store in version control; sync to annotator machines via rsync or a config-management tool
rsync -av ./python/plugins/ ~/.local/share/QGIS/QGIS3/profiles/annotation-prod/python/plugins/

Step 2 — CRS Harmonization Across Ingested Layers

Misaligned coordinate reference systems are the leading cause of silent topology errors in annotation exports. Even a small datum offset — a few metres between EPSG:3857 and a local UTM zone — collapses IoU scores for small objects such as vehicles or roof segments.

The following PyQGIS script enforces a project-wide CRS at load time. It detects layers that do not match the target CRS and reprojects them using the native processing algorithm, which preserves attribute schemas and handles curved geometries correctly:

python

from __future__ import annotations
from qgis.core import (
    QgsProject,
    QgsCoordinateReferenceSystem,
    QgsMapLayer,
    QgsTask,
    QgsApplication,
)
import processing

TARGET_EPSG = "EPSG:3857"
TARGET_CRS = QgsCoordinateReferenceSystem(TARGET_EPSG)
project = QgsProject.instance()

# Lock the project CRS before any layer is added
project.setCrs(TARGET_CRS)

layers_to_reproject: list[tuple[str, str]] = []

for layer_id, layer in project.mapLayers().items():
    if layer.type() == QgsMapLayer.VectorLayer and layer.crs() != TARGET_CRS:
        layers_to_reproject.append((layer_id, layer.name()))

for layer_id, name in layers_to_reproject:
    result = processing.run(
        "native:reprojectlayer",
        {
            "INPUT": layer_id,
            "TARGET_CRS": TARGET_CRS,
            "OUTPUT": "memory:",
        },
    )
    reprojected = result["OUTPUT"]
    reprojected.setName(f"{name}_reprojected")
    project.addMapLayer(reprojected)
    print(f"Reprojected {name!r} → {TARGET_EPSG}")

For raster layers, align before ingestion using gdalwarp. Use -r lanczos for continuous imagery and -r nearest for categorical masks or pre-existing annotation rasters to avoid introducing interpolated class values:

bash

gdalwarp \
  -t_srs EPSG:3857 \
  -r lanczos \
  -of GTiff \
  -co TILED=YES \
  -co COMPRESS=DEFLATE \
  input_ortho.tif output_ortho_3857.tif

Build Virtual Raster Tables (VRT) for multi-gigabyte orthomosaics to avoid loading entire files into RAM:

bash

gdalbuildvrt -resolution highest mosaic.vrt tile_*.tif

Step 3 — Schema Enforcement with Field Constraints

Annotation quality degrades quickly when attribute schemas drift across annotators or sprints. QGIS field constraints block invalid entries at the point of data entry, before errors propagate into export files and corrupt training labels.

Define mandatory attributes for every feature class, then apply constraints via PyQGIS:

python

from qgis.core import (
    QgsVectorLayer,
    QgsField,
    QgsFieldConstraints,
)
from qgis.PyQt.QtCore import QVariant

REQUIRED_FIELDS: dict[str, QVariant.Type] = {
    "class_id": QVariant.Int,
    "confidence": QVariant.Double,
    "annotator_id": QVariant.String,
    "review_status": QVariant.String,
}

def enforce_schema(layer: QgsVectorLayer) -> None:
    with layer.editCommandStarted("Enforce schema"):
        for field_name, field_type in REQUIRED_FIELDS.items():
            if layer.fields().indexFromName(field_name) == -1:
                layer.dataProvider().addAttributes([QgsField(field_name, field_type)])
            layer.updateFields()

            constraints = QgsFieldConstraints()
            constraints.setConstraint(
                QgsFieldConstraints.ConstraintNotNull,
                QgsFieldConstraints.ConstraintOriginLayer,
            )
            idx = layer.fields().indexFromName(field_name)
            layer.setFieldConstraint(idx, QgsFieldConstraints.ConstraintNotNull)
    print("Schema enforced on", layer.name())

Pair field constraints with QGIS form widgets: use Value Maps for class_id and review_status to prevent typos, Range Sliders (0.0–1.0) for confidence, and read-only expressions for annotator_id populated from the OS login. Assign per-annotation confidence scores to drive active learning queues in downstream training workflows.

Manual digitization bottlenecks high-throughput pipelines. The recommended pattern injects foundation-model masks as draft features that annotators correct rather than draw from scratch. The automating batch pre-labeling with SAM and QGIS workflow covers the full SAM inference chain; the integration points below show how to receive its output inside the plugin stack.

Vectorize SAM binary masks using GDAL’s polygonize algorithm, then load results as editable draft features:

python

import processing
from pathlib import Path

def ingest_sam_masks(mask_raster: str, output_dir: str) -> QgsVectorLayer:
    out_path = str(Path(output_dir) / "prelabels.gpkg")
    result = processing.run(
        "gdal:polygonize",
        {
            "INPUT": mask_raster,
            "BAND": 1,
            "FIELD": "class_id",
            "EIGHT_CONNECTEDNESS": False,
            "OUTPUT": out_path,
        },
    )
    layer = QgsVectorLayer(out_path, "pre_labels_draft", "ogr")
    # Mark all imported features as pending review
    layer.startEditing()
    for feature in layer.getFeatures():
        feature["review_status"] = "pending"
        layer.updateFeature(feature)
    layer.commitChanges()
    return layer

Configure QGIS rendering rules to visually separate pre-labels from manually digitized features. Use a rule-based renderer: show pre-labels with a dashed orange stroke and human-reviewed features with a solid green stroke. This lets annotators immediately identify regions that require attention.

For automating pre-labeling with foundation models beyond SAM — including segment-then-classify pipelines — enable QGIS’s Advanced Digitizing panel to constrain correction angles and snapping distances during the refinement pass.

Step 5 — FlatGeobuf Export & CI/CD Integration

Final exports must align with the attribute schema and geometry requirements of the ML framework. Strip null geometries, standardize column names to snake_case, validate geometry topology, and write to FlatGeobuf for streaming performance:

python

from __future__ import annotations
import geopandas as gpd
from pathlib import Path

REQUIRED_COLUMNS = {"class_id", "confidence", "annotator_id", "review_status"}

def export_ml_ready(layer_path: str, output_dir: str) -> Path:
    gdf = gpd.read_file(layer_path)

    # Drop null and invalid geometries
    gdf = gdf.dropna(subset=["geometry"])
    gdf = gdf[gdf.geometry.is_valid]

    # Enforce only reviewed annotations enter training
    if "review_status" in gdf.columns:
        gdf = gdf[gdf["review_status"] == "approved"]

    # Standardize column names
    gdf.columns = [c.lower().replace(" ", "_") for c in gdf.columns]

    # Validate required columns
    missing = REQUIRED_COLUMNS - set(gdf.columns)
    if missing:
        raise ValueError(f"Export aborted — missing required fields: {missing}")

    out_path = Path(output_dir) / "annotations.fgb"
    gdf.to_file(str(out_path), driver="FlatGeobuf")
    print(f"Exported {len(gdf)} approved features → {out_path}")
    return out_path

Wire this export into a GitHub Actions workflow that triggers on pull requests to the annotation branch. The gate blocks merge if validation fails:

yaml

name: Annotation Export Validation

on:
  pull_request:
    paths:
      - "annotations/**/*.gpkg"
      - "annotations/**/*.geojson"

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.11"
      - name: Install dependencies
        run: pip install geopandas==1.0.1 shapely==2.0.5
      - name: Run export validation
        run: python scripts/export_ml_ready.py annotations/ exports/

Version control .qgz project files alongside annotation layers to maintain reproducible environments. Store large spatial datasets using DVC versioning for geospatial training data rather than Git LFS, which handles binary diff and remote storage more efficiently for raster datasets exceeding a few hundred megabytes.

Spatial Parameters & Configuration Reference

Parameter	Type	Valid Range / Values	Spatial Implication
Project CRS	EPSG code	`EPSG:3857`, UTM zones	Controls coordinate storage precision; mismatches shift geometry
Tile overlap margin	%	10–15%	Below 10% causes boundary artifacts in tiled inference
Minimum polygon area	m²	≥ 1.0 (object-class dependent)	Sliver polygons below threshold inflate false-positive counts
Confidence threshold	float	0.0–1.0	Features below 0.5 should be routed to human review
Snapping tolerance	map units	0.1–1.0 m (projected)	Too large merges distinct vertices; too small leaves topology gaps
GSD range (annotation quality)	cm/px	5–30 cm/px (optimal for objects > 2 m)	Coarser GSD reduces polygon precision and increases boundary ambiguity
IoU acceptance threshold	float	0.5 (detection), 0.75 (segmentation)	See IoU thresholds for geospatial object detection
Max feature count per layer	integer	≤ 500 000 for in-memory layers	Larger layers require PostGIS or GeoPackage-backed approach

Edge Cases & Spatial Gotchas

These are the failure modes that most often reach a training job undetected. Each expands to the symptom, the root cause, and the fix.

On-the-fly reprojection during annotation

If a basemap layer uses EPSG:4326 while the annotation layer uses EPSG:3857, QGIS reprojects the basemap at render time. Annotators see correctly aligned features in the canvas, but the underlying coordinates accumulate rounding error from the display transform. Lock the project CRS before opening annotation layers, and disable on-the-fly reprojection for the annotation layers specifically so vertices are digitized directly in stored coordinates.

Datum shift between NAD83 and WGS84

In North American projects, mixing EPSG:4269 (NAD83) and EPSG:4326 (WGS84) introduces a sub-metre offset that is invisible at country-scale zoom but material for 10 cm/px drone imagery. Apply the +nadgrids=@null PROJ flag only if datum grid files are genuinely absent; prefer installing proj-data so transformations use accurate grid-based shifts rather than a null approximation.

Self-intersecting polygons from SAM masks

Vectorized SAM output frequently produces self-intersecting rings at class boundaries where adjacent mask regions share pixels. Run shapely.make_valid() on every ingested polygon before storing it as an editable feature, and record the repair in the audit log so the geometry change is traceable back to its source mask.

Thread-safety violations in background processing

Calling QgsVectorLayer.getFeatures() directly from a worker thread without QgsTask causes random crashes that are impossible to reproduce deterministically. Always subclass QgsTask, perform all data access inside run(), and emit a custom signal that the main thread connects to for UI updates.

Plugin conflicts between annotation releases

QuickMapServices and custom digitizing plugins sometimes register overlapping keyboard shortcuts, causing silent action-hijacking. After installing a new plugin, run QgsApplication.actionManager() in the QGIS Python console to list every registered action and detect duplicate keybindings before distributing the profile to annotators.

Sliver polygons at tile boundaries

Tiled SAM inference with insufficient overlap (< 10%) generates narrow sliver polygons at tile edges where masks from adjacent tiles do not fully overlap. Apply a minimum-area filter and use shapely.buffer(0) to dissolve slivers before loading features into the annotation layer.

Integration & Automation Hooks

The QGIS plugin stack integrates with Label Studio for geospatial workflows via its REST API. Export GeoPackage layers from QGIS and push them to Label Studio as pre-annotations, then pull human-reviewed results back as GeoJSON for final export:

python

import requests
import json

LABEL_STUDIO_URL = "https://your-label-studio.example.com"
API_TOKEN = "your-api-token"

def push_prelabels_to_label_studio(
    task_id: int,
    geojson_path: str,
    project_id: int,
) -> dict:
    with open(geojson_path) as f:
        geojson = json.load(f)

    payload = {
        "task": task_id,
        "result": [
            {
                "type": "polygonlabels",
                "value": {
                    "points": feature["geometry"]["coordinates"][0],
                    "polygonlabels": [str(feature["properties"]["class_id"])],
                },
                "from_name": "label",
                "to_name": "image",
            }
            for feature in geojson.get("features", [])
        ],
        "was_cancelled": False,
    }

    resp = requests.post(
        f"{LABEL_STUDIO_URL}/api/predictions/",
        json=payload,
        headers={"Authorization": f"Token {API_TOKEN}"},
        timeout=30,
    )
    resp.raise_for_status()
    return resp.json()

For DVC integration, track .gpkg and .fgb annotation files as DVC-managed artifacts. This enables tracking annotation changes with SHA hashing at the file level and supports rollback if a corrupt export reaches the training pipeline.

Validation & Testing

Run the following checks on every exported annotation file before it progresses to training:

python

from __future__ import annotations
import geopandas as gpd
from shapely.validation import explain_validity

def validate_annotation_export(fgb_path: str) -> None:
    gdf = gpd.read_file(fgb_path)

    # 1. Geometry validity
    invalid_mask = ~gdf.geometry.is_valid
    if invalid_mask.any():
        for idx, geom in gdf[invalid_mask].geometry.items():
            print(f"Feature {idx}: {explain_validity(geom)}")
        raise ValueError(f"{invalid_mask.sum()} invalid geometries in export")

    # 2. CRS check
    expected_epsg = 3857
    if gdf.crs is None or gdf.crs.to_epsg() != expected_epsg:
        raise ValueError(f"Expected EPSG:{expected_epsg}, got {gdf.crs}")

    # 3. Required attribute presence
    required = {"class_id", "confidence", "annotator_id", "review_status"}
    missing = required - set(gdf.columns)
    if missing:
        raise ValueError(f"Missing required columns: {missing}")

    # 4. Confidence range
    out_of_range = gdf[(gdf["confidence"] < 0.0) | (gdf["confidence"] > 1.0)]
    if not out_of_range.empty:
        raise ValueError(f"{len(out_of_range)} features have out-of-range confidence values")

    # 5. No null class_id
    null_class = gdf[gdf["class_id"].isna()]
    if not null_class.empty:
        raise ValueError(f"{len(null_class)} features have null class_id")

    print(f"Validation passed: {len(gdf)} features, CRS EPSG:{expected_epsg}")

Add a ogrinfo sanity check as a lightweight CI step that does not require Python:

bash

ogrinfo -al -so annotations.fgb | grep -E "Geometry Type|Feature Count|Layer SRS"

Verify metadata preservation across dataset versions by comparing field checksums between the QGIS source layer and the exported FlatGeobuf to detect silent attribute truncation.

This workflow is one component of the broader Labeling Workflows & Toolchain Integration pipeline.

Related

Automating Batch Pre-Labeling with SAM and QGIS — step-by-step SAM tiling, inference, and mask vectorization inside QGIS
Human-in-the-Loop Validation Cycles — structure review queues and track annotator disagreement across sprints
Integrating Label Studio with Geospatial Workflows — webhook and REST API bridging between QGIS and Label Studio
Coordinate Reference Systems in Annotation Pipelines — CRS selection, datum transformation grids, and projection mismatch debugging

QGIS Plugin Ecosystem for Annotation Teams

# Prerequisites & Toolchain Alignment

# Core Workflow

# Step 1 — Environment Initialization & Plugin Stack Deployment

# Step 2 — CRS Harmonization Across Ingested Layers

# Step 3 — Schema Enforcement with Field Constraints

# Step 4 — Pre-Label Ingestion & Human-in-the-Loop Refinement

# Step 5 — FlatGeobuf Export & CI/CD Integration

# Spatial Parameters & Configuration Reference

# Edge Cases & Spatial Gotchas

# Integration & Automation Hooks

# Validation & Testing

Dive deeper

Related in Labeling Workflows & Toolchain Integration for Geospatial AI

Prerequisites & Toolchain Alignment

Core Workflow

Step 1 — Environment Initialization & Plugin Stack Deployment

Step 2 — CRS Harmonization Across Ingested Layers

Step 3 — Schema Enforcement with Field Constraints

Step 4 — Pre-Label Ingestion & Human-in-the-Loop Refinement

Step 5 — FlatGeobuf Export & CI/CD Integration

Spatial Parameters & Configuration Reference

Edge Cases & Spatial Gotchas

Integration & Automation Hooks

Validation & Testing