Preserving Metadata Across Dataset Versions
Geospatial machine learning pipelines fail silently when coordinate reference systems, acquisition timestamps, or annotation provenance drift between iterations. Unlike tabular datasets, spatial data carries implicit geometric and semantic context that must survive every transformation, augmentation, and version commit. Preserving Metadata Across Dataset Versions is not an optional hygiene step; it is a structural requirement for reproducible training, regulatory compliance, and model auditability.
When annotation teams iterate over satellite imagery, LiDAR point clouds, or vector boundaries, metadata loss typically occurs during format conversion, cropping, or batch processing. Without deterministic tracking, downstream training scripts inherit mismatched projections, stale bounding box scales, or orphaned sensor tags. This guide outlines a production-ready workflow for extracting, serializing, and versioning geospatial metadata alongside your training data.
Why Geospatial Metadata Drift Breaks ML Pipelines
Spatial formats like GeoTIFF, GeoJSON, and Cloud-Optimized GeoTIFF (COG) embed critical context in headers, XML blocks, or auxiliary files. Standard image processing libraries (e.g., OpenCV, PIL) strip this context by design, treating spatial arrays as generic pixel matrices. Even GDAL-based tools can silently drop metadata when drivers encounter unsupported tags or when files are rewritten without explicit metadata preservation flags. GDAL’s raster data model documentation outlines exactly which tags survive round-trip conversions and which require explicit driver configuration.
The downstream impact is severe:
- Projection Mismatch: A model trained on EPSG:4326 coordinates receives EPSG:32633 tiles during inference, causing spatial misalignment.
- Temporal Drift: Acquisition timestamps are lost, breaking time-series models that rely on seasonal or diurnal patterns.
- Annotation Misalignment: Bounding boxes or polygon masks shift when affine transforms are applied without updating coordinate offsets.
Addressing these failures requires treating metadata as a first-class artifact, not a byproduct.
Prerequisites & Environment Setup
Before implementing a metadata preservation pipeline, ensure your environment meets the following baseline requirements:
- Python 3.9+ with
piporconda - Core Libraries:
rasterio>=1.3,geopandas>=0.13,pyproj>=3.4,shapely>=2.0,pyyaml>=6.0 - Version Control: Git for code tracking, paired with a data versioning layer. See Implementing DVC for Geospatial Training Data for storage orchestration and remote caching strategies.
- Input Formats: GeoTIFF, COG, GeoJSON, or GeoPackage. For structured vector storage, the OGC GeoPackage specification provides a reliable, SQLite-backed alternative to shapefiles.
- Storage Layout: Flat or hierarchical directory structure with explicit
data/,metadata/, andannotations/separation
Install dependencies:
pip install rasterio geopandas pyproj shapely pyyaml
Step-by-Step Metadata Preservation Workflow
1. Extract & Normalize Baseline Metadata
Parse native headers (GDAL tags, PROJ strings, sensor metadata) and normalize into a schema-agnostic dictionary. Convert all CRS representations to EPSG codes or WKT2 strings. Standardize timestamps to ISO 8601 UTC. Store resolution, affine transform, and bounding box in a consistent numeric format.
2. Apply Spatial Transformations & Update Context
Execute cropping, tiling, augmentation, or label injection while explicitly updating spatial bounds and resolution fields. Use rasterio.windows for precise tile extraction and recalculate the affine transform matrix. Never assume the original CRS survives a crop operation without verification.
3. Serialize Deterministic Sidecar Files
Write a structured YAML or JSON file alongside each dataset version. Avoid embedding metadata inside binary formats where GDAL truncation or driver limitations may occur. Sidecar files guarantee human readability, machine parseability, and driver-agnostic portability.
4. Compute Hashes & Build Version Manifests
Generate SHA-256 signatures for both the data file and its metadata sidecar. Store these in a centralized manifest for drift detection. For teams managing large-scale annotation campaigns, Tracking Annotation Changes with SHA Hashing provides a proven methodology for detecting silent label corruption.
5. Commit, Validate & Automate Sync
Push data and metadata to your version control layer. Validate that CRS, extent, and annotation counts match across environments. Integrate validation checks into CI/CD pipelines to block merges that introduce spatial inconsistencies. This workflow aligns with broader Dataset Versioning & Spatial Data Sync practices, ensuring that every pipeline stage consumes identical spatial context regardless of compute node or cloud region.
Production-Ready Code Implementation
The following script demonstrates a robust, production-grade approach to metadata extraction, sidecar generation, and hash computation. It handles both raster and vector inputs, normalizes CRS, and outputs a deterministic YAML sidecar alongside a SHA-256 manifest.
import hashlib
import json
import os
from pathlib import Path
from typing import Dict, Any
import rasterio
import geopandas as gpd
import yaml
import pyproj
def compute_sha256(file_path: str) -> str:
"""Compute SHA-256 hash of a file."""
sha256 = hashlib.sha256()
with open(file_path, "rb") as f:
for chunk in iter(lambda: f.read(8192), b""):
sha256.update(chunk)
return sha256.hexdigest()
def normalize_crs(crs_obj) -> Dict[str, Any]:
"""Normalize CRS to EPSG code and WKT2 string."""
try:
crs = pyproj.CRS.from_user_input(crs_obj)
return {
"epsg": crs.to_epsg(),
"wkt2": crs.to_wkt(version="WKT2"),
"is_geographic": crs.is_geographic
}
except Exception as e:
return {"error": str(e), "raw": str(crs_obj)}
def extract_raster_metadata(path: str) -> Dict[str, Any]:
with rasterio.open(path) as src:
return {
"driver": src.driver,
"width": src.width,
"height": src.height,
"count": src.count,
"dtype": str(src.dtypes[0]),
"nodata": src.nodata,
"crs": normalize_crs(src.crs),
"transform": list(src.transform),
"bounds": src.bounds._asdict()
}
def extract_vector_metadata(path: str) -> Dict[str, Any]:
gdf = gpd.read_file(path)
return {
"driver": gdf.__class__.__name__,
"geometry_type": str(gdf.geometry.iloc[0].geom_type) if not gdf.empty else "empty",
"feature_count": len(gdf),
"crs": normalize_crs(gdf.crs),
"bounds": gdf.total_bounds.tolist()
}
def build_sidecar(data_path: str, output_dir: str) -> Dict[str, Any]:
path = Path(data_path)
suffix = path.suffix.lower()
if suffix in (".tif", ".tiff", ".gtiff", ".cog"):
meta = extract_raster_metadata(data_path)
elif suffix in (".geojson", ".gpkg", ".shp"):
meta = extract_vector_metadata(data_path)
else:
raise ValueError(f"Unsupported format: {suffix}")
# Add file-level metadata
meta["source_path"] = str(path)
meta["file_size_bytes"] = path.stat().st_size
# Write YAML sidecar
sidecar_name = f"{path.stem}_metadata.yaml"
sidecar_path = Path(output_dir) / sidecar_name
with open(sidecar_path, "w") as f:
yaml.dump(meta, f, default_flow_style=False, sort_keys=False)
# Compute hashes
data_hash = compute_sha256(data_path)
sidecar_hash = compute_sha256(str(sidecar_path))
return {
"data_file": str(path),
"metadata_file": str(sidecar_path),
"data_sha256": data_hash,
"metadata_sha256": sidecar_hash,
"metadata": meta
}
# Usage example
if __name__ == "__main__":
INPUT_DIR = "data/raw"
OUTPUT_DIR = "metadata/v1"
os.makedirs(OUTPUT_DIR, exist_ok=True)
for file in Path(INPUT_DIR).glob("*"):
if file.is_file():
manifest_entry = build_sidecar(str(file), OUTPUT_DIR)
print(f"✅ Processed: {file.name}")
print(f" Data SHA: {manifest_entry['data_sha256'][:12]}...")
print(f" Meta SHA: {manifest_entry['metadata_sha256'][:12]}...")
Code Reliability Notes
- CRS Normalization: Uses
pyprojto convert ambiguous PROJ strings into standardized EPSG/WKT2 pairs, preventing silent projection mismatches. - Affine Transform Preservation:
rasterio.transformis serialized as a flat list, ensuring exact reconstruction during data loading. - Hash Determinism: Reads files in binary chunks to handle multi-gigabyte COGs without memory overflow.
- Extensibility: Easily adaptable to batch processing frameworks (Dask, Ray) or integrated into Airflow/Prefect DAGs.
Common Pitfalls & Mitigation Strategies
| Pitfall | Root Cause | Mitigation |
|---|---|---|
| Silent CRS Override | GDAL defaults to WGS84 when tags are missing | Explicitly validate crs.is_valid before processing; reject files without defined projections |
| Timestamp Timezone Drift | EXIF/GDAL stores local time without offset | Convert all timestamps to UTC ISO 8601 during normalization |
| Bounding Box Shift After Augmentation | Random crops/rotations applied without updating affine matrix | Recompute transform and bounds post-augmentation; never hardcode original extents |
| Sidecar-Data Desync | Manual edits or partial commits | Enforce atomic writes; validate SHA-256 pairs before pipeline execution |
Next Steps & Ecosystem Integration
Preserving metadata across dataset versions requires treating spatial context as immutable infrastructure. Once your sidecar generation and hashing pipeline is stable, integrate it with your data registry, CI validation gates, and model training schedulers. Automate drift detection by comparing manifest hashes across environments, and enforce strict schema validation before data enters the training queue.
For teams scaling beyond single-node workloads, consider coupling this workflow with distributed storage orchestration and automated rollback triggers. The foundational patterns covered here directly support enterprise-grade spatial ML pipelines, where reproducibility, compliance, and auditability are non-negotiable.