One canonical model, written in Dolfin. N adapters that read source datasets and emit canonical entities. M writers that render the canonical entities to different open standards. The pivot makes the cross-format mapping go away by construction.
Several departments or providers describe the same kind of thing with different field names, units, structures, or spelling variants. The pivot dedupes by construction.
You must publish to several open standards (Smart Data Models, INSPIRE, DATEX II, schema.org). N×M direct mappings drift; one pivot stays consistent.
Multilingual fields, embedded vCard or iCalendar, Python dict-literals in CSV cells, free-text values that hide structured information. The pivot is where the cleanup converges.
One source, one consumer, fixed schemas, no growth expected → just write a script. An existing standard already fits the source data without translation → use it directly. Dataset is tiny and one-off → not worth the structure.
source A source B source C
│ │ │
adapter A adapter B adapter C
│ │ │
▼ ▼ ▼
┌──────────────────────────────────────┐
│ Canonical model (Dolfin pivot) │
│ + typed sub-entities │
│ + external IRI references │
│ + closed enums │
└──────────────────────────────────────┘
│ │ │
writer 1 writer 2 writer 3
│ │ │
▼ ▼ ▼
SDM JSON-LD DATEX II GeoJSON
consumer consumer consumer
Each new source = one new adapter. Each new output = one new writer. The pivot stays small.
Count records and columns, list distinct values for candidate enums, spot format quirks (embedded vCard, Python dict literals, spelling variants). The spelling variants are gold: they reveal the shared entities your model should lift.
Search Smart Data Models, schema.org, domain-specific (DATEX II, INSPIRE, GBIF, Wikidata). Decide for each concept: align (standard fits), partial (extend with local namespace), or gap (define and propose).
Write <domain>.dolfin. Lift every shared real-world entity (Authority, Species, Category) to a separate concept. Use enums for closed sets. Add optional refExt attributes for dereferenceable IRIs. Be generous with optional.
Create harmonize_<domain>/ with model.py, transforms.py, one writer per output format, __main__.py, and adapters/_template.py. transforms.py is portable: copy from any reference implementation.
Every adapter exposes one function: read(path) -> Iterator[CanonicalEntity]. Source parsing, regex extraction, registry dedup live in the adapter. No writer logic, no API logic, no CLI logic.
One writer per output format. Each writer reads canonical entities, never source data. Two writers cannot drift, because they read the same input. Add a third format = add a third writer.
Where a concept has global identity, attach a resolvable IRI. Cache lookups to disk (GBIF, schema.org). Prefer an explicit JSON lookup file (category_map.json) over an API call for small controlled vocabularies.
A single CLI orchestrates: python -m harmonize_<domain> --adapter <src> --input ... --output ... --<extra-format> .... Optional output flags enable additional writers without touching the core.
Re-run on a second dataset without changing the core. If you cannot, the canonical model is too source-specific. Check sanity metrics: record counts, distinct entity counts before vs after, spot checks by ID.
Page per use case, Dolfin file deposit, tarball of the harmonizer, slides in markdown. Make it reproducible: anyone should be able to clone, run, and get the same outputs.
The pattern was developed and validated during the MIMathon Porto 2026 against three Porto Open Data sources covering three different standards positions.
| Use case | Domain | Standards stance | External backbone | Writers |
|---|---|---|---|---|
| UC1 | Classified urban trees | Gap (no SDM Tree, proposed) | GBIF Backbone Taxonomy | JSON-LD, GeoJSON |
| UC2 | Points of interest | Align (SDM PointOfInterest) | schema.org + Wikidata | JSON-LD, GeoJSON |
| UC4 | City traffic indicators | Partial (SDM per-segment vs city-wide) | SDM Transportation + DATEX II | JSON-LD, DATEX II XML, GeoJSON |
The complete write-up, including all 10 steps with copy-paste code, file-tree templates, and external references, is shipped as a single markdown file. Use it as a working brief when you start a new harmonization project.
transforms.py as-is across projects, it carries the portable text helpers.