Pivot harmonizer pattern, team LESL, MIMathon Porto 2026

When and why

What this pattern is good for.

Use it when

Multiple sources, one entity.

Several departments or providers describe the same kind of thing with different field names, units, structures, or spelling variants. The pivot dedupes by construction.

Use it when

Multiple consumers, no drift allowed.

You must publish to several open standards (Smart Data Models, INSPIRE, DATEX II, schema.org). N×M direct mappings drift; one pivot stays consistent.

Use it when

The data is messy.

Multilingual fields, embedded vCard or iCalendar, Python dict-literals in CSV cells, free-text values that hide structured information. The pivot is where the cleanup converges.

Don't use it when

One source, one consumer, fixed schemas, no growth expected → just write a script. An existing standard already fits the source data without translation → use it directly. Dataset is tiny and one-off → not worth the structure.

The pattern in one diagram

N in, M out, one pivot.

        source A       source B       source C
           │              │              │
        adapter A      adapter B      adapter C
           │              │              │
           ▼              ▼              ▼
        ┌──────────────────────────────────────┐
        │ Canonical model (Dolfin pivot)       │
        │ + typed sub-entities                 │
        │ + external IRI references            │
        │ + closed enums                       │
        └──────────────────────────────────────┘
           │              │              │
        writer 1       writer 2       writer 3
           │              │              │
           ▼              ▼              ▼
        SDM JSON-LD    DATEX II      GeoJSON
        consumer       consumer       consumer

Each new source = one new adapter. Each new output = one new writer. The pivot stays small.

The 10-step recipe

From a CSV to a multi-format Open Data deliverable.

Audit the data

Count records and columns, list distinct values for candidate enums, spot format quirks (embedded vCard, Python dict literals, spelling variants). The spelling variants are gold: they reveal the shared entities your model should lift.

Benchmark open standards

Search Smart Data Models, schema.org, domain-specific (DATEX II, INSPIRE, GBIF, Wikidata). Decide for each concept: align (standard fits), partial (extend with local namespace), or gap (define and propose).

Define the canonical model in Dolfin

Write <domain>.dolfin. Lift every shared real-world entity (Authority, Species, Category) to a separate concept. Use enums for closed sets. Add optional refExt attributes for dereferenceable IRIs. Be generous with optional.

Scaffold the harmonizer package

Create harmonize_<domain>/ with model.py, transforms.py, one writer per output format, __main__.py, and adapters/_template.py. transforms.py is portable: copy from any reference implementation.

Adapter contract

Every adapter exposes one function: read(path) -> Iterator[CanonicalEntity]. Source parsing, regex extraction, registry dedup live in the adapter. No writer logic, no API logic, no CLI logic.

Write the writers

One writer per output format. Each writer reads canonical entities, never source data. Two writers cannot drift, because they read the same input. Add a third format = add a third writer.

External references

Where a concept has global identity, attach a resolvable IRI. Cache lookups to disk (GBIF, schema.org). Prefer an explicit JSON lookup file (category_map.json) over an API call for small controlled vocabularies.

Wire the CLI

A single CLI orchestrates: python -m harmonize_<domain> --adapter <src> --input ... --output ... --<extra-format> .... Optional output flags enable additional writers without touching the core.

Validate

Re-run on a second dataset without changing the core. If you cannot, the canonical model is too source-specific. Check sanity metrics: record counts, distinct entity counts before vs after, spot checks by ID.

Ship

Page per use case, Dolfin file deposit, tarball of the harmonizer, slides in markdown. Make it reproducible: anyone should be able to clone, run, and get the same outputs.

Use case	Domain	Standards stance	External backbone	Writers
UC1	Classified urban trees	Gap (no SDM Tree, proposed)	GBIF Backbone Taxonomy	JSON-LD, GeoJSON
UC2	Points of interest	Align (SDM PointOfInterest)	schema.org + Wikidata	JSON-LD, GeoJSON
UC4	City traffic indicators	Partial (SDM per-segment vs city-wide)	SDM Transportation + DATEX II	JSON-LD, DATEX II XML, GeoJSON

Download

The full SKILL document.

The complete write-up, including all 10 steps with copy-paste code, file-tree templates, and external references, is shipped as a single markdown file. Use it as a working brief when you start a new harmonization project.

↓ SKILL.md the full method, 1782 words ↓ harmonizer-template.tar.gz starter kit, ready to fork ↓ demo-porto-meeting.tar.gz hands-on demo, 3 pipelines + compiler → team LESL who we are → UC1, Trees → UC2, POIs → UC4, Traffic

Suggested usage

Read the SKILL once end-to-end to internalise the pattern.
For a new project, open SKILL alongside the closest reference UC.
Copy the harmonizer package of the closest UC as starting scaffolding, rename, and replace the adapter and writers.
Keep transforms.py as-is across projects, it carries the portable text helpers.

The pivot harmonizer pattern.

What this pattern is good for.

Multiple sources, one entity.

Multiple consumers, no drift allowed.

The data is messy.

N in, M out, one pivot.

From a CSV to a multi-format Open Data deliverable.

Audit the data

Benchmark open standards

Define the canonical model in Dolfin

Scaffold the harmonizer package

Adapter contract

Write the writers

External references

Wire the CLI

Validate

Ship

Three concrete worked examples.

The full SKILL document.

Suggested usage