Team GEX shipped an AI-assisted city-data harmonization pipeline, built around nine specialised "rangers" trained as MIM Champions in the spirit of the OASC Academy. Tested end-to-end on the Porto energy track over five heterogeneous sources, then proven domain-agnostic on water-quality data with zero pipeline change.
Cities are drowning in data, but the data does not fail because
it is missing. It fails because meaning and format fragment
across systems. In the Porto energy use case, the same quantity
appears as energy_kWh, consumption_kWh,
and energyConsumed across providers: one concept,
a dozen names, zero compatibility.
Different delimiters, encodings, and shapes. Every provider does it differently.
energy_kwh vs consumption_kWh vs energyConsumed. Same concept. Zero compatibility.
Without a shared canonical target, comparing or aggregating across providers is impossible at scale.
Can we make the harmonization process replicable? Can we develop a reusable, domain-agnostic system? The task is not a 1:1 field match. It is, first, a semantic problem: take several non-interoperable datasets within one domain, identify the right canonical target, fill the gaps, and produce one combined output that carries all content from every source without losing provenance.
The CityData Harmonizer is a nine-act pipeline of agents, the Rangers, trained in the spirit of the OASC Academy as MIM champions. Its founding principle: a strict separation between probabilistic proposal and deterministic execution. The AI rangers propose, the deterministic rangers verify and execute, and nothing reaches production data without passing a rule-based quality gate.
INFERREDUNCERTAINUNMAPPED, surfaced for human reviewFour AI-powered rangers handle investigation and judgment. Five deterministic rangers handle parsing, verification, execution, validation, and lifecycle.
INFERRED, UNMAPPED, or REVIEW.schema_fit, vocab_match, unit_check, provenance_check. APPROVE, REJECT or ESCALATE. No LLM involved.Once a mapping is promoted to the Knowledge Base, the Planner skips the LLM call entirely on later runs for that field. Zero cost, same quality, faster every run. For stable, recurring sources, the LLM cost converges to zero.
Today, no Smart Data Model captures observed energy consumption
across providers and fuel types. ACMeasurement is
electricity-only and built for phase-level electrical engineering.
ConsumptionCost supports only monthly granularity and
uses free-text energy types. EnergyConsumer is a
grid-topology object for power-flow simulation, not a metered
observation.
Team GEX reviewed seven city open-data portals, five international standards (ESPI / Green Button, IEC 61968-9, CityGML Energy ADE, ISO 52000, EPBD) and every existing SDM energy entity, and proposes EnergyConsumptionObserved: one entity for any fuel at any granularity, 8 mandatory fields, 41 in all, designed from day one for AI-driven harmonization.
| Capability | ACMeasurement | ConsumptionCost | EnergyConsumer | SmartMeteringObs | EnergyConsumptionObserved |
|---|---|---|---|---|---|
| Multi-fuel (gas, heat, H₂, ...) | elec only | free text | grid only | elec focus | ✓ 12-value enum |
| Interval timestamps (from / to) | single string | year + month | — | partial | ✓ from / to split |
| Building / meter / org relationships | refDevice | refPoint | — | partial | ✓ 6 relationships |
| Flow direction (import / export) | separate fields | — | — | — | ✓ enum |
| Supply source (grid / onSite / storage) | — | — | — | — | ✓ enum |
| Sector classification | — | — | — | — | ✓ Eurostat-aligned |
| CO₂ emission factor + emissions | — | — | — | — | ✓ native |
| Weather context (inline) | — | — | — | — | ✓ 9 fields |
To test the pipeline, GEX deliberately chose datasets that did not agree with each other. Within the energy domain, five independent sources with different owners, formats, field names, and granularities. Then the same exercise repeated in the water-quality domain with no changes to the pipeline.
| Dataset | Format | Source | What it tested |
|---|---|---|---|
| Fronius Energy Consumption, Maia | XLSX (private) | Pedro C.C. Pimenta, GitHub | Proprietary monitoring export, closest to a live city feed |
| Tetouan Smart Grid, 10-min Power | CSV | Tetouan Smart Grid Research | Three zones with 10-min timestamps, temporal harmonization + multi-zone |
| Electricity Consumption with Environment Variables | CSV | Aamir Ansari, Kaggle | Household-level active/reactive power + environmental variables |
| Climate and Energy Consumption, 2020–2024 | CSV | Emirhan Akku, Kaggle | Country-level aggregation, emissions, renewables, demographics |
| Smart Energy Consumption and Peak Load | CSV | Jay Joshi, Kaggle | Building characteristics + occupancy + peak-load indicators |
| Dataset | Format | Source | What it tested |
|---|---|---|---|
| Water Quality Monitoring | CSV | Pratiksha827, GitHub | Mapping to WaterQualityObserved with no pipeline change |
| Water Potability | CSV | A. Kadiwal, Kaggle | Merging two non-interoperable water datasets into one canonical model |
Across both domains, heterogeneous non-interoperable inputs converged to a single Smart Data Model output. The energy datasets prove multi-source convergence within one domain. The water datasets prove the same pipeline carries to a new domain and a new canonical model without code changes.
MIM compliance is not a checklist applied after the fact. Each mechanism is satisfied by a specific ranger at a specific stage. Below: the MIMs satisfied today, and the roadmap for the rest, following the same four-step method (classify, decompose, assign, promote).
| MIM | Requirement | How it is satisfied | Responsible ranger |
|---|---|---|---|
| MIM1 | Unique persistent identifier per entity | Auto-generated urn:ngsi-ld:EnergyConsumptionObserved:<uuid> | Executor |
| MIM1 | Cross-system entity linking | refMeter / refBuilding NGSI-LD URNs | Executor |
| MIM1 | Semantic typing of entities | Every record carries an explicit type aligned to NGSI-LD | Executor |
| MIM1 | Ontology-level mapping across sources | Heterogeneous field names mapped to canonical equivalents with confidence scores and persisted evidence | Profiler, Planner, Resolver |
| MIM2 | Models explicit, documented, unambiguous | Schema defines all attributes with types, units, enumerations, schema.org and OASC SDM references | SchemaSelector, Planner |
| MIM2 | Build on standardized community models | Composes OASC SDMs via allOf/$ref: GSMA-Commons, Location-Commons, EnergyConsumption | SchemaSelector |
| MIM2 | Cross-model transformation to common model | The 9-act pipeline transforms any provider schema into one canonical model automatically | Planner, Executor |
| MIM3 | Adequate machine-readable metadata | Each source manifest is a structured descriptor: owner, domain, format, columns, units, quality notes | Profiler |
| MIM3 | Traceable provenance and trust lineage | Provenance record per output: source row → mapping IDs → every transform → KE verdict → validation result | Executor, Validator |
| MIM7 | Geospatial data in open standards | combine_lat_lon and parse_geojson transforms produce OGC-compliant GeoJSON Point geometry | Executor |
| MIM7 | Contextual datasets comply with MIM1 + MIM2 | Location embedded in the same NGSI-LD record that carries the MIM1 URN and MIM2 canonical schema | Executor |
| MIM | Requirement | How the same method extends the system |
|---|---|---|
| MIM0 | Accessible, standardized APIs, no silos | Push harmonized NGSI-LD records to an Orion-LD or Scorpio context broker after execution |
| MIM2 | Transformation rules reusable and self-learning | Promote the Knowledge Base from per-deployment store to a shared community registry |
| MIM3 | Governance defined and comprehensible | Add a governance block to the manifest, expose manifest metadata as a DCAT-AP catalogue endpoint, integrate an IDS / GAIA-X data-sharing policy at the API gateway |
| MIM4 | Individuals can control their personal data | Add a consent-management step (MyData or Solid pattern) before harmonizing any personal record |
| MIM6 | Identity lifecycle and cryptographic data integrity | OAuth2 / JWT + TLS at the API layer with role-based access, sign harmonized output and provenance files |
| MIM7 | CRS compliance and INSPIRE metadata for the EU context | Declare crs EPSG:4326 on all GeoJSON output, add INSPIRE-compliant metadata fields to the manifest schema |
| MIM8 | Local Digital Twin layers | Declare the harmonizer as the LDT layer-3 pre-processing component, feed its UTC-normalized time series into demand forecasting and Climate Contract tracking |
Give every MIM a machine-readable companion. The MIM text stays as it is, written for people; alongside it sits a structured version, derived from the three Y.4505 parts each MIM already has, that an agent can consume directly.
Standards-compliant NGSI-LD output, live model selection through the Smart Data Models MCP server, expert-in-the-loop interface for the stewards who confirm mappings.
A versioned registry where a mapping confirmed in one city becomes reusable by every other. MIM compliance compounds across the network instead of restarting in each city.
A small group of member cities runs the harmonizer on real data, while the OASC Academy trains the first stewards, the AI Rangers, who operate the system and grow the shared knowledge library. MIM Champions to start with.
Submit the canonical target back to the Smart Data Models community. 8 mandatory fields, 41 in all, designed from day one for AI-driven harmonization across heterogeneous city data sources.
Three documents shipped by team GEX. Together they form the groundwork: the proof, the architecture, and the pitch.