UC3, Energy, CityData Harmonizer, team GEX, MIMathon Porto 2026

Step 1, the challenge

Cities are drowning in data.

Cities are drowning in data, but the data does not fail because it is missing. It fails because meaning and format fragment across systems. In the Porto energy use case, the same quantity appears as energy_kWh, consumption_kWh, and energyConsumed across providers: one concept, a dozen names, zero compatibility.

Multiple formats

CSV, XLSX, JSON.

Different delimiters, encodings, and shapes. Every provider does it differently.

Conflicting schemas

One concept, twelve names.

energy_kwh vs consumption_kWh vs energyConsumed. Same concept. Zero compatibility.

No shared model

No canonical target.

Without a shared canonical target, comparing or aggregating across providers is impossible at scale.

Beyond the simple match

Can we make the harmonization process replicable? Can we develop a reusable, domain-agnostic system? The task is not a 1:1 field match. It is, first, a semantic problem: take several non-interoperable datasets within one domain, identify the right canonical target, fill the gaps, and produce one combined output that carries all content from every source without losing provenance.

Step 2, the approach

AI proposes, deterministic verifies.

The CityData Harmonizer is a nine-act pipeline of agents, the Rangers, trained in the spirit of the OASC Academy as MIM champions. Its founding principle: a strict separation between probabilistic proposal and deterministic execution. The AI rangers propose, the deterministic rangers verify and execute, and nothing reaches production data without passing a rule-based quality gate.

rangers / MIM champions

4 / 5

AI / deterministic split

architectural layers (L1–L7)

0.85

auto-approval confidence

Confidence routing, explicit and deterministic

≥ 0.85: auto-approved as INFERRED
0.60 to 0.84: sent to Resolver as UNCERTAIN
< 0.60: excluded as UNMAPPED, surfaced for human review

Control / data separation Deterministic reproducibility Self-learning convergence Full provenance MCP interoperability Quality gate Format-agnostic Schema-agnostic

Step 3, the team

Meet the 9 rangers.

Four AI-powered rangers handle investigation and judgment. Five deterministic rangers handle parsing, verification, execution, validation, and lifecycle.

SchemaSelector

AI · pre-pipeline

"I pick the right model before the mission starts."

Discovers the best-fitting OASC Smart Data Model for the source via the Smart Data Models MCP server.

Adapter

DET · ACT 1, ingestion

"I don't ask questions. I just read files."

Reads CSV, XLSX, JSON. Turns any format into a clean list of records. Pluggable: new formats need no pipeline changes.

Profiler

AI · ACT 2, profiling

"I observe. I never judge. Well, almost never."

Reads source + manifest metadata, produces a Source Intelligence File: field names, types, distributions, quality hints.

Planner

AI · ACT 3, planning

"Give me a schema and I will give you a plan."

Maps source fields to canonical targets. Routes by confidence score. Reasons about required fields when absent from source.

Resolver

AI · ACT 4, resolution

"I handle the hard cases the Planner didn't want."

Deep analysis of uncertain mappings. Returns one decision per field: INFERRED, UNMAPPED, or REVIEW.

Knowledge Engine

DET · ACT 5, quality gate

"Nothing passes without my approval. Nothing."

Four rule-based checks: schema_fit, vocab_match, unit_check, provenance_check. APPROVE, REJECT or ESCALATE. No LLM involved.

Executor

DET · ACT 6, execution

"No LLM. No drama. Just execution."

Applies the verified plan row by row. 9 pure transform functions: rename, convert_unit, aggregate, map_value, parse_date, normalize_text, restructure, parse_geojson, combine_lat_lon.

Validator

DET · ACT 7, validation

"I am the last line of defense. Don't disappoint me."

Checks every output record against the canonical JSON Schema. Returns pass / fail with error messages. Failures block publication.

PromotionManager

DET · ACT 5+, lifecycle

"I decide who gets promoted. Yes, really."

Manages Knowledge Base entry lifecycle: candidate → reviewed → approved → promoted. Each confirmed mapping becomes reusable for the next run.

Knowledge Base short-circuit

Once a mapping is promoted to the Knowledge Base, the Planner skips the LLM call entirely on later runs for that field. Zero cost, same quality, faster every run. For stable, recurring sources, the LLM cost converges to zero.

Step 4, the canonical model

EnergyConsumptionObserved, proposed back to SDM.

Today, no Smart Data Model captures observed energy consumption across providers and fuel types. ACMeasurement is electricity-only and built for phase-level electrical engineering. ConsumptionCost supports only monthly granularity and uses free-text energy types. EnergyConsumer is a grid-topology object for power-flow simulation, not a metered observation.

Team GEX reviewed seven city open-data portals, five international standards (ESPI / Green Button, IEC 61968-9, CityGML Energy ADE, ISO 52000, EPBD) and every existing SDM energy entity, and proposes EnergyConsumptionObserved: one entity for any fuel at any granularity, 8 mandatory fields, 41 in all, designed from day one for AI-driven harmonization.

🔒 Core, measurement

all required

idtypedateObservedFromdateObservedTotemporalResolutionenergyTypeconsumptionunitCode

📋 Context, who, what, how much

flowDirectionsupplySourcesectorbuildingTypefloorAreaconsumptionIntensityenergyServiceconsumptionPointcostcostCurrencytariffPerioddataQualityisNormalisedprimaryEnergyFactorco2EmissionFactorco2Emissionsphase

☁️ Weather, energy drivers

temperatureheatingDegreeDayscoolingDegreeDaysdegreeDayBaseTemprelativeHumiditywindSpeedglobalSolarRadiationprecipitationcloudCoverrefWeatherObserved

🔗 Relationships, linked entities

refConsumptionPointrefBuildingrefDevicerefOrganizationrefOperatingArearefWeatherObserved

vs existing SDM energy entities

Capability	ACMeasurement	ConsumptionCost	EnergyConsumer	SmartMeteringObs	EnergyConsumptionObserved
Multi-fuel (gas, heat, H₂, ...)	elec only	free text	grid only	elec focus	✓ 12-value enum
Interval timestamps (from / to)	single string	year + month	—	partial	✓ from / to split
Building / meter / org relationships	refDevice	refPoint	—	partial	✓ 6 relationships
Flow direction (import / export)	separate fields	—	—	—	✓ enum
Supply source (grid / onSite / storage)	—	—	—	—	✓ enum
Sector classification	—	—	—	—	✓ Eurostat-aligned
CO₂ emission factor + emissions	—	—	—	—	✓ native
Weather context (inline)	—	—	—	—	✓ 9 fields

Step 5, domain-agnostic proof

Five energy sources, then two water sources, same pipeline.

To test the pipeline, GEX deliberately chose datasets that did not agree with each other. Within the energy domain, five independent sources with different owners, formats, field names, and granularities. Then the same exercise repeated in the water-quality domain with no changes to the pipeline.

Energy domain · five heterogeneous sources, one canonical target

Dataset	Format	Source	What it tested
Fronius Energy Consumption, Maia	XLSX (private)	Pedro C.C. Pimenta, GitHub	Proprietary monitoring export, closest to a live city feed
Tetouan Smart Grid, 10-min Power	CSV	Tetouan Smart Grid Research	Three zones with 10-min timestamps, temporal harmonization + multi-zone
Electricity Consumption with Environment Variables	CSV	Aamir Ansari, Kaggle	Household-level active/reactive power + environmental variables
Climate and Energy Consumption, 2020–2024	CSV	Emirhan Akku, Kaggle	Country-level aggregation, emissions, renewables, demographics
Smart Energy Consumption and Peak Load	CSV	Jay Joshi, Kaggle	Building characteristics + occupancy + peak-load indicators

Water-quality domain · zero pipeline change

Dataset	Format	Source	What it tested
Water Quality Monitoring	CSV	Pratiksha827, GitHub	Mapping to `WaterQualityObserved` with no pipeline change
Water Potability	CSV	A. Kadiwal, Kaggle	Merging two non-interoperable water datasets into one canonical model

Result

Across both domains, heterogeneous non-interoperable inputs converged to a single Smart Data Model output. The energy datasets prove multi-source convergence within one domain. The water datasets prove the same pipeline carries to a new domain and a new canonical model without code changes.

Step 6, MIM coverage

One MIM per ranger.

MIM compliance is not a checklist applied after the fact. Each mechanism is satisfied by a specific ranger at a specific stage. Below: the MIMs satisfied today, and the roadmap for the rest, following the same four-step method (classify, decompose, assign, promote).

MIMs satisfied today

MIM	Requirement	How it is satisfied	Responsible ranger
MIM1	Unique persistent identifier per entity	Auto-generated `urn:ngsi-ld:EnergyConsumptionObserved:<uuid>`	Executor
MIM1	Cross-system entity linking	`refMeter` / `refBuilding` NGSI-LD URNs	Executor
MIM1	Semantic typing of entities	Every record carries an explicit `type` aligned to NGSI-LD	Executor
MIM1	Ontology-level mapping across sources	Heterogeneous field names mapped to canonical equivalents with confidence scores and persisted evidence	Profiler, Planner, Resolver
MIM2	Models explicit, documented, unambiguous	Schema defines all attributes with types, units, enumerations, schema.org and OASC SDM references	SchemaSelector, Planner
MIM2	Build on standardized community models	Composes OASC SDMs via `allOf/$ref`: GSMA-Commons, Location-Commons, EnergyConsumption	SchemaSelector
MIM2	Cross-model transformation to common model	The 9-act pipeline transforms any provider schema into one canonical model automatically	Planner, Executor
MIM3	Adequate machine-readable metadata	Each source manifest is a structured descriptor: owner, domain, format, columns, units, quality notes	Profiler
MIM3	Traceable provenance and trust lineage	Provenance record per output: source row → mapping IDs → every transform → KE verdict → validation result	Executor, Validator
MIM7	Geospatial data in open standards	`combine_lat_lon` and `parse_geojson` transforms produce OGC-compliant GeoJSON Point geometry	Executor
MIM7	Contextual datasets comply with MIM1 + MIM2	Location embedded in the same NGSI-LD record that carries the MIM1 URN and MIM2 canonical schema	Executor

Roadmap, four-step method applied once more

MIM	Requirement	How the same method extends the system
MIM0	Accessible, standardized APIs, no silos	Push harmonized NGSI-LD records to an Orion-LD or Scorpio context broker after execution
MIM2	Transformation rules reusable and self-learning	Promote the Knowledge Base from per-deployment store to a shared community registry
MIM3	Governance defined and comprehensible	Add a governance block to the manifest, expose manifest metadata as a DCAT-AP catalogue endpoint, integrate an IDS / GAIA-X data-sharing policy at the API gateway
MIM4	Individuals can control their personal data	Add a consent-management step (MyData or Solid pattern) before harmonizing any personal record
MIM6	Identity lifecycle and cryptographic data integrity	OAuth2 / JWT + TLS at the API layer with role-based access, sign harmonized output and provenance files
MIM7	CRS compliance and INSPIRE metadata for the EU context	Declare `crs EPSG:4326` on all GeoJSON output, add INSPIRE-compliant metadata fields to the manifest schema
MIM8	Local Digital Twin layers	Declare the harmonizer as the LDT layer-3 pre-processing component, feed its UTC-normalized time series into demand forecasting and Climate Contract tracking

Recommendation back to OASC

Give every MIM a machine-readable companion. The MIM text stays as it is, written for people; alongside it sits a structured version, derived from the three Y.4505 parts each MIM already has, that an agent can consume directly.

Step 7, what's next

From prototype to production service for member cities.

Harden the pipeline for production

Standards-compliant NGSI-LD output, live model selection through the Smart Data Models MCP server, expert-in-the-loop interface for the stewards who confirm mappings.

Build the shared skill, "MIM-AI"

A versioned registry where a mapping confirmed in one city becomes reusable by every other. MIM compliance compounds across the network instead of restarting in each city.

Run a pilot with founding cities

A small group of member cities runs the harmonizer on real data, while the OASC Academy trains the first stewards, the AI Rangers, who operate the system and grow the shared knowledge library. MIM Champions to start with.

Propose EnergyConsumptionObserved to SDM

Submit the canonical target back to the Smart Data Models community. 8 mandatory fields, 41 in all, designed from day one for AI-driven harmonization across heterogeneous city data sources.

CityData Harmonizer:
any source, any domain, one pipeline.

Cities are drowning in data.

CSV, XLSX, JSON.

One concept, twelve names.

No canonical target.

AI proposes, deterministic verifies.

Confidence routing, explicit and deterministic

Meet the 9 rangers.

Knowledge Base short-circuit

EnergyConsumptionObserved, proposed back to SDM.

vs existing SDM energy entities

Five energy sources, then two water sources, same pipeline.

Energy domain · five heterogeneous sources, one canonical target

Water-quality domain · zero pipeline change

One MIM per ranger.

MIMs satisfied today

Roadmap, four-step method applied once more

From prototype to production service for member cities.

Harden the pipeline for production

Build the shared skill, "MIM-AI"

Run a pilot with founding cities

Propose EnergyConsumptionObserved to SDM

Read the full work.

CityData Harmonizer:any source, any domain, one pipeline.

Cities are drowning in data.

CSV, XLSX, JSON.

One concept, twelve names.

No canonical target.

AI proposes, deterministic verifies.

Confidence routing, explicit and deterministic

Meet the 9 rangers.

Knowledge Base short-circuit

EnergyConsumptionObserved, proposed back to SDM.

vs existing SDM energy entities

Five energy sources, then two water sources, same pipeline.

Energy domain · five heterogeneous sources, one canonical target

Water-quality domain · zero pipeline change

One MIM per ranger.

MIMs satisfied today

Roadmap, four-step method applied once more

From prototype to production service for member cities.

Harden the pipeline for production

Build the shared skill, "MIM-AI"

Run a pilot with founding cities

Propose EnergyConsumptionObserved to SDM

Read the full work.

CityData Harmonizer:
any source, any domain, one pipeline.