OASC
UC1 · team LESL
MIMathon Porto 2026 · Use case 01

Classified trees of Porto,
one canonical model.

Team LESL: working notes on harmonizing the representation of classified urban trees of Porto, using Dolfin as a backend independent pivot, and benchmarking against Smart Data Models.

TeamLESL
Dataset238 classified trees, Porto
Pivot languageDolfin
Reference standardsSmart Data Models, INSPIRE

What the data actually contains.

Two source files, identical in semantic content, different in geometry. The CSV uses a projected reference system (likely ETRS89 / PT-TM06), the GeoJSON is in WGS84. We work from the GeoJSON.

238
classified trees
11
distinct species
8
attributes per tree
9
age ranges
Inconsistency found

One authority, three spellings.

The classif_tutela field carries three orthographic variants for what is a single legal authority (ICNF). 159 records say "ICNF", 78 say "ICNF (Instituto da Conservação da Natureza e das Florestas)", 1 reverses the order. A canonical model resolves this by lifting Authority to a first class concept with stable identity.

Inconsistency found

Mixed kind and count in one field.

The classif_tipo field mixes two facts in free text: the kind of classification (isolated specimen vs tree cluster) and, for clusters, the specimen count, embedded as "Conjunto arbóreo (12 exemplares)". We split into kind (enum) and specimenCount (int). One typo found along the way: "Conjunto arbórep".

Observation

Species couples are stable.

11 distinct species, each consistently paired with one common name (e.g. Metrosideros excelsaÁrvore-do-Fogo). We bind Species to the GBIF Backbone Taxonomy via a full URL in taxonRef, e.g. https://www.gbif.org/species/3152398 for Metrosideros excelsa. Resolvable, machine readable, versionable.

Observation

Legal references are free text.

The classif_dec_lei_ref field has 10 distinct values covering 6 underlying acts, with formatting variants like "D.R. nº 6 II Série de 10/01/2005" vs "D. R. n.º 6 II Série de 10/01/2005". Modelled as a LegalAct concept with a normalized reference and an issued date.

Is there a Smart Data Model for trees?

We searched the official Smart Data Models registry for any model covering trees, plants, species or forestry inventories. Result, in one line:

Gap finding

No Smart Data Model exists for an individual tree. The closest domain, dataModel.ParksAndGardens, defines only Garden, FlowerBed and GreenspaceRecord. None of them carries species, age, legal classification or heritage status. The dataModel.Forestry repository contains only FireForestStatus.

What this means for UC1

SDM gap, documented ParksAndGardens compatible PR ready

What about Plant Ontology?

We checked the OBO Foundry Plant Ontology (PO). It models plant anatomy, morphology and developmental stages for genomics annotation. Out of scope for an urban heritage register: it would be the right reference if we were tagging tissues or growth phases, not classified municipal assets.

The relevant external references for our Species concept are taxonomic backbones: GBIF, NCBI Taxonomy, IPNI, and the Darwin Core terms dwc:scientificName / dwc:vernacularName.

The Dolfin pivot.

Dolfin is a human friendly ontology language, backend independent: the same definition runs against graph, relational or document stores. We use it as the canonical source of truth, and derive sérialisations from it (JSON-LD for SDM, GeoJSON for mapping tools, INSPIRE for European datasets).

package <http://mimathon.askem.eu/uc1/trees>:
  dolfin_version "1"
  version "0.2.0"
  author "LESL (Lea, Eliott, Sattisvar & Louis), Dolfin & Askem"
  description "Canonical model for urban trees, MIMathon Porto 2026 use case 01"

concept AgeRange:
  one of:
    Y21_30
    Y31_40
    Y41_50
    Y51_60
    Y61_70
    Y71_80
    Y81_100
    Over100
    Unknown

concept ClassificationKind:
  one of:
    IsolatedSpecimen
    TreeCluster

concept Authority:
  has name: one string
  has acronym: one string

concept Species:
  has scientificName: one string
  has commonName: optional string
  has taxonRef: optional string

concept LegalAct:
  has reference: one string
  has issuedOn: optional string

concept Classification:
  has kind: one ClassificationKind
  has specimenCount: optional int
  has authority: one Authority
  has legalAct: one LegalAct
  has classifiedOn: optional string

concept Location:
  has latitude: one float
  has longitude: one float

concept Tree:
  has localId: one string
  has species: one Species
  has ageRange: optional AgeRange
  has location: one Location
  has classification: optional Classification
  has refFlowerBed: optional string
  has refGarden: optional string

Design choices

From Porto Open Data to the pivot.

Source attributeSource exampleCanonical attributeTransformation
objectid 1726 Tree.localId cast to string
especie Magnolia grandiflora Species.scientificName trim, normalize
esp_nomecomum Magnólia Species.commonName trim
(derived from especie) Magnolia grandiflora Species.taxonRef resolve via GBIF species/match API, store full URL e.g. https://www.gbif.org/species/3152398
arv_intervalo_idade 81-100 Tree.ageRange map to enum, "sup 100" → Over100, empty → Unknown
classif_tutela ICNF (Instituto da Conservação da Natureza e das Florestas) Classification.authority parse, dedupe, build Authority{ name, acronym }
classif_tipo Conjunto arbóreo (12 exemplares) Classification.kind, .specimenCount regex split, "Conjunto arbóreo" → TreeCluster, "Exemplar isolado" / "Isolada" → IsolatedSpecimen, capture (N)
classif_dec_lei_ref D.R. nº 6 II Série de 10/01/2005 LegalAct.reference normalize whitespace, unify "D.R." / "D. R."
classif_data 2005-01-10 Classification.classifiedOn ISO 8601, no transform
geometry.coordinates [-8.6024, 41.1458] Location.longitude, .latitude WGS84, [lon, lat] order from GeoJSON

One core, many datasets.

The script is split in two: a dataset-agnostic core (canonical model, GBIF resolver, JSON-LD writer), and one adapter per source dataset. Harmonizing another city's tree inventory means writing a new adapter file, nothing else.

Core, reusable

harmonize/

  • model.py, dataclasses mirroring the Dolfin definition
  • gbif.py, GBIF Backbone resolver with on-disk cache
  • jsonld.py, JSON-LD writer with Darwin Core and WGS84 vocab
  • __main__.py, CLI orchestrating adapter, GBIF, output
Adapters, per dataset

harmonize/adapters/

  • porto.py, GeoJSON, Porto Open Data conventions
  • (your dataset here), expose a read(path) generator yielding canonical Tree objects

No core change required to onboard a new city or schema.

CLI

python -m harmonize \
    --adapter porto \
    --input ../uc1-trees-porto.geojson \
    --output ../out/trees.jsonld \
    --base-id "http://mimathon.askem.eu/uc1/trees/" \
    --gbif-cache ../out/.gbif_cache.json

GBIF resolution, run on Porto

11 distinct species in the dataset, 10 resolved on the first pass:

GBIF EXACT  Magnolia grandiflora      ->  https://www.gbif.org/species/9605163
GBIF EXACT  Platanus x acerifolia     ->  https://www.gbif.org/species/3152815
GBIF FUZZY  Araucaria bidwilii        ->  https://www.gbif.org/species/2684918
GBIF EXACT  Washingtonia robusta      ->  https://www.gbif.org/species/5294595
GBIF EXACT  Liriodendron tulipifera   ->  https://www.gbif.org/species/3152861
GBIF EXACT  Metrosideros excelsa      ->  https://www.gbif.org/species/3185393
GBIF EXACT  Metrosideros robusta      ->  https://www.gbif.org/species/3185294
GBIF EXACT  Ginkgo biloba             ->  https://www.gbif.org/species/2687885
GBIF EXACT  Phoenix canariensis       ->  https://www.gbif.org/species/7445284
GBIF  MISS  Phoenix sp.
GBIF EXACT  Araucaria heterophylla    ->  https://www.gbif.org/species/2684969

Result: 237 of 238 trees carry a resolvable taxonRef. The single miss is Phoenix sp., a genus-only label with no species rank, correctly rejected by GBIF. The FUZZY match on Araucaria bidwilii reveals a typo in the source: GBIF prefers Araucaria bidwillii, with two l.

Sample output, JSON-LD

{
  "@id": "http://mimathon.askem.eu/uc1/trees/1726",
  "@type": "Tree",
  "localId": "1726",
  "species": {
    "@type": "Species",
    "scientificName": "Magnolia grandiflora",
    "commonName": "Magnólia",
    "taxonRef": "https://www.gbif.org/species/9605163"
  },
  "location": {
    "@type": "Location",
    "latitude": 41.14581684711751,
    "longitude": -8.602456850857784
  },
  "ageRange": "Y81_100",
  "classification": {
    "@type": "Classification",
    "kind": "TreeCluster",
    "specimenCount": 12,
    "authority": {
      "@type": "Authority",
      "name": "Instituto da Conservação da Natureza e das Florestas",
      "acronym": "ICNF"
    },
    "legalAct": {
      "@type": "LegalAct",
      "reference": "D.R. nº 6 II Série de 10/01/2005"
    },
    "classifiedOn": "2005-01-10"
  }
}

Before and after, on the same record.

Same tree (objectid 1726, a Magnolia in Jardim do Palácio de Cristal), shown raw from the Porto Open Data GeoJSON on the left, and harmonized to the canonical model with GBIF binding on the right.

JSON-LD is for the data layer (semantic, dereferenceable, queryable). For map tools (geojson.io, QGIS, Mapbox, Leaflet) use the canonical GeoJSON: same data, flattened into FeatureCollection with dotted-key properties (e.g. species.taxonRef).

Source · GeoJSON featurePorto Open Data
{
  "type": "Feature",
  "id": 1726,
  "geometry": {
    "type": "Point",
    "coordinates": [
      -8.602456850857784,
      41.14581684711751
    ]
  },
  "properties": {
    "objectid": 1726,
    "especie": "Magnolia grandiflora",
    "esp_nomecomum": "Magnólia",
    "arv_intervalo_idade": "81-100",
    "classif_tutela": "ICNF (Instituto da Conservação da Natureza e das Florestas)",
    "classif_tipo": "Conjunto arbóreo (12 exemplares)\r\n",
    "classif_dec_lei_ref": "D.R. nº 6 II Série de 10/01/2005",
    "classif_data": "2005-01-10"
  }
}
Canonical · JSON-LD nodehttp://mimathon.askem.eu/uc1/trees/
{
  "localId": "1726",
  "species": {
    "@type": "Species",
    "scientificName": "Magnolia grandiflora",
    "commonName": "Magnólia",
    "taxonRef": "https://www.gbif.org/species/9605163"
  },
  "location": {
    "@type": "Location",
    "latitude": 41.14581684711751,
    "longitude": -8.602456850857784
  },
  "ageRange": "Y81_100",
  "classification": {
    "kind": "TreeCluster",
    "authority": {
      "@type": "Authority",
      "name": "Instituto da Conservação da Natureza e das Florestas",
      "acronym": "ICNF"
    },
    "legalAct": {
      "@type": "LegalAct",
      "reference": "D.R. nº 6 II Série de 10/01/2005"
    },
    "specimenCount": 12,
    "classifiedOn": "2005-01-10",
    "@type": "Classification"
  },
  "@id": "http://mimathon.askem.eu/uc1/trees/1726",
  "@type": "Tree"
}

JSON-LD as a graph

Same record visualized as a property graph. Each typed fragment becomes a node, each property a labelled edge. The dashed link to GBIF is a live IRI: taxonRef resolves to a stable, dereferenceable taxon page on gbif.org.

JSON-LD instance graph for tree 1726, with nodes Tree, Species, Location, Classification, Authority, LegalAct and an external link to GBIF

Species in the dataset

11 distinct scientific names, 10 with a stable GBIF taxonRef (the 1 unmatched is a genus-only label, Phoenix sp.).

Scientific nameCommon name (PT)GBIF taxonRefTrees
Metrosideros excelsaÁrvore-do-Fogo318539385
Phoenix canariensisPalmeira-das-Canárias744528446
Platanus x acerifoliaPlataneiro315281537
Araucaria heterophyllaAraucária-de-Norfolk268496927
Magnolia grandifloraMagnólia960516325
Washingtonia robustaPalmeira-Washingtonia52945957
Liriodendron tulipiferaTulipeiro-da-Virgínia31528617
Araucaria bidwiliiAraucária-da-Queenslândia26849181
Metrosideros robustaMetrosidero31852941
Ginkgo bilobaGinkgo26878851
Phoenix sp.(none)unmatched1

Legal acts in the dataset

10 distinct references after the per-record normalization (collapse of D. R. to D.R.).

Harmonization headroom

Four of these ten references all point to the same act of 10 January 2005, with cosmetic variants ("D.R. nº 6", "D.R. n.º 6", "D.R. n.º6", "D.R. nº 6 ... 2005-01-10"), covering 214 of the 238 trees. A second pass, deduplicating LegalAct by issued date, would collapse the count from 10 to 6.

ReferenceclassifiedOnTrees
D.R. nº 6 II Série de 10/01/20052005-01-10, 2005-10-01210
D.R. nº 29 2ª Série Parte C de 11/02/20212021-02-1112
D.R. n.º 60/ 2019, Série II de 2019-03-262019-03-267
D.R. n.º 99/2021, Série II de 2021/05/212021-01-072
D.R. n.º 6 II Série de 10/01/20052005-01-10, 2008-01-102
D.G. nº 204 II Série de 01/09/19501950-09-011
D.R. n.º 53 / 2019, Série II de 2019-03-152019-03-151
D.G. nº 280 II Série de 02/12/19391939-12-021
D.R. n.º6 II Série de 10/01/20052005-01-101
D.R. nº 6 II Série de 2005-01-102005-01-101

What changed

Read it, run it, fork it.

Full source of the harmonizer, hosted alongside this page. Each file is also a one-click download as raw .py. The whole package is bundled as a tarball below.

Add a new dataset

The harmonizer is split in three layers so a new dataset means writing only the bottom layer, the adapter:

Workflow: copy adapters/_template.py to adapters/<your_name>.py, fill in the read function, run python -m harmonize --adapter <your_name> .... Look at adapters/porto.py for a complete worked example.

model.py Canonical Tree model, dataclasses mirroring trees.dolfin View raw
"""Canonical urban tree model, mirrors trees.dolfin v0.2.0."""
from __future__ import annotations
from dataclasses import dataclass, field, asdict
from typing import Optional


AGE_RANGES = {
    "21-30": "Y21_30",
    "31-40": "Y31_40",
    "41-50": "Y41_50",
    "51-60": "Y51_60",
    "61-70": "Y61_70",
    "71-80": "Y71_80",
    "81-100": "Y81_100",
    "sup 100": "Over100",
    "": "Unknown",
    None: "Unknown",
}


@dataclass(frozen=True)
class Authority:
    name: str
    acronym: str


@dataclass(frozen=True)
class Species:
    scientificName: str
    commonName: Optional[str] = None
    taxonRef: Optional[str] = None


@dataclass(frozen=True)
class LegalAct:
    reference: str
    issuedOn: Optional[str] = None


@dataclass
class Classification:
    kind: str
    authority: Authority
    legalAct: LegalAct
    specimenCount: Optional[int] = None
    classifiedOn: Optional[str] = None


@dataclass
class Location:
    latitude: float
    longitude: float


@dataclass
class Tree:
    localId: str
    species: Species
    location: Location
    ageRange: Optional[str] = None
    classification: Optional[Classification] = None
    refFlowerBed: Optional[str] = None
    refGarden: Optional[str] = None


def normalize_age(raw: Optional[str]) -> str:
    if raw is None:
        return "Unknown"
    return AGE_RANGES.get(raw.strip(), "Unknown")
gbif.py GBIF Backbone resolver with on-disk cache, stdlib only View raw
"""GBIF Backbone Taxonomy resolver, with on-disk cache.

Uses the public species/match endpoint, which fuzzy matches a scientific name
against the GBIF backbone and returns a stable usageKey. We turn that into
a canonical, resolvable URL: https://www.gbif.org/species/{usageKey}.

No external dependency, stdlib only.
"""
from __future__ import annotations
import json
import urllib.parse
import urllib.request
from pathlib import Path
from typing import Optional


GBIF_MATCH_URL = "https://api.gbif.org/v1/species/match"
GBIF_SPECIES_PAGE = "https://www.gbif.org/species/{key}"


class GbifResolver:
    """Resolve scientific names to GBIF species URLs, cached to disk."""

    def __init__(self, cache_path: Path, timeout: float = 10.0):
        self.cache_path = Path(cache_path)
        self.timeout = timeout
        self._cache: dict[str, dict] = {}
        if self.cache_path.exists():
            self._cache = json.loads(self.cache_path.read_text(encoding="utf-8"))

    def resolve(self, scientific_name: str) -> Optional[dict]:
        """Return {url, usageKey, canonicalName, matchType} or None on failure.

        Result is cached, including misses (stored as {"miss": true}) so a
        re-run does not pound the API for unmatched names.
        """
        key = scientific_name.strip()
        if not key:
            return None
        if key in self._cache:
            entry = self._cache[key]
            return None if entry.get("miss") else entry

        params = urllib.parse.urlencode({"name": key, "verbose": "false"})
        url = f"{GBIF_MATCH_URL}?{params}"
        try:
            with urllib.request.urlopen(url, timeout=self.timeout) as r:
                payload = json.loads(r.read().decode("utf-8"))
        except Exception as e:
            print(f"  ! GBIF lookup failed for {key!r}: {e}")
            return None

        usage_key = payload.get("usageKey")
        if not usage_key or payload.get("matchType") == "NONE":
            self._cache[key] = {"miss": True}
            self._flush()
            return None

        entry = {
            "url": GBIF_SPECIES_PAGE.format(key=usage_key),
            "usageKey": usage_key,
            "canonicalName": payload.get("canonicalName") or payload.get("scientificName"),
            "matchType": payload.get("matchType"),
            "rank": payload.get("rank"),
        }
        self._cache[key] = entry
        self._flush()
        return entry

    def _flush(self) -> None:
        self.cache_path.parent.mkdir(parents=True, exist_ok=True)
        self.cache_path.write_text(
            json.dumps(self._cache, ensure_ascii=False, indent=2, sort_keys=True),
            encoding="utf-8",
        )
jsonld.py JSON-LD writer with Darwin Core and WGS84 vocabulary View raw
"""JSON-LD serializer for the canonical Tree model.

Produces a single document with a @graph of all trees. Inline objects
for Species, Authority, LegalAct, Classification and Location, with
type tags so each fragment is independently identifiable.
"""
from __future__ import annotations
from dataclasses import asdict
from typing import Iterable

from .model import Tree


NS = "http://mimathon.askem.eu/uc1/trees#"

CONTEXT = {
    "@vocab": NS,
    "Tree": NS + "Tree",
    "Species": NS + "Species",
    "Authority": NS + "Authority",
    "LegalAct": NS + "LegalAct",
    "Classification": NS + "Classification",
    "Location": NS + "Location",
    "taxonRef": {"@id": NS + "taxonRef", "@type": "@id"},
    "refFlowerBed": {"@id": NS + "refFlowerBed", "@type": "@id"},
    "refGarden": {"@id": NS + "refGarden", "@type": "@id"},
    "geo": "https://www.w3.org/2003/01/geo/wgs84_pos#",
    "latitude": "geo:lat",
    "longitude": "geo:long",
    "dwc": "http://rs.tdwg.org/dwc/terms/",
    "scientificName": "dwc:scientificName",
    "commonName": "dwc:vernacularName",
}


def _strip_none(d):
    if isinstance(d, dict):
        return {k: _strip_none(v) for k, v in d.items() if v is not None}
    if isinstance(d, list):
        return [_strip_none(x) for x in d]
    return d


def tree_to_node(tree: Tree, base_id: str) -> dict:
    d = asdict(tree)
    d["@id"] = f"{base_id}{tree.localId}"
    d["@type"] = "Tree"
    d["species"] = {"@type": "Species", **asdict(tree.species)}
    d["location"] = {"@type": "Location", **asdict(tree.location)}
    if tree.classification is not None:
        c = asdict(tree.classification)
        c["@type"] = "Classification"
        c["authority"] = {"@type": "Authority", **asdict(tree.classification.authority)}
        c["legalAct"] = {"@type": "LegalAct", **asdict(tree.classification.legalAct)}
        d["classification"] = c
    return _strip_none(d)


def build_document(trees: Iterable[Tree], base_id: str) -> dict:
    return {
        "@context": CONTEXT,
        "@graph": [tree_to_node(t, base_id) for t in trees],
    }
geojson_out.py GeoJSON FeatureCollection writer for GIS tools (geojson.io, QGIS, Leaflet) View raw
"""GeoJSON serializer for the canonical Tree model.

Emits a FeatureCollection in WGS84, suitable for any GIS tool
(geojson.io, QGIS, Mapbox, Leaflet). Nested entities are flattened
to a dotted-key namespace inside `properties` so the file is also
useful for non-LD-aware consumers.
"""
from __future__ import annotations
from dataclasses import asdict
from typing import Iterable

from .model import Tree


def _flatten(prefix: str, value, target: dict) -> None:
    if value is None:
        return
    if isinstance(value, dict):
        for k, v in value.items():
            _flatten(f"{prefix}.{k}" if prefix else k, v, target)
    else:
        target[prefix] = value


def tree_to_feature(tree: Tree, base_id: str) -> dict:
    props: dict = {"@id": f"{base_id}{tree.localId}", "@type": "Tree"}
    _flatten("localId", tree.localId, props)
    _flatten("ageRange", tree.ageRange, props)
    _flatten("species", asdict(tree.species), props)
    if tree.classification is not None:
        _flatten("classification", asdict(tree.classification), props)
    if tree.refFlowerBed is not None:
        props["refFlowerBed"] = tree.refFlowerBed
    if tree.refGarden is not None:
        props["refGarden"] = tree.refGarden

    return {
        "type": "Feature",
        "id": tree.localId,
        "geometry": {
            "type": "Point",
            "coordinates": [tree.location.longitude, tree.location.latitude],
        },
        "properties": props,
    }


def build_collection(trees: Iterable[Tree], base_id: str) -> dict:
    return {
        "type": "FeatureCollection",
        "features": [tree_to_feature(t, base_id) for t in trees],
    }
__main__.py CLI orchestrating adapter, GBIF resolution, JSON-LD output View raw
"""CLI entry point.

Example:
    python -m harmonize \
        --adapter porto \
        --input ../uc1-trees-porto.geojson \
        --output ../out/trees.jsonld \
        --base-id http://mimathon.askem.eu/uc1/trees/ \
        --gbif-cache ../out/.gbif_cache.json
"""
from __future__ import annotations
import argparse
import importlib
import json
import sys
from dataclasses import replace
from pathlib import Path

from .gbif import GbifResolver
from .jsonld import build_document
from .model import Species, Tree


def _load_adapter(name: str):
    mod = importlib.import_module(f"harmonize.adapters.{name}")
    if not hasattr(mod, "read"):
        raise SystemExit(f"adapter {name!r} has no read(path) function")
    return mod


def _enrich_with_gbif(trees, resolver: GbifResolver):
    """Replace each tree.species with a copy carrying taxonRef when resolved."""
    seen: dict[str, Species] = {}
    for t in trees:
        sci = t.species.scientificName
        if sci in seen:
            yield replace(t, species=seen[sci])
            continue
        match = resolver.resolve(sci)
        if match:
            enriched = replace(t.species, taxonRef=match["url"])
            print(f"  GBIF {match['matchType']:>5}  {sci}  ->  {match['url']}")
        else:
            enriched = t.species
            print(f"  GBIF  MISS  {sci}")
        seen[sci] = enriched
        yield replace(t, species=enriched)


def main(argv=None) -> int:
    p = argparse.ArgumentParser(prog="harmonize", description="Harmonize an urban tree dataset to the canonical Tree model and emit JSON-LD.")
    p.add_argument("--adapter", required=True, help="Adapter module name under harmonize.adapters, e.g. porto")
    p.add_argument("--input", required=True, type=Path, help="Source dataset path")
    p.add_argument("--output", required=True, type=Path, help="Destination JSON-LD file")
    p.add_argument("--base-id", default="http://example.org/trees/", help="IRI prefix for tree @id values")
    p.add_argument("--gbif-cache", type=Path, default=Path(".gbif_cache.json"), help="On-disk GBIF lookup cache")
    p.add_argument("--no-gbif", action="store_true", help="Skip GBIF resolution (faster, offline)")
    args = p.parse_args(argv)

    adapter = _load_adapter(args.adapter)
    print(f"Reading via adapter '{args.adapter}' from {args.input}...")
    trees = list(adapter.read(args.input))
    print(f"  {len(trees)} trees read")

    if not args.no_gbif:
        print(f"Resolving species against GBIF (cache: {args.gbif_cache})...")
        resolver = GbifResolver(args.gbif_cache)
        trees = list(_enrich_with_gbif(trees, resolver))

    print(f"Writing JSON-LD to {args.output}...")
    doc = build_document(trees, base_id=args.base_id)
    args.output.parent.mkdir(parents=True, exist_ok=True)
    args.output.write_text(json.dumps(doc, ensure_ascii=False, indent=2), encoding="utf-8")
    print(f"  done, {len(doc['@graph'])} entities in @graph")
    return 0


if __name__ == "__main__":
    sys.exit(main())
transforms.py Reusable text helpers shared across adapters View raw
"""Reusable text transforms shared across adapters.

Adapters compose these helpers rather than reimplementing them. Helpers
are intentionally minimal: they only do generic text work (cleanup,
regex extraction, keyword routing). Anything dataset-specific belongs
in the adapter itself.
"""
from __future__ import annotations
import re
from typing import Optional


def clean_text(value: Optional[str]) -> Optional[str]:
    """Trim, collapse internal whitespace, return None for empty input."""
    if value is None:
        return None
    txt = re.sub(r"\s+", " ", str(value)).strip()
    return txt or None


def extract_count(value: Optional[str], pattern: str = r"\((\d+)") -> Optional[int]:
    """Pull an integer out of free text, e.g. '... (12 exemplares)' -> 12."""
    if value is None:
        return None
    m = re.search(pattern, value)
    return int(m.group(1)) if m else None


def match_keywords(value: Optional[str], keyword_map: dict[str, str]) -> Optional[str]:
    """Return the first enum value whose regex key matches the input.

    keyword_map: {regex_pattern: enum_value}, e.g.
        {r"conjunto\\s+arb[óo]re[op]": "TreeCluster",
         r"isolad": "IsolatedSpecimen"}
    Patterns are evaluated in insertion order, case-insensitive.
    """
    if not value:
        return None
    for pattern, enum_value in keyword_map.items():
        if re.search(pattern, value, re.IGNORECASE):
            return enum_value
    return None


class Registry:
    """Tiny dedupe registry for value-typed entities like Authority.

    Use when source data has many spelling variants of the same entity:
        reg = Registry({"ICNF": Authority(name="...", acronym="ICNF")})
        a = reg.resolve("ICNF (Instituto da Conservação ...)", needle="ICNF")
    The canonical instance is returned, ensuring downstream graphs share
    one node per real-world entity.
    """

    def __init__(self, known: dict | None = None):
        self._known = dict(known or {})

    def resolve(self, raw, needle: str | None = None, default=None):
        if raw is None:
            return default
        text = str(raw)
        if needle is not None and needle in text and needle in self._known:
            return self._known[needle]
        for key, val in self._known.items():
            if key in text:
                return val
        return default

    def get(self, key: str):
        return self._known.get(key)
adapters/_template.py Skeleton, copy and rename to add a new dataset View raw
"""Skeleton adapter, copy and rename to add a new dataset.

Quick start:
    1. Copy this file to harmonize/adapters/<your_dataset>.py
    2. Replace the `read` function body with your own parsing
    3. Run:  python -m harmonize --adapter <your_dataset> --input ... --output ...

Contract:
    Expose a single function `read(path) -> Iterator[Tree]`.
    The CLI (harmonize/__main__.py) takes care of GBIF resolution and
    JSON-LD output, so the adapter only has to map source records to
    canonical Tree instances.

See harmonize/adapters/porto.py for a complete worked example covering
field renaming, enum normalization, kind/count splitting from free
text, and authority dedupe.
"""
from __future__ import annotations
from pathlib import Path
from typing import Iterator

from ..model import (
    AGE_RANGES,
    Authority, Classification, LegalAct, Location, Species, Tree,
    normalize_age,
)
from ..transforms import (
    Registry, clean_text, extract_count, match_keywords,
)


# Optional: pre-populate canonical entities you'll reuse, then dedupe via Registry.
_AUTHORITIES = Registry({
    # "ICNF": Authority(name="Instituto da Conservação ...", acronym="ICNF"),
})


# Optional: keyword-driven enum routing for messy free-text fields.
_KIND_KEYWORDS = {
    # r"conjunto\s+arb[óo]re[op]": "TreeCluster",
    # r"isolad|exemplar\s+isolado": "IsolatedSpecimen",
}


def read(path: str | Path) -> Iterator[Tree]:
    """Yield canonical Tree records from `path`.

    Replace the body below with your dataset's parsing logic. The
    important parts:

      - return Tree(localId=..., species=..., location=..., ...)
      - all string-valued attributes should already be cleaned
      - taxonRef is left for the CLI to fill via GBIF
      - if classification info is missing, set classification=None
    """
    # Example for a CSV: read it and iterate rows
    # import csv
    # with Path(path).open(encoding="utf-8") as f:
    #     for row in csv.DictReader(f):
    #         yield Tree(
    #             localId=row["id"],
    #             species=Species(
    #                 scientificName=clean_text(row["scientific_name"]) or "",
    #                 commonName=clean_text(row.get("common_name")),
    #             ),
    #             location=Location(
    #                 latitude=float(row["lat"]),
    #                 longitude=float(row["lon"]),
    #             ),
    #             ageRange=normalize_age(row.get("age_class")),
    #             classification=None,
    #         )
    raise NotImplementedError("Implement read() for your dataset, see porto.py")
adapters/porto.py Adapter for Porto Open Data classified-trees GeoJSON View raw
"""Adapter for the Porto Open Data classified-trees GeoJSON.

Source schema (per feature.properties):
    objectid              -> Tree.localId
    especie               -> Species.scientificName
    esp_nomecomum         -> Species.commonName
    arv_intervalo_idade   -> Tree.ageRange
    classif_tutela        -> Classification.authority (parse)
    classif_tipo          -> Classification.kind + .specimenCount (parse)
    classif_dec_lei_ref   -> Classification.legalAct.reference (normalize)
    classif_data          -> Classification.classifiedOn

Geometry: Point in WGS84, [lon, lat] order.
"""
from __future__ import annotations
import json
import re
from pathlib import Path
from typing import Iterator

from ..model import (
    Authority, Classification, LegalAct, Location, Species, Tree, normalize_age,
)


SOURCE_ID = "porto-classified-trees"


_AUTHORITY_KNOWN = {
    "ICNF": Authority(
        name="Instituto da Conservação da Natureza e das Florestas",
        acronym="ICNF",
    ),
}


def _parse_authority(raw: str | None) -> Authority | None:
    if not raw:
        return None
    txt = raw.strip()
    if "ICNF" in txt:
        return _AUTHORITY_KNOWN["ICNF"]
    return Authority(name=txt, acronym=txt[:8])


_KIND_CLUSTER = re.compile(r"conjunto\s+arb[óo]re[op]", re.IGNORECASE)
_KIND_ISOLATED = re.compile(r"(exemplar\s+isolado|isolada|árvore\s+isolada)", re.IGNORECASE)
_COUNT = re.compile(r"\((\d+)\s*exemplares?\)", re.IGNORECASE)


def _parse_kind_and_count(raw: str | None) -> tuple[str | None, int | None]:
    if not raw:
        return None, None
    txt = raw.strip()
    count = None
    m = _COUNT.search(txt)
    if m:
        count = int(m.group(1))
    if _KIND_CLUSTER.search(txt):
        return "TreeCluster", count
    if _KIND_ISOLATED.search(txt):
        return "IsolatedSpecimen", count
    return None, count


def _normalize_legal_ref(raw: str | None) -> str | None:
    if not raw:
        return None
    txt = raw.strip()
    txt = re.sub(r"D\.\s*R\.", "D.R.", txt)
    txt = re.sub(r"D\.\s*G\.", "D.G.", txt)
    txt = re.sub(r"\s+", " ", txt)
    return txt


def _build_classification(props: dict) -> Classification | None:
    authority = _parse_authority(props.get("classif_tutela"))
    kind, count = _parse_kind_and_count(props.get("classif_tipo"))
    legal_ref = _normalize_legal_ref(props.get("classif_dec_lei_ref"))
    classified_on = (props.get("classif_data") or "").strip() or None

    if not (authority or kind or legal_ref):
        return None

    return Classification(
        kind=kind or "IsolatedSpecimen",
        authority=authority or Authority(name="Unknown", acronym="UNK"),
        legalAct=LegalAct(reference=legal_ref or "(unspecified)"),
        specimenCount=count,
        classifiedOn=classified_on,
    )


def read(geojson_path: str | Path) -> Iterator[Tree]:
    """Yield canonical Tree records from a Porto GeoJSON file.

    GBIF resolution is left to the caller; species are emitted with
    scientificName / commonName only, taxonRef is set later.
    """
    data = json.loads(Path(geojson_path).read_text(encoding="utf-8"))
    for feat in data.get("features", []):
        props = feat.get("properties") or {}
        geom = feat.get("geometry") or {}
        coords = geom.get("coordinates") or [None, None]
        if len(coords) < 2 or coords[0] is None:
            continue

        species = Species(
            scientificName=(props.get("especie") or "").strip(),
            commonName=((props.get("esp_nomecomum") or "").strip() or None),
        )
        if not species.scientificName:
            continue

        yield Tree(
            localId=str(props.get("objectid") or feat.get("id") or ""),
            species=species,
            location=Location(latitude=float(coords[1]), longitude=float(coords[0])),
            ageRange=normalize_age(props.get("arv_intervalo_idade")),
            classification=_build_classification(props),
        )

Remaining work.

Lower the bar for new adapters

Today an adapter is a small Python file. Two complementary ideas to push further: a declarative YAML mapping covering the easy cases (just field renames and enum lookups), with a generic config adapter that consumes it, and a harmonize new-adapter --name <city> command that scaffolds a starter file from _template.py with a fixture and a quick smoke test. Goal: bring a new dataset online in under 30 minutes, no Python required for the simple cases.

Draft a Smart Data Model "Tree" proposal

Translate the Dolfin canonical model into the SDM template (schema.json, examples, README), targeting dataModel.ParksAndGardens. Bring the gap finding to the SDM community as a candidate PR.

Onboard a second dataset

Pick another city's tree inventory and write a second adapter. Confirms the core is truly dataset-agnostic and lets us refine the canonical model on a second worked example.

Anchor in the existing ecosystem

Wire refFlowerBed and refGarden to actual SDM instances when available, especially for Conjunto arbóreo records that semantically belong to a planted area.