UC1 — Classified Trees of Porto · team LESL

Step 1, dataset audit

What the data actually contains.

Two source files, identical in semantic content, different in geometry. The CSV uses a projected reference system (likely ETRS89 / PT-TM06), the GeoJSON is in WGS84. We work from the GeoJSON.

238

classified trees

distinct species

attributes per tree

age ranges

Inconsistency found

One authority, three spellings.

The classif_tutela field carries three orthographic variants for what is a single legal authority (ICNF). 159 records say "ICNF", 78 say "ICNF (Instituto da Conservação da Natureza e das Florestas)", 1 reverses the order. A canonical model resolves this by lifting Authority to a first class concept with stable identity.

Inconsistency found

Mixed kind and count in one field.

The classif_tipo field mixes two facts in free text: the kind of classification (isolated specimen vs tree cluster) and, for clusters, the specimen count, embedded as "Conjunto arbóreo (12 exemplares)". We split into kind (enum) and specimenCount (int). One typo found along the way: "Conjunto arbórep".

Observation

Species couples are stable.

11 distinct species, each consistently paired with one common name (e.g. Metrosideros excelsa ↔ Árvore-do-Fogo). We bind Species to the GBIF Backbone Taxonomy via a full URL in taxonRef, e.g. https://www.gbif.org/species/3152398 for Metrosideros excelsa. Resolvable, machine readable, versionable.

Observation

Legal references are free text.

The classif_dec_lei_ref field has 10 distinct values covering 6 underlying acts, with formatting variants like "D.R. nº 6 II Série de 10/01/2005" vs "D. R. n.º 6 II Série de 10/01/2005". Modelled as a LegalAct concept with a normalized reference and an issued date.

Step 2, standards benchmark

Is there a Smart Data Model for trees?

We searched the official Smart Data Models registry for any model covering trees, plants, species or forestry inventories. Result, in one line:

Gap finding

No Smart Data Model exists for an individual tree. The closest domain, dataModel.ParksAndGardens, defines only Garden, FlowerBed and GreenspaceRecord. None of them carries species, age, legal classification or heritage status. The dataModel.Forestry repository contains only FireForestStatus.

What this means for UC1

The brief explicitly invites teams to identify gaps in SDM and extend entities when necessary
UC1 is therefore a candidate for a new SDM contribution, e.g. a Tree entity in dataModel.ParksAndGardens
We adopt SDM conventions (type, id, refXxx for relationships) so the canonical model maps cleanly to a future SDM submission
Our Tree can reference an existing FlowerBed or Garden to anchor it in the existing ecosystem (useful for the Conjunto arbóreo case)

SDM gap, documented ParksAndGardens compatible PR ready

Step 3, sanity check on adjacent ontologies

What about Plant Ontology?

We checked the OBO Foundry Plant Ontology (PO). It models plant anatomy, morphology and developmental stages for genomics annotation. Out of scope for an urban heritage register: it would be the right reference if we were tagging tissues or growth phases, not classified municipal assets.

The relevant external references for our Species concept are taxonomic backbones: GBIF, NCBI Taxonomy, IPNI, and the Darwin Core terms dwc:scientificName / dwc:vernacularName.

Step 4, canonical model

The Dolfin pivot.

Dolfin is a human friendly ontology language, backend independent: the same definition runs against graph, relational or document stores. We use it as the canonical source of truth, and derive sérialisations from it (JSON-LD for SDM, GeoJSON for mapping tools, INSPIRE for European datasets).

package <http://mimathon.askem.eu/uc1/trees>:
  dolfin_version "1"
  version "0.2.0"
  author "LESL (Lea, Eliott, Sattisvar & Louis), Dolfin & Askem"
  description "Canonical model for urban trees, MIMathon Porto 2026 use case 01"

concept AgeRange:
  one of:
    Y21_30
    Y31_40
    Y41_50
    Y51_60
    Y61_70
    Y71_80
    Y81_100
    Over100
    Unknown

concept ClassificationKind:
  one of:
    IsolatedSpecimen
    TreeCluster

concept Authority:
  has name: one string
  has acronym: one string

concept Species:
  has scientificName: one string
  has commonName: optional string
  has taxonRef: optional string

concept LegalAct:
  has reference: one string
  has issuedOn: optional string

concept Classification:
  has kind: one ClassificationKind
  has specimenCount: optional int
  has authority: one Authority
  has legalAct: one LegalAct
  has classifiedOn: optional string

concept Location:
  has latitude: one float
  has longitude: one float

concept Tree:
  has localId: one string
  has species: one Species
  has ageRange: optional AgeRange
  has location: one Location
  has classification: optional Classification
  has refFlowerBed: optional string
  has refGarden: optional string

Design choices

Tree is generic: classification is optional so the entity is reusable beyond the heritage register, e.g. for any urban tree inventory
Species.taxonRef carries a full GBIF URL, e.g. https://www.gbif.org/species/3152398, resolvable and stable
refFlowerBed and refGarden anchor the tree in the existing Smart Data Models ecosystem (dataModel.ParksAndGardens)
Classification isolated from Tree, so a tree can hold several classifications over time
Authority as a concept, not a string, to absorb the spelling variants observed in source data
kind and specimenCount split out of the original textual classif_tipo
Dates as ISO 8601 strings, since Dolfin has no native date type

Step 5, source to canonical mapping

From Porto Open Data to the pivot.

Source attribute	Source example	Canonical attribute	Transformation
objectid	1726	Tree.localId	cast to string
especie	Magnolia grandiflora	Species.scientificName	trim, normalize
esp_nomecomum	Magnólia	Species.commonName	trim
(derived from especie)	Magnolia grandiflora	Species.taxonRef	resolve via GBIF species/match API, store full URL e.g. https://www.gbif.org/species/3152398
arv_intervalo_idade	81-100	Tree.ageRange	map to enum, "sup 100" → Over100, empty → Unknown
classif_tutela	ICNF (Instituto da Conservação da Natureza e das Florestas)	Classification.authority	parse, dedupe, build Authority{ name, acronym }
classif_tipo	Conjunto arbóreo (12 exemplares)	Classification.kind, .specimenCount	regex split, "Conjunto arbóreo" → TreeCluster, "Exemplar isolado" / "Isolada" → IsolatedSpecimen, capture (N)
classif_dec_lei_ref	D.R. nº 6 II Série de 10/01/2005	LegalAct.reference	normalize whitespace, unify "D.R." / "D. R."
classif_data	2005-01-10	Classification.classifiedOn	ISO 8601, no transform
geometry.coordinates	[-8.6024, 41.1458]	Location.longitude, .latitude	WGS84, [lon, lat] order from GeoJSON

Step 6, generic harmonizer

One core, many datasets.

The script is split in two: a dataset-agnostic core (canonical model, GBIF resolver, JSON-LD writer), and one adapter per source dataset. Harmonizing another city's tree inventory means writing a new adapter file, nothing else.

Core, reusable

harmonize/

model.py, dataclasses mirroring the Dolfin definition
gbif.py, GBIF Backbone resolver with on-disk cache
jsonld.py, JSON-LD writer with Darwin Core and WGS84 vocab
__main__.py, CLI orchestrating adapter, GBIF, output

Adapters, per dataset

harmonize/adapters/

porto.py, GeoJSON, Porto Open Data conventions
(your dataset here), expose a read(path) generator yielding canonical Tree objects

No core change required to onboard a new city or schema.

CLI

python -m harmonize \
    --adapter porto \
    --input ../uc1-trees-porto.geojson \
    --output ../out/trees.jsonld \
    --base-id "http://mimathon.askem.eu/uc1/trees/" \
    --gbif-cache ../out/.gbif_cache.json

GBIF resolution, run on Porto

11 distinct species in the dataset, 10 resolved on the first pass:

GBIF EXACT  Magnolia grandiflora      ->  https://www.gbif.org/species/9605163
GBIF EXACT  Platanus x acerifolia     ->  https://www.gbif.org/species/3152815
GBIF FUZZY  Araucaria bidwilii        ->  https://www.gbif.org/species/2684918
GBIF EXACT  Washingtonia robusta      ->  https://www.gbif.org/species/5294595
GBIF EXACT  Liriodendron tulipifera   ->  https://www.gbif.org/species/3152861
GBIF EXACT  Metrosideros excelsa      ->  https://www.gbif.org/species/3185393
GBIF EXACT  Metrosideros robusta      ->  https://www.gbif.org/species/3185294
GBIF EXACT  Ginkgo biloba             ->  https://www.gbif.org/species/2687885
GBIF EXACT  Phoenix canariensis       ->  https://www.gbif.org/species/7445284
GBIF  MISS  Phoenix sp.
GBIF EXACT  Araucaria heterophylla    ->  https://www.gbif.org/species/2684969

Result: 237 of 238 trees carry a resolvable taxonRef. The single miss is Phoenix sp., a genus-only label with no species rank, correctly rejected by GBIF. The FUZZY match on Araucaria bidwilii reveals a typo in the source: GBIF prefers Araucaria bidwillii, with two l.

Sample output, JSON-LD

{
  "@id": "http://mimathon.askem.eu/uc1/trees/1726",
  "@type": "Tree",
  "localId": "1726",
  "species": {
    "@type": "Species",
    "scientificName": "Magnolia grandiflora",
    "commonName": "Magnólia",
    "taxonRef": "https://www.gbif.org/species/9605163"
  },
  "location": {
    "@type": "Location",
    "latitude": 41.14581684711751,
    "longitude": -8.602456850857784
  },
  "ageRange": "Y81_100",
  "classification": {
    "@type": "Classification",
    "kind": "TreeCluster",
    "specimenCount": 12,
    "authority": {
      "@type": "Authority",
      "name": "Instituto da Conservação da Natureza e das Florestas",
      "acronym": "ICNF"
    },
    "legalAct": {
      "@type": "LegalAct",
      "reference": "D.R. nº 6 II Série de 10/01/2005"
    },
    "classifiedOn": "2005-01-10"
  }
}

Step 7, the data

Before and after, on the same record.

Same tree (objectid 1726, a Magnolia in Jardim do Palácio de Cristal), shown raw from the Porto Open Data GeoJSON on the left, and harmonized to the canonical model with GBIF binding on the right.

↓ uc1-trees-porto.geojson source, 238 features ↓ trees.jsonld canonical JSON-LD, 238 trees ↓ trees-canonical.geojson canonical, GIS-ready

JSON-LD is for the data layer (semantic, dereferenceable, queryable). For map tools (geojson.io, QGIS, Mapbox, Leaflet) use the canonical GeoJSON: same data, flattened into FeatureCollection with dotted-key properties (e.g. species.taxonRef).

Source · GeoJSON featurePorto Open Data

{
  "type": "Feature",
  "id": 1726,
  "geometry": {
    "type": "Point",
    "coordinates": [
      -8.602456850857784,
      41.14581684711751
    ]
  },
  "properties": {
    "objectid": 1726,
    "especie": "Magnolia grandiflora",
    "esp_nomecomum": "Magnólia",
    "arv_intervalo_idade": "81-100",
    "classif_tutela": "ICNF (Instituto da Conservação da Natureza e das Florestas)",
    "classif_tipo": "Conjunto arbóreo (12 exemplares)\r\n",
    "classif_dec_lei_ref": "D.R. nº 6 II Série de 10/01/2005",
    "classif_data": "2005-01-10"
  }
}

Canonical · JSON-LD nodehttp://mimathon.askem.eu/uc1/trees/

{
  "localId": "1726",
  "species": {
    "@type": "Species",
    "scientificName": "Magnolia grandiflora",
    "commonName": "Magnólia",
    "taxonRef": "https://www.gbif.org/species/9605163"
  },
  "location": {
    "@type": "Location",
    "latitude": 41.14581684711751,
    "longitude": -8.602456850857784
  },
  "ageRange": "Y81_100",
  "classification": {
    "kind": "TreeCluster",
    "authority": {
      "@type": "Authority",
      "name": "Instituto da Conservação da Natureza e das Florestas",
      "acronym": "ICNF"
    },
    "legalAct": {
      "@type": "LegalAct",
      "reference": "D.R. nº 6 II Série de 10/01/2005"
    },
    "specimenCount": 12,
    "classifiedOn": "2005-01-10",
    "@type": "Classification"
  },
  "@id": "http://mimathon.askem.eu/uc1/trees/1726",
  "@type": "Tree"
}

JSON-LD as a graph

Same record visualized as a property graph. Each typed fragment becomes a node, each property a labelled edge. The dashed link to GBIF is a live IRI: taxonRef resolves to a stable, dereferenceable taxon page on gbif.org.

JSON-LD instance graph for tree 1726, with nodes Tree, Species, Location, Classification, Authority, LegalAct and an external link to GBIF

Species in the dataset

11 distinct scientific names, 10 with a stable GBIF taxonRef (the 1 unmatched is a genus-only label, Phoenix sp.).

Scientific name	Common name (PT)	GBIF taxonRef	Trees
Metrosideros excelsa	Árvore-do-Fogo	3185393	85
Phoenix canariensis	Palmeira-das-Canárias	7445284	46
Platanus x acerifolia	Plataneiro	3152815	37
Araucaria heterophylla	Araucária-de-Norfolk	2684969	27
Magnolia grandiflora	Magnólia	9605163	25
Washingtonia robusta	Palmeira-Washingtonia	5294595	7
Liriodendron tulipifera	Tulipeiro-da-Virgínia	3152861	7
Araucaria bidwilii	Araucária-da-Queenslândia	2684918	1
Metrosideros robusta	Metrosidero	3185294	1
Ginkgo biloba	Ginkgo	2687885	1
Phoenix sp.	(none)	unmatched	1

Legal acts in the dataset

10 distinct references after the per-record normalization (collapse of D. R. to D.R.).

Harmonization headroom

Four of these ten references all point to the same act of 10 January 2005, with cosmetic variants ("D.R. nº 6", "D.R. n.º 6", "D.R. n.º6", "D.R. nº 6 ... 2005-01-10"), covering 214 of the 238 trees. A second pass, deduplicating LegalAct by issued date, would collapse the count from 10 to 6.

Reference	classifiedOn	Trees
D.R. nº 6 II Série de 10/01/2005	2005-01-10, 2005-10-01	210
D.R. nº 29 2ª Série Parte C de 11/02/2021	2021-02-11	12
D.R. n.º 60/ 2019, Série II de 2019-03-26	2019-03-26	7
D.R. n.º 99/2021, Série II de 2021/05/21	2021-01-07	2
D.R. n.º 6 II Série de 10/01/2005	2005-01-10, 2008-01-10	2
D.G. nº 204 II Série de 01/09/1950	1950-09-01	1
D.R. n.º 53 / 2019, Série II de 2019-03-15	2019-03-15	1
D.G. nº 280 II Série de 02/12/1939	1939-12-02	1
D.R. n.º6 II Série de 10/01/2005	2005-01-10	1
D.R. nº 6 II Série de 2005-01-10	2005-01-10	1

What changed

Authority lifted from a free string to a typed Authority with name + acronym
"Conjunto arbóreo (12 exemplares)" split into kind: TreeCluster + specimenCount: 12
"81-100" mapped to enum Y81_100
especie resolved against GBIF, taxonRef set to https://www.gbif.org/species/9605163
Coordinates re-keyed as latitude / longitude in a typed Location fragment
Stable @id assigned, every fragment carries an @type

Step 8, source code

Read it, run it, fork it.

Full source of the harmonizer, hosted alongside this page. Each file is also a one-click download as raw .py. The whole package is bundled as a tarball below.

↓ harmonize.tar.gz complete package ↓ trees.dolfin canonical model, v0.2.0 ▶ Open slides mSM web player, F for fullscreen ↓ slides.md raw deck source

Add a new dataset

The harmonizer is split in three layers so a new dataset means writing only the bottom layer, the adapter:

Core (model.py, gbif.py, jsonld.py, __main__.py): never changes
Shared transforms (transforms.py): generic helpers (clean_text, extract_count, match_keywords, Registry) reused by every adapter
Adapter (adapters/<dataset>.py): one per dataset, exposes a single read(path) generator

Workflow: copy adapters/_template.py to adapters/<your_name>.py, fill in the read function, run python -m harmonize --adapter <your_name> .... Look at adapters/porto.py for a complete worked example.

model.py Canonical Tree model, dataclasses mirroring trees.dolfin View raw

"""Canonical urban tree model, mirrors trees.dolfin v0.2.0."""
from __future__ import annotations
from dataclasses import dataclass, field, asdict
from typing import Optional


AGE_RANGES = {
    "21-30": "Y21_30",
    "31-40": "Y31_40",
    "41-50": "Y41_50",
    "51-60": "Y51_60",
    "61-70": "Y61_70",
    "71-80": "Y71_80",
    "81-100": "Y81_100",
    "sup 100": "Over100",
    "": "Unknown",
    None: "Unknown",
}


@dataclass(frozen=True)
class Authority:
    name: str
    acronym: str


@dataclass(frozen=True)
class Species:
    scientificName: str
    commonName: Optional[str] = None
    taxonRef: Optional[str] = None


@dataclass(frozen=True)
class LegalAct:
    reference: str
    issuedOn: Optional[str] = None


@dataclass
class Classification:
    kind: str
    authority: Authority
    legalAct: LegalAct
    specimenCount: Optional[int] = None
    classifiedOn: Optional[str] = None


@dataclass
class Location:
    latitude: float
    longitude: float


@dataclass
class Tree:
    localId: str
    species: Species
    location: Location
    ageRange: Optional[str] = None
    classification: Optional[Classification] = None
    refFlowerBed: Optional[str] = None
    refGarden: Optional[str] = None


def normalize_age(raw: Optional[str]) -> str:
    if raw is None:
        return "Unknown"
    return AGE_RANGES.get(raw.strip(), "Unknown")

gbif.py GBIF Backbone resolver with on-disk cache, stdlib only View raw

"""GBIF Backbone Taxonomy resolver, with on-disk cache.

Uses the public species/match endpoint, which fuzzy matches a scientific name
against the GBIF backbone and returns a stable usageKey. We turn that into
a canonical, resolvable URL: https://www.gbif.org/species/{usageKey}.

No external dependency, stdlib only.
"""
from __future__ import annotations
import json
import urllib.parse
import urllib.request
from pathlib import Path
from typing import Optional


GBIF_MATCH_URL = "https://api.gbif.org/v1/species/match"
GBIF_SPECIES_PAGE = "https://www.gbif.org/species/{key}"


class GbifResolver:
    """Resolve scientific names to GBIF species URLs, cached to disk."""

    def __init__(self, cache_path: Path, timeout: float = 10.0):
        self.cache_path = Path(cache_path)
        self.timeout = timeout
        self._cache: dict[str, dict] = {}
        if self.cache_path.exists():
            self._cache = json.loads(self.cache_path.read_text(encoding="utf-8"))

    def resolve(self, scientific_name: str) -> Optional[dict]:
        """Return {url, usageKey, canonicalName, matchType} or None on failure.

        Result is cached, including misses (stored as {"miss": true}) so a
        re-run does not pound the API for unmatched names.
        """
        key = scientific_name.strip()
        if not key:
            return None
        if key in self._cache:
            entry = self._cache[key]
            return None if entry.get("miss") else entry

        params = urllib.parse.urlencode({"name": key, "verbose": "false"})
        url = f"{GBIF_MATCH_URL}?{params}"
        try:
            with urllib.request.urlopen(url, timeout=self.timeout) as r:
                payload = json.loads(r.read().decode("utf-8"))
        except Exception as e:
            print(f"  ! GBIF lookup failed for {key!r}: {e}")
            return None

        usage_key = payload.get("usageKey")
        if not usage_key or payload.get("matchType") == "NONE":
            self._cache[key] = {"miss": True}
            self._flush()
            return None

        entry = {
            "url": GBIF_SPECIES_PAGE.format(key=usage_key),
            "usageKey": usage_key,
            "canonicalName": payload.get("canonicalName") or payload.get("scientificName"),
            "matchType": payload.get("matchType"),
            "rank": payload.get("rank"),
        }
        self._cache[key] = entry
        self._flush()
        return entry

    def _flush(self) -> None:
        self.cache_path.parent.mkdir(parents=True, exist_ok=True)
        self.cache_path.write_text(
            json.dumps(self._cache, ensure_ascii=False, indent=2, sort_keys=True),
            encoding="utf-8",
        )

jsonld.py JSON-LD writer with Darwin Core and WGS84 vocabulary View raw

"""JSON-LD serializer for the canonical Tree model.

Produces a single document with a @graph of all trees. Inline objects
for Species, Authority, LegalAct, Classification and Location, with
type tags so each fragment is independently identifiable.
"""
from __future__ import annotations
from dataclasses import asdict
from typing import Iterable

from .model import Tree


NS = "http://mimathon.askem.eu/uc1/trees#"

CONTEXT = {
    "@vocab": NS,
    "Tree": NS + "Tree",
    "Species": NS + "Species",
    "Authority": NS + "Authority",
    "LegalAct": NS + "LegalAct",
    "Classification": NS + "Classification",
    "Location": NS + "Location",
    "taxonRef": {"@id": NS + "taxonRef", "@type": "@id"},
    "refFlowerBed": {"@id": NS + "refFlowerBed", "@type": "@id"},
    "refGarden": {"@id": NS + "refGarden", "@type": "@id"},
    "geo": "https://www.w3.org/2003/01/geo/wgs84_pos#",
    "latitude": "geo:lat",
    "longitude": "geo:long",
    "dwc": "http://rs.tdwg.org/dwc/terms/",
    "scientificName": "dwc:scientificName",
    "commonName": "dwc:vernacularName",
}


def _strip_none(d):
    if isinstance(d, dict):
        return {k: _strip_none(v) for k, v in d.items() if v is not None}
    if isinstance(d, list):
        return [_strip_none(x) for x in d]
    return d


def tree_to_node(tree: Tree, base_id: str) -> dict:
    d = asdict(tree)
    d["@id"] = f"{base_id}{tree.localId}"
    d["@type"] = "Tree"
    d["species"] = {"@type": "Species", **asdict(tree.species)}
    d["location"] = {"@type": "Location", **asdict(tree.location)}
    if tree.classification is not None:
        c = asdict(tree.classification)
        c["@type"] = "Classification"
        c["authority"] = {"@type": "Authority", **asdict(tree.classification.authority)}
        c["legalAct"] = {"@type": "LegalAct", **asdict(tree.classification.legalAct)}
        d["classification"] = c
    return _strip_none(d)


def build_document(trees: Iterable[Tree], base_id: str) -> dict:
    return {
        "@context": CONTEXT,
        "@graph": [tree_to_node(t, base_id) for t in trees],
    }

geojson_out.py GeoJSON FeatureCollection writer for GIS tools (geojson.io, QGIS, Leaflet) View raw

"""GeoJSON serializer for the canonical Tree model.

Emits a FeatureCollection in WGS84, suitable for any GIS tool
(geojson.io, QGIS, Mapbox, Leaflet). Nested entities are flattened
to a dotted-key namespace inside `properties` so the file is also
useful for non-LD-aware consumers.
"""
from __future__ import annotations
from dataclasses import asdict
from typing import Iterable

from .model import Tree


def _flatten(prefix: str, value, target: dict) -> None:
    if value is None:
        return
    if isinstance(value, dict):
        for k, v in value.items():
            _flatten(f"{prefix}.{k}" if prefix else k, v, target)
    else:
        target[prefix] = value


def tree_to_feature(tree: Tree, base_id: str) -> dict:
    props: dict = {"@id": f"{base_id}{tree.localId}", "@type": "Tree"}
    _flatten("localId", tree.localId, props)
    _flatten("ageRange", tree.ageRange, props)
    _flatten("species", asdict(tree.species), props)
    if tree.classification is not None:
        _flatten("classification", asdict(tree.classification), props)
    if tree.refFlowerBed is not None:
        props["refFlowerBed"] = tree.refFlowerBed
    if tree.refGarden is not None:
        props["refGarden"] = tree.refGarden

    return {
        "type": "Feature",
        "id": tree.localId,
        "geometry": {
            "type": "Point",
            "coordinates": [tree.location.longitude, tree.location.latitude],
        },
        "properties": props,
    }


def build_collection(trees: Iterable[Tree], base_id: str) -> dict:
    return {
        "type": "FeatureCollection",
        "features": [tree_to_feature(t, base_id) for t in trees],
    }

__main__.py CLI orchestrating adapter, GBIF resolution, JSON-LD output View raw

"""CLI entry point.

Example:
    python -m harmonize \
        --adapter porto \
        --input ../uc1-trees-porto.geojson \
        --output ../out/trees.jsonld \
        --base-id http://mimathon.askem.eu/uc1/trees/ \
        --gbif-cache ../out/.gbif_cache.json
"""
from __future__ import annotations
import argparse
import importlib
import json
import sys
from dataclasses import replace
from pathlib import Path

from .gbif import GbifResolver
from .jsonld import build_document
from .model import Species, Tree


def _load_adapter(name: str):
    mod = importlib.import_module(f"harmonize.adapters.{name}")
    if not hasattr(mod, "read"):
        raise SystemExit(f"adapter {name!r} has no read(path) function")
    return mod


def _enrich_with_gbif(trees, resolver: GbifResolver):
    """Replace each tree.species with a copy carrying taxonRef when resolved."""
    seen: dict[str, Species] = {}
    for t in trees:
        sci = t.species.scientificName
        if sci in seen:
            yield replace(t, species=seen[sci])
            continue
        match = resolver.resolve(sci)
        if match:
            enriched = replace(t.species, taxonRef=match["url"])
            print(f"  GBIF {match['matchType']:>5}  {sci}  ->  {match['url']}")
        else:
            enriched = t.species
            print(f"  GBIF  MISS  {sci}")
        seen[sci] = enriched
        yield replace(t, species=enriched)


def main(argv=None) -> int:
    p = argparse.ArgumentParser(prog="harmonize", description="Harmonize an urban tree dataset to the canonical Tree model and emit JSON-LD.")
    p.add_argument("--adapter", required=True, help="Adapter module name under harmonize.adapters, e.g. porto")
    p.add_argument("--input", required=True, type=Path, help="Source dataset path")
    p.add_argument("--output", required=True, type=Path, help="Destination JSON-LD file")
    p.add_argument("--base-id", default="http://example.org/trees/", help="IRI prefix for tree @id values")
    p.add_argument("--gbif-cache", type=Path, default=Path(".gbif_cache.json"), help="On-disk GBIF lookup cache")
    p.add_argument("--no-gbif", action="store_true", help="Skip GBIF resolution (faster, offline)")
    args = p.parse_args(argv)

    adapter = _load_adapter(args.adapter)
    print(f"Reading via adapter '{args.adapter}' from {args.input}...")
    trees = list(adapter.read(args.input))
    print(f"  {len(trees)} trees read")

    if not args.no_gbif:
        print(f"Resolving species against GBIF (cache: {args.gbif_cache})...")
        resolver = GbifResolver(args.gbif_cache)
        trees = list(_enrich_with_gbif(trees, resolver))

    print(f"Writing JSON-LD to {args.output}...")
    doc = build_document(trees, base_id=args.base_id)
    args.output.parent.mkdir(parents=True, exist_ok=True)
    args.output.write_text(json.dumps(doc, ensure_ascii=False, indent=2), encoding="utf-8")
    print(f"  done, {len(doc['@graph'])} entities in @graph")
    return 0


if __name__ == "__main__":
    sys.exit(main())

transforms.py Reusable text helpers shared across adapters View raw

"""Reusable text transforms shared across adapters.

Adapters compose these helpers rather than reimplementing them. Helpers
are intentionally minimal: they only do generic text work (cleanup,
regex extraction, keyword routing). Anything dataset-specific belongs
in the adapter itself.
"""
from __future__ import annotations
import re
from typing import Optional


def clean_text(value: Optional[str]) -> Optional[str]:
    """Trim, collapse internal whitespace, return None for empty input."""
    if value is None:
        return None
    txt = re.sub(r"\s+", " ", str(value)).strip()
    return txt or None


def extract_count(value: Optional[str], pattern: str = r"\((\d+)") -> Optional[int]:
    """Pull an integer out of free text, e.g. '... (12 exemplares)' -> 12."""
    if value is None:
        return None
    m = re.search(pattern, value)
    return int(m.group(1)) if m else None


def match_keywords(value: Optional[str], keyword_map: dict[str, str]) -> Optional[str]:
    """Return the first enum value whose regex key matches the input.

    keyword_map: {regex_pattern: enum_value}, e.g.
        {r"conjunto\\s+arb[óo]re[op]": "TreeCluster",
         r"isolad": "IsolatedSpecimen"}
    Patterns are evaluated in insertion order, case-insensitive.
    """
    if not value:
        return None
    for pattern, enum_value in keyword_map.items():
        if re.search(pattern, value, re.IGNORECASE):
            return enum_value
    return None


class Registry:
    """Tiny dedupe registry for value-typed entities like Authority.

    Use when source data has many spelling variants of the same entity:
        reg = Registry({"ICNF": Authority(name="...", acronym="ICNF")})
        a = reg.resolve("ICNF (Instituto da Conservação ...)", needle="ICNF")
    The canonical instance is returned, ensuring downstream graphs share
    one node per real-world entity.
    """

    def __init__(self, known: dict | None = None):
        self._known = dict(known or {})

    def resolve(self, raw, needle: str | None = None, default=None):
        if raw is None:
            return default
        text = str(raw)
        if needle is not None and needle in text and needle in self._known:
            return self._known[needle]
        for key, val in self._known.items():
            if key in text:
                return val
        return default

    def get(self, key: str):
        return self._known.get(key)

adapters/_template.py Skeleton, copy and rename to add a new dataset View raw

"""Skeleton adapter, copy and rename to add a new dataset.

Quick start:
    1. Copy this file to harmonize/adapters/<your_dataset>.py
    2. Replace the `read` function body with your own parsing
    3. Run:  python -m harmonize --adapter <your_dataset> --input ... --output ...

Contract:
    Expose a single function `read(path) -> Iterator[Tree]`.
    The CLI (harmonize/__main__.py) takes care of GBIF resolution and
    JSON-LD output, so the adapter only has to map source records to
    canonical Tree instances.

See harmonize/adapters/porto.py for a complete worked example covering
field renaming, enum normalization, kind/count splitting from free
text, and authority dedupe.
"""
from __future__ import annotations
from pathlib import Path
from typing import Iterator

from ..model import (
    AGE_RANGES,
    Authority, Classification, LegalAct, Location, Species, Tree,
    normalize_age,
)
from ..transforms import (
    Registry, clean_text, extract_count, match_keywords,
)


# Optional: pre-populate canonical entities you'll reuse, then dedupe via Registry.
_AUTHORITIES = Registry({
    # "ICNF": Authority(name="Instituto da Conservação ...", acronym="ICNF"),
})


# Optional: keyword-driven enum routing for messy free-text fields.
_KIND_KEYWORDS = {
    # r"conjunto\s+arb[óo]re[op]": "TreeCluster",
    # r"isolad|exemplar\s+isolado": "IsolatedSpecimen",
}


def read(path: str | Path) -> Iterator[Tree]:
    """Yield canonical Tree records from `path`.

    Replace the body below with your dataset's parsing logic. The
    important parts:

      - return Tree(localId=..., species=..., location=..., ...)
      - all string-valued attributes should already be cleaned
      - taxonRef is left for the CLI to fill via GBIF
      - if classification info is missing, set classification=None
    """
    # Example for a CSV: read it and iterate rows
    # import csv
    # with Path(path).open(encoding="utf-8") as f:
    #     for row in csv.DictReader(f):
    #         yield Tree(
    #             localId=row["id"],
    #             species=Species(
    #                 scientificName=clean_text(row["scientific_name"]) or "",
    #                 commonName=clean_text(row.get("common_name")),
    #             ),
    #             location=Location(
    #                 latitude=float(row["lat"]),
    #                 longitude=float(row["lon"]),
    #             ),
    #             ageRange=normalize_age(row.get("age_class")),
    #             classification=None,
    #         )
    raise NotImplementedError("Implement read() for your dataset, see porto.py")

adapters/porto.py Adapter for Porto Open Data classified-trees GeoJSON View raw

"""Adapter for the Porto Open Data classified-trees GeoJSON.

Source schema (per feature.properties):
    objectid              -> Tree.localId
    especie               -> Species.scientificName
    esp_nomecomum         -> Species.commonName
    arv_intervalo_idade   -> Tree.ageRange
    classif_tutela        -> Classification.authority (parse)
    classif_tipo          -> Classification.kind + .specimenCount (parse)
    classif_dec_lei_ref   -> Classification.legalAct.reference (normalize)
    classif_data          -> Classification.classifiedOn

Geometry: Point in WGS84, [lon, lat] order.
"""
from __future__ import annotations
import json
import re
from pathlib import Path
from typing import Iterator

from ..model import (
    Authority, Classification, LegalAct, Location, Species, Tree, normalize_age,
)


SOURCE_ID = "porto-classified-trees"


_AUTHORITY_KNOWN = {
    "ICNF": Authority(
        name="Instituto da Conservação da Natureza e das Florestas",
        acronym="ICNF",
    ),
}


def _parse_authority(raw: str | None) -> Authority | None:
    if not raw:
        return None
    txt = raw.strip()
    if "ICNF" in txt:
        return _AUTHORITY_KNOWN["ICNF"]
    return Authority(name=txt, acronym=txt[:8])


_KIND_CLUSTER = re.compile(r"conjunto\s+arb[óo]re[op]", re.IGNORECASE)
_KIND_ISOLATED = re.compile(r"(exemplar\s+isolado|isolada|árvore\s+isolada)", re.IGNORECASE)
_COUNT = re.compile(r"\((\d+)\s*exemplares?\)", re.IGNORECASE)


def _parse_kind_and_count(raw: str | None) -> tuple[str | None, int | None]:
    if not raw:
        return None, None
    txt = raw.strip()
    count = None
    m = _COUNT.search(txt)
    if m:
        count = int(m.group(1))
    if _KIND_CLUSTER.search(txt):
        return "TreeCluster", count
    if _KIND_ISOLATED.search(txt):
        return "IsolatedSpecimen", count
    return None, count


def _normalize_legal_ref(raw: str | None) -> str | None:
    if not raw:
        return None
    txt = raw.strip()
    txt = re.sub(r"D\.\s*R\.", "D.R.", txt)
    txt = re.sub(r"D\.\s*G\.", "D.G.", txt)
    txt = re.sub(r"\s+", " ", txt)
    return txt


def _build_classification(props: dict) -> Classification | None:
    authority = _parse_authority(props.get("classif_tutela"))
    kind, count = _parse_kind_and_count(props.get("classif_tipo"))
    legal_ref = _normalize_legal_ref(props.get("classif_dec_lei_ref"))
    classified_on = (props.get("classif_data") or "").strip() or None

    if not (authority or kind or legal_ref):
        return None

    return Classification(
        kind=kind or "IsolatedSpecimen",
        authority=authority or Authority(name="Unknown", acronym="UNK"),
        legalAct=LegalAct(reference=legal_ref or "(unspecified)"),
        specimenCount=count,
        classifiedOn=classified_on,
    )


def read(geojson_path: str | Path) -> Iterator[Tree]:
    """Yield canonical Tree records from a Porto GeoJSON file.

    GBIF resolution is left to the caller; species are emitted with
    scientificName / commonName only, taxonRef is set later.
    """
    data = json.loads(Path(geojson_path).read_text(encoding="utf-8"))
    for feat in data.get("features", []):
        props = feat.get("properties") or {}
        geom = feat.get("geometry") or {}
        coords = geom.get("coordinates") or [None, None]
        if len(coords) < 2 or coords[0] is None:
            continue

        species = Species(
            scientificName=(props.get("especie") or "").strip(),
            commonName=((props.get("esp_nomecomum") or "").strip() or None),
        )
        if not species.scientificName:
            continue

        yield Tree(
            localId=str(props.get("objectid") or feat.get("id") or ""),
            species=species,
            location=Location(latitude=float(coords[1]), longitude=float(coords[0])),
            ageRange=normalize_age(props.get("arv_intervalo_idade")),
            classification=_build_classification(props),
        )

Step 9, what's next

Remaining work.

Lower the bar for new adapters

Today an adapter is a small Python file. Two complementary ideas to push further: a declarative YAML mapping covering the easy cases (just field renames and enum lookups), with a generic config adapter that consumes it, and a harmonize new-adapter --name <city> command that scaffolds a starter file from _template.py with a fixture and a quick smoke test. Goal: bring a new dataset online in under 30 minutes, no Python required for the simple cases.

Draft a Smart Data Model "Tree" proposal

Translate the Dolfin canonical model into the SDM template (schema.json, examples, README), targeting dataModel.ParksAndGardens. Bring the gap finding to the SDM community as a candidate PR.

Onboard a second dataset

Pick another city's tree inventory and write a second adapter. Confirms the core is truly dataset-agnostic and lets us refine the canonical model on a second worked example.

Anchor in the existing ecosystem

Wire refFlowerBed and refGarden to actual SDM instances when available, especially for Conjunto arbóreo records that semantically belong to a planted area.