Team LESL: working notes on harmonizing the representation of classified urban trees of Porto, using Dolfin as a backend independent pivot, and benchmarking against Smart Data Models.
Two source files, identical in semantic content, different in geometry. The CSV uses a projected reference system (likely ETRS89 / PT-TM06), the GeoJSON is in WGS84. We work from the GeoJSON.
The classif_tutela field carries three orthographic
variants for what is a single legal authority (ICNF). 159 records
say "ICNF", 78 say "ICNF (Instituto da Conservação
da Natureza e das Florestas)", 1 reverses the order. A
canonical model resolves this by lifting Authority to a
first class concept with stable identity.
The classif_tipo field mixes two facts in free text:
the kind of classification (isolated specimen vs tree cluster)
and, for clusters, the specimen count, embedded as
"Conjunto arbóreo (12 exemplares)". We split into
kind (enum) and specimenCount (int).
One typo found along the way: "Conjunto arbórep".
11 distinct species, each consistently paired with one common
name (e.g. Metrosideros excelsa ↔ Árvore-do-Fogo).
We bind Species to the
GBIF Backbone Taxonomy
via a full URL in taxonRef, e.g.
https://www.gbif.org/species/3152398 for
Metrosideros excelsa. Resolvable, machine readable,
versionable.
The classif_dec_lei_ref field has 10 distinct values
covering 6 underlying acts, with formatting variants like
"D.R. nº 6 II Série de 10/01/2005" vs
"D. R. n.º 6 II Série de 10/01/2005". Modelled as a
LegalAct concept with a normalized reference and an
issued date.
We searched the official Smart Data Models registry for any model covering trees, plants, species or forestry inventories. Result, in one line:
No Smart Data Model exists for an individual tree.
The closest domain, dataModel.ParksAndGardens,
defines only Garden, FlowerBed and
GreenspaceRecord. None of them carries species, age,
legal classification or heritage status. The
dataModel.Forestry repository contains only
FireForestStatus.
Tree entity in dataModel.ParksAndGardenstype, id, refXxx for relationships) so the canonical model maps cleanly to a future SDM submissionTree can reference an existing FlowerBed or Garden to anchor it in the existing ecosystem (useful for the Conjunto arbóreo case)We checked the OBO Foundry Plant Ontology (PO). It models plant anatomy, morphology and developmental stages for genomics annotation. Out of scope for an urban heritage register: it would be the right reference if we were tagging tissues or growth phases, not classified municipal assets.
The relevant external references for our Species concept
are taxonomic backbones: GBIF, NCBI Taxonomy,
IPNI, and the Darwin Core terms
dwc:scientificName / dwc:vernacularName.
Dolfin is a human friendly ontology language, backend independent: the same definition runs against graph, relational or document stores. We use it as the canonical source of truth, and derive sérialisations from it (JSON-LD for SDM, GeoJSON for mapping tools, INSPIRE for European datasets).
package <http://mimathon.askem.eu/uc1/trees>: dolfin_version "1" version "0.2.0" author "LESL (Lea, Eliott, Sattisvar & Louis), Dolfin & Askem" description "Canonical model for urban trees, MIMathon Porto 2026 use case 01" concept AgeRange: one of: Y21_30 Y31_40 Y41_50 Y51_60 Y61_70 Y71_80 Y81_100 Over100 Unknown concept ClassificationKind: one of: IsolatedSpecimen TreeCluster concept Authority: has name: one string has acronym: one string concept Species: has scientificName: one string has commonName: optional string has taxonRef: optional string concept LegalAct: has reference: one string has issuedOn: optional string concept Classification: has kind: one ClassificationKind has specimenCount: optional int has authority: one Authority has legalAct: one LegalAct has classifiedOn: optional string concept Location: has latitude: one float has longitude: one float concept Tree: has localId: one string has species: one Species has ageRange: optional AgeRange has location: one Location has classification: optional Classification has refFlowerBed: optional string has refGarden: optional string
Tree is generic: classification is optional so the entity is reusable beyond the heritage register, e.g. for any urban tree inventorySpecies.taxonRef carries a full GBIF URL, e.g. https://www.gbif.org/species/3152398, resolvable and stablerefFlowerBed and refGarden anchor the tree in the existing Smart Data Models ecosystem (dataModel.ParksAndGardens)Classification isolated from Tree, so a tree can hold several classifications over timeAuthority as a concept, not a string, to absorb the spelling variants observed in source datakind and specimenCount split out of the original textual classif_tipo| Source attribute | Source example | Canonical attribute | Transformation |
|---|---|---|---|
| objectid | 1726 | Tree.localId | cast to string |
| especie | Magnolia grandiflora | Species.scientificName | trim, normalize |
| esp_nomecomum | Magnólia | Species.commonName | trim |
| (derived from especie) | Magnolia grandiflora | Species.taxonRef | resolve via GBIF species/match API, store full URL e.g. https://www.gbif.org/species/3152398 |
| arv_intervalo_idade | 81-100 | Tree.ageRange | map to enum, "sup 100" → Over100, empty → Unknown |
| classif_tutela | ICNF (Instituto da Conservação da Natureza e das Florestas) | Classification.authority | parse, dedupe, build Authority{ name, acronym } |
| classif_tipo | Conjunto arbóreo (12 exemplares) | Classification.kind, .specimenCount | regex split, "Conjunto arbóreo" → TreeCluster, "Exemplar isolado" / "Isolada" → IsolatedSpecimen, capture (N) |
| classif_dec_lei_ref | D.R. nº 6 II Série de 10/01/2005 | LegalAct.reference | normalize whitespace, unify "D.R." / "D. R." |
| classif_data | 2005-01-10 | Classification.classifiedOn | ISO 8601, no transform |
| geometry.coordinates | [-8.6024, 41.1458] | Location.longitude, .latitude | WGS84, [lon, lat] order from GeoJSON |
The script is split in two: a dataset-agnostic core (canonical model, GBIF resolver, JSON-LD writer), and one adapter per source dataset. Harmonizing another city's tree inventory means writing a new adapter file, nothing else.
model.py, dataclasses mirroring the Dolfin definitiongbif.py, GBIF Backbone resolver with on-disk cachejsonld.py, JSON-LD writer with Darwin Core and WGS84 vocab__main__.py, CLI orchestrating adapter, GBIF, outputporto.py, GeoJSON, Porto Open Data conventionsread(path) generator yielding canonical Tree objectsNo core change required to onboard a new city or schema.
python -m harmonize \
--adapter porto \
--input ../uc1-trees-porto.geojson \
--output ../out/trees.jsonld \
--base-id "http://mimathon.askem.eu/uc1/trees/" \
--gbif-cache ../out/.gbif_cache.json
11 distinct species in the dataset, 10 resolved on the first pass:
GBIF EXACT Magnolia grandiflora -> https://www.gbif.org/species/9605163 GBIF EXACT Platanus x acerifolia -> https://www.gbif.org/species/3152815 GBIF FUZZY Araucaria bidwilii -> https://www.gbif.org/species/2684918 GBIF EXACT Washingtonia robusta -> https://www.gbif.org/species/5294595 GBIF EXACT Liriodendron tulipifera -> https://www.gbif.org/species/3152861 GBIF EXACT Metrosideros excelsa -> https://www.gbif.org/species/3185393 GBIF EXACT Metrosideros robusta -> https://www.gbif.org/species/3185294 GBIF EXACT Ginkgo biloba -> https://www.gbif.org/species/2687885 GBIF EXACT Phoenix canariensis -> https://www.gbif.org/species/7445284 GBIF MISS Phoenix sp. GBIF EXACT Araucaria heterophylla -> https://www.gbif.org/species/2684969
Result: 237 of 238 trees carry a resolvable
taxonRef. The single miss is Phoenix sp., a
genus-only label with no species rank, correctly rejected by GBIF.
The FUZZY match on Araucaria bidwilii reveals a typo in the
source: GBIF prefers Araucaria bidwillii, with two l.
{
"@id": "http://mimathon.askem.eu/uc1/trees/1726",
"@type": "Tree",
"localId": "1726",
"species": {
"@type": "Species",
"scientificName": "Magnolia grandiflora",
"commonName": "Magnólia",
"taxonRef": "https://www.gbif.org/species/9605163"
},
"location": {
"@type": "Location",
"latitude": 41.14581684711751,
"longitude": -8.602456850857784
},
"ageRange": "Y81_100",
"classification": {
"@type": "Classification",
"kind": "TreeCluster",
"specimenCount": 12,
"authority": {
"@type": "Authority",
"name": "Instituto da Conservação da Natureza e das Florestas",
"acronym": "ICNF"
},
"legalAct": {
"@type": "LegalAct",
"reference": "D.R. nº 6 II Série de 10/01/2005"
},
"classifiedOn": "2005-01-10"
}
}
Same tree (objectid 1726, a Magnolia in Jardim do Palácio de Cristal), shown raw from the Porto Open Data GeoJSON on the left, and harmonized to the canonical model with GBIF binding on the right.
JSON-LD is for the data layer (semantic, dereferenceable, queryable).
For map tools (geojson.io, QGIS, Mapbox, Leaflet) use the canonical
GeoJSON: same data, flattened into FeatureCollection
with dotted-key properties (e.g. species.taxonRef).
{
"type": "Feature",
"id": 1726,
"geometry": {
"type": "Point",
"coordinates": [
-8.602456850857784,
41.14581684711751
]
},
"properties": {
"objectid": 1726,
"especie": "Magnolia grandiflora",
"esp_nomecomum": "Magnólia",
"arv_intervalo_idade": "81-100",
"classif_tutela": "ICNF (Instituto da Conservação da Natureza e das Florestas)",
"classif_tipo": "Conjunto arbóreo (12 exemplares)\r\n",
"classif_dec_lei_ref": "D.R. nº 6 II Série de 10/01/2005",
"classif_data": "2005-01-10"
}
}
{
"localId": "1726",
"species": {
"@type": "Species",
"scientificName": "Magnolia grandiflora",
"commonName": "Magnólia",
"taxonRef": "https://www.gbif.org/species/9605163"
},
"location": {
"@type": "Location",
"latitude": 41.14581684711751,
"longitude": -8.602456850857784
},
"ageRange": "Y81_100",
"classification": {
"kind": "TreeCluster",
"authority": {
"@type": "Authority",
"name": "Instituto da Conservação da Natureza e das Florestas",
"acronym": "ICNF"
},
"legalAct": {
"@type": "LegalAct",
"reference": "D.R. nº 6 II Série de 10/01/2005"
},
"specimenCount": 12,
"classifiedOn": "2005-01-10",
"@type": "Classification"
},
"@id": "http://mimathon.askem.eu/uc1/trees/1726",
"@type": "Tree"
}
Same record visualized as a property graph. Each typed fragment
becomes a node, each property a labelled edge. The dashed link to
GBIF is a live IRI: taxonRef resolves to a stable,
dereferenceable taxon page on gbif.org.
11 distinct scientific names, 10 with a stable GBIF taxonRef (the 1 unmatched is a genus-only label, Phoenix sp.).
| Scientific name | Common name (PT) | GBIF taxonRef | Trees |
|---|---|---|---|
| Metrosideros excelsa | Árvore-do-Fogo | 3185393 | 85 |
| Phoenix canariensis | Palmeira-das-Canárias | 7445284 | 46 |
| Platanus x acerifolia | Plataneiro | 3152815 | 37 |
| Araucaria heterophylla | Araucária-de-Norfolk | 2684969 | 27 |
| Magnolia grandiflora | Magnólia | 9605163 | 25 |
| Washingtonia robusta | Palmeira-Washingtonia | 5294595 | 7 |
| Liriodendron tulipifera | Tulipeiro-da-Virgínia | 3152861 | 7 |
| Araucaria bidwilii | Araucária-da-Queenslândia | 2684918 | 1 |
| Metrosideros robusta | Metrosidero | 3185294 | 1 |
| Ginkgo biloba | Ginkgo | 2687885 | 1 |
| Phoenix sp. | (none) | unmatched | 1 |
10 distinct references after the per-record normalization (collapse of D. R. to D.R.).
Four of these ten references all point to the same act
of 10 January 2005, with cosmetic variants
("D.R. nº 6", "D.R. n.º 6",
"D.R. n.º6", "D.R. nº 6 ... 2005-01-10"),
covering 214 of the 238 trees. A second pass, deduplicating
LegalAct by issued date, would collapse the count
from 10 to 6.
| Reference | classifiedOn | Trees |
|---|---|---|
| D.R. nº 6 II Série de 10/01/2005 | 2005-01-10, 2005-10-01 | 210 |
| D.R. nº 29 2ª Série Parte C de 11/02/2021 | 2021-02-11 | 12 |
| D.R. n.º 60/ 2019, Série II de 2019-03-26 | 2019-03-26 | 7 |
| D.R. n.º 99/2021, Série II de 2021/05/21 | 2021-01-07 | 2 |
| D.R. n.º 6 II Série de 10/01/2005 | 2005-01-10, 2008-01-10 | 2 |
| D.G. nº 204 II Série de 01/09/1950 | 1950-09-01 | 1 |
| D.R. n.º 53 / 2019, Série II de 2019-03-15 | 2019-03-15 | 1 |
| D.G. nº 280 II Série de 02/12/1939 | 1939-12-02 | 1 |
| D.R. n.º6 II Série de 10/01/2005 | 2005-01-10 | 1 |
| D.R. nº 6 II Série de 2005-01-10 | 2005-01-10 | 1 |
Authority with name + acronym"Conjunto arbóreo (12 exemplares)" split into kind: TreeCluster + specimenCount: 12"81-100" mapped to enum Y81_100especie resolved against GBIF, taxonRef set to https://www.gbif.org/species/9605163latitude / longitude in a typed Location fragment@id assigned, every fragment carries an @type
Full source of the harmonizer, hosted alongside this page. Each file
is also a one-click download as raw .py. The whole package
is bundled as a tarball below.
The harmonizer is split in three layers so a new dataset means writing only the bottom layer, the adapter:
model.py, gbif.py, jsonld.py, __main__.py): never changestransforms.py): generic helpers (clean_text, extract_count, match_keywords, Registry) reused by every adapteradapters/<dataset>.py): one per dataset, exposes a single read(path) generator
Workflow: copy adapters/_template.py to
adapters/<your_name>.py, fill in the
read function, run
python -m harmonize --adapter <your_name> ....
Look at adapters/porto.py for a complete worked example.
"""Canonical urban tree model, mirrors trees.dolfin v0.2.0."""
from __future__ import annotations
from dataclasses import dataclass, field, asdict
from typing import Optional
AGE_RANGES = {
"21-30": "Y21_30",
"31-40": "Y31_40",
"41-50": "Y41_50",
"51-60": "Y51_60",
"61-70": "Y61_70",
"71-80": "Y71_80",
"81-100": "Y81_100",
"sup 100": "Over100",
"": "Unknown",
None: "Unknown",
}
@dataclass(frozen=True)
class Authority:
name: str
acronym: str
@dataclass(frozen=True)
class Species:
scientificName: str
commonName: Optional[str] = None
taxonRef: Optional[str] = None
@dataclass(frozen=True)
class LegalAct:
reference: str
issuedOn: Optional[str] = None
@dataclass
class Classification:
kind: str
authority: Authority
legalAct: LegalAct
specimenCount: Optional[int] = None
classifiedOn: Optional[str] = None
@dataclass
class Location:
latitude: float
longitude: float
@dataclass
class Tree:
localId: str
species: Species
location: Location
ageRange: Optional[str] = None
classification: Optional[Classification] = None
refFlowerBed: Optional[str] = None
refGarden: Optional[str] = None
def normalize_age(raw: Optional[str]) -> str:
if raw is None:
return "Unknown"
return AGE_RANGES.get(raw.strip(), "Unknown")
"""GBIF Backbone Taxonomy resolver, with on-disk cache.
Uses the public species/match endpoint, which fuzzy matches a scientific name
against the GBIF backbone and returns a stable usageKey. We turn that into
a canonical, resolvable URL: https://www.gbif.org/species/{usageKey}.
No external dependency, stdlib only.
"""
from __future__ import annotations
import json
import urllib.parse
import urllib.request
from pathlib import Path
from typing import Optional
GBIF_MATCH_URL = "https://api.gbif.org/v1/species/match"
GBIF_SPECIES_PAGE = "https://www.gbif.org/species/{key}"
class GbifResolver:
"""Resolve scientific names to GBIF species URLs, cached to disk."""
def __init__(self, cache_path: Path, timeout: float = 10.0):
self.cache_path = Path(cache_path)
self.timeout = timeout
self._cache: dict[str, dict] = {}
if self.cache_path.exists():
self._cache = json.loads(self.cache_path.read_text(encoding="utf-8"))
def resolve(self, scientific_name: str) -> Optional[dict]:
"""Return {url, usageKey, canonicalName, matchType} or None on failure.
Result is cached, including misses (stored as {"miss": true}) so a
re-run does not pound the API for unmatched names.
"""
key = scientific_name.strip()
if not key:
return None
if key in self._cache:
entry = self._cache[key]
return None if entry.get("miss") else entry
params = urllib.parse.urlencode({"name": key, "verbose": "false"})
url = f"{GBIF_MATCH_URL}?{params}"
try:
with urllib.request.urlopen(url, timeout=self.timeout) as r:
payload = json.loads(r.read().decode("utf-8"))
except Exception as e:
print(f" ! GBIF lookup failed for {key!r}: {e}")
return None
usage_key = payload.get("usageKey")
if not usage_key or payload.get("matchType") == "NONE":
self._cache[key] = {"miss": True}
self._flush()
return None
entry = {
"url": GBIF_SPECIES_PAGE.format(key=usage_key),
"usageKey": usage_key,
"canonicalName": payload.get("canonicalName") or payload.get("scientificName"),
"matchType": payload.get("matchType"),
"rank": payload.get("rank"),
}
self._cache[key] = entry
self._flush()
return entry
def _flush(self) -> None:
self.cache_path.parent.mkdir(parents=True, exist_ok=True)
self.cache_path.write_text(
json.dumps(self._cache, ensure_ascii=False, indent=2, sort_keys=True),
encoding="utf-8",
)
"""JSON-LD serializer for the canonical Tree model.
Produces a single document with a @graph of all trees. Inline objects
for Species, Authority, LegalAct, Classification and Location, with
type tags so each fragment is independently identifiable.
"""
from __future__ import annotations
from dataclasses import asdict
from typing import Iterable
from .model import Tree
NS = "http://mimathon.askem.eu/uc1/trees#"
CONTEXT = {
"@vocab": NS,
"Tree": NS + "Tree",
"Species": NS + "Species",
"Authority": NS + "Authority",
"LegalAct": NS + "LegalAct",
"Classification": NS + "Classification",
"Location": NS + "Location",
"taxonRef": {"@id": NS + "taxonRef", "@type": "@id"},
"refFlowerBed": {"@id": NS + "refFlowerBed", "@type": "@id"},
"refGarden": {"@id": NS + "refGarden", "@type": "@id"},
"geo": "https://www.w3.org/2003/01/geo/wgs84_pos#",
"latitude": "geo:lat",
"longitude": "geo:long",
"dwc": "http://rs.tdwg.org/dwc/terms/",
"scientificName": "dwc:scientificName",
"commonName": "dwc:vernacularName",
}
def _strip_none(d):
if isinstance(d, dict):
return {k: _strip_none(v) for k, v in d.items() if v is not None}
if isinstance(d, list):
return [_strip_none(x) for x in d]
return d
def tree_to_node(tree: Tree, base_id: str) -> dict:
d = asdict(tree)
d["@id"] = f"{base_id}{tree.localId}"
d["@type"] = "Tree"
d["species"] = {"@type": "Species", **asdict(tree.species)}
d["location"] = {"@type": "Location", **asdict(tree.location)}
if tree.classification is not None:
c = asdict(tree.classification)
c["@type"] = "Classification"
c["authority"] = {"@type": "Authority", **asdict(tree.classification.authority)}
c["legalAct"] = {"@type": "LegalAct", **asdict(tree.classification.legalAct)}
d["classification"] = c
return _strip_none(d)
def build_document(trees: Iterable[Tree], base_id: str) -> dict:
return {
"@context": CONTEXT,
"@graph": [tree_to_node(t, base_id) for t in trees],
}
"""GeoJSON serializer for the canonical Tree model.
Emits a FeatureCollection in WGS84, suitable for any GIS tool
(geojson.io, QGIS, Mapbox, Leaflet). Nested entities are flattened
to a dotted-key namespace inside `properties` so the file is also
useful for non-LD-aware consumers.
"""
from __future__ import annotations
from dataclasses import asdict
from typing import Iterable
from .model import Tree
def _flatten(prefix: str, value, target: dict) -> None:
if value is None:
return
if isinstance(value, dict):
for k, v in value.items():
_flatten(f"{prefix}.{k}" if prefix else k, v, target)
else:
target[prefix] = value
def tree_to_feature(tree: Tree, base_id: str) -> dict:
props: dict = {"@id": f"{base_id}{tree.localId}", "@type": "Tree"}
_flatten("localId", tree.localId, props)
_flatten("ageRange", tree.ageRange, props)
_flatten("species", asdict(tree.species), props)
if tree.classification is not None:
_flatten("classification", asdict(tree.classification), props)
if tree.refFlowerBed is not None:
props["refFlowerBed"] = tree.refFlowerBed
if tree.refGarden is not None:
props["refGarden"] = tree.refGarden
return {
"type": "Feature",
"id": tree.localId,
"geometry": {
"type": "Point",
"coordinates": [tree.location.longitude, tree.location.latitude],
},
"properties": props,
}
def build_collection(trees: Iterable[Tree], base_id: str) -> dict:
return {
"type": "FeatureCollection",
"features": [tree_to_feature(t, base_id) for t in trees],
}
"""CLI entry point.
Example:
python -m harmonize \
--adapter porto \
--input ../uc1-trees-porto.geojson \
--output ../out/trees.jsonld \
--base-id http://mimathon.askem.eu/uc1/trees/ \
--gbif-cache ../out/.gbif_cache.json
"""
from __future__ import annotations
import argparse
import importlib
import json
import sys
from dataclasses import replace
from pathlib import Path
from .gbif import GbifResolver
from .jsonld import build_document
from .model import Species, Tree
def _load_adapter(name: str):
mod = importlib.import_module(f"harmonize.adapters.{name}")
if not hasattr(mod, "read"):
raise SystemExit(f"adapter {name!r} has no read(path) function")
return mod
def _enrich_with_gbif(trees, resolver: GbifResolver):
"""Replace each tree.species with a copy carrying taxonRef when resolved."""
seen: dict[str, Species] = {}
for t in trees:
sci = t.species.scientificName
if sci in seen:
yield replace(t, species=seen[sci])
continue
match = resolver.resolve(sci)
if match:
enriched = replace(t.species, taxonRef=match["url"])
print(f" GBIF {match['matchType']:>5} {sci} -> {match['url']}")
else:
enriched = t.species
print(f" GBIF MISS {sci}")
seen[sci] = enriched
yield replace(t, species=enriched)
def main(argv=None) -> int:
p = argparse.ArgumentParser(prog="harmonize", description="Harmonize an urban tree dataset to the canonical Tree model and emit JSON-LD.")
p.add_argument("--adapter", required=True, help="Adapter module name under harmonize.adapters, e.g. porto")
p.add_argument("--input", required=True, type=Path, help="Source dataset path")
p.add_argument("--output", required=True, type=Path, help="Destination JSON-LD file")
p.add_argument("--base-id", default="http://example.org/trees/", help="IRI prefix for tree @id values")
p.add_argument("--gbif-cache", type=Path, default=Path(".gbif_cache.json"), help="On-disk GBIF lookup cache")
p.add_argument("--no-gbif", action="store_true", help="Skip GBIF resolution (faster, offline)")
args = p.parse_args(argv)
adapter = _load_adapter(args.adapter)
print(f"Reading via adapter '{args.adapter}' from {args.input}...")
trees = list(adapter.read(args.input))
print(f" {len(trees)} trees read")
if not args.no_gbif:
print(f"Resolving species against GBIF (cache: {args.gbif_cache})...")
resolver = GbifResolver(args.gbif_cache)
trees = list(_enrich_with_gbif(trees, resolver))
print(f"Writing JSON-LD to {args.output}...")
doc = build_document(trees, base_id=args.base_id)
args.output.parent.mkdir(parents=True, exist_ok=True)
args.output.write_text(json.dumps(doc, ensure_ascii=False, indent=2), encoding="utf-8")
print(f" done, {len(doc['@graph'])} entities in @graph")
return 0
if __name__ == "__main__":
sys.exit(main())
"""Reusable text transforms shared across adapters.
Adapters compose these helpers rather than reimplementing them. Helpers
are intentionally minimal: they only do generic text work (cleanup,
regex extraction, keyword routing). Anything dataset-specific belongs
in the adapter itself.
"""
from __future__ import annotations
import re
from typing import Optional
def clean_text(value: Optional[str]) -> Optional[str]:
"""Trim, collapse internal whitespace, return None for empty input."""
if value is None:
return None
txt = re.sub(r"\s+", " ", str(value)).strip()
return txt or None
def extract_count(value: Optional[str], pattern: str = r"\((\d+)") -> Optional[int]:
"""Pull an integer out of free text, e.g. '... (12 exemplares)' -> 12."""
if value is None:
return None
m = re.search(pattern, value)
return int(m.group(1)) if m else None
def match_keywords(value: Optional[str], keyword_map: dict[str, str]) -> Optional[str]:
"""Return the first enum value whose regex key matches the input.
keyword_map: {regex_pattern: enum_value}, e.g.
{r"conjunto\\s+arb[óo]re[op]": "TreeCluster",
r"isolad": "IsolatedSpecimen"}
Patterns are evaluated in insertion order, case-insensitive.
"""
if not value:
return None
for pattern, enum_value in keyword_map.items():
if re.search(pattern, value, re.IGNORECASE):
return enum_value
return None
class Registry:
"""Tiny dedupe registry for value-typed entities like Authority.
Use when source data has many spelling variants of the same entity:
reg = Registry({"ICNF": Authority(name="...", acronym="ICNF")})
a = reg.resolve("ICNF (Instituto da Conservação ...)", needle="ICNF")
The canonical instance is returned, ensuring downstream graphs share
one node per real-world entity.
"""
def __init__(self, known: dict | None = None):
self._known = dict(known or {})
def resolve(self, raw, needle: str | None = None, default=None):
if raw is None:
return default
text = str(raw)
if needle is not None and needle in text and needle in self._known:
return self._known[needle]
for key, val in self._known.items():
if key in text:
return val
return default
def get(self, key: str):
return self._known.get(key)
"""Skeleton adapter, copy and rename to add a new dataset.
Quick start:
1. Copy this file to harmonize/adapters/<your_dataset>.py
2. Replace the `read` function body with your own parsing
3. Run: python -m harmonize --adapter <your_dataset> --input ... --output ...
Contract:
Expose a single function `read(path) -> Iterator[Tree]`.
The CLI (harmonize/__main__.py) takes care of GBIF resolution and
JSON-LD output, so the adapter only has to map source records to
canonical Tree instances.
See harmonize/adapters/porto.py for a complete worked example covering
field renaming, enum normalization, kind/count splitting from free
text, and authority dedupe.
"""
from __future__ import annotations
from pathlib import Path
from typing import Iterator
from ..model import (
AGE_RANGES,
Authority, Classification, LegalAct, Location, Species, Tree,
normalize_age,
)
from ..transforms import (
Registry, clean_text, extract_count, match_keywords,
)
# Optional: pre-populate canonical entities you'll reuse, then dedupe via Registry.
_AUTHORITIES = Registry({
# "ICNF": Authority(name="Instituto da Conservação ...", acronym="ICNF"),
})
# Optional: keyword-driven enum routing for messy free-text fields.
_KIND_KEYWORDS = {
# r"conjunto\s+arb[óo]re[op]": "TreeCluster",
# r"isolad|exemplar\s+isolado": "IsolatedSpecimen",
}
def read(path: str | Path) -> Iterator[Tree]:
"""Yield canonical Tree records from `path`.
Replace the body below with your dataset's parsing logic. The
important parts:
- return Tree(localId=..., species=..., location=..., ...)
- all string-valued attributes should already be cleaned
- taxonRef is left for the CLI to fill via GBIF
- if classification info is missing, set classification=None
"""
# Example for a CSV: read it and iterate rows
# import csv
# with Path(path).open(encoding="utf-8") as f:
# for row in csv.DictReader(f):
# yield Tree(
# localId=row["id"],
# species=Species(
# scientificName=clean_text(row["scientific_name"]) or "",
# commonName=clean_text(row.get("common_name")),
# ),
# location=Location(
# latitude=float(row["lat"]),
# longitude=float(row["lon"]),
# ),
# ageRange=normalize_age(row.get("age_class")),
# classification=None,
# )
raise NotImplementedError("Implement read() for your dataset, see porto.py")
"""Adapter for the Porto Open Data classified-trees GeoJSON.
Source schema (per feature.properties):
objectid -> Tree.localId
especie -> Species.scientificName
esp_nomecomum -> Species.commonName
arv_intervalo_idade -> Tree.ageRange
classif_tutela -> Classification.authority (parse)
classif_tipo -> Classification.kind + .specimenCount (parse)
classif_dec_lei_ref -> Classification.legalAct.reference (normalize)
classif_data -> Classification.classifiedOn
Geometry: Point in WGS84, [lon, lat] order.
"""
from __future__ import annotations
import json
import re
from pathlib import Path
from typing import Iterator
from ..model import (
Authority, Classification, LegalAct, Location, Species, Tree, normalize_age,
)
SOURCE_ID = "porto-classified-trees"
_AUTHORITY_KNOWN = {
"ICNF": Authority(
name="Instituto da Conservação da Natureza e das Florestas",
acronym="ICNF",
),
}
def _parse_authority(raw: str | None) -> Authority | None:
if not raw:
return None
txt = raw.strip()
if "ICNF" in txt:
return _AUTHORITY_KNOWN["ICNF"]
return Authority(name=txt, acronym=txt[:8])
_KIND_CLUSTER = re.compile(r"conjunto\s+arb[óo]re[op]", re.IGNORECASE)
_KIND_ISOLATED = re.compile(r"(exemplar\s+isolado|isolada|árvore\s+isolada)", re.IGNORECASE)
_COUNT = re.compile(r"\((\d+)\s*exemplares?\)", re.IGNORECASE)
def _parse_kind_and_count(raw: str | None) -> tuple[str | None, int | None]:
if not raw:
return None, None
txt = raw.strip()
count = None
m = _COUNT.search(txt)
if m:
count = int(m.group(1))
if _KIND_CLUSTER.search(txt):
return "TreeCluster", count
if _KIND_ISOLATED.search(txt):
return "IsolatedSpecimen", count
return None, count
def _normalize_legal_ref(raw: str | None) -> str | None:
if not raw:
return None
txt = raw.strip()
txt = re.sub(r"D\.\s*R\.", "D.R.", txt)
txt = re.sub(r"D\.\s*G\.", "D.G.", txt)
txt = re.sub(r"\s+", " ", txt)
return txt
def _build_classification(props: dict) -> Classification | None:
authority = _parse_authority(props.get("classif_tutela"))
kind, count = _parse_kind_and_count(props.get("classif_tipo"))
legal_ref = _normalize_legal_ref(props.get("classif_dec_lei_ref"))
classified_on = (props.get("classif_data") or "").strip() or None
if not (authority or kind or legal_ref):
return None
return Classification(
kind=kind or "IsolatedSpecimen",
authority=authority or Authority(name="Unknown", acronym="UNK"),
legalAct=LegalAct(reference=legal_ref or "(unspecified)"),
specimenCount=count,
classifiedOn=classified_on,
)
def read(geojson_path: str | Path) -> Iterator[Tree]:
"""Yield canonical Tree records from a Porto GeoJSON file.
GBIF resolution is left to the caller; species are emitted with
scientificName / commonName only, taxonRef is set later.
"""
data = json.loads(Path(geojson_path).read_text(encoding="utf-8"))
for feat in data.get("features", []):
props = feat.get("properties") or {}
geom = feat.get("geometry") or {}
coords = geom.get("coordinates") or [None, None]
if len(coords) < 2 or coords[0] is None:
continue
species = Species(
scientificName=(props.get("especie") or "").strip(),
commonName=((props.get("esp_nomecomum") or "").strip() or None),
)
if not species.scientificName:
continue
yield Tree(
localId=str(props.get("objectid") or feat.get("id") or ""),
species=species,
location=Location(latitude=float(coords[1]), longitude=float(coords[0])),
ageRange=normalize_age(props.get("arv_intervalo_idade")),
classification=_build_classification(props),
)
Today an adapter is a small Python file. Two complementary ideas to push further: a declarative YAML mapping covering the easy cases (just field renames and enum lookups), with a generic config adapter that consumes it, and a harmonize new-adapter --name <city> command that scaffolds a starter file from _template.py with a fixture and a quick smoke test. Goal: bring a new dataset online in under 30 minutes, no Python required for the simple cases.
Translate the Dolfin canonical model into the SDM template (schema.json, examples, README), targeting dataModel.ParksAndGardens. Bring the gap finding to the SDM community as a candidate PR.
Pick another city's tree inventory and write a second adapter. Confirms the core is truly dataset-agnostic and lets us refine the canonical model on a second worked example.
Wire refFlowerBed and refGarden to actual SDM instances when available, especially for Conjunto arbóreo records that semantically belong to a planted area.