Add the implementation plan for building project taxa lists from a region#1366
Add the implementation plan for building project taxa lists from a region#1366mihow wants to merge 4 commits into
Conversation
…) [skip ci] Staged, TDD-oriented implementation plan for building a project taxa list from a geographic region so class masking works out of the box. Covers the reusable core service, the union-with-provenance source merge (sources union, never intersect), Site/Project data-model fields, region derivation for backfill, the class-masking auto-resolution order, the five surfaces, a test plan, and a phased rollout. Refs #1364, #999, #1289 Co-Authored-By: Claude <noreply@anthropic.com>
…ip ci] Fold the model/DB-awareness requirement into the Proposal A plan as a first-class section. The regional-list generator now, by default, subsets the union of source species to those a classifier can actually predict (name in some AlgorithmCategoryMap label set), with an opt-in flag to also create uncovered regional species flagged as not classifiable. Confirmed by code reading that no persisted Taxon-to-Algorithm/CategoryMap link exists today (with_taxa() resolves names live, unpersisted), so the plan adds a persisted relationship (category-map-anchored M2M plus a denormalized Taxon.is_classifiable boolean) and a refresh path keyed on labels_hash. Updates the Result dataclass with explicit buckets, the data-model and test-plan sections, and the open-questions list. Refs #1364, #999, #1289 Co-Authored-By: Claude <noreply@anthropic.com>
#1364) [skip ci] Rename the persisted model-coverage relationship to the through-model the requester asked for: Taxon.covered_by_algorithms (M2M to Algorithm) so the list and UI can show which model is aware of a taxon, with Taxon.has_model_coverage as the denormalized boolean MVP. The category-map-anchored variant is retained as a noted deduplication alternative and open question, since many algorithms share one category map. Updates the data-model section, the options table and recommendation, the refresh helpers, the Result-consuming step, the test plan, and the open questions accordingly. Refs #1364, #999, #1289 Co-Authored-By: Claude <noreply@anthropic.com>
✅ Deploy Preview for antenna-ssec canceled.
|
✅ Deploy Preview for antenna-preview canceled.
|
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Phase 0 spike verified GBIF/iNat regional endpoints and the A3 reverse-geocode path against a live run, and measured 70% coverage of the 2497-label Quebec & Vermont classifier from a Vermont region list. Verdict: GO for Proposal A. Co-Authored-By: Claude <noreply@anthropic.com>
|
Phase 0 (de-risk spike) ran — verdict: GO. Measured against the real 2497-label Quebec & Vermont classifier, a Vermont region list (GBIF ∪ iNat) covers 70.0% (1749/2497) of the classifier's labels, so the default masking list would keep 1749 classes and mask ~748 (30%) that neither source records in Vermont. GBIF alone reaches 69.8%; iNat adds little to the intersection but feeds the 550-species Top risk to carry into Phase 1: name-join fragility — 30% of labels are absent from the region union, a mix of true regional absences and likely name/synonym mismatches that needs a sample audit. Full numbers, caveats, and the reproducible script are in |
Summary
Class masking (#999) can cut a global classifier down to the species that actually occur at a site, but only if someone first curates a taxa list for the project — and today that means adding taxa one at a time, which is too tedious to expect of project owners. This PR does not add any code; it adds the implementation plan for the approach proposed in #1364 ("Proposal A"): let a user generate a regional species list automatically from an external biodiversity database (GBIF and/or iNaturalist) by giving a region code, so building a masking list becomes a single action.
The point of opening this as a draft is to get agreement on the design before implementation starts — in particular the two decisions that shape everything downstream: (1) when multiple sources are used they are combined as a wide union with per-source provenance, never intersected against each other; and (2) the list is, by default, restricted to species the classifiers can actually predict, with a stored "model coverage" relationship so the UI can be honest about regional species that no model will ever predict (many valid species lack training data). The same core service is designed to be reused by a management command, a Django admin action, an API endpoint, the main UI, and unit tests — which also lets us backfill regional lists for every existing project.
No migrations, models, or endpoints are included yet. This is the plan; the code lands in the phased slices it describes, starting with a Phase 0 spike to verify the external APIs (none of the GBIF/iNaturalist endpoints have been exercised against the live services — they are flagged CANDIDATE/UNVERIFIED throughout).
Planning for #1364. Design writeup:
docs/claude/planning/2026-07-02-regional-taxa-lists-class-masking.md.List of Changes
docs/claude/planning/2026-07-02-proposal-a-regional-taxa-lists-impl-plan.md— 14 sections, docs-only, no executable changegenerate_regional_taxa_list(...)with aResultbreakdown, surfaced through five thin wrappers (command, admin action, API, UI, tests)RegionalSpeciesSourceprotocol + a wide union merge that keeps per-source provenance; sources are never intersected with each otherTaxon.covered_by_algorithmsrelationship (+ a denormalizedhas_model_coverageflag) plus a refresh path, so the UI can flag regional species no model can predictregion_source/region_code/ taxa-list fields onSiteandProject, plus the coverage relationship — each with its migration, called out but not implemented heretaxa_list_mode="auto"resolution ladder (occurrence → site → project) that no-ops until a region/list is configured; lands on the #999 branch sinceclass_masking.pyis not onmainyetThe branch is docs-only. CI is skipped on these commits (
[skip ci]). Nothing outsidedocs/claude/planning/is touched.