Generate a project taxa list from a region, and track which species the models can predict#1367
Draft
mihow wants to merge 10 commits into
Draft
Generate a project taxa list from a region, and track which species the models can predict#1367mihow wants to merge 10 commits into
mihow wants to merge 10 commits into
Conversation
…service Move Command.create_taxon() and get_or_create_root_taxon() out of the import_taxa management command and into ami.main.services.taxonomy, so the regional taxa-list service can reuse the same rank-hierarchy builder instead of re-deriving it. Behaviour is unchanged; import_taxa now calls the extracted functions instead of defining them locally. Co-Authored-By: Claude <noreply@anthropic.com>
…el-coverage relationship Part of #1364 (regional taxa lists for class masking), Phase 1. Adds the data-model plumbing for region-derived taxa lists: - Site/Project gain region_source, region_code, and a taxa_list / default_taxa_list FK, so a project or one of its research sites can be tied to a geographic region and a designated TaxaList. - Taxon gains covered_by_algorithms (M2M to ml.Algorithm) and the denormalized has_model_coverage boolean, answering "which classifier(s), if any, can predict this taxon" without a live label-set join at read time. Coverage is derived data, computed by ami.main.services.taxon_coverage from each algorithm's category map labels (the same Taxon.name == label join AlgorithmCategoryMap.with_taxa() uses for masking). Algorithm.save() refreshes coverage automatically whenever its category_map link changes; the refresh_taxon_model_coverage management command does a full rebuild for the initial backfill or to repair drift from a write path that bypasses the hook (e.g. a bulk_update). Co-Authored-By: Claude <noreply@anthropic.com>
Adds generate_regional_taxa_list(), the core service that turns a geographic region into a project-scoped TaxaList: fetch species recorded in the region from GBIF, merge multiple sources with a wide union (never an intersection - a species in any source is a candidate), map merged species onto Taxon rows (matching by GBIF/iNat key or name, creating missing ones via the shared taxonomy hierarchy builder), then restrict to species some classifier can actually predict using the persisted model-coverage relationship. By default the saved list keeps only model-covered species, since class masking can't do anything with a species no classifier knows. include_uncovered=True opts into keeping the rest too, honestly flagged has_model_coverage=False so the UI/reporting can distinguish "in the region" from "a model can predict it." A single classifier can also be passed for a report-only coverage count that never changes what's saved. Idempotent: re-running for the same (name, project) updates the existing list rather than creating a duplicate. GBIFRegionalSource is the first concrete source (species search faceted by GBIF's speciesKey, endpoints exercised in the #1364 Phase 0 spike); iNaturalist can be added later behind the same RegionalSpeciesSource protocol without changing the merge or mapping logic. Every test uses a stubbed source or a monkeypatched HTTP session - no network calls in the suite. Co-Authored-By: Claude <noreply@anthropic.com>
logger.warn has been deprecated since Python 3.3. These two calls were moved verbatim from import_taxa.py into the extracted taxonomy service; switch them to logger.warning while they are being relocated. Co-Authored-By: Claude <noreply@anthropic.com>
✅ Deploy Preview for antenna-ssec canceled.
|
✅ Deploy Preview for antenna-preview canceled.
|
Contributor
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
… backfill Operator/backfill entry point over the regional-taxa service (#1364, Phase 2). Runs for one project with an explicit GADM region, or --all-projects derives each project's region from a representative deployment's coordinates via GBIF reverse-geocode (path A3). Adds reverse_geocode_gadm() to the GBIF client and derive_region_for_project() to the service, both with a test seam so nothing hits the network in CI. 11 new tests (reverse-geocode level selection, region derivation, command arg wiring + the two guards). Co-Authored-By: Claude <noreply@anthropic.com>
…axa list Adds a background task (generate_regional_taxa_list_task) and admin actions on the Project and Site changelists that enqueue it for rows with a region configured. The generated list is linked to project.default_taxa_list or site.taxa_list, which the masking auto-resolution reads. Runs off the request path because the external fetch is slow. Exposes the region_source/region_code/taxa-list fields in both admins. 4 new tests (task links list to project vs. site; actions enqueue only configured rows with the right scope). Part of #1364, Phase 2. Co-Authored-By: Claude <noreply@anthropic.com>
… the region Adds a taxa_list_mode to class masking. In 'auto' mode the taxa list is resolved from the scope's configured region instead of an operator picking one each run: an occurrence prefers its site's list, then its project's default; a collection resolves at the project level. When nothing is configured the run is a safe no-op, so a pipeline can enable masking before a project has set up a region. The explicit path (taxa_list_id) is unchanged and still the default. The admin form gains a source toggle. 12 new tests (config validation, the resolution ladder, the no-op path). Part of #1364, Phase 3. Co-Authored-By: Claude <noreply@anthropic.com>
POST /projects/{id}/generate-regional-taxa-list/ enqueues the background
generation task and returns 202; the generated list becomes the project's
default_taxa_list. region_code may be omitted to derive it from the project's
deployments. Requires update permission on the project. 6 tests cover the
permission matrix (editor 202, non-editor and anonymous 403), body validation
(invalid source / underivable region -> 400), and region derivation. Part of
#1364, Phase 4.
Co-Authored-By: Claude <noreply@anthropic.com>
…orithms apply_model_coverage previously called refresh_all_algorithm_coverage() whenever a run created any taxon, rewriting the covered_taxa relation for every algorithm. In the --all-projects backfill that is O(projects x algorithms). Replace it with a targeted refresh_coverage_for_taxa() that links only the just-created taxa to the category maps whose labels overlap their names (one overlap query, then adds), so per-run cost scales with new taxa, not the total algorithm/label count. Adds a test pinning that the targeted refresh covers only the named taxa. Co-Authored-By: Claude <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Class masking (#999) can cut a global classifier down to the species that actually occur where a project monitors, but only if someone first curates a taxa list — and doing that by hand, one taxon at a time, is too tedious to expect of project owners. This PR lets a project build that list automatically from a geographic region: give a region (or let it be derived from the project's deployments) and it pulls the species recorded there from an external biodiversity database, keeps the ones a classifier can actually predict, and saves them as the project's taxa list. Class masking can then resolve that list on its own.
The same core capability is reachable five ways — a management command (including an
--all-projectsbackfill), Django admin actions, a REST endpoint, and unit tests all call one service — so we can also generate regional lists for every existing project in one pass. This is the backend and API; the in-app UI button is the remaining piece (see the checklist below).Two design decisions are worth a reviewer's attention:
Taxon ↔ Algorithmrelationship. An opt-in mode also keeps regional species no model can predict, flagging each so the UI can say so.Phase 0 measured this against the real 2497-label Quebec & Vermont classifier: a Vermont region list covers 70% of its labels, so the default masking list keeps ~1749 classes and drops ~748 that neither GBIF nor iNaturalist records in Vermont (full findings in
docs/claude/analysis/, verdict: proceed).Verification: the feature ships with its own tests (44 in
tests_regional_taxa.pyplus 12 for masking auto-mode), and the existing taxonomy/taxa-list/class-masking suites still pass.makemigrations --checkis clean; black/isort/flake8 pass. No existing behavior changes until a project configures a region.Planning: #1364. Plan/design PR: #1366.
List of Changes
ami/main/services/regional_taxa.py:generate_regional_taxa_list(), the wide-unionmerge_source_species(),map_to_taxa(),apply_model_coverage(),Resultwith per-bucket countsservices/gbif.py— occurrence facet by GADM region + species-name resolution + reverse-geocode; iNaturalist later behind the same protocolTaxon.covered_by_algorithms(M2M → Algorithm) +has_model_coverageflag;services/taxon_coverage.pyrefreshes it from category-map labels, hook keyed onlabels_hash, targeted refresh for newly created taxagenerate_regional_taxa_listmanagement command with--all-projects(derives each region from deployments),--dry-run,--include-uncovered;refresh_taxon_model_coveragecommandClassMaskingConfig.taxa_list_mode="auto"resolves the list from the occurrence's site, then the project's default; a no-op when nothing is configured, so masking is safe to enable by defaultPOST /projects/{id}/generate-regional-taxa-list/enqueues the task and returns 202; requires project update permission; region derivable from deploymentsregion_source/region_codefields onSiteandProject, plustaxa_list(Site) /default_taxa_list(Project) FKs; migration0095create_taxonhierarchy builder fromimport_taxaintoservices/taxonomy.py(no behavior change)Still to do