How the data is made.
Latent aggregates cold-case records from official law-enforcement sources, normalizes them into a research-grade schema, and runs local LLM extraction over the narrative text to surface structured modus-operandi fields. Every algorithm is documented here, with explicit transparency about confidence, provenance, and limits.
Nothing here is a suspect identification. Every correlation and extracted field is a lead for human review, never a conclusion.
Who this is for, and what we promise.
Latent is a public-facing analytics platform for serious non-professional investigators: citizen detectives, journalists, students, data scientists, and genetic genealogists working cases that law-enforcement tools wouldn't give them access to. The promise is simple: every transformation we apply to a case is documented, every extracted field carries a verifiable link back to its source phrase, and every limitation we know about is surfaced rather than buried.
The framing matters. Latent surfaces structure from a corpus too large to read by hand, but structure is not a verdict. A shared weapon and a 30-mile radius make two cases worth comparing, not the same case. The methodology below exists so a careful reader can audit the substrate of any claim the interface makes, and decide for themselves what the signal is worth.
Coverage and per-source field completeness.
Latent ingests cold-case data from official law-enforcement sources only. Coverage today spans the states colored below; expand any state for the underlying sources, refresh cadence, geocoding strategy, and the caveats that travel with each one.
Coverage at a glance
10 states · 15 sources · 6,760 casesOhioOHbrowser-assisted index · web scrape · polite2,7961943–2024100%
Ohio Attorney General — Statewide Unsolved Homicides
Cases submitted by Ohio's county sheriffs and municipal police departments. The AG hosts and publishes; the named local agency owns each investigation. ASP.NET WebForms index is harvested via a browser userscript; detail pages are fetched programmatically.
www.ohioattorneygeneral.gov/ColdCase →ColoradoCOweb scrape · structured fields · polite1,4001911–2025100%
Colorado Bureau of Investigation — Cold Case Database
State compendium from 86 county and municipal agencies. Detail pages publish structured first-party demographics plus federal cross-references to NamUs + NCMEC. Re-runs reflect current state (cases solved between runs drop out automatically).
apps.colorado.gov/apps/coldcase/search.html →District of ColumbiaDCweb scrape · PDF parsing · polite7031979–202099%
Washington D.C. Metropolitan Police — Unsolved Homicides
MPDC bulletins consistently use Mr./Mrs./Ms./Miss; gender is extracted only from these explicit narrative signals, never from photos or first names.
Latent treats DC alongside states for filtering and display; the badge reads 'DC' and FIPS 11001 is the federal-district code.
FLFL—6811955–202076%
Florida Department of Law Enforcement - Unsolved Cases
VirginiaVACSV · manual refresh5251961–2023100%
Virginia VSPCCD Cold Case Database
UtahUTweb scrape · structured fields · polite2071949–202199%
Utah Bureau of Criminal Identification — Cold Case Database
Statutory compendium — Utah SB 160 (2018) mandates agency submission, making this a stronger model than voluntary programs.
bci.utah.gov/coldcases →MarylandMD(6 of 24 jurisdictions)6 sources2051970–202198%
Anne Arundel County, MD Police Department — Cold Cases
Co-killed victims modeled as separate cases with a victim-name suffix to keep IDs unique.
www.aacounty.org/police-department/crime-information/cold-cases →Baltimore County, MD Police Department — Unsolved Homicides
Baltimore County narratives never state race. Cases show race as 'not provided' rather than imputing from photographs or names.
Howard County, MD Police Department — Cold Cases
Includes IDENTIFIED-prefix cases — unsolved homicides where the previously-unknown victim has been DNA-named.
www.howardcountymd.gov/police/cold-cases →Harford County, MD Sheriff's Office — Cold Cases
Wicomico County, MD Sheriff's Office — Cold Cases (Homicides)
Paginated card-tile index. Cases titled 'Missing Person -' are filtered; homicides only.
www.wicomicosheriff.com/coldCases →Charles County, MD Sheriff's Office — Cold Cases (Homicides)
Index page groups cases by category; we ingest only the Homicides section. ARREST MADE entries are filtered out.
www.ccso.us/crime/cold-cases →Maryland — sub-state coverage
6 of 24 jurisdictionsLatent ingests only first-party county or municipal sources. Where a Maryland jurisdiction has no public cold-case roster, the cases aren't here — by source-integrity choice, not omission.
Queen Anne's County hosts a 2-case page (1 homicide, 1 missing person); below the threshold for a dedicated adapter. Revisit when the page grows.
New HampshireNHJSON API + detail scrape · hybrid1151960–2015100%
New Hampshire DOJ — Cold Case Unit Victim List
NH DOJ does not publish race and Latent does not infer it. The correlation engine treats race as unknown for these cases (it doesn't contribute to scoring).
NH publishes both unsolved homicides and suspicious deaths. We preserve the distinction (`case_type=suspicious_death` for deaths where a homicide determination hasn't been made).
MNMNbrowser-assisted harvest · PDF parse · polite961964–2021100%
Minnesota BCA — Unsolved Cases Database
The BCA case list sits behind an Imperva web application firewall. Latent harvests the list out-of-band via a one-time browser console request against the SharePoint API; per-case Summary-of-Case PDFs are static assets fetched programmatically.
Weapon and location type are not published as structured fields. They are extracted from the SOC PDF narrative by the Tier-1/Tier-2 LLM pipeline and surfaced as a reviewable overlay, not source-of-truth.
WyomingWYweb scrape · structured fields · polite321974–2018100%
Wyoming Division of Criminal Investigation — Cold Case Homicides
Submit-from-counties compendium. The roster mixes homicide and standalone sexual-assault cases; Latent ingests homicides only. Weapon, scene type, and demographics are structured at source, so no LLM extraction is required.
wyomingdci.wyo.gov/operations/cold-cases →Per-source field completeness
what each source provides| Source | Name | Date | Addr. | City | State | Gender | Race | Age | Weapon | Loc. | Narr. | Photo |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| OHBCI | ● | ● | ● | ● | ● | ● | ● | ● | ◐ | ◐ | ● | ● |
| COCBI | ● | ● | ● | ● | ● | ● | ● | ● | ◐ | ◐ | ● | ● |
| DCMPDC | ● | ● | ● | ● | ● | ◐ | × | × | ◐ | ◐ | ● | ● |
| FLFDLE_FL | — | — | — | — | — | — | — | — | — | — | — | — |
| VAVSPCCD | ● | ● | ● | ● | ● | ● | ● | ● | ◐ | ◐ | ● | — |
| UTBCI | ● | ● | ● | ● | ● | ● | ● | ● | ◐ | ◐ | ● | ● |
| NHNH DOJ | ● | ◐ | ◐ | ● | ● | ◐ | × | ● | ◐ | ◐ | ● | — |
| MNMN BCA | ● | ● | ◐ | ● | ● | ◐ | ◐ | ● | ◐ | ◐ | ● | ● |
| MDAnne Arundel | ● | ◐ | ◐ | ◑ | ● | ◐ | × | ◐ | ◐ | ◐ | ● | ● |
| MDBaltimore County | ● | ◐ | ◐ | ◑ | ● | ◐ | × | ◐ | ◐ | ◐ | ● | ● |
| WYDCI | ● | ● | ● | ● | ● | ● | ● | ● | ● | ● | ● | ● |
| MDHoward | ● | ◐ | ◐ | ◑ | ● | ◐ | × | ◐ | ◐ | ◐ | ● | ● |
| MDHarford | ● | ◐ | ◐ | ◑ | ● | ◐ | × | ◐ | ◐ | ◐ | ● | ● |
| MDWicomico | ● | ◐ | ◐ | ◑ | ● | ◐ | × | ◐ | ◐ | ◐ | ● | ● |
| MDCharles | ● | ◐ | ◐ | ◑ | ● | ◐ | × | ◐ | ◐ | ◐ | ● | ● |
Cells where a field is structural (source-provided) carry the strongest provenance. Cells marked LLM-extracted point at a verbatim source phrase via the case detail page's Source ▸ reveal. Cells marked not in source are honest absences. Latent does not infer demographics from photographs or names.
Victim race is recorded exactly as the source agency reported it and shown that way on each case. For filtering only, the many source spellings of one category (Black, BLACK or AFRICAN AMERICAN, African American) are grouped into standard FBI/NIBRS categories so the filter does not fragment; the original value is never overwritten. Hispanic or Latino is an ethnicity rather than a race, but some agencies record it in the race field, so it is carried as a category here. This is a fixed grouping of source-provided values, not inference.
What gets pulled from each narrative.
Local LLM extraction surfaces structured modus-operandi fields from case narratives. How much runs depends on the source: weapon and location type are extracted wherever a narrative exists; dates, addresses, and demographics are extracted only where the source publishes prose instead of structured fields, and sources that already publish structured records need no extraction at all. The §02 completeness matrix shows exactly which fields are extracted per source.
The model that produces these classifications is Qwen3:14b, an open-weights model run locally on Latent infrastructure. No case data is sent to third-party AI services. Privacy and local execution are the trust signals that matter for investigative work; running everything on local hardware means a case you research here doesn't leak to OpenAI, Anthropic, or any other vendor.
Where a source already publishes a field as structured data ( Colorado, Utah, and Ohio ship victim demographics directly; Wyoming ships weapon and scene type too), Latent trusts the source value and does not re-derive it with the model. Extraction fills gaps; it never overrides first-party data.
Each classification carries one of two confidence states, plus a separate unknownfor cases where the narrative simply doesn't say.
- ·Model: Qwen3:14b (open weights)
- ·Inference: local hardware only
- ·Egress: zero, no third-party API calls
- ·Determinism: temperature pinned to 0.0
Confidence states
Source-span anchoring · live example
Validation
For data-quality audits, the table cold_case.case_extractions_current is the authoritative latest extraction per case. A LEFT JOIN against cold_case.cases reveals any cases missing an extraction (typically transient between ingestion and worker drain).
How "possibly related cases" gets ranked.
On each cold-case detail page, a small panel surfaces other cases that share enough structural similarity to be worth an investigator's attention. Two hard filters (50 mi / 20 yr) precede a five-dimension weighted score. The ranking is intentionally conservative: high-signal candidates only, not every plausible match.
Worked example
Why the weights are what they are
The two modus-operandi dimensions (weapon and location type) carry a weight of 1.0 each. The three victim demographics (gender, race, age band) carry 0.5. MO describes what the offender did: it is the discriminating signal for linking cases. Demographics describe who was harmed, which clusters for reasons that have nothing to do with a shared offender. A neighborhood's population is not a pattern. Weighting MO at twice demographics keeps a coincidental demographic overlap from outscoring a genuine behavioral one.
The score floor of 1.5 is the lowest total worth a reviewer's attention: it takes either one MO match plus one demographic, or all three demographics together. A pair that agrees on demographics alone just barely clears, and is shown as the weakest tier. The maximum of 3.5 is all five dimensions in agreement. The correlation engine reads these five weights from a single module, correlation/config.py, so the scoring you see described here is the scoring the engineactually runs.
What doesn't contribute
A classification of unknown on either case excludes that dimension from scoring. Two cases both classified weapon = unknown aren't matching on weapon; they're both cases where the narrative didn't specify a weapon. Same logic applies to missing demographics: if either case has no recorded victim gender, gender doesn't contribute.
A low-confidence MO classification (paraphrased anchor) still counts as a match for scoring purposes. The panel displays the “uncertain” qualifier alongside the matched value so users see the weakened evidence without clicking through.
Scope. Correlation compares cold cases to other cold cases within the same dataset. Cross-corpus correlation against other record types is out of scope for this version: the data shapes differ, the join is more expensive, and the framing of “potential leads” needs separate methodological care before it's sound.
Four precision tiers, one map.
Every case on the map carries one of four precision tiers, signaling how exactly the published source data lets us place the incident. Pins are styled differently per tier; click a county-precision pin to see the county boundary the case falls within.
| Tier | Source | What it means |
|---|---|---|
| street | Census or Google ROOFTOP | Exact address |
| landmark | Google GEOMETRIC_CENTER / APPROXIMATE | Road, neighborhood, or city centroid |
| zip | ZCTA centroid | Within this ZIP code area |
| county | County centroid | Somewhere in this county (click pin to see boundary) |
The tier assigned to any case is determined by the best match returned from a three-step geocoding ladder. First, we try the Census Geocoding Service (free, maintained by EODC) against the published address; if that fails, we step down to Google's Geocoding API; if that also fails, we assign a ZIP centroid (ZCTA from the published postal code if available), then fall back to the county centroid where the case occurred.
County-precision pinsare clickable on the map detail. Clicking zooms to show the full county boundary, a transparent wash that clarifies the uncertainty radius. Many cases from smaller counties or jurisdictions with incomplete narratives end up in tier 4; that's an honest representation of what the source told us, not a flaw in the geocoding.
What the data is not.
Every dataset assembled from public records carries selection effects. Latent's are worth naming up front: an analyst who knows the shape of the bias can correct for it; one who doesn't will read artifacts as findings.
Latent is a corpus of cold cases that an official publisher chose to make public and that Latent has so far ingested. It is not a census of cold cases, and aggregate statistics drawn from it inherit every gap below. None of this makes the data unusable; it makes the data something you read with the following in hand.
What we know we don't know.
Documented uncertainty is more useful than hidden uncertainty. The following are known v1 limitations, each with a route to resolution.
Take what you need; cite the substrate.
Underlying source data is published as public records by the originating agencies. Latent aggregates, structures, and annotates. Researchers and journalists are welcome to use whatever they find; a citation back to this dataset is appreciated and lets readers trace provenance themselves.
If you're considering re-using extracted fields in a downstream system: please retain the source-span provenance so your readers can audit the inference chain back to the original narrative. The whole point of anchoring extractions to verbatim phrases is that the trust signal travels with the data. Tip-line and corrections: courtney.e.hammond@gmail.com.
What changed, and when.
This page is versioned. Substantive changes to how the data is built or documented are logged here, so a claim you cite today can be checked against the methodology as it stood when you cited it.
Methodology page rebuilt: coverage map, per-source field-completeness matrix, dedicated bias & selection-effects section, and grounded correlation weighting.
New Hampshire DOJ ingested; suspicious_death added as a distinct case type for deaths not yet ruled homicides.
Google geocoding tier added between Census and ZIP-centroid fallback; four-tier precision visualization introduced.
Initial methodology published alongside the first security-audited dataset.