LIVE · DATASET v2026.05
11 states6d ago
METHODOLOGY · DATASET v2026.05 · UPDATED May 2026

How the data is made.

Latent aggregates cold-case records from official law-enforcement sources, normalizes them into a research-grade schema, and runs local LLM extraction over the narrative text to surface structured modus-operandi fields. Every algorithm is documented here, with explicit transparency about confidence, provenance, and limits.

Nothing here is a suspect identification. Every correlation and extracted field is a lead for human review, never a conclusion.

Cases indexed
6,760

1911–2025, across 15 active sources.

Geocoded coverage
97.3%

6,580 cases plot to coordinates; the remainder are pending geographic resolution.

Extraction precision
100 / 96.0%

Weapon and location-type classification, re-measured on qwen3:14b against the hand-labeled validation set.

Inference locality
100% local

Qwen3:14b runs on local infrastructure. No case data leaves Latent hardware.

§ 01Preamble
standing

Who this is for, and what we promise.

Latent is a public-facing analytics platform for serious non-professional investigators: citizen detectives, journalists, students, data scientists, and genetic genealogists working cases that law-enforcement tools wouldn't give them access to. The promise is simple: every transformation we apply to a case is documented, every extracted field carries a verifiable link back to its source phrase, and every limitation we know about is surfaced rather than buried.

The framing matters. Latent surfaces structure from a corpus too large to read by hand, but structure is not a verdict. A shared weapon and a 30-mile radius make two cases worth comparing, not the same case. The methodology below exists so a careful reader can audit the substrate of any claim the interface makes, and decide for themselves what the signal is worth.

§ 02Data sources
15 active

Coverage and per-source field completeness.

Latent ingests cold-case data from official law-enforcement sources only. Coverage today spans the states colored below; expand any state for the underlying sources, refresh cadence, geocoding strategy, and the caveats that travel with each one.

Coverage at a glance

10 states · 15 sources · 6,760 cases
Latent coverage mapAlabamaAlaskaArizonaColorado — coveredFlorida — coveredGeorgiaIndianaKansasMaineMassachusettsMinnesota — coveredNew JerseyNorth CarolinaNorth DakotaOklahomaPennsylvaniaSouth DakotaTexasWyoming — coveredConnecticutMissouriWest VirginiaIllinoisNew MexicoArkansasCaliforniaDelawareDistrict of Columbia — coveredHawaiiIowaKentuckyMaryland — coveredMichiganMississippiMontanaNew Hampshire — coveredNew YorkOhio — coveredOregonTennesseeUtah — coveredVirginia — coveredWashingtonWisconsinNebraskaSouth CarolinaIdahoNevadaVermontLouisianaRhode Island
CoveredPartial (county-level)Not yet
OhioOH
browser-assisted index · web scrape · polite
2,796
1943–2024
100%
BCIAttorney General's BCI · Statewide
browser-assisted index · web scrape · polite

Ohio Attorney General — Statewide Unsolved Homicides

Cases
2,796 · 100% geocoded
Year range
1943–2024
Geocoding
Census street → ZIP centroid → county centroid.

Cases submitted by Ohio's county sheriffs and municipal police departments. The AG hosts and publishes; the named local agency owns each investigation. ASP.NET WebForms index is harvested via a browser userscript; detail pages are fetched programmatically.

www.ohioattorneygeneral.gov/ColdCase
ColoradoCO
web scrape · structured fields · polite
1,400
1911–2025
100%
CBIBureau of Investigation · Statewide
web scrape · structured fields · polite

Colorado Bureau of Investigation — Cold Case Database

Cases
1,400 · 100% geocoded
Year range
1911–2025
Geocoding
Census street → ZIP centroid → county centroid.

State compendium from 86 county and municipal agencies. Detail pages publish structured first-party demographics plus federal cross-references to NamUs + NCMEC. Re-runs reflect current state (cases solved between runs drop out automatically).

apps.colorado.gov/apps/coldcase/search.html
District of ColumbiaDC
web scrape · PDF parsing · polite
703
1979–2020
99%
MPDCMetropolitan police · Washington, D.C.
web scrape · PDF parsing · polite

Washington D.C. Metropolitan Police — Unsolved Homicides

Cases
703 · 99% geocoded
Year range
1979–2020
Geocoding
Census street → ZIP centroid → DC-district centroid (FIPS 11001).
Gender from honorifics only

MPDC bulletins consistently use Mr./Mrs./Ms./Miss; gender is extracted only from these explicit narrative signals, never from photos or first names.

DC is a federal district, not a state

Latent treats DC alongside states for filtering and display; the badge reads 'DC' and FIPS 11001 is the federal-district code.

mpdc.dc.gov/page/unsolved-homicides
FLFL
681
1955–2020
76%
FDLE_FL ·

Florida Department of Law Enforcement - Unsolved Cases

Cases
681 · 76% geocoded
Year range
1955–2020
Geocoding
VirginiaVA
CSV · manual refresh
525
1961–2023
100%
VSPCCDState police · Statewide
CSV · manual refresh

Virginia VSPCCD Cold Case Database

Cases
525 · 100% geocoded
Year range
1961–2023
Geocoding
Source-provided coordinates (geocoding ladder skipped).
vspccd.dcjs.virginia.gov
UtahUT
web scrape · structured fields · polite
207
1949–2021
99%
BCIBureau of Criminal Identification · Statewide
web scrape · structured fields · polite

Utah Bureau of Criminal Identification — Cold Case Database

Cases
207 · 99% geocoded
Year range
1949–2021
Geocoding
Census street → ZIP centroid → county centroid.

Statutory compendium — Utah SB 160 (2018) mandates agency submission, making this a stronger model than voluntary programs.

bci.utah.gov/coldcases
MarylandMD
(6 of 24 jurisdictions)
6 sources
205
1970–2021
98%
Anne ArundelCounty police · Anne Arundel County
web scrape · polite

Anne Arundel County, MD Police Department — Cold Cases

Cases
75 · 99% geocoded
Year range
1970–2020
Geocoding
Census street → ZIP centroid → county centroid (FIPS 24003).

Co-killed victims modeled as separate cases with a victim-name suffix to keep IDs unique.

www.aacounty.org/police-department/crime-information/cold-cases
Baltimore CountyCounty police · Baltimore County
web scrape · polite

Baltimore County, MD Police Department — Unsolved Homicides

Cases
75 · 96% geocoded
Year range
1970–2015
Geocoding
Census street → ZIP centroid → county centroid.
Race not in narratives

Baltimore County narratives never state race. Cases show race as 'not provided' rather than imputing from photographs or names.

www.baltimorecountymd.gov/departments/police/unsolved/homicides
HowardCounty police · Howard County
web scrape · polite

Howard County, MD Police Department — Cold Cases

Cases
23 · 100% geocoded
Year range
1971–2017
Geocoding
Census street → ZIP centroid → county centroid (FIPS 24027).

Includes IDENTIFIED-prefix cases — unsolved homicides where the previously-unknown victim has been DNA-named.

www.howardcountymd.gov/police/cold-cases
HarfordSheriff's office · Harford County
web scrape · polite

Harford County, MD Sheriff's Office — Cold Cases

Cases
21 · 100% geocoded
Year range
1981–2021
Geocoding
Census street → ZIP centroid → county centroid (FIPS 24025).
harfordsheriff.org/wanted/cold-cases
WicomicoSheriff's office · Wicomico County
web scrape · polite

Wicomico County, MD Sheriff's Office — Cold Cases (Homicides)

Cases
6 · 100% geocoded
Year range
1990–2019
Geocoding
Census street → ZIP centroid → county centroid (FIPS 24045).

Paginated card-tile index. Cases titled 'Missing Person -' are filtered; homicides only.

www.wicomicosheriff.com/coldCases
CharlesSheriff's office · Charles County
web scrape · polite

Charles County, MD Sheriff's Office — Cold Cases (Homicides)

Cases
5 · 100% geocoded
Year range
1980–2008
Geocoding
Census street → ZIP centroid → county centroid (FIPS 24017).

Index page groups cases by category; we ingest only the Homicides section. ARREST MADE entries are filtered out.

www.ccso.us/crime/cold-cases

Maryland — sub-state coverage

6 of 24 jurisdictions

Latent ingests only first-party county or municipal sources. Where a Maryland jurisdiction has no public cold-case roster, the cases aren't here — by source-integrity choice, not omission.

Covered (6)
Baltimore County · Anne Arundel · Howard · Harford · Charles · Wicomico
No public roster (17)
Allegany · Baltimore City · Calvert · Caroline · Carroll · Cecil · Dorchester · Frederick · Garrett · Kent · Montgomery · Prince George's · Queen Anne's · St. Mary's · Talbot · Washington · Worcester
State-handled (1)
Somerset (Maryland State Police)

Queen Anne's County hosts a 2-case page (1 homicide, 1 missing person); below the threshold for a dedicated adapter. Revisit when the page grows.

New HampshireNH
JSON API + detail scrape · hybrid
115
1960–2015
100%
NH DOJAttorney General's Cold Case Unit · Statewide
JSON API + detail scrape · hybrid

New Hampshire DOJ — Cold Case Unit Victim List

Cases
115 · 100% geocoded
Year range
1960–2015
Geocoding
Census street → ZIP centroid → county centroid.
Race not published

NH DOJ does not publish race and Latent does not infer it. The correlation engine treats race as unknown for these cases (it doesn't contribute to scoring).

Two case_type values

NH publishes both unsolved homicides and suspicious deaths. We preserve the distinction (`case_type=suspicious_death` for deaths where a homicide determination hasn't been made).

www.doj.nh.gov/bureaus/new-hampshire-cold-case-unit/victim-list
MNMN
browser-assisted harvest · PDF parse · polite
96
1964–2021
100%
MN BCABureau of Criminal Apprehension · Statewide
browser-assisted harvest · PDF parse · polite

Minnesota BCA — Unsolved Cases Database

Cases
96 · 100% geocoded
Year range
1964–2021
Geocoding
Census street → ZIP centroid → county centroid.
WAF-gated case list

The BCA case list sits behind an Imperva web application firewall. Latent harvests the list out-of-band via a one-time browser console request against the SharePoint API; per-case Summary-of-Case PDFs are static assets fetched programmatically.

Narrative-derived modus operandi

Weapon and location type are not published as structured fields. They are extracted from the SOC PDF narrative by the Tier-1/Tier-2 LLM pipeline and surfaced as a reviewable overlay, not source-of-truth.

portal.dps.mn.gov/bca/unsolved-cases/Pages/database.aspx
WyomingWY
web scrape · structured fields · polite
32
1974–2018
100%
DCIDivision of Criminal Investigation · Statewide
web scrape · structured fields · polite

Wyoming Division of Criminal Investigation — Cold Case Homicides

Cases
32 · 100% geocoded
Year range
1974–2018
Geocoding
Census street → ZIP centroid → county centroid.

Submit-from-counties compendium. The roster mixes homicide and standalone sexual-assault cases; Latent ingests homicides only. Weapon, scene type, and demographics are structured at source, so no LLM extraction is required.

wyomingdci.wyo.gov/operations/cold-cases

Per-source field completeness

what each source provides
SourceNameDateAddr.CityStateGenderRaceAgeWeaponLoc.Narr.Photo
OHBCI
COCBI
DCMPDC××
FLFDLE_FL
VAVSPCCD
UTBCI
NHNH DOJ×
MNMN BCA
MDAnne Arundel×
MDBaltimore County×
WYDCI
MDHoward×
MDHarford×
MDWicomico×
MDCharles×
Source-providedLLM-extractedCounty-level only×Not in sourceNot collected

Cells where a field is structural (source-provided) carry the strongest provenance. Cells marked LLM-extracted point at a verbatim source phrase via the case detail page's Source ▸ reveal. Cells marked not in source are honest absences. Latent does not infer demographics from photographs or names.

Victim race is recorded exactly as the source agency reported it and shown that way on each case. For filtering only, the many source spellings of one category (Black, BLACK or AFRICAN AMERICAN, African American) are grouped into standard FBI/NIBRS categories so the filter does not fragment; the original value is never overwritten. Hispanic or Latino is an ethnicity rather than a race, but some agencies record it in the race field, so it is carried as a category here. This is a fixed grouping of source-provided values, not inference.

§ 03Modus operandi extraction
Tier 1 + Tier 2

What gets pulled from each narrative.

Local LLM extraction surfaces structured modus-operandi fields from case narratives. How much runs depends on the source: weapon and location type are extracted wherever a narrative exists; dates, addresses, and demographics are extracted only where the source publishes prose instead of structured fields, and sources that already publish structured records need no extraction at all. The §02 completeness matrix shows exactly which fields are extracted per source.

The model that produces these classifications is Qwen3:14b, an open-weights model run locally on Latent infrastructure. No case data is sent to third-party AI services. Privacy and local execution are the trust signals that matter for investigative work; running everything on local hardware means a case you research here doesn't leak to OpenAI, Anthropic, or any other vendor.

Where a source already publishes a field as structured data ( Colorado, Utah, and Ohio ship victim demographics directly; Wyoming ships weapon and scene type too), Latent trusts the source value and does not re-derive it with the model. Extraction fills gaps; it never overrides first-party data.

Each classification carries one of two confidence states, plus a separate unknownfor cases where the narrative simply doesn't say.

Locality
  • ·Model: Qwen3:14b (open weights)
  • ·Inference: local hardware only
  • ·Egress: zero, no third-party API calls
  • ·Determinism: temperature pinned to 0.0

Confidence states

Confident

Classification produced and anchored to a verbatim phrase in the narrative. The detail page's Source ▸ button reveals the exact span.

Uncertain

Classification stands but the model paraphrased (“at her place” vs “at her apartment”). The audit trail does not.

Unknown

The narrative doesn't specify the attribute. Excluded from scoring; surfaced as “not provided” on the detail page.

Source-span anchoring · live example

On October 10, 2010 at approximately 9:30 p.m., officers found the victim suffering from a gunshot wound on the ground. He was taken to the hospital where he later died of his injuries.

extractedweapon = firearmraw phrase“gunshot wound”spanchars 92–105confidenceconfident · anchored

Validation

Weapon · precision / recall
100 / 100%

Hand-labeled validation set, re-scored on every prompt or model change (current: qwen3:14b).

Location type · P/R
96.0 / 96.0%

Same validation set. Most failures are paraphrase, not misclassification.

Uncertain rate · location
17.9%

Defensible classification, paraphrased anchor. Known v1 limitation; will narrow with prompt iteration.

Audit

For data-quality audits, the table cold_case.case_extractions_current is the authoritative latest extraction per case. A LEFT JOIN against cold_case.cases reveals any cases missing an extraction (typically transient between ingestion and worker drain).

§ 04Correlation engine
5 dimensions · floor 1.5 / 3.5

How "possibly related cases" gets ranked.

On each cold-case detail page, a small panel surfaces other cases that share enough structural similarity to be worth an investigator's attention. Two hard filters (50 mi / 20 yr) precede a five-dimension weighted score. The ranking is intentionally conservative: high-signal candidates only, not every plausible match.

Geographic filter
≤ 50miles

Cross-metro pairs in serial-pattern data are usually unrelated. Geography stays bounded.

Temporal filter
≤ 20years

Time runs wide because cold-case patterns frequently span dormant periods.

Worked example

Dimension
Case A
Case B
Match
Score
Weapon
firearm
firearm
+1.0
Location type
public-outdoor
public-outdoor
+1.0
Victim gender
male
male
+0.5
Victim race
black
not recorded
0.0
Victim age band
30–39
30–39
+0.5
Score floor 1.5 · maximum 3.5 · ties broken by closer distance3.0 / 3.5

Why the weights are what they are

The two modus-operandi dimensions (weapon and location type) carry a weight of 1.0 each. The three victim demographics (gender, race, age band) carry 0.5. MO describes what the offender did: it is the discriminating signal for linking cases. Demographics describe who was harmed, which clusters for reasons that have nothing to do with a shared offender. A neighborhood's population is not a pattern. Weighting MO at twice demographics keeps a coincidental demographic overlap from outscoring a genuine behavioral one.

The score floor of 1.5 is the lowest total worth a reviewer's attention: it takes either one MO match plus one demographic, or all three demographics together. A pair that agrees on demographics alone just barely clears, and is shown as the weakest tier. The maximum of 3.5 is all five dimensions in agreement. The correlation engine reads these five weights from a single module, correlation/config.py, so the scoring you see described here is the scoring the engineactually runs.

What doesn't contribute

A classification of unknown on either case excludes that dimension from scoring. Two cases both classified weapon = unknown aren't matching on weapon; they're both cases where the narrative didn't specify a weapon. Same logic applies to missing demographics: if either case has no recorded victim gender, gender doesn't contribute.

A low-confidence MO classification (paraphrased anchor) still counts as a match for scoring purposes. The panel displays the “uncertain” qualifier alongside the matched value so users see the weakened evidence without clicking through.

Scope. Correlation compares cold cases to other cold cases within the same dataset. Cross-corpus correlation against other record types is out of scope for this version: the data shapes differ, the join is more expensive, and the framing of “potential leads” needs separate methodological care before it's sound.

§ 05Geocoding precision
tiers 1–4

Four precision tiers, one map.

Every case on the map carries one of four precision tiers, signaling how exactly the published source data lets us place the incident. Pins are styled differently per tier; click a county-precision pin to see the county boundary the case falls within.

TierSourceWhat it means
streetCensus or Google ROOFTOPExact address
landmarkGoogle GEOMETRIC_CENTER / APPROXIMATERoad, neighborhood, or city centroid
zipZCTA centroidWithin this ZIP code area
countyCounty centroidSomewhere in this county (click pin to see boundary)

The tier assigned to any case is determined by the best match returned from a three-step geocoding ladder. First, we try the Census Geocoding Service (free, maintained by EODC) against the published address; if that fails, we step down to Google's Geocoding API; if that also fails, we assign a ZIP centroid (ZCTA from the published postal code if available), then fall back to the county centroid where the case occurred.

County-precision pinsare clickable on the map detail. Clicking zooms to show the full county boundary, a transparent wash that clarifies the uncertainty radius. Many cases from smaller counties or jurisdictions with incomplete narratives end up in tier 4; that's an honest representation of what the source told us, not a flaw in the geocoding.

§ 06Bias & selection effects
read before analyzing

What the data is not.

Every dataset assembled from public records carries selection effects. Latent's are worth naming up front: an analyst who knows the shape of the bias can correct for it; one who doesn't will read artifacts as findings.

Latent is a corpus of cold cases that an official publisher chose to make public and that Latent has so far ingested. It is not a census of cold cases, and aggregate statistics drawn from it inherit every gap below. None of this makes the data unusable; it makes the data something you read with the following in hand.

Selection effect

Publication bias

Only cases an agency chose to post appear here. Agencies publish selectively: high-profile cases, cases where public tips help, cases that show the unit is active. A jurisdiction with no public roster contributes zero cases regardless of how many cold cases it carries.

Selection effect

Geographic coverage follows publishers, not need

Coverage tracks wherever an official source posts a usable roster, not population, not case density, not where help is most needed. Maryland appears county-by-county; most states aren't in yet. Cross-state comparison reflects ingestion order as much as anything real. See the §02 map for current shape.

Selection effect

The corpus skews to still-unsolved

Several sources are re-ingested live; cases solved between runs drop out. That keeps the data current but means Latent trends toward cases that remain open; it is not a historical record of every case that was ever cold.

Selection effect

Narrative framing carries through

Extracted MO fields come from narratives written by law enforcement, in the language and emphasis the agency chose. 'Location type' reflects what a narrative foregrounded; a detail the writer omitted is a detail Latent cannot recover.

Selection effect

Demographic completeness varies by source

Race is a structured first-party field in Colorado and Utah, absent from Maryland and DC narratives entirely. Any aggregate demographic figure is shaped by which states happen to be ingested, not a population-level measurement.

Selection effect

Sources span different record-keeping eras

Year ranges differ widely: Colorado reaches back to 1911, other sources are recent. Time-based analysis mixes decades of changing investigative practice, evidence standards, and what agencies recorded at all.

§ 07Limits & open questions
v1 known issues

What we know we don't know.

Documented uncertainty is more useful than hidden uncertainty. The following are known v1 limitations, each with a route to resolution.

Known limitation

Paraphrased location extractions

On the hand-labeled validation set, roughly one location classification in six resolves to a defensible paraphrase rather than a verbatim source span: the classification stands, the audit trail weakens. This is a validation-set rate, not a measured corpus-wide figure; it narrows with prompt iteration.

Known limitation

Manual refresh cadence on most sources

Most sources (Virginia VSPCCD among them) are re-ingested by hand when the publisher posts an update. There is no scheduled refresh today, so drift between a live site and our snapshot is bounded but real. The hero eyebrow shows the dataset's last ingestion date.

Known limitation

Cross-source correlation is deferred

Only cold-case ↔ cold-case in this version. Joining other record types needs methodological care before pairs are trustworthy.

Known limitation

Multi-victim semantics use primary victim

Demographic match against multi-victim cases pins to one victim per case. Rule is locked even at low incidence so behavior stays consistent as more multi-victim cases land.

Known limitation

Race is never inferred from photographs

An ethical guardrail, not a technical limitation. Cases without a source-stated race show as 'not provided' rather than imputing a value from imagery or names; the correlation engine treats race as unknown for those cases.

Known limitation

Single primary photo per case

Schema supports multiple photos per case but only the primary is rendered today. SHA256 captured for future dedup work.

§ 08Citation & reuse

Take what you need; cite the substrate.

Underlying source data is published as public records by the originating agencies. Latent aggregates, structures, and annotates. Researchers and journalists are welcome to use whatever they find; a citation back to this dataset is appreciated and lets readers trace provenance themselves.

Suggested citation
Latent Cold-Case Dataset. Dataset v2026.05, revised 2026-05-29.
Retrieved from https://latenttrace.org/methodology.

If you're considering re-using extracted fields in a downstream system: please retain the source-span provenance so your readers can audit the inference chain back to the original narrative. The whole point of anchoring extractions to verbatim phrases is that the trust signal travels with the data. Tip-line and corrections: courtney.e.hammond@gmail.com.

§ 09Revision history
4 entries

What changed, and when.

This page is versioned. Substantive changes to how the data is built or documented are logged here, so a claim you cite today can be checked against the methodology as it stood when you cited it.

  1. Methodology page rebuilt: coverage map, per-source field-completeness matrix, dedicated bias & selection-effects section, and grounded correlation weighting.

  2. New Hampshire DOJ ingested; suspicious_death added as a distinct case type for deaths not yet ruled homicides.

  3. Google geocoding tier added between Census and ZIP-centroid fallback; four-tier precision visualization introduced.

  4. Initial methodology published alongside the first security-audited dataset.