+0.30 Archivists work to save disappearing data.gov datasets

Name: HRCB Evaluation: Archivists work to save disappearing data.gov datasets
Item: Archivists work to save disappearing data.gov datasets
Rating: 0.287
Author: HN HRCB

Model: claude-haiku-4-5-20251001 +0.30 @cf/meta/llama-3.3-70b-instruct-fp8-fast lite +0.30 @cf/meta/llama-4-scout-17b-16e-instruct lite +0.56 Compare

+0.30	Archivists work to save disappearing data.gov datasets (www.404media.co S:+0.10 )
	801 points by johnneville 393 days ago \| 238 comments on HN \| Mild positive Editorial · v3.7 · 2026-02-28 12:26:35

Summary Information Access & Scientific Data Preservation Advocates

404 Media investigates the disappearance of 2,000+ government datasets from data.gov following Trump's 2025 inauguration, documenting potential targeting of climate and diversity research. The article works with archivists and researchers to establish that while some deletions may be technical or routine, substantial evidence suggests intentional government erasure, raising critical concerns about access to scientific research and the fragility of digital government information preservation systems.

Article Heatmap

Negative Neutral Positive No Data

Aggregates

Editorial Mean	+0.30	Structural Mean	+0.10
Weighted Mean	+0.29	Unweighted Mean	+0.22
Max	+0.64 Article 27	Min	+0.04 Article 1
Signal	11	No Data	20
Confidence ℹ	17%	Volatility	0.20 (Medium)
Negative	0	Channels	E: 0.6 S: 0.4
SETL ℹ	+0.23	Editorial-dominant
FW Ratio ℹ	62%	26 facts · 16 inferences

Evidence: High: 3 Medium: 2 Low: 6 No Data: 20

Theme Radar

HN Discussion 19 top-level · 18 replies

debeloo 2025-02-01 17:11 UTC link

Is this normal when there's change in presidency?

smrtinsert 2025-02-01 17:17 UTC link

Are datasets mirrored anywhere where the govt doesn't automatically have a take down authority? If not there should be a mirroring effort.

sunk1st 2025-02-01 17:36 UTC link

I don’t see a list of the datasets that have gone missing. Is there a list?

jl6 2025-02-01 18:05 UTC link

> The outlet reports that deleted datasets "disproportionately" come from environmental science agencies like the Department of Energy, National Oceanic and Atmospheric Administration (NOAA), and the Environmental Protection Agency (EPA).

Was there an EO targeting these areas?

cle 2025-02-01 19:05 UTC link

I’ve been archiving data.gov for over a year now and it’s not unusual to see large fluctuations on the order of hundreds or thousands of datasets. I’ve never bothered trying to figure out what exactly is changing, maybe I should build a tool for that…

pluto_modadic 2025-02-01 19:48 UTC link

don't they have to have to have done this /before/ it gets deleted?

dang 2025-02-01 20:53 UTC link

Related ongoing thread:

CDC data are disappearing - https://news.ycombinator.com/item?id=42897696 - Feb 2025 (216 comments)

chrishoyle 2025-02-01 21:11 UTC link

Related ongoing discussion

The government information crisis is bigger than you think it is - https://news.ycombinator.com/item?id=42895331

JackC 2025-02-01 21:41 UTC link

I'm quoted in this article. Happy to discuss what we're working on at the Library Innovation Lab if anyone has questions.

There's lots of people making copies of things right now, which is great -- Lots Of Copies Keeps Stuff Safe. It's your data, why not have a copy?

One thing I think we can contribute here as an institution is timestamping and provenance. Our copy of data.gov is made with https://github.com/harvard-lil/bag-nabit , which extends BagIt format to sign archives with email/domain/document certificates. That way (once we have a public endpoint) you can make your own copy with rclone, pass it around, but still verify it hasn't been modified since we made it.

Some open questions we'd love help on --

* One is that it's hard to tell what's disappearing and what's just moving. If you do a raw comparison of snapshots, there's things like 2011-glass-buttes-exploration-and-drilling-535cf being replaced by 2011-glass-buttes-exploration-and-drilling-236cf, but it's still exactly the same data; it's a rename rather than a delete and add. We need some data munging to work out what's actually changing.

* Another is how to find the most valuable things to preserve that aren't directly linked from the catalog. If a data.gov entry links to a csv, we have it. If it links to an html landing page, we have the landing page. It would be great to do some analysis to figure out the most valuable stuff behind the landing pages.

crowcroft 2025-02-01 21:48 UTC link

Still, even with best efforts this is such a shame. There is always going to be a question around governance over the data, integrity, and potentially chain of custody as well. If the goal is to muddy the waters and create a narrative that whatever might be in this data isn't reliable or accurate then mission accomplished. I don't see how anything can stop that.

Not to say the data isn't incredibly valuable and should be preserved for many other reasons of course. All the best to anyone archiving this, this is important work.

0n0n0m0uz 2025-02-01 22:02 UTC link

One of the USA greatest strengths is the almost unprecedented degree of transparency of governments records going back decades. We can actually see the true facts including when our government has lied to us or covered things up. Many other nations do not have this luxury and it has provided the evidentiary basis for both legal cases and "progress" in general. Not surprising that authoritarians would target and destroy data as it makes their objective of a post-truth society that much easier

eh_why_not 2025-02-01 22:06 UTC link

What's a good way to be an "Archivist" on a low budget these days?

Say you have a few TBs of disk space, and you're willing to capture some public datasets (or parts of them) that interest you, and publish them in a friendly jurisdiction - keyed by their MD5/SHA1 - or make them available upon request. I.e. be part of a large open-source storage network, but only for objects/datasets you're willing to store (so there are no illegal shenanigans).

Is this a use case for Torrents? What's the most suitable architecture available today for this?

choobacker 2025-02-01 22:35 UTC link

It's impressive that volunteers are stepping up to archive this. I understand the desire to keep this open data available.

How much of this sort of effort results in that data being used? Are there success stories for these datasets being discoverable enough and useful to others?

chrishoyle 2025-02-01 23:09 UTC link

Beyond federal websites (.gov, .mil) there are lot of gov contractor websites that are being taken down (presumably at the demand of agencies) that contain a wealth of information and years of project research.

Some below of contractors that work with US AID:

- https://www.edu-links.org/ (taken down)

- https://www.genderlinks.org/ (taken down)

- https://usaidlearninglab.org/ (taken down)

- https://agrilinks.org/ (presumably at risk)

- https://www.climatelinks.org/ (presumably at risk)

- https://biodiversitylinks.org/ (presumably at risk)

derektank 2025-02-01 23:15 UTC link

Does anyone know if the St Louis Federal Reserve (and I guess the federal reserve banks generally) is subject to presidential executive orders or is it entirely responsible to the Federal Reserve Board and the St. Louis Bank president? FRED is the only dataset I access regularly

generalizations 2025-02-01 23:36 UTC link

Do we know what datasets these are? Do we actually have a diff here so we know what's been removed? There's a lot of assumptions being thrown around here, but we don't even know if this is some kind of malicious compliance. An actual list of what's been removed would probably clear the air a lot.

As one of the reddit comments (in the thread linked by the article) pointed out,

> During the start of Biden’s term, On 6th feb data.gov had “218,384 DATASETS” but on 7th feb it only had “192,180 DATASETS”

andyjohnson0 2025-02-02 12:03 UTC link

If the intention is to restore these data sets at some future date, when sanity has possibly been restored, then there needs to be a way to demonstrate that the archived data hasn't itself been modified. Without that, malign actors (e.g. oil/gas lobby) could very easily poison the future.

downrightmike 2025-02-02 19:07 UTC link

Already seeing: 404 Not Found: Requested route ('ed-public-download.app.cloud.gov') does not exist.

liontwist 2025-02-02 20:49 UTC link

I think people are interested in archiving and the political image associated with that but I don’t think anybody cares about the content. Who is going to go back and read Biden era agency publications?

BowBun 2025-02-01 17:12 UTC link

From the article:

> Changes in presidential administrations have led to datasets being deleted in the past, either on purpose or by accident. When Biden took office, 1,000 datasets were deleted according to the Wayback Machine, via 404 Media's reporting.

lhl 2025-02-01 18:11 UTC link

There's been a lot of discussion in https://www.reddit.com/r/DataHoarder/

Here's documentation on independent backup efforts of various government websites: https://www.reddit.com/r/DataHoarder/comments/1ifalwe/us_gov...

Also here: https://www.reddit.com/r/DataHoarder/comments/1idj6dm/all_us...

Apparently, much of the data has been back up here: https://eotarchive.org/

Here's also a discussion on whether the Internet Archive is sufficiently backed up/decentralized (it is not): https://www.reddit.com/r/DataHoarder/comments/1if32iq/does_i...

bilbo0s 2025-02-01 18:28 UTC link

Don’t worry, it is a matter of great doctrinal import that all scientific datasets be replaced with datasets that have been properly refined in accordance with scripture. /s

Maybe this administration will get better over time?

johnneville 2025-02-01 18:43 UTC link

Politico reports that USDA landing pages regarding climate change were ordered to be deleted by a directive from the USDA's office of communications.

I think it is likely that orders to these other agencies follows this model. Many other datasets are being targeted via EO 14168 which has quite wide impacts but doesn't appear at first glance to apply to what i would expect to be a part of NOAA and EPA reports.

https://www.politico.com/news/2025/01/31/usda-climate-change...

_DeadFred_ 2025-02-01 19:01 UTC link

Looks like the EPA is being targeted (Even though ninety-five percent of the funding going to EPA has not only been appropriated, but is locked in, legally obligated grant funding. The Constitution does not give the president a line item veto over Congress's spending decisions):

https://www.cbsnews.com/news/epa-employees-warned-of-immedia...

nemofoo 2025-02-01 19:09 UTC link

Do you mirror these data sets anywhere?

chrishoyle 2025-02-01 22:16 UTC link

I'd love to learn more about what is in scope of the Library Innovation Lab projects. Is it targeting data.gov specifically or all government agency websites?

Given the rapid take downs of websites (cdc, usaid) do you have a prioritization framework for which website pages to prioritize or do you have "comprehensive" coverage of pages (in scope of the project)?

As you allude to, I've been having a hard time learn about what sort of duplicate work might be happening given that there isn't a great "archived coverage" source of truth for government websites (between projects such as End of Term archive, Internet archive, research labs, and independent archivists).

Your open questions are interesting. Content hashes for each page/resource would be a way to do quick comparisons, but I assume you might want to set some threshold to determine how much it's changed vs if it changed?

Is the second question about figuring out how to prioritize valuable stuff behind two depth traversals? (ex data.gov links to another website and that website has a csv download)

josh-sematic 2025-02-01 22:26 UTC link

A common metric for how much actual content has changed is the Jaccard Index. Even for large numbers of datasets that are too large to fit in memory it can be approximated with various forms of MinHash algorithms. Some write up here: https://blog.nelhage.com/post/fuzzy-dedup/

https://en.wikipedia.org/wiki/Jaccard_index

josh-sematic 2025-02-01 22:30 UTC link

I’m not an expert in such things, but this seems like a good use case for IPFS. Kinda similar to a torrent except that it is natively content-addressed (essentially the key to access is a hash of the data).

mrshadowgoose 2025-02-01 22:40 UTC link

Just commenting to double-down on the need for cryptographic timestamping - especially in the current era of generative AI.

smrtinsert 2025-02-01 23:48 UTC link

Thank you for this effort.

honestSysAdmin 2025-02-01 23:58 UTC link

The pedestrian "right", which I encounter on a day-to-day basis the months I visit client sites a couple hundred miles inland of the Gulf of America, will look at climatelinks.org and say something like: "all I see are foreign countries, why are we spending money on this instead of citizens of the United States?".

honestSysAdmin 2025-02-02 00:00 UTC link

In my experience, to archive effectively you need a physical datacenter footprint, or to rent capacity of someone who does. Over a longer timespan (even just 6 months), having your own footprint is a lower total cost of ownership, provided you have the skills or access to someone with the skills to run Kubernetes + Ceph (or something similar).

> Is this a use case for Torrents?

Yes, provided you have a good way to dynamically append a distributed index of torrents and users willing to run that software in addition to the torrent software. Should be easy enough to define in container-compose.

smallerize 2025-02-02 03:34 UTC link

https://wiki.archiveteam.org/index.php/The_WARC_Ecosystem

Set up a scrape using ArchiveTeam's fork of wget. It can save all the requests and responses into a single WARC file. Then you can use https://replayweb.page/ or some other tool to browse the contents.

codetrotter 2025-02-02 05:25 UTC link

> sign archives with email/domain/document certificates

I do a bit of web archival for fun, and have been thinking about something.

Currently I save both response body and response headers and request headers for the data I save from the net.

But I was thinking that maybe if instead of just saving that, I could go a level deeper and preserve actual TCP packets and TLS key exchange stuff.

And then, I might be able to get a lot of data provenance “for free”. Because if in some decades when we look back at the saved TCP packets and TLS stuff, we would see that these packets were signed with a certificate chain that matches what that website was serving at the time. Assuming of course that they haven’t accidentally leaked their private keys in the meantime and that the CA hasn’t gone rogue since etc.

To me I think that would make sense to build out web archival infra that preserves the CA chain and enough to be able to see later that it was valid. And if many people across the world save the right parts we don’t have to trust each other in order to verify that data that the other saved was also really sent by the website our archives say it was from.

For example maybe I only archived a single page from some domain, and you saved a whole bunch of other pages from that domain around the same time so the same certificate chain was used in the responses to both of us. Then I can know that the data you are saying you archived from them really was served by their server because I have the certificate chain I saved to verify that.

bamboozled 2025-02-02 08:25 UTC link

It’s also the driver for a great economy imo. Why can’t Russia and China build and innovate like the USA can?

Because they spend a lot of time censoring and covering things up.

RIP USA.

DJBunnies 2025-02-02 12:58 UTC link

If only we had checksums or file hash integrity.

acureau 2025-02-03 16:18 UTC link

Historians, political scientists, and social scientists come to mind

Editorial Channel

What the content says

+0.90

Article 27 Cultural Participation

High Advocacy Framing Practice

Editorial

+0.90

SETL

+0.76

Core article focus. The article documents loss of access to scientific research datasets and advocates for their preservation. Quotes Harvard and Stanford researchers working to preserve scientific data. Explicitly frames loss of climate, environmental, and research data as problematic.

+0.85

Article 19 Freedom of Expression

High Advocacy Framing Practice

Editorial

+0.85

SETL

+0.68

Core article focus. The content explicitly investigates and advocates for preservation of access to public information. Documents government removal of publicly-funded research and emphasizes importance of information access to democratic governance. Quotes multiple researchers committed to preserving information access.

+0.35

Article 2 Non-Discrimination

High Advocacy Framing

Editorial

+0.35

SETL

+0.32

The article documents and critiques targeted deletion of datasets related to diversity, equity, and inclusion research, and research about marginalized communities. Explicitly notes discriminatory targeting through executive order.

+0.25

Article 21 Political Participation

Low Framing

Editorial

+0.25

SETL

+0.22

Weak engagement. The article frames government data access as necessary for informed public participation, implying that right to information supports right to participate in governance.

+0.20

Article 7 Equality Before Law

Medium Framing

Editorial

+0.20

SETL

+0.17

The article connects loss of access to information about marginalized communities (climate, equity research) with equal protection concerns, though not as primary focus.

+0.20

Article 30 No Destruction of Rights

Low Framing

Editorial

+0.20

SETL

+0.17

The article distinguishes between legitimate policy changes and 'erasing information,' documenting targeted deletion of specific data categories as potentially abusive government action.

+0.15

Preamble Preamble

Medium Framing

Editorial

+0.15

SETL

0.00

The article implicitly engages with human dignity by framing access to public information and scientific research as matters of fundamental importance to informed governance.

+0.15

Article 25 Standard of Living

Low Framing

Editorial

+0.15

SETL

+0.12

Weak engagement. Disproportionate loss of environmental and health monitoring datasets (NOAA coral reef data, environmental agency data) implies connection to standard of living and health information access.

+0.10

Article 8 Right to Remedy

Low Advocacy

Editorial

+0.10

SETL

+0.07

Weak engagement: the article documents archivists and researchers working to preserve and recover deleted data, which constitutes a remedy-seeking response.

+0.10

Article 28 Social & International Order

Low Framing

Editorial

+0.10

SETL

+0.07

Weak engagement. The article notes international collaboration in preservation efforts (Internet Archive, Common Crawl, University of North Texas partnerships).

0.00

Article 1 Freedom, Equality, Brotherhood

Low

Editorial

0.00

SETL

-0.10

No explicit engagement with equal dignity and rights.

Article 3 Life, Liberty, Security

Not engaged.

Article 4 No Slavery

Not engaged.

Article 5 No Torture

Not engaged.

Article 6 Legal Personhood

Not engaged.

Article 9 No Arbitrary Detention

Not engaged.

Article 10 Fair Hearing

Not engaged.

Article 11 Presumption of Innocence

Not engaged.

Article 12 Privacy

Not engaged.

Article 13 Freedom of Movement

Not engaged.

Article 14 Asylum

Not engaged.

Article 15 Nationality

Not engaged.

Article 16 Marriage & Family

Not engaged.

Article 17 Property

Not engaged.

Article 18 Freedom of Thought

Not engaged.

Article 20 Assembly & Association

Not engaged.

Article 22 Social Security

Not engaged.

Article 23 Work & Equal Pay

Not engaged.

Article 24 Rest & Leisure

Not engaged.

Article 26 Education

Not engaged.

Article 29 Duties to Community

Not engaged.

Structural Channel

What the site does

+0.30

Article 19 Freedom of Expression

High Advocacy Framing Practice

Structural

+0.30

Context Modifier

SETL

+0.68

The site structurally supports freedom of expression through free public access to this journalism; publication itself constitutes support for information freedom.

+0.25

Article 27 Cultural Participation

High Advocacy Framing Practice

Structural

+0.25

Context Modifier

SETL

+0.76

The site publishes journalism that reports on and advocates for preservation of scientific research access.

+0.15

Preamble Preamble

Medium Framing

Structural

+0.15

Context Modifier

SETL

0.00

The site publishes freely accessible journalism that supports information preservation and transparency.

+0.10

Article 1 Freedom, Equality, Brotherhood

Low

Structural

+0.10

Context Modifier

SETL

-0.10

The site's open access model provides equal access to information regardless of economic status.

+0.05

Article 2 Non-Discrimination

High Advocacy Framing

Structural

+0.05

Context Modifier

SETL

+0.32

The publication of this investigative report demonstrates structural commitment to documenting discrimination.

+0.05

Article 7 Equality Before Law

Medium Framing

Structural

+0.05

Context Modifier

SETL

+0.17

The publication documents threats to equal information access.

+0.05

Article 8 Right to Remedy

Low Advocacy

Structural

+0.05

Context Modifier

SETL

+0.07

The site documents remedy efforts by archivists and preservation organizations.

+0.05

Article 21 Political Participation

Low Framing

Structural

+0.05

Context Modifier

SETL

+0.22

The publication contributes to informed public discussion of governance issues.

+0.05

Article 25 Standard of Living

Low Framing

Structural

+0.05

Context Modifier

SETL

+0.12

The publication of this concern demonstrates awareness of relationship between data access and welfare.

+0.05

Article 28 Social & International Order

Low Framing

Structural

+0.05

Context Modifier

SETL

+0.07

The site documents collaborative international efforts in data preservation.

+0.05

Article 30 No Destruction of Rights

Low Framing

Structural

+0.05

Context Modifier

SETL

+0.17

The publication documents and critiques abuse of information control.

Article 3 Life, Liberty, Security

Not engaged.

Article 4 No Slavery

Not engaged.

Article 5 No Torture

Not engaged.

Article 6 Legal Personhood

Not engaged.

Article 9 No Arbitrary Detention

Not engaged.

Article 10 Fair Hearing

Not engaged.

Article 11 Presumption of Innocence

Not engaged.

Article 12 Privacy

Not engaged.

Article 13 Freedom of Movement

Not engaged.

Article 14 Asylum

Not engaged.

Article 15 Nationality

Not engaged.

Article 16 Marriage & Family

Not engaged.

Article 17 Property

Not engaged.

Article 18 Freedom of Thought

Not engaged.

Article 20 Assembly & Association

Not engaged.

Article 22 Social Security

Not engaged.

Article 23 Work & Equal Pay

Not engaged.

Article 24 Rest & Leisure

Not engaged.

Article 26 Education

Not engaged.

Article 29 Duties to Community

Not engaged.

Supplementary Signals

How this content communicates, beyond directional lean. Learn more

Epistemic Quality ℹ

How well-sourced and evidence-based is this content?

0.83 medium claims

Sources		0.8
Evidence		0.8
Uncertainty		0.9
Purpose		0.9

Propaganda Flags ℹ

No manipulative rhetoric detected

0 techniques detected

Emotional Tone ℹ

Emotional character: positive/negative, intensity, authority

measured

Valence		-0.3
Arousal		0.4
Dominance		0.6

Transparency ℹ

Does the content identify its author and disclose interests?

1.00

✓ Author

More signals: context, framing & audience

Solution Orientation ℹ

Does this content offer solutions or only describe problems?

0.41 mixed

Reader Agency

0.3

Stakeholder Voice ℹ

Whose perspectives are represented in this content?

0.65 5 perspectives

Speaks: individualsinstitutioncommunity

About: governmentmarginalizedworkers

Temporal Framing ℹ

Is this content looking backward, at the present, or forward?

present immediate

Geographic Scope ℹ

What geographic area does this content cover?

national

United States, Hawaii, Oregon

Complexity ℹ

How accessible is this content to a general audience?

moderate medium jargon general

Audit Trail 12 entries

2026-02-28 12:26	model_divergence	Cross-model spread 0.27 exceeds threshold (3 models)	- -
2026-02-28 12:26	eval	Evaluated by claude-haiku-4-5-20251001: +0.29 (Mild positive)
2026-02-28 09:32	eval_success	Light evaluated: Moderate positive (0.56)	- -
2026-02-28 09:32	rater_validation_warn	Light validation warnings for model llama-4-scout-wai: 0W 1R	- -
2026-02-28 09:32	model_divergence	Cross-model spread 0.26 exceeds threshold (2 models)	- -
2026-02-28 09:32	eval	Evaluated by llama-4-scout-wai: +0.56 (Moderate positive)
2026-02-28 09:30	eval_success	Light evaluated: Moderate positive (0.30)	- -
2026-02-28 09:30	eval	Evaluated by llama-3.3-70b-wai: +0.30 (Moderate positive) 0.00
2026-02-28 09:30	rater_validation_warn	Light validation warnings for model llama-3.3-70b-wai: 0W 1R	- -
2026-02-28 09:24	eval_success	Light evaluated: Moderate positive (0.30)	- -
2026-02-28 09:24	rater_validation_warn	Light validation warnings for model llama-3.3-70b-wai: 0W 1R	- -
2026-02-28 09:24	eval	Evaluated by llama-3.3-70b-wai: +0.30 (Moderate positive)

build 08564a6+21y2 · deployed 2026-02-28 15:24 UTC · evaluated 2026-02-28 15:14:40 UTC