Summary Information Access & Scientific Data Preservation Advocates
404 Media investigates the disappearance of 2,000+ government datasets from data.gov following Trump's 2025 inauguration, documenting potential targeting of climate and diversity research. The article works with archivists and researchers to establish that while some deletions may be technical or routine, substantial evidence suggests intentional government erasure, raising critical concerns about access to scientific research and the fragility of digital government information preservation systems.
> The outlet reports that deleted datasets "disproportionately" come from environmental science agencies like the Department of Energy, National Oceanic and Atmospheric Administration (NOAA), and the Environmental Protection Agency (EPA).
I’ve been archiving data.gov for over a year now and it’s not unusual to see large fluctuations on the order of hundreds or thousands of datasets. I’ve never bothered trying to figure out what exactly is changing, maybe I should build a tool for that…
I'm quoted in this article. Happy to discuss what we're working on at the Library Innovation Lab if anyone has questions.
There's lots of people making copies of things right now, which is great -- Lots Of Copies Keeps Stuff Safe. It's your data, why not have a copy?
One thing I think we can contribute here as an institution is timestamping and provenance. Our copy of data.gov is made with https://github.com/harvard-lil/bag-nabit , which extends BagIt format to sign archives with email/domain/document certificates. That way (once we have a public endpoint) you can make your own copy with rclone, pass it around, but still verify it hasn't been modified since we made it.
Some open questions we'd love help on --
* One is that it's hard to tell what's disappearing and what's just moving. If you do a raw comparison of snapshots, there's things like 2011-glass-buttes-exploration-and-drilling-535cf being replaced by 2011-glass-buttes-exploration-and-drilling-236cf, but it's still exactly the same data; it's a rename rather than a delete and add. We need some data munging to work out what's actually changing.
* Another is how to find the most valuable things to preserve that aren't directly linked from the catalog. If a data.gov entry links to a csv, we have it. If it links to an html landing page, we have the landing page. It would be great to do some analysis to figure out the most valuable stuff behind the landing pages.
Still, even with best efforts this is such a shame. There is always going to be a question around governance over the data, integrity, and potentially chain of custody as well. If the goal is to muddy the waters and create a narrative that whatever might be in this data isn't reliable or accurate then mission accomplished. I don't see how anything can stop that.
Not to say the data isn't incredibly valuable and should be preserved for many other reasons of course. All the best to anyone archiving this, this is important work.
One of the USA greatest strengths is the almost unprecedented degree of transparency of governments records going back decades. We can actually see the true facts including when our government has lied to us or covered things up. Many other nations do not have this luxury and it has provided the evidentiary basis for both legal cases and "progress" in general. Not surprising that authoritarians would target and destroy data as it makes their objective of a post-truth society that much easier
What's a good way to be an "Archivist" on a low budget these days?
Say you have a few TBs of disk space, and you're willing to capture some public datasets (or parts of them) that interest you, and publish them in a friendly jurisdiction - keyed by their MD5/SHA1 - or make them available upon request. I.e. be part of a large open-source storage network, but only for objects/datasets you're willing to store (so there are no illegal shenanigans).
Is this a use case for Torrents? What's the most suitable architecture available today for this?
It's impressive that volunteers are stepping up to archive this. I understand the desire to keep this open data available.
How much of this sort of effort results in that data being used? Are there success stories for these datasets being discoverable enough and useful to others?
Beyond federal websites (.gov, .mil) there are lot of gov contractor websites that are being taken down (presumably at the demand of agencies) that contain a wealth of information and years of project research.
Does anyone know if the St Louis Federal Reserve (and I guess the federal reserve banks generally) is subject to presidential executive orders or is it entirely responsible to the Federal Reserve Board and the St. Louis Bank president? FRED is the only dataset I access regularly
Do we know what datasets these are? Do we actually have a diff here so we know what's been removed? There's a lot of assumptions being thrown around here, but we don't even know if this is some kind of malicious compliance. An actual list of what's been removed would probably clear the air a lot.
As one of the reddit comments (in the thread linked by the article) pointed out,
> During the start of Biden’s term, On 6th feb data.gov had “218,384 DATASETS” but on 7th feb it only had “192,180 DATASETS”
If the intention is to restore these data sets at some future date, when sanity has possibly been restored, then there needs to be a way to demonstrate that the archived data hasn't itself been modified. Without that, malign actors (e.g. oil/gas lobby) could very easily poison the future.
I think people are interested in archiving and the political image associated with that but I don’t think anybody cares about the content. Who is going to go back and read Biden era agency publications?
> Changes in presidential administrations have led to datasets being deleted in the past, either on purpose or by accident. When Biden took office, 1,000 datasets were deleted according to the Wayback Machine, via 404 Media's reporting.
Don’t worry, it is a matter of great doctrinal import that all scientific datasets be replaced with datasets that have been properly refined in accordance with scripture. /s
Maybe this administration will get better over time?
Politico reports that USDA landing pages regarding climate change were ordered to be deleted by a directive from the USDA's office of communications.
I think it is likely that orders to these other agencies follows this model. Many other datasets are being targeted via EO 14168 which has quite wide impacts but doesn't appear at first glance to apply to what i would expect to be a part of NOAA and EPA reports.
Looks like the EPA is being targeted (Even though ninety-five percent of the funding going to EPA has not only been appropriated, but is locked in, legally obligated grant funding. The Constitution does not give the president a line item veto over Congress's spending decisions):
I'd love to learn more about what is in scope of the Library Innovation Lab projects. Is it targeting data.gov specifically or all government agency websites?
Given the rapid take downs of websites (cdc, usaid) do you have a prioritization framework for which website pages to prioritize or do you have "comprehensive" coverage of pages (in scope of the project)?
As you allude to, I've been having a hard time learn about what sort of duplicate work might be happening given that there isn't a great "archived coverage" source of truth for government websites (between projects such as End of Term archive, Internet archive, research labs, and independent archivists).
Your open questions are interesting. Content hashes for each page/resource would be a way to do quick comparisons, but I assume you might want to set some threshold to determine how much it's changed vs if it changed?
Is the second question about figuring out how to prioritize valuable stuff behind two depth traversals? (ex data.gov links to another website and that website has a csv download)
A common metric for how much actual content has changed is the Jaccard Index. Even for large numbers of datasets that are too large to fit in memory it can be approximated with various forms of MinHash algorithms. Some write up here: https://blog.nelhage.com/post/fuzzy-dedup/
I’m not an expert in such things, but this seems like a good use case for IPFS. Kinda similar to a torrent except that it is natively content-addressed (essentially the key to access is a hash of the data).
The pedestrian "right", which I encounter on a day-to-day basis the months I visit client sites a couple hundred miles inland of the Gulf of America, will look at climatelinks.org and say something like: "all I see are foreign countries, why are we spending money on this instead of citizens of the United States?".
In my experience, to archive effectively you need a physical datacenter footprint, or to rent capacity of someone who does. Over a longer timespan (even just 6 months), having your own footprint is a lower total cost of ownership, provided you have the skills or access to someone with the skills to run Kubernetes + Ceph (or something similar).
.
> Is this a use case for Torrents?
Yes, provided you have a good way to dynamically append a distributed index of torrents and users willing to run that software in addition to the torrent software. Should be easy enough to define in container-compose.
Set up a scrape using ArchiveTeam's fork of wget. It can save all the requests and responses into a single WARC file. Then you can use https://replayweb.page/ or some other tool to browse the contents.
> sign archives with email/domain/document certificates
I do a bit of web archival for fun, and have been thinking about something.
Currently I save both response body and response headers and request headers for the data I save from the net.
But I was thinking that maybe if instead of just saving that, I could go a level deeper and preserve actual TCP packets and TLS key exchange stuff.
And then, I might be able to get a lot of data provenance “for free”. Because if in some decades when we look back at the saved TCP packets and TLS stuff, we would see that these packets were signed with a certificate chain that matches what that website was serving at the time. Assuming of course that they haven’t accidentally leaked their private keys in the meantime and that the CA hasn’t gone rogue since etc.
To me I think that would make sense to build out web archival infra that preserves the CA chain and enough to be able to see later that it was valid. And if many people across the world save the right parts we don’t have to trust each other in order to verify that data that the other saved was also really sent by the website our archives say it was from.
For example maybe I only archived a single page from some domain, and you saved a whole bunch of other pages from that domain around the same time so the same certificate chain was used in the responses to both of us. Then I can know that the data you are saying you archived from them really was served by their server because I have the certificate chain I saved to verify that.
Historians, political scientists, and social scientists come to mind
Editorial Channel
What the content says
+0.90
Article 27Cultural Participation
High Advocacy Framing Practice
Editorial
+0.90
SETL
+0.76
Core article focus. The article documents loss of access to scientific research datasets and advocates for their preservation. Quotes Harvard and Stanford researchers working to preserve scientific data. Explicitly frames loss of climate, environmental, and research data as problematic.
FW Ratio: 57%
Observable Facts
The article documents specific scientific datasets now inaccessible: 'National Coral Reef Monitoring Program: Water Temperature Data from Subsurface Temperature Recorders (STRs) deployed at coral reef sites in the Hawaiian Archipelago from 2005 to 2019,' 'Stetson Flower Garden Banks Benthic_Covage Monitoring 1993-2018,' 'Three Dimensional Thermal Model of Newberry Volcano, Oregon.'
The article states: 'Disproportionately, the datasets that are no longer accessible through the portal come from the Department of Energy, the National Oceanic and Atmospheric Administration, the Department of the Interior, NASA, and the Environmental Protection Agency.'
Harvard researcher Jack Cushman is documented creating 'a full archive of the data.'
Stanford researcher James Jacobs is quoted discussing data preservation challenges.
Inferences
The article frames scientific data as something that requires deliberate protection and preservation infrastructure.
The documentation of specific scientific datasets emphasizes material loss to scientific progress and research continuity.
The focus on archivists and researchers' preservation work implies belief that maintaining access to scientific data is a human right.
+0.85
Article 19Freedom of Expression
High Advocacy Framing Practice
Editorial
+0.85
SETL
+0.68
Core article focus. The content explicitly investigates and advocates for preservation of access to public information. Documents government removal of publicly-funded research and emphasizes importance of information access to democratic governance. Quotes multiple researchers committed to preserving information access.
FW Ratio: 63%
Observable Facts
The article states: 'More than 2,000 datasets have disappeared from data.gov since Trump was inaugurated.'
The article documents: 'Datasets aggregated on data.gov, the largest repository of U.S. government open data on the internet, are being deleted.'
Named researcher Jack Cushman is documented working to create 'a full archive of the data.'
The article states: 'Many of the deletions happened immediately after Trump was inaugurated, according to snapshots of the website saved on the Internet Archive's Wayback Machine.'
Multiple researchers are quoted discussing the importance and difficulty of preserving information access.
Inferences
The article frames loss of public data access as a significant human rights violation warranting extensive investigation and preservation efforts.
The article positions government data access and preservation as essential to informed democratic participation and right to information.
The extensive reporting on archival work implies editorial judgment that information access is a fundamental right.
+0.35
Article 2Non-Discrimination
High Advocacy Framing
Editorial
+0.35
SETL
+0.32
The article documents and critiques targeted deletion of datasets related to diversity, equity, and inclusion research, and research about marginalized communities. Explicitly notes discriminatory targeting through executive order.
FW Ratio: 60%
Observable Facts
The article quotes: 'Trump issued an executive order asking all federal agencies to delete anything related to diversity, equity and inclusion.'
The article states: 'research about marginalized communities and minorities are among the datasets that have been purged.'
Named researcher Jack Cushman is documented working to archive data 'both before and after the inauguration.'
Inferences
The article frames DEI-related deletions as evidence of discriminatory government action targeting vulnerable populations.
The selection of this story for investigation suggests editorial judgment that discrimination in information access warrants scrutiny.
+0.25
Article 21Political Participation
Low Framing
Editorial
+0.25
SETL
+0.22
Weak engagement. The article frames government data access as necessary for informed public participation, implying that right to information supports right to participate in governance.
FW Ratio: 67%
Observable Facts
The article describes data.gov as 'the largest repository of U.S. government open data on the internet.'
Researcher James Jacobs is quoted: 'There is a difference between the government changing a policy and the government erasing information.'
Inferences
The article implies that access to government information is necessary for citizens to participate meaningfully in democratic governance.
+0.20
Article 7Equality Before Law
Medium Framing
Editorial
+0.20
SETL
+0.17
The article connects loss of access to information about marginalized communities (climate, equity research) with equal protection concerns, though not as primary focus.
FW Ratio: 67%
Observable Facts
The article lists Department of Interior, EPA, and NOAA as sources of disproportionately deleted datasets related to environment and equity.
Named researcher Mark Phillips discusses archiving as protection against loss of information access.
Inferences
The article implies that equal protection includes access to information about one's community and environment.
+0.20
Article 30No Destruction of Rights
Low Framing
Editorial
+0.20
SETL
+0.17
The article distinguishes between legitimate policy changes and 'erasing information,' documenting targeted deletion of specific data categories as potentially abusive government action.
FW Ratio: 67%
Observable Facts
The article quotes James Jacobs: 'There is a difference between the government changing a policy and the government erasing information, but the line between those two has blurred in the digital age.'
The article states: 'some of the deletions are surely malicious information scrubbing, some are likely routine artifacts of an administration change.'
Inferences
The article frames intentional erasure of information as distinguishable from legitimate administrative change, implying it constitutes abuse.
+0.15
PreamblePreamble
Medium Framing
Editorial
+0.15
SETL
0.00
The article implicitly engages with human dignity by framing access to public information and scientific research as matters of fundamental importance to informed governance.
FW Ratio: 50%
Observable Facts
The article discusses government data repositories and research as matters of public concern and importance.
The article emphasizes that digital information requires deliberate preservation systems and distribution.
Inferences
The article frames information access and preservation as foundational to democratic dignity.
The extensive reporting on archival efforts implies belief that information preservation relates to fundamental rights.
+0.15
Article 25Standard of Living
Low Framing
Editorial
+0.15
SETL
+0.12
Weak engagement. Disproportionate loss of environmental and health monitoring datasets (NOAA coral reef data, environmental agency data) implies connection to standard of living and health information access.
FW Ratio: 67%
Observable Facts
The article documents loss of 'National Coral Reef Monitoring Program: Water Temperature Data from Subsurface Temperature Recorders.'
The article lists 'Department of Energy...National Oceanic and Atmospheric Administration...Department of Interior, NASA, and Environmental Protection Agency' as sources of disproportionate deletions.
Inferences
Loss of environmental monitoring data could reduce public access to information necessary for understanding threats to health and welfare.
+0.10
Article 8Right to Remedy
Low Advocacy
Editorial
+0.10
SETL
+0.07
Weak engagement: the article documents archivists and researchers working to preserve and recover deleted data, which constitutes a remedy-seeking response.
FW Ratio: 67%
Observable Facts
The article states: 'archivists and academics like Cushman are working on triaging the situation.'
The article notes: 'The End of Term Web Archive...archives as much as possible from government websites before a new administration takes over.'
Inferences
The article frames archival and preservation work as remedial responses to information loss.
+0.10
Article 28Social & International Order
Low Framing
Editorial
+0.10
SETL
+0.07
Weak engagement. The article notes international collaboration in preservation efforts (Internet Archive, Common Crawl, University of North Texas partnerships).
FW Ratio: 50%
Observable Facts
The article quotes Mark Phillips: 'We've worked to collect 100s of terabytes of web content, which includes datasets from domains like data.gov' through 'help from our partners at the Internet Archive, Common Crawl, and the University of North Texas.'
Inferences
The article frames international collaboration in data preservation as necessary to protect global access to information.
0.00
Article 1Freedom, Equality, Brotherhood
Low
Editorial
0.00
SETL
-0.10
No explicit engagement with equal dignity and rights.
FW Ratio: 100%
Observable Facts
The article is published without paywall or access restrictions.
ND
Article 3Life, Liberty, Security
Not engaged.
ND
Article 4No Slavery
Not engaged.
ND
Article 5No Torture
Not engaged.
ND
Article 6Legal Personhood
Not engaged.
ND
Article 9No Arbitrary Detention
Not engaged.
ND
Article 10Fair Hearing
Not engaged.
ND
Article 11Presumption of Innocence
Not engaged.
ND
Article 12Privacy
Not engaged.
ND
Article 13Freedom of Movement
Not engaged.
ND
Article 14Asylum
Not engaged.
ND
Article 15Nationality
Not engaged.
ND
Article 16Marriage & Family
Not engaged.
ND
Article 17Property
Not engaged.
ND
Article 18Freedom of Thought
Not engaged.
ND
Article 20Assembly & Association
Not engaged.
ND
Article 22Social Security
Not engaged.
ND
Article 23Work & Equal Pay
Not engaged.
ND
Article 24Rest & Leisure
Not engaged.
ND
Article 26Education
Not engaged.
ND
Article 29Duties to Community
Not engaged.
Structural Channel
What the site does
+0.30
Article 19Freedom of Expression
High Advocacy Framing Practice
Structural
+0.30
Context Modifier
ND
SETL
+0.68
The site structurally supports freedom of expression through free public access to this journalism; publication itself constitutes support for information freedom.
+0.25
Article 27Cultural Participation
High Advocacy Framing Practice
Structural
+0.25
Context Modifier
ND
SETL
+0.76
The site publishes journalism that reports on and advocates for preservation of scientific research access.
+0.15
PreamblePreamble
Medium Framing
Structural
+0.15
Context Modifier
ND
SETL
0.00
The site publishes freely accessible journalism that supports information preservation and transparency.
+0.10
Article 1Freedom, Equality, Brotherhood
Low
Structural
+0.10
Context Modifier
ND
SETL
-0.10
The site's open access model provides equal access to information regardless of economic status.
+0.05
Article 2Non-Discrimination
High Advocacy Framing
Structural
+0.05
Context Modifier
ND
SETL
+0.32
The publication of this investigative report demonstrates structural commitment to documenting discrimination.
+0.05
Article 7Equality Before Law
Medium Framing
Structural
+0.05
Context Modifier
ND
SETL
+0.17
The publication documents threats to equal information access.
+0.05
Article 8Right to Remedy
Low Advocacy
Structural
+0.05
Context Modifier
ND
SETL
+0.07
The site documents remedy efforts by archivists and preservation organizations.
+0.05
Article 21Political Participation
Low Framing
Structural
+0.05
Context Modifier
ND
SETL
+0.22
The publication contributes to informed public discussion of governance issues.
+0.05
Article 25Standard of Living
Low Framing
Structural
+0.05
Context Modifier
ND
SETL
+0.12
The publication of this concern demonstrates awareness of relationship between data access and welfare.
+0.05
Article 28Social & International Order
Low Framing
Structural
+0.05
Context Modifier
ND
SETL
+0.07
The site documents collaborative international efforts in data preservation.
+0.05
Article 30No Destruction of Rights
Low Framing
Structural
+0.05
Context Modifier
ND
SETL
+0.17
The publication documents and critiques abuse of information control.
ND
Article 3Life, Liberty, Security
Not engaged.
ND
Article 4No Slavery
Not engaged.
ND
Article 5No Torture
Not engaged.
ND
Article 6Legal Personhood
Not engaged.
ND
Article 9No Arbitrary Detention
Not engaged.
ND
Article 10Fair Hearing
Not engaged.
ND
Article 11Presumption of Innocence
Not engaged.
ND
Article 12Privacy
Not engaged.
ND
Article 13Freedom of Movement
Not engaged.
ND
Article 14Asylum
Not engaged.
ND
Article 15Nationality
Not engaged.
ND
Article 16Marriage & Family
Not engaged.
ND
Article 17Property
Not engaged.
ND
Article 18Freedom of Thought
Not engaged.
ND
Article 20Assembly & Association
Not engaged.
ND
Article 22Social Security
Not engaged.
ND
Article 23Work & Equal Pay
Not engaged.
ND
Article 24Rest & Leisure
Not engaged.
ND
Article 26Education
Not engaged.
ND
Article 29Duties to Community
Not engaged.
Supplementary Signals
How this content communicates, beyond directional lean. Learn more
build 08564a6+21y2 · deployed 2026-02-28 15:24 UTC · evaluated 2026-02-28 15:14:40 UTC
Support HN HRCB
Each evaluation uses real API credits. HN HRCB runs on donations — no ads, no paywalls.
If you find it useful, please consider helping keep it running.