+0.39 Data accidentally exposed by Microsoft AI researchers

Name: HRCB Evaluation: Data accidentally exposed by Microsoft AI researchers
Item: Data accidentally exposed by Microsoft AI researchers
Rating: 0.332
Author: HN HRCB

Model: claude-haiku-4-5-20251001 +0.39 @cf/meta/llama-3.3-70b-instruct-fp8-fast lite +0.30 @cf/meta/llama-4-scout-17b-16e-instruct lite +0.34 Compare

+0.39	Data accidentally exposed by Microsoft AI researchers (www.wiz.io S:+0.23 )
	721 points by deepersprout 894 days ago \| 226 comments on HN \| Moderate positive Human Rights · v3.7 · 2026-02-28 13:46:08

Summary Privacy & Data Protection Advocates

Wiz Research published a detailed security case study documenting a 38TB data exposure incident involving Microsoft's AI research team, caused by misconfigured Azure SAS tokens including over 30,000 employee Teams messages and personal data. The article actively advocates for stronger privacy protections, responsible disclosure practices, organizational security governance, and scientific research freedoms, demonstrating strong alignment with UDHR provisions protecting digital rights, privacy, information access, and institutional safeguards.

Article Heatmap

Negative Neutral Positive No Data

Aggregates

Editorial Mean	+0.39	Structural Mean	+0.23
Weighted Mean	+0.33	Unweighted Mean	+0.31
Max	+0.53 Article 3	Min	+0.13 Article 25
Signal	15	No Data	16
Confidence ℹ	34%	Volatility	0.13 (Low)
Negative	0	Channels	E: 0.6 S: 0.4
SETL ℹ	+0.26	Editorial-dominant
FW Ratio ℹ	65%	36 facts · 19 inferences

Evidence: High: 5 Medium: 9 Low: 1 No Data: 16

Theme Radar

HN Discussion 20 top-level · 30 replies

anon1199022 2023-09-18 14:34 UTC link

Just proves how hard it cloud security now. 1-2 mistake and you expose TB's. Insane.

bkm 2023-09-18 15:00 UTC link

Would be insane if the GPT4 model is in there somewhere (as its served by Azure).

hdesh 2023-09-18 15:01 UTC link

On a lighter note - I saw a chat message that started with "Hey dude! How is it going". I'm disappointed that the response was not https://nohello.net/en/.

sillysaurusx 2023-09-18 15:10 UTC link

The article tries to play up the AI angle, but this was a pretty standard misconfiguration of a storage token. This kind of thing happens shockingly often, and it’s why frequent pentests are important.

stevanl 2023-09-18 15:14 UTC link

Looks like it was up for 2 years with that old link[1]. Fixed two months ago.

[1] https://github.com/microsoft/robust-models-transfer/blame/a9...

saurik 2023-09-18 15:22 UTC link

A number of replies here are noting (correctly) how this doesn't have much to do with AI (despite some sentences in this article kind of implicating it; the title doesn't really, fwiw) and is more of an issue with cloud providers, confusing ways in which security tokens apply to data being shared publicly, and dealing with big data downloads (which isn't terribly new)...

...but one notable way in which it does implicate an AI-specific risk is how prevalent it is to use serialized Python objects to store these large opaque AI models, given how the Python serialization format was never exactly intended for untrusted data distribution and so is kind of effectively code... but stored in a way where both what that code says as well as that it is there at all is extremely obfuscated to people who download it.

> This is particularly interesting considering the repository’s original purpose: providing AI models for use in training code. The repository instructs users to download a model data file from the SAS link and feed it into a script. The file’s format is ckpt, a format produced by the TensorFlow library. It’s formatted using Python’s pickle formatter, which is prone to arbitrary code execution by design. Meaning, an attacker could have injected malicious code into all the AI models in this storage account, and every user who trusts Microsoft’s GitHub repository would’ve been infected by it.

rickette 2023-09-18 15:25 UTC link

At this point MS might as well aquire Wiz, given the number of azure security findings they have found.

formerly_proven 2023-09-18 15:43 UTC link

This stands out

> Our scan shows that this account contained 38TB of additional data — including Microsoft employees’ personal computer backups.

Not even Microsoft has functioning corporate IT any more, with employees not just being able to make their own image-based backups, but also having to store them in some random A3 bucket that they're using for work files.

wodenokoto 2023-09-18 15:46 UTC link

I really dislike how Azure makes you juggle keys in order to make any two Azure things talk together.

Even more so, you only have two keys for the entire storage account. Would have made much more sense if you could have unlimited, named keys for each container.

quickthrower2 2023-09-18 15:51 UTC link

Two of the things that make me cringe are mentioned. Pickle files and SAS tokens. I get nervous dealing with Azure storage. Use RBAC. They should depreciate SAS and account keys IMO.

SOC2 type auditing should have been done here so I am surprised of the reach. Having the SAS with no expiry and then the deep level of access it gave including machine backups with their own tokens. A lot of lack of defence in depth going on there.

My view is burn all secrets. Burn all environment variables. I think most systems can work based on roles. Important humans access via username password and other factors.

If you are working in one cloud you don’t in theory need secrets. If not I had the idea the other day that proxies tightly couples to vaults could be used as api adaptors to convert then into RBAC too. But I am not a security expert just paranoid lol.

mola 2023-09-18 16:05 UTC link

It's always funny that wiz's big security revelations are almost always about Microsoft. When wiz's founder was the highest ranking in charge of cyber security at Microsoft in his previous job .

EGreg 2023-09-18 16:07 UTC link

This seems to be a common occurrence with Big Tech and Big Government, so we better get used to it:

https://qbix.com/blog/2023/06/12/no-way-to-prevent-this-says...

https://qbix.com/blog/2021/01/25/no-way-to-prevent-this-says...

jl6 2023-09-18 16:22 UTC link

Kind of incredible that someone managed to export Teams messages out from Teams…

naikrovek 2023-09-18 16:27 UTC link

Amazing how ingrained it is in some people to just go around security controls.

someone chose to make that SAS have a long expiry and someone chose to make it read-write.

gumballindie 2023-09-18 17:18 UTC link

Would be cool if someone analysed - i am fairly certain it has proprietary code and data laying around. Would be useful for future lawsuits against microsoft and others that steal people’s ip for “training” purposes.

madelyn-goodman 2023-09-18 17:37 UTC link

This is so unfortunate but a clear illustration of something I've been thinking about a lot when it comes to LLMs and AI. It seems like we're forgetting that we are just handing our data over to these companies on a solver platter in the form of our prompts. Disclosure that I do work for Tonic.ai and we are working on a way to automatically redact any information you send to an LLM - https://www.tonic.ai/solar

baz00 2023-09-18 18:14 UTC link

What's that, the second major data loss / leak event from MSFT recently.

Is your data really safe there?

pradn 2023-09-18 18:14 UTC link

It's not reasonable to expect human security token generation to be perfectly secure all the time. The system needs to be safe overall. The organization should have set an OrgPolicy on this entire project to prevent blanket sharing of auth tokens/credentials like this. Ideally blanket access tokens should be opt-in, not opt-out.

Google banned generation of service account keys for internally-used projects. So an awry JSON file doesn't allow access to Google data/code. This is enforced at the highest level by OrgPolicy. There's a bunch more restrictions, too.

lijok 2023-09-18 19:18 UTC link

I wouldn't trust MSFT with my glass of chocolate milk at this point. I would come back to lipstick all over the rim and somehow multiple leaks in the glass

kevinsundar 2023-09-18 21:13 UTC link

This is very similar to how some security researchers got access to TikTok's S3 bucket: https://medium.com/berkeleyischool/cloudsquatting-taking-ove...

They used the same mechanism of using common crawl or other publicly available web crawler data to source dns records for s3 buckets.

photoGrant 2023-09-18 14:49 UTC link

Hard coded secrets in shareable URL’s with almost infinite time windows and an untraceable ability to audit what’s made and shared and at what level?

Sounds like it’s as hard as it’s always been. Pretty basic and filled with humans

tombert 2023-09-18 14:55 UTC link

My wife and I just rewatched WarGames for the millionth time a few nights ago.

The level of cybersecurity incompetency in the early 80's makes sense; computers (and in particular networked computers) were still relatively new, and there weren't that many external users to begin with, so while the potential impact of a mistake was huge (which of course was the plot of the movie), the likelihood of a horrible thing happening was fairly low just because computers were an expensive, somewhat niche thing.

Fast forward to 2023, and now everyone owns bunches of computers, all of which are connected to a network, and all of which are oodles more powerful than anything in the 80s. Cybersecurity protocols are of course much more mature now, but there's also several orders of magnitude more potential attackers than there were in the 80s.

monkpit 2023-09-18 15:11 UTC link

I strongly support the “no hello” concept but I also fear being seen as “that guy” so I never mention it. Sigh

syndicatedjelly 2023-09-18 15:16 UTC link

I love that an entire website was made around this, without any attempt to sell me anything. So rare to see that these days

hahn-kev 2023-09-18 15:17 UTC link

Glad I've never had to deal with that in chat.

Though I have had the equivalent in tech support: "App doesn't work" which is basically just hello, obviously you're having an issue otherwise you wouldn't have contacted our support.

cj 2023-09-18 15:20 UTC link

> it’s why frequent pentests are important.

Unfortunately a lot of pen testing services have devolved into "We know you need a report for SOC 2, but don't worry, we can do some light security testing and generate a report for you in a few days and you'll be able to check the box for compliance"

Which is guess is better than nothing.

If anyone works at a company that does pen tests for compliance purposes, I'd recommend advocating internally for doing a "quick, easy, and cheap" pen test to "check the box" for compliance, _alongside_ a more comprehensive pen test (maybe call it something other than a "pen test" to convince internal stakeholders who might be afraid that a 2nd in depth pen test might weaken their compliance posture since the report is typically shared with sales prospects)

Ideally grey box or white box testing (provide access to codebase / infrastructure to make finding bugs easier). Most pen tests done for compliance purposes are black-box and limit their findings as a result.

mymac 2023-09-18 15:30 UTC link

Pentests where people actually get out of bed to do stuff (read code, read API docs etc) and then try to really hack your system are rare. Pentests where people go through the motions, send you report with a few unimportant bits highlit while patting you on the back for your exemplary security so you can check the box on whatever audit you're going through are common.

PretzelPirate 2023-09-18 15:54 UTC link

> if you could have unlimited, named keys for each container.

These exist and are called Shared Access Tokens. People are too lazy to use them and just use the account-wide keys instead.

trebligdivad 2023-09-18 15:55 UTC link

How would a pentest find that? Ok in this case it's splattered onto github; but the main point here is that you might have some unknown number of SAS tokens issued to unknown storage that you probably haven't any easy way to revoke.

quickthrower2 2023-09-18 15:58 UTC link

https://learn.microsoft.com/en-us/azure/role-based-access-co...

bunderbunder 2023-09-18 16:02 UTC link

Pickle files are cringe, but they're also basically unavoidable when working with Python machine learning infrastructure. None of the major ML packages provide a proper model serialization/deserialization mechanism.

In the case of scikit-learn, the code implementing some components does so much crazy dynamic shit that it might not even be feasible to provide a well-engineered serde mechanism without a major rewrite. Or at least, that's roughly what the project's maintainers say whenever they close tickets requesting such a thing.

bootloop 2023-09-18 16:03 UTC link

This is quite funny for me because at first I didn't understand what the problem is.

In German, if you ask this question, it is expected that your question is genuine and you can expect an answer (Although usually people don't use this opportunity to unload there emotional package, but it can happen!)

Whereas in Englisch you assume this is just a hello and nothing more.

albert_e 2023-09-18 16:04 UTC link

Also imagine all such exposed data sources including those that are not yet discovered... are crawled and trained on by GPT5.

Meanwhile a big enterprise provider like MS suffers a bigger leak and exposes MS Teams/ OneDrive / SharePoint data of all its North America customers say.

Boom we have GPT model that can autonomously run whole businesses.

osanseviero 2023-09-18 16:16 UTC link

The safetensors format was created exactly for this - safe model serialization

https://huggingface.co/blog/safetensors-security-audit

dheera 2023-09-18 16:27 UTC link

Many people are also unaware that json is way, way, way faster than Python pickles, and human-editing-friendly. Not that you'd use it for neural net weights, but I see people use Python pickles all the time for things that json would have worked perfectly well.

hypeatei 2023-09-18 16:29 UTC link

Absolutely, RBAC should be the default. I would also advocate separate storage accounts for public-facing data, so that any misconfiguration doesn't affect your sensitive data. Just typical "security in layers" thinking that apparently this department in MSFT didn't have.

unoti 2023-09-18 16:32 UTC link

> I really dislike how Azure makes you juggle keys in order to make any two Azure things talk together.

Actually there is a better way. Look into “Managed Identity”. This allows you to grant access from one service to another, for example grant access to allow a specific VM to work with your storage account.

alphabetting 2023-09-18 16:33 UTC link

Would be kind of surprising if that weren't the case.

alphabetting 2023-09-18 16:35 UTC link

Is this stuff regularly happening to AWS and GCP? This is like the 3rd insane security incident from Microsoft in the past year.

croes 2023-09-18 16:36 UTC link

Why not even?

Security was never a strong part of Microsoft.

acdha 2023-09-18 16:38 UTC link

It didn’t seem to be focused on AI except for the very reasonable concerns that AI research involves lots of data and often also people without much security experience. Seeing things like personal computer backups in the dump immediately suggests that this was a quasi-academic division with a lot less attention to traditional IT standards: I’d be shocked if a Windows engineer could commit a ton of personal data, passwords, API keys, etc. and first hear about it from an outside researcher.

prmoustache 2023-09-18 16:45 UTC link

Many SOC2 audits are a joke. We were audited this year and were asked to provide screenshots of various categories (but most being of our own choosing in the end). Only requirement was screenshots needed to show date of the computer on which the screenshot had been taken, as if it couldn't be forged as well as the file/exif data.

anonymousDan 2023-09-18 16:57 UTC link

For me it's also interesting as a potential pathway for data poisoning attacks - if you have control over the data used to train a production model, can you modify the dataset such that it inserts a backdoor to any model trained subsequently trained over it? E.g. what if gpt was biased to insert certain security vulnerabilities as part of its codegen capabilities?

ozim 2023-09-18 17:01 UTC link

So SAS tokens are worse that some admin setting up "FileDownloaderAccount" and then sharing its password with multiple users or using the same for different applications?

I take SAS tokens with expiration over people setting up shared RBAC account and sharing password for it.

Yes people should do proper RBAC, but point a company and I will find dozens "shared" accounts. People don't care and don't mind. When beating them up with sticks does not solve the issue SAS tokens while still not perfect help quite a lot.

wolftickets 2023-09-18 17:11 UTC link

Disclosure I work for the company that released this: https://github.com/protectai/modelscan but we do have a tool to support scanning many models for this kind of problem.

That said you should be using something like safe-tensors.

naillo 2023-09-18 17:16 UTC link

Well there is that "transformers" folder at the bottom of the screenshot...

sneak 2023-09-18 18:13 UTC link

It was so common that S3 added several features to make it really, really hard to accidentally leave a whole bucket public.

Looks like Azure hasn't done similarly.

jovial_cavalier 2023-09-18 20:20 UTC link

Destroying comradery with a co-worker - Any % (WR)

BlueTemplar 2023-09-18 20:20 UTC link

So, a little bit like a lot of people think that (non-checksummed/non-encrypted) PDFs cannot be modified, even though they are easily editable with Libre freaking Office ?

JohnMakin 2023-09-18 23:27 UTC link

It’s easy.

“ugh, this thing needs to get out by end of week and I can’t scope this key properly, nothing’s working with it.”

“just give it admin privileges and we’ll fix it later”

sometimes they’ll put a short TTL on it, aware of the risk. Then something major breaks a few months later, gets a 15 year expiry, never is remediated.

It’s common because it’s tempting and easy to tell yourself you’ll fix it later, refactor, etc. But then people leave, stuff gets dropped, and security is very rarely a priority in most orgs - let alone remediation of old security issues.

Editorial Channel

What the content says

+0.70

Article 12 Privacy

High Advocacy Coverage Practice

Editorial

+0.70

SETL

+0.49

Central focus of article: extensive coverage of privacy violation (38TB data exposure, personal backups, Teams messages, passwords, keys); strong advocacy for privacy-protective practices and monitoring.

+0.65

Article 3 Life, Liberty, Security

High Advocacy Practice

Editorial

+0.65

SETL

+0.40

Core focus on security rights: article directly addresses threats to personal security, data integrity, and prevention of unauthorized access that could endanger individuals.

+0.60

Article 19 Freedom of Expression

High Advocacy Coverage

Editorial

+0.60

SETL

+0.39

Strong advocacy for right to information through public disclosure of security vulnerability following responsible disclosure timeline; supports security researchers' right to communicate findings.

+0.55

Preamble Preamble

High Advocacy

Editorial

+0.55

SETL

+0.37

Content advocates for protective frameworks and organizational responsibility to safeguard individuals from security violations and protect their dignity through security governance.

+0.50

Article 28 Social & International Order

High Advocacy Practice

Editorial

+0.50

SETL

+0.32

Extensive advocacy for security governance, organizational frameworks, and institutional safeguards; provides detailed recommendations for implementing protective social order.

+0.40

Article 1 Freedom, Equality, Brotherhood

Medium Advocacy

Editorial

+0.40

SETL

+0.24

Research findings and recommendations apply universally across all organizations; advocates for equal protection of all users regardless of size or type.

+0.40

Article 29 Duties to Community

Medium Advocacy Practice

Editorial

+0.40

SETL

+0.28

Advocates for researcher responsibility and responsible disclosure practices; discusses duties of security professionals and organizations.

+0.35

Article 26 Education

Medium Advocacy Coverage

Editorial

+0.35

SETL

+0.23

Provides free security education and awareness about cloud risks; advocates for education about digital threats and security literacy.

+0.35

Article 27 Cultural Participation

Medium Advocacy

Editorial

+0.35

SETL

+0.23

Advocates for scientific security research and knowledge sharing; supports researchers' right to conduct and publish research.

+0.30

Article 2 Non-Discrimination

Medium Advocacy

Editorial

+0.30

SETL

+0.17

Security protections are recommended without discrimination; all cloud users and organizations receive equal guidance.

+0.25

Article 22 Social Security

Medium Advocacy Practice

Editorial

+0.25

SETL

+0.16

Advocates for security governance and organizational frameworks; recommends institutional safeguards and policy implementation for social order.

+0.25

Article 23 Work & Equal Pay

Medium Advocacy Practice

Editorial

+0.25

SETL

+0.16

Addresses workplace security, discussing exposure of employee communications and backups; advocates for workplace security safeguards.

+0.20

Article 17 Property

Medium Framing Advocacy

Editorial

+0.20

SETL

+0.14

Data is framed as protected property requiring safeguards against unauthorized access; advocacy for protection of data as a sensitive asset.

+0.15

Article 25 Standard of Living

Low Advocacy

Editorial

+0.15

SETL

+0.09

Implicitly advocates for security standards as foundational to adequate digital infrastructure and living standards.

+0.15

Article 30 No Destruction of Rights

Medium Framing

Editorial

+0.15

SETL

Acknowledges legitimate limitations and trade-offs: security measures have costs, token features provide agility while requiring governance, monitoring incurs expenses.

Article 4 No Slavery

Not addressed in content.

Article 5 No Torture

Not addressed in content.

Article 6 Legal Personhood

Not addressed in content.

Article 7 Equality Before Law

Not addressed in content.

Article 8 Right to Remedy

Not addressed in content.

Article 9 No Arbitrary Detention

Not addressed in content.

Article 10 Fair Hearing

Not addressed in content.

Article 11 Presumption of Innocence

Not addressed in content.

Article 13 Freedom of Movement

Not addressed in content.

Article 14 Asylum

Not addressed in content.

Article 15 Nationality

Not addressed in content.

Article 16 Marriage & Family

Not addressed in content.

Article 18 Freedom of Thought

Not addressed in content.

Article 20 Assembly & Association

Not addressed in content.

Article 21 Political Participation

Not addressed in content.

Article 24 Rest & Leisure

Not addressed in content.

Structural Channel

What the site does

Domain Context Profile

Element	Modifier	Affects	Note
Legal & Terms
Privacy	+0.15	Article 12	GDPR checks enabled and consent cookie management observable on page, indicating privacy-conscious infrastructure.
Terms of Service	—		Not observable from provided content.
Identity & Mission
Mission	+0.20	Article 28	Wiz.io is a cloud security company; security research and disclosure align with digital rights protection.
Editorial Code	—		No editorial code observable.
Ownership	—		Wiz.io identified as publisher; no adversarial ownership signals.
Access & Distribution
Access Model	+0.10	Article 19	Free access to research blog supports information dissemination.
Ad/Tracking	-0.05	Article 12	Optimizely tracking script observed; advertising/analytics tracking present but consent-gated.
Accessibility	+0.10	Article 2	Content marked as accessible for free; responsive design patterns evident.

+0.40

Article 3 Life, Liberty, Security

High Advocacy Practice

Structural

+0.40

Context Modifier

SETL

+0.40

Domain's security mission and infrastructure design demonstrate commitment to protecting users' right to security.

+0.35

Article 12 Privacy

High Advocacy Coverage Practice

Structural

+0.35

Context Modifier

SETL

+0.49

Domain implements observable privacy infrastructure (GDPR compliance, cookie consent management); provides tools for privacy monitoring and secret scanning.

+0.35

Article 19 Freedom of Expression

High Advocacy Coverage

Structural

+0.35

Context Modifier

SETL

+0.39

Free public access to research blog and detailed technical information; transparent author identification and contact information.

+0.30

Preamble Preamble

High Advocacy

Structural

+0.30

Context Modifier

SETL

+0.37

Domain implements privacy protections (GDPR, consent management) and free access to research, supporting the protective ethos underlying the Preamble.

+0.30

Article 28 Social & International Order

High Advocacy Practice

Structural

+0.30

Context Modifier

SETL

+0.32

Wiz's business model and domain structure embody commitment to organizational security governance.

+0.25

Article 1 Freedom, Equality, Brotherhood

Medium Advocacy

Structural

+0.25

Context Modifier

SETL

+0.24

Free access to security research supports equal information access for all stakeholders.

+0.20

Article 2 Non-Discrimination

Medium Advocacy

Structural

+0.20

Context Modifier

SETL

+0.17

No access restrictions based on user type; security knowledge freely distributed.

+0.20

Article 26 Education

Medium Advocacy Coverage

Structural

+0.20

Context Modifier

SETL

+0.23

Blog and educational resources support right to education.

+0.20

Article 27 Cultural Participation

Medium Advocacy

Structural

+0.20

Context Modifier

SETL

+0.23

Domain is built around research publication and knowledge dissemination.

+0.20

Article 29 Duties to Community

Medium Advocacy Practice

Structural

+0.20

Context Modifier

SETL

+0.28

Research team demonstrates responsible disclosure practice.

+0.15

Article 22 Social Security

Medium Advocacy Practice

Structural

+0.15

Context Modifier

SETL

+0.16

Wiz's mission inherently involves organizational security governance and frameworks.

+0.15

Article 23 Work & Equal Pay

Medium Advocacy Practice

Structural

+0.15

Context Modifier

SETL

+0.16

Domain's security focus includes workplace protection measures.

+0.10

Article 17 Property

Medium Framing Advocacy

Structural

+0.10

Context Modifier

SETL

+0.14

Domain security practices support protection of data assets.

+0.10

Article 25 Standard of Living

Low Advocacy

Structural

+0.10

Context Modifier

SETL

+0.09

Free access to security resources supports adequate digital service standards.

Article 4 No Slavery