Wiz Research published a detailed security case study documenting a 38TB data exposure incident involving Microsoft's AI research team, caused by misconfigured Azure SAS tokens including over 30,000 employee Teams messages and personal data. The article actively advocates for stronger privacy protections, responsible disclosure practices, organizational security governance, and scientific research freedoms, demonstrating strong alignment with UDHR provisions protecting digital rights, privacy, information access, and institutional safeguards.
On a lighter note - I saw a chat message that started with "Hey dude! How is it going". I'm disappointed that the response was not https://nohello.net/en/.
The article tries to play up the AI angle, but this was a pretty standard misconfiguration of a storage token. This kind of thing happens shockingly often, and it’s why frequent pentests are important.
A number of replies here are noting (correctly) how this doesn't have much to do with AI (despite some sentences in this article kind of implicating it; the title doesn't really, fwiw) and is more of an issue with cloud providers, confusing ways in which security tokens apply to data being shared publicly, and dealing with big data downloads (which isn't terribly new)...
...but one notable way in which it does implicate an AI-specific risk is how prevalent it is to use serialized Python objects to store these large opaque AI models, given how the Python serialization format was never exactly intended for untrusted data distribution and so is kind of effectively code... but stored in a way where both what that code says as well as that it is there at all is extremely obfuscated to people who download it.
> This is particularly interesting considering the repository’s original purpose: providing AI models for use in training code. The repository instructs users to download a model data file from the SAS link and feed it into a script. The file’s format is ckpt, a format produced by the TensorFlow library. It’s formatted using Python’s pickle formatter, which is prone to arbitrary code execution by design. Meaning, an attacker could have injected malicious code into all the AI models in this storage account, and every user who trusts Microsoft’s GitHub repository would’ve been infected by it.
> Our scan shows that this account contained 38TB of additional data — including Microsoft employees’ personal computer backups.
Not even Microsoft has functioning corporate IT any more, with employees not just being able to make their own image-based backups, but also having to store them in some random A3 bucket that they're using for work files.
I really dislike how Azure makes you juggle keys in order to make any two Azure things talk together.
Even more so, you only have two keys for the entire storage account. Would have made much more sense if you could have unlimited, named keys for each container.
Two of the things that make me cringe are mentioned. Pickle files and SAS tokens. I get nervous dealing with Azure storage. Use RBAC. They should depreciate SAS and account keys IMO.
SOC2 type auditing should have been done here so I am surprised of the reach. Having the SAS with no expiry and then the deep level of access it gave including machine backups with their own tokens. A lot of lack of defence in depth going on there.
My view is burn all secrets. Burn all environment variables. I think most systems can work based on roles. Important humans access via username password and other factors.
If you are working in one cloud you don’t in theory need secrets. If not I had the idea the other day that proxies tightly couples to vaults could be used as api adaptors to convert then into RBAC too. But I am not a security expert just paranoid lol.
It's always funny that wiz's big security revelations are almost always about Microsoft. When wiz's founder was the highest ranking in charge of cyber security at Microsoft in his previous job .
Would be cool if someone analysed - i am fairly certain it has proprietary code and data laying around. Would be useful for future lawsuits against microsoft and others that steal people’s ip for “training” purposes.
This is so unfortunate but a clear illustration of something I've been thinking about a lot when it comes to LLMs and AI. It seems like we're forgetting that we are just handing our data over to these companies on a solver platter in the form of our prompts.
Disclosure that I do work for Tonic.ai and we are working on a way to automatically redact any information you send to an LLM - https://www.tonic.ai/solar
It's not reasonable to expect human security token generation to be perfectly secure all the time. The system needs to be safe overall. The organization should have set an OrgPolicy on this entire project to prevent blanket sharing of auth tokens/credentials like this. Ideally blanket access tokens should be opt-in, not opt-out.
Google banned generation of service account keys for internally-used projects. So an awry JSON file doesn't allow access to Google data/code. This is enforced at the highest level by OrgPolicy. There's a bunch more restrictions, too.
I wouldn't trust MSFT with my glass of chocolate milk at this point. I would come back to lipstick all over the rim and somehow multiple leaks in the glass
My wife and I just rewatched WarGames for the millionth time a few nights ago.
The level of cybersecurity incompetency in the early 80's makes sense; computers (and in particular networked computers) were still relatively new, and there weren't that many external users to begin with, so while the potential impact of a mistake was huge (which of course was the plot of the movie), the likelihood of a horrible thing happening was fairly low just because computers were an expensive, somewhat niche thing.
Fast forward to 2023, and now everyone owns bunches of computers, all of which are connected to a network, and all of which are oodles more powerful than anything in the 80s. Cybersecurity protocols are of course much more mature now, but there's also several orders of magnitude more potential attackers than there were in the 80s.
Though I have had the equivalent in tech support: "App doesn't work" which is basically just hello, obviously you're having an issue otherwise you wouldn't have contacted our support.
Unfortunately a lot of pen testing services have devolved into "We know you need a report for SOC 2, but don't worry, we can do some light security testing and generate a report for you in a few days and you'll be able to check the box for compliance"
Which is guess is better than nothing.
If anyone works at a company that does pen tests for compliance purposes, I'd recommend advocating internally for doing a "quick, easy, and cheap" pen test to "check the box" for compliance, _alongside_ a more comprehensive pen test (maybe call it something other than a "pen test" to convince internal stakeholders who might be afraid that a 2nd in depth pen test might weaken their compliance posture since the report is typically shared with sales prospects)
Ideally grey box or white box testing (provide access to codebase / infrastructure to make finding bugs easier). Most pen tests done for compliance purposes are black-box and limit their findings as a result.
Pentests where people actually get out of bed to do stuff (read code, read API docs etc) and then try to really hack your system are rare. Pentests where people go through the motions, send you report with a few unimportant bits highlit while patting you on the back for your exemplary security so you can check the box on whatever audit you're going through are common.
How would a pentest find that? Ok in this case it's splattered onto github; but the main point here is that you might have some unknown number of SAS tokens issued to unknown storage that you probably haven't any easy way to revoke.
Pickle files are cringe, but they're also basically unavoidable when working with Python machine learning infrastructure. None of the major ML packages provide a proper model serialization/deserialization mechanism.
In the case of scikit-learn, the code implementing some components does so much crazy dynamic shit that it might not even be feasible to provide a well-engineered serde mechanism without a major rewrite. Or at least, that's roughly what the project's maintainers say whenever they close tickets requesting such a thing.
This is quite funny for me because at first I didn't understand what the problem is.
In German, if you ask this question, it is expected that your question is genuine and you can expect an answer (Although usually people don't use this opportunity to unload there emotional package, but it can happen!)
Whereas in Englisch you assume this is just a hello and nothing more.
Also imagine all such exposed data sources including those that are not yet discovered... are crawled and trained on by GPT5.
Meanwhile a big enterprise provider like MS suffers a bigger leak and exposes MS Teams/ OneDrive / SharePoint data of all its North America customers say.
Boom we have GPT model that can autonomously run whole businesses.
Many people are also unaware that json is way, way, way faster than Python pickles, and human-editing-friendly. Not that you'd use it for neural net weights, but I see people use Python pickles all the time for things that json would have worked perfectly well.
Absolutely, RBAC should be the default. I would also advocate separate storage accounts for public-facing data, so that any misconfiguration doesn't affect your sensitive data. Just typical "security in layers" thinking that apparently this department in MSFT didn't have.
> I really dislike how Azure makes you juggle keys in order to make any two Azure things talk together.
Actually there is a better way. Look into “Managed Identity”. This allows you to grant access from one service to another, for example grant access to allow a specific VM to work with your storage account.
It didn’t seem to be focused on AI except for the very reasonable concerns that AI research involves lots of data and often also people without much security experience. Seeing things like personal computer backups in the dump immediately suggests that this was a quasi-academic division with a lot less attention to traditional IT standards: I’d be shocked if a Windows engineer could commit a ton of personal data, passwords, API keys, etc. and first hear about it from an outside researcher.
Many SOC2 audits are a joke. We were audited this year and were asked to provide screenshots of various categories (but most being of our own choosing in the end). Only requirement was screenshots needed to show date of the computer on which the screenshot had been taken, as if it couldn't be forged as well as the file/exif data.
For me it's also interesting as a potential pathway for data poisoning attacks - if you have control over the data used to train a production model, can you modify the dataset such that it inserts a backdoor to any model trained subsequently trained over it? E.g. what if gpt was biased to insert certain security vulnerabilities as part of its codegen capabilities?
So SAS tokens are worse that some admin setting up "FileDownloaderAccount" and then sharing its password with multiple users or using the same for different applications?
I take SAS tokens with expiration over people setting up shared RBAC account and sharing password for it.
Yes people should do proper RBAC, but point a company and I will find dozens "shared" accounts. People don't care and don't mind. When beating them up with sticks does not solve the issue SAS tokens while still not perfect help quite a lot.
Disclosure I work for the company that released this: https://github.com/protectai/modelscan but we do have a tool to support scanning many models for this kind of problem.
That said you should be using something like safe-tensors.
So, a little bit like a lot of people think that (non-checksummed/non-encrypted) PDFs cannot be modified, even though they are easily editable with Libre freaking Office ?
“ugh, this thing needs to get out by end of week and I can’t scope this key properly, nothing’s working with it.”
“just give it admin privileges and we’ll fix it later”
sometimes they’ll put a short TTL on it, aware of the risk. Then something major breaks a few months later, gets a 15 year expiry, never is remediated.
It’s common because it’s tempting and easy to tell yourself you’ll fix it later, refactor, etc. But then people leave, stuff gets dropped, and security is very rarely a priority in most orgs - let alone remediation of old security issues.
Editorial Channel
What the content says
+0.70
Article 12Privacy
High Advocacy Coverage Practice
Editorial
+0.70
SETL
+0.49
Central focus of article: extensive coverage of privacy violation (38TB data exposure, personal backups, Teams messages, passwords, keys); strong advocacy for privacy-protective practices and monitoring.
FW Ratio: 60%
Observable Facts
Article headline and primary content: '38TB of data accidentally exposed' including 'over 30,000 internal Microsoft Teams messages,' 'secrets, private keys, passwords,' and employee backups.
Domain context shows GDPR checks and consent-gated analytics tracking.
Article explicitly recommends: 'enable Storage Analytics logs,' 'secret scanning tools,' monitoring of SAS token usage for privacy protection.
Inferences
The comprehensive documentation of privacy violations and prescriptive privacy-protective measures demonstrate strong advocacy for right to privacy.
Domain-level implementation of privacy controls reinforces editorial stance on privacy as a protected right.
+0.65
Article 3Life, Liberty, Security
High Advocacy Practice
Editorial
+0.65
SETL
+0.40
Core focus on security rights: article directly addresses threats to personal security, data integrity, and prevention of unauthorized access that could endanger individuals.
FW Ratio: 60%
Observable Facts
Article documents security threats: '38 terabytes of additional private data — including a disk backup of two employees' workstations.'
Article explains potential harm: 'An attacker could have injected malicious code... every user who trusts Microsoft's GitHub repository would've been infected.'
Emphasizes protection: 'security teams should review and sanitize AI models from external sources, since they can be used as a remote code execution vector.'
Inferences
The detailed analysis of security risks and protective measures demonstrates strong alignment with the right to life, liberty, and security of person.
Domain's security-focused mission directly supports implementation of protective security rights.
+0.60
Article 19Freedom of Expression
High Advocacy Coverage
Editorial
+0.60
SETL
+0.39
Strong advocacy for right to information through public disclosure of security vulnerability following responsible disclosure timeline; supports security researchers' right to communicate findings.
FW Ratio: 60%
Observable Facts
Article publicly discloses vulnerability after responsible timeline: 'Jun. 22, 2023 – Wiz Research finds and reports issue to MSRC... Sep. 18, 2023 – Public disclosure.'
Provides detailed technical information, recommendations, and mitigation strategies for broad cloud community.
Page explicitly invites engagement: 'We would love to hear from you! Feel free to contact us on Twitter or via email.'
Inferences
Public disclosure of security information following responsible practices supports the right to receive and impart information.
Free, accessible publication and author identification demonstrate commitment to information freedom.
+0.55
PreamblePreamble
High Advocacy
Editorial
+0.55
SETL
+0.37
Content advocates for protective frameworks and organizational responsibility to safeguard individuals from security violations and protect their dignity through security governance.
FW Ratio: 67%
Observable Facts
Page describes Wiz Research Team's mission: 'to make the cloud a safer place for everyone.'
Article emphasizes protecting people through organizational security frameworks and governance recommendations.
Inferences
The framing of security research as protection of fundamental rights aligns with the Preamble's foundational commitment to inherent human dignity.
+0.50
Article 28Social & International Order
High Advocacy Practice
Editorial
+0.50
SETL
+0.32
Extensive advocacy for security governance, organizational frameworks, and institutional safeguards; provides detailed recommendations for implementing protective social order.
FW Ratio: 60%
Observable Facts
Article includes detailed 'SAS security recommendations' section with 'Management' and 'Monitoring' subsections and specific governance controls.
Recommends specific practices: 'creating dedicated storage accounts for external sharing,' 'using CSPM to track and enforce this as a policy,' 'enabling Storage Analytics logs.'
Emphasizes institutional responsibility: 'organizations will have to disable SAS access for each of their storage accounts' — establishing mandatory governance.
Inferences
The comprehensive governance framework and institutional recommendations demonstrate strong advocacy for security-based social order.
The provision of specific controls and policies supports establishment of protective organizational frameworks.
+0.40
Article 1Freedom, Equality, Brotherhood
Medium Advocacy
Editorial
+0.40
SETL
+0.24
Research findings and recommendations apply universally across all organizations; advocates for equal protection of all users regardless of size or type.
FW Ratio: 67%
Observable Facts
Recommendations apply equally: recommendations for SAS token security apply universally across all organization types.
Article states findings benefit 'all users' and protect all 'engineers now work with massive amounts of training data.'
Inferences
Universal framing of security best practices suggests advocacy for equal protection of all rights-holders regardless of organizational status.
+0.40
Article 29Duties to Community
Medium Advocacy Practice
Editorial
+0.40
SETL
+0.28
Advocates for researcher responsibility and responsible disclosure practices; discusses duties of security professionals and organizations.
FW Ratio: 75%
Observable Facts
Article documents responsible disclosure process: 'Jun. 22, 2023 – Wiz Research finds and reports issue to MSRC... Jul. 7, 2023 – SAS token replaced on GitHub.'
Team articulates responsibility: 'Our goal is to make the cloud a safer place for everyone' — defining duty to community.
Emphasizes institutional duties: 'security teams should work closely with the data science and research teams' — defining collective responsibility.
Inferences
The adherence to responsible disclosure timeline and emphasis on community responsibility demonstrates alignment with duties toward others.
+0.35
Article 26Education
Medium Advocacy Coverage
Editorial
+0.35
SETL
+0.23
Provides free security education and awareness about cloud risks; advocates for education about digital threats and security literacy.
FW Ratio: 75%
Observable Facts
Article provides detailed technical education about SAS tokens, security risks, and mitigation strategies.
Footer mentions 'Cloud Security Courses' and educational resources.
Complex technical concepts explained accessibly to broad audience without paywalls.
Inferences
The provision of free security education supports the right to education about digital and security topics.
+0.35
Article 27Cultural Participation
Medium Advocacy
Editorial
+0.35
SETL
+0.23
Advocates for scientific security research and knowledge sharing; supports researchers' right to conduct and publish research.
FW Ratio: 67%
Observable Facts
Article describes Wiz Research Team's 'ongoing work on accidental exposure of cloud-hosted data.'
Team identifies as 'white-hat hackers with a single goal: to make the cloud a safer place for everyone.'
Inferences
Support for scientific security research and public knowledge-sharing aligns with right to participate in scientific advancement.
+0.30
Article 2Non-Discrimination
Medium Advocacy
Editorial
+0.30
SETL
+0.17
Security protections are recommended without discrimination; all cloud users and organizations receive equal guidance.
FW Ratio: 67%
Observable Facts
Recommendations apply uniformly to all organizations without exception based on size, type, or characteristics.
Research findings apply to all 'data scientists and engineers' without distinction of role or background.
Inferences
Non-discriminatory dissemination of security protections aligns with equality principles.
+0.25
Article 22Social Security
Medium Advocacy Practice
Editorial
+0.25
SETL
+0.16
Advocates for security governance and organizational frameworks; recommends institutional safeguards and policy implementation for social order.
FW Ratio: 67%
Observable Facts
Article explicitly recommends: 'security teams should work closely with the data science and research teams to ensure proper guardrails are defined.'
Discusses institutional security governance: 'Due to the lack of security and governance over Account SAS tokens, they should be considered as sensitive as the account key itself.'
Inferences
The emphasis on institutional governance and organizational frameworks aligns with establishing social order for security.
+0.25
Article 23Work & Equal Pay
Medium Advocacy Practice
Editorial
+0.25
SETL
+0.16
Addresses workplace security, discussing exposure of employee communications and backups; advocates for workplace security safeguards.
FW Ratio: 67%
Observable Facts
Article documents workplace data exposure: '30,000 internal Microsoft Teams messages from 359 Microsoft employees' were exposed.
Discusses security guardrails needed 'as more of their engineers now work with massive amounts of training data.'
Inferences
The focus on workplace data security and employee protection indicates concern for workers' right to secure working conditions.
+0.20
Article 17Property
Medium Framing Advocacy
Editorial
+0.20
SETL
+0.14
Data is framed as protected property requiring safeguards against unauthorized access; advocacy for protection of data as a sensitive asset.
FW Ratio: 67%
Observable Facts
Article discusses Microsoft's storage account containing 'private data' that was improperly exposed.
Recommendations aim to prevent unauthorized access and modification: token was 'misconfigured to allow full control permissions' enabling deletion and overwriting.
Inferences
The emphasis on preventing unauthorized modification and access to data treats data as protected property worthy of security.
+0.15
Article 25Standard of Living
Low Advocacy
Editorial
+0.15
SETL
+0.09
Implicitly advocates for security standards as foundational to adequate digital infrastructure and living standards.
FW Ratio: 67%
Observable Facts
Article discusses importance of security standards for 'all organizations' using cloud services.
Recommends security practices as essential baseline for safe technology adoption.
Inferences
The positioning of security infrastructure as foundational suggests alignment with adequate standards for digital participation.
+0.15
Article 30No Destruction of Rights
Medium Framing
Editorial
+0.15
SETL
ND
Acknowledges legitimate limitations and trade-offs: security measures have costs, token features provide agility while requiring governance, monitoring incurs expenses.
FW Ratio: 67%
Observable Facts
Article notes: 'SAS tokens pose a security risk, as they allow sharing information with external unidentified identities. The risk can be examined from several angles... [but] granularity provides great agility for users.'
Acknowledges cost trade-offs: 'enabling logging comes with extra charges — which might be costly for accounts with extensive activity.'
Inferences
The acknowledgment of legitimate tensions between security and other values (agility, cost) reflects awareness of proportional rights limitations.
build cbdda6a+ezsz · deployed 2026-02-28 16:02 UTC · evaluated 2026-02-28 16:03:48 UTC
Support HN HRCB
Each evaluation uses real API credits. HN HRCB runs on donations — no ads, no paywalls.
If you find it useful, please consider helping keep it running.