Microsoft AI involuntarily exposed a secret giving access to 38TB of confidential data for 3 years

The WIZ Research team recently discovered that an overprovisioned SAS token had been lying exposed on GitHub for nearly three years. This token granted access to a massive 38-terabyte trove of private data. This Azure storage contained additional secrets, such as private SSH keys, hidden within the disk backups of two Microsoft employees. This revelation underscores the importance of robust data security measures.

What happened?

WIZ Research recently disclosed a data exposure incident found on Microsoft’s AI GitHub repository on June 23, 2023.

The researchers managing the GitHub used an Azure Storage sharing feature through a SAS token to give access to a bucket of open-source AI training data.

This token was misconfigured, giving access to the account's entire cloud storage rather than the intended bucket.

This storage comprised 38TB of data, including a disk backup of two employees’ workstations with secrets, private keys, passwords, and more than 30,000 internal Microsoft Teams messages.

SAS (Shared Access Signatures) are signed URLs for sharing Azure Storage resources. They are configured with fine-grained controls over how a client can access the data: what resources are exposed (full account, container, or selection of files), with what permissions, and for how long. See Azure Storage documentation.

After disclosing the incident to Microsoft, the SAS token was invalidated. From its first commit to GitHub (July 20, 2020) to its revoking, nearly three years elapsed. See the timeline presented by the Wiz Research team:

Why did the token have such an extended lifespan? If you take a look at the timeline, you'll see that the token's expiration date was extended by an additional 30 years post-expiration. This longevity isn't surprising when you consider that the token was intentionally engineered to be shared and grant access to training data.

Yet, as emphasized by the WIZ Research team, there was a misconfiguration with the Shared Access Signature (SAS).

Data Exposure

The token was allowing anyone to access an additional 38TB of data, including sensitive data such as secret keys, personal passwords, and over 30,000 internal Microsoft Teams messages from hundreds of Microsoft employees.

Here is an excerpt from some of the most sensitive data recovered by the Wiz team:

Not only was the access scope excessively permissive, but the token was also misconfigured to grant "full control" permissions instead of read-only. This means that an attacker not only had the ability to view all the files in the storage account but could also delete and overwrite existing files.

As highlighted by the researchers, this could have allowed an attacker to inject malicious code into the storage blob that could then automatically execute with every download by a user (presumably an AI researcher) trusting in Microsoft's reputation, which could have led to a supply chain attack.

Also read Examples of software supply chain attacks

Security Risks

According to the researchers, Account SAS tokens such as the one presented in their research present a high-security risk. This is because these tokens are highly permissive, long-lived tokens that escape the monitoring perimeter of administrators.

When a user generates a new token, it is signed by the browser and doesn't trigger any Azure event. To revoke a token, an administrator needs to rotate the signing account key, therefore revoking all the other tokens at once.

Ironically, the security risk of a Microsoft product feature (Azure SAS tokens) caused an incident for a Microsoft research team, a risk recently referenced by the second version of the Microsoft threat matrix for storage services:

Secrets Sprawl

This example perfectly underscores the pervasive issue of secrets sprawl within organizations, even those with advanced security measures. Intriguingly, it highlights how an AI research team, or any data team, can independently create tokens that could potentially jeopardize the organization. These tokens can cleverly sidestep the security safeguards designed to shield the environment.

Read the State of Secrets Sprawl 2023

Mitigation strategies

For Azure Storage users:

1 - avoid Account SAS tokens

The lack of monitoring makes this feature a security hole in your perimeter. A better way to share data externally is using a Service SAS with a Stored Access Policy. This feature binds a SAS token to a policy, providing the ability to centrally manage token policies.

Better though, if you don't need to use this Azure Storage sharing feature, is to simply disable SAS access for each account you own.

2 - enable Azure Storage analytics

Active SAS tokens usage can be monitored through the Storage Analytics logs for each of your storage accounts. Azure Metrics allows the monitoring of SAS-authenticated requests and identifies storage accounts that have been accessed through SAS tokens, for up to 93 days.

For all:

1 - Audit your GitHub perimeter for sensitive secrets

With around 90 million developer accounts, 300 million hosted repositories, and 4 million active organizations, including 90% of Fortune 100 companies, GitHub holds a much larger attack surface than meets the eye.

Last year, GitGuardian uncovered 10 million leaked secrets on public repositories, up 67% from the previous year.

GitHub must be actively monitored as part of any organization's security perimeter. Incidents involving leaked credentials on the platform continue to cause massive breaches for large companies, and this security hole in Microsoft's protective shell wasn't without reminding us of the Toyota data breach from a year ago.

On October 7, 2022 Toyota, the Japanese-based automotive manufacturer, revealed they had accidentally exposed a credential allowing access to customer data in a public GitHub repo for nearly 5 years. The code was made public from December 2017 through September 2022. While Toyota says they have invalidated the key, any exposure this long could mean multiple malicious actors had already acquired access.

Being able to detect exposed sensitive tokens on GitHub is a unique feature of GitGuardian's Public Monitoring system. It allows security analysts to quickly inspect an organization's footprint on the platform, identify valid secrets, and assess the severity of incidents. What is more, the engine is able to include developers’ personal public repositories — where 80% of corporate credentials are leaked — to an organization's perimeter.

If your company has development teams, it is very likely that some of your company's secrets (API keys, tokens, password) end up on public GitHub, so you should evaluate your GitHub attack surface by requesting a complimentary audit.

2 - Lay out traps in the form of honeytokens

Do you need time to restructure governance around cloud storage access, yet need to be alerted if highly sensitive parts get scanned by a malicious actor?

Your best allies are honeytokens. These tokens are decoy AWS secrets you can deploy strategically across your software assets to regain observability in the grey areas of your IT infrastructure. Getting the attackers' IP addresses, user agent, what actions they were attempting, and the timestamps of each attempt will help you thwart attempts before they can inflict damage on your software supply chain.

Final words

Every organization, regardless of size, needs to be prepared to tackle a wide range of emerging risks. These risks often stem from insufficient monitoring of extensive software operations within today's modern enterprises. In this case, an AI research team inadvertently created and exposed a misconfigured cloud storage sharing link, bypassing security guardrails. But how many other departments - support, sales, operations, or marketing - could find themselves in a similar situation? The increasing dependence on software, data, and digital services amplifies cyber risks on a global scale.

Combatting the spread of confidential information and its associated risks necessitates reevaluating security teams' oversight and governance capabilities. It also requires the provision of appropriate tools to identify and counteract emerging threat categories. While human errors are an inevitable part of the process, GitGuardian is here to guide you along your security journey.