Governing The Data Lake

2 minute read

Governing the Data Lake: The Precursor to Safe AI

Everyone wants to talk about “AI Safety.” They want to talk about model weights, context windows, and hallucination rates.

But as a GRC Engineer, I am looking at the plumbing. And frankly, the plumbing is leaking.

In October 2025, the most common architecture I see is RAG (Retrieval-Augmented Generation). We point an LLM at our corporate “Data Lake” and tell it to answer questions.

The problem? For the last five years, we treated our Data Lakes like digital dumpsters. We dumped everything—financials, HR records, public marketing data—into a single storage account, often with “Contributor” access granted to entire teams.

If you point a hyper-intelligent AI at a flat, ungoverned storage bucket, you are automating data exfiltration.

The “Flat” Storage Problem

In standard Azure Blob Storage, the permission model is relatively flat. You usually grant access at the Container level. If a user (or an AI Agent) needs access to one file in the container, they often get access to all files.

That works for hosting images for a website. It is catastrophic for an AI that answers questions about employee salaries.

The Engineering Fix: ADLS Gen2 & ACLs

To survive the AI era, we have to migrate from standard Blob Storage to Azure Data Lake Storage Gen2 (ADLS Gen2).

Why? Because Gen2 gives us Hierarchical Namespaces.

This sounds boring, but it is critical. It turns flat storage into a real file system with folders and sub-folders. More importantly, it allows us to apply POSIX-style ACLs (Access Control Lists) at the directory and file level.

This means we can engineer a structure where:

The Finance AI Agent has Read and Execute permissions on /Finance/2025/.
But it is explicitly denied access to /HR/Salaries/.

Networking is the Second Wall

Identity isn’t enough. We also need to talk about Private Endpoints.

If your Data Lake is accessible via the public internet (even with auth), you are one misconfigured token away from a breach. In a GRC Engineering model, we enforce—via Azure Policy—that all Data Lakes must disable public access and use Azure Private Link. The traffic between your AI compute (e.g., Azure OpenAI) and your data should never leave the Microsoft backbone network.

The Architect’s Takeaway

You cannot fix “AI Risk” if your data governance is broken.

Before you buy more GPUs, look at your storage accounts. If you are still running a “flat” permission model, your AI isn’t a tool; it’s a liability.

Fix the lake first. Then launch the boat.

Share on

X Facebook LinkedIn Bluesky

Greg Surber

Governing The Data Lake

Governing the Data Lake: The Precursor to Safe AI

The “Flat” Storage Problem

The Engineering Fix: ADLS Gen2 & ACLs

Networking is the Second Wall

The Architect’s Takeaway

Share on

You May Also Enjoy

Human On The Design

ICS-SimLab: A Containerized Approach for Simulating Industrial Control Systems for Cyber Security Research

Introducing The Kinetic Prompt Dataset

Non Human Identity Crisis