Your Data Doesn't Need to Be Clean. Here's What Actually Matters.

The number one reason AI projects stall before they even start: the leadership team convinces themselves their data needs to be cleaned up first.

“We know our CRM is a mess. Give us a few months to get it sorted, then we’ll bring you in.” Six months later, nothing happened. The data is still messy. The AI landscape shifted. The ideas they had are different now. And their competitor already has a system running.

I hear this constantly. It’s the most common misconception in AI implementation, and it kills projects before a single line of code is written.

Here’s what’s actually true: LLMs are remarkably good at navigating unstructured, messy data. The data hygiene obsession is a holdover from the RPA and traditional automation era, where everything had to be perfectly structured because the systems were completely deterministic. Spelling had to be exact. Formats had to match. One edge case could break the entire flow.

That world is over. What matters now isn’t whether your data is clean. It’s how your data flows through the business.

The Data Flow, Not the Data

When we come into an engagement, the first thing we do is an audit. But we’re not auditing data quality. We’re auditing data movement. Where does data live? Who creates it? Who touches it? How does it get from one system to another? Where does it get stuck? Where is the same information being entered into multiple systems by different people?

This map reveals the actual structure of your operation — not the org chart, not the process document nobody’s updated, but the real flow of information through the business.

We track data through four categories. This will sound familiar if you’re an engineer — it’s basically CRUD operations reframed for business context.

Created. Where does new data come from? An email. A form submission. A call transcript. A system event. What’s the trigger, and how do we capture it?

Manipulated. How does data get transformed once it exists? How does it relate to other data in the organization? That incoming proposal is connected to a client account, which is connected to a deal in the CRM, which is connected to a project, which is connected to team members. Pieces of the data need to flow to different places — pricing goes to the CRM, scope goes to the project management tool, contact details go to the communication system.

Shared. Where does data move? Which tools does it get copied to? Which people does it get passed to? Where does it live once the task is done? Is it needed for reporting later?

Deleted. When does data stop being useful? When are new versions created? What happens to old versions? Who decides what gets kept?

The reframe I’m trying to establish: most companies think they have a data problem. What they actually have is a sprawl problem. Data lives in five or six different platforms because that’s how work gets done — someone creates a contract in PandaDoc, sends it to legal via email, gets a redlined version back, saves it to Google Drive, pushes it to the CRM, sends it to the client for signature, and the executed version ends up in three different places. One piece of data, six platforms. That’s not a data quality issue. That’s a data flow issue.

What This Looks Like in Practice

Let me walk through a simple example. An email hits Gil’s inbox at 8 AM — a signed proposal, time to onboard a new client. Gil needs to set up a Slack channel, structure Google Drive folders, log the deal in the CRM, add people to the project, and send the client a welcome email.

But Gil has a question. He messages Nick on Slack asking who should be on the project. Nick is in back-to-back meetings. Two hours go by. Nick finally sees the message, spends 45 minutes digging through Google Drive and email to find the answer, then responds. Gil has moved on to other things. Another hour passes before Gil sees Nick’s reply. Then 10 minutes to finish onboarding.

Four and a half hours from signed proposal to onboarded client. One hour of combined labor capacity consumed between Gil and Nick. Plus the hidden costs — Nick context-switching away from his actual work, both of them carrying mental overhead from the open task, everything else on their to-do lists getting pushed back.

Now imagine the AI-native version. The email arrives. The system picks it up immediately. It has access to all the data it needs — no waiting on Nick, no searching through Google Drive, no Slack messages. It sets up the channel, structures the folders, logs the CRM data, adds the team, sends the welcome email. Ten minutes. Zero human capacity consumed.

Extrapolate that across every function in the business — every point where data comes in and actions need to happen — and you start to see the fundamental inefficiency in how we run organizations today. It was always the cost of doing business. It doesn’t have to be anymore.

The Pattern: Deterministic Where We Can, LLM Where We Must, Human Where It Matters

When we design systems, we follow a simple pattern.

Start with the data flow. Map the events — emails, Slack messages, form submissions, system triggers. Most of them don’t require action. Some need data routed to another system. Some trigger a workflow.

For each workflow, we determine: can this step be handled deterministically? If the logic is clear and the inputs are structured, we don’t need an LLM. We use code. It’s cheaper, faster, and more reliable.

Where we can’t be deterministic — because the input is unstructured, the reasoning is ambiguous, or edge cases are too varied — we use an LLM. It handles the reasoning, classifies the intent, generates the output. The LLM is placed precisely where it’s needed, not running the entire show.

And where it really matters — where judgment, taste, relationship context, or accountability is required — a human steps in. The key is constraining what goes to the human to the highest-leverage touchpoint. Not human as fallback for reasoning. Human as the asset that adds value no system can replicate.

This is the architecture shift. Old world: humans run the workflow, occasionally using tools. New world: the system runs the workflow, escalating to humans at the points where human involvement is highest-leverage.

Why the “Clean Your Data First” Instinct Is Backward

Here’s the trade-off most people don’t understand. The cleaner and more structured your data, the more you can do deterministically — minimal LLM usage, lower cost, higher speed. The messier the data, the more you need LLMs to reason through it — higher cost, but still functional.

LLMs can handle messy data. That’s one of their core strengths. It’s the business logic that needs to be structured, not the data itself.

If you know the logic of your workflows — what should happen when this event occurs, what data is needed, where it goes, who needs to be involved — an LLM can figure out the messy data part. It can match “Johnson & Johnson” to “J&J” to “Johnson and Johnson Inc.” It can parse a PDF that’s formatted differently every time. It can extract the right fields from an email that doesn’t follow any template.

What it can’t do is guess what your business rules are. That’s the actual requirement for AI readiness — documented logic, not clean data.

Where to Start

If you want to start mapping this yourself, pick one workflow that’s burning real hours. Follow one unit of work from start to finish — one client being onboarded, one invoice being processed, one report being generated, one ticket being handled.

At every step, write down three things: What data did the person need? Where did they get it? What did they do with it?

Count the hops. Every time someone opens a different tool, copies information from one system to another, or re-enters data that already exists somewhere else — that’s a hop. Don’t worry about API connections or system architecture. Just count the hops.

At the end, you’ll have a workflow map that shows you how data actually flows through that process. You’ll see where time is wasted, where information is being re-entered unnecessarily, and where the bottlenecks are. That map is worth more than six months of data cleanup, because it shows you exactly where AI can step in and collapse the workflow.

Your data hygiene is not the problem. Don’t let it stop you from getting started.