Is Data Quality Moving Upstream for AI?
Before we are able to speak concerning the new AI corpus, we have to look backward.
For many years, information + AI groups have been educated to look downstream in the direction of their analysts or enterprise customers for necessities.
This is partially as a result of information high quality is restricted to the use-case. For instance, a machine studying software could require recent however solely directionally correct information whereas a finance report would possibly should be correct all the way down to the penny however solely up to date as soon as per day.
But it wasn’t all pragmatic. It was additionally responsive.
The fact is, even in case you wished to look upstream, most upstream information sources wouldn’t speak to you. They have been both third-party sources pumping information into the void, or inner software program engineers creating an internet of microservices… that have been additionally pumping information into the void.
New quantity who dis?
In response, we’d even begun to play intermediary, bringing necessities from downstream customers to our information producers upstream within the type of .
And this strategy (flawed because it was) actually labored for a time. The problem we’re dealing with within the wake of the AI race is that, whereas it’s not out of date, it’s now not ample.
So, what’s the newest?
The Data + AI Team’s New Best Friend: Knowledge Managers?
With unstructured RAG pipelines, the info supply is now not a messy database… it’s a messy information base, doc repo, wiki, SharePoint web site and many others.
And guess what?
These information sources are simply as opaque as their structured foils, however with the added complication of additionally being much less predictable.
BUT there’s a silver lining.
Unlike these structured stalwarts that dominated earlier than the AI enlightenment, unstructured information sources are (virtually at all times) owned by a topic skilled – or “information supervisor” – with a transparent understanding of what beauty like.
This AI corpus was created and cultivated for a purpose, more likely to reply the identical sorts of questions and clear up the identical issues that your AI chatbot or agent is seeking to clear up.
And the place these third-parties and software program engineers is likely to be unwilling to dialogue concerning the trivia of their information, these information managers are be very happy to information you thru their painstakenly curated and managed repository.

“And they mentioned, what do you imply model management?”
And which means these information managers are the right accomplice to outline what high quality appears like.
Managing Unstructured Data Quality Upstream
When it involves the unpredictability of unstructured information + AI pipelines, the most effective protection is an efficient offense. That means shifting left to construct necessities alongside the information managers who perceive their information the most effective.
If you need to get to the beating coronary heart of your AI corpus, begin with questions like:
- What canonical paperwork ought to at all times be there? (completeness)
- What is the method for updating paperwork, how typically does it occur? (freshness)
- How steady are the file constructions? Are there headings, sections, and many others. (chunking technique, validity)
- What are probably the most essential metadata filters? How typically do they modify? (schema)
- Is it multi function language? Does it include code or HTML? (validity)
- Are there file naming conventions? Any jargon or shorthand or contradictory phrases? (validity)
- Who are the most typical customers? What are the most typical questions? (eval technique)
Once you perceive who maintains that information supply and what questions you want them to reply, you’re only a dialog away from gathering the necessities you’ll want to create dependable information + AI programs.
Don’t Let Your AI Corpus Become a Crisis
An AI response may be related, grounded, and completely mistaken. And in case you aren’t as intimately acquainted with your AI corpus (and its directors) as you’re together with your pipelines and your fashions, you will fail.
The most sensible technique to get forward of this silent failure is to make sure your AI is at all times receiving probably the most correct and up-to-date content material.
And the excellent news is, you most likely have a useful resource in your group who’s prepared and prepared to assist.
One of the finest methods to try this is to make sure you at all times have corpus-embedding alignment – which suggests information + AI staff and information supervisor alignment.
Once upon a time, downstream alignment was sufficient to create efficient necessities. But now not. If you’re constructing information + AI programs, you HAVE to solid an eye fixed each downstream and upstream.
Outputs are solely HALF the story. If your AI is mistaken, the issue is simply as more likely to be upstream together with your inputs (or lack of inputs) as it’s within the mannequin itself.
Remember that lesson – and operationalize a knowledge + AI observability answer – and also you’ll be one step forward of the AI reliability sport.
The submit Is Data Quality Moving Upstream for AI? appeared first on Datafloq.
