AI’s Achilles’ Heel: The Data Quality Dilemma
As AI has gained prominence, all the info high quality points we’ve confronted traditionally are nonetheless related. However, there are further complexities confronted when coping with the nontraditional information that AI usually makes use of.
AI Data Has Different Quality Needs
When AI makes use of conventional structured information, all the identical information cleaning processes and protocols which were developed through the years can be utilized as-is. To the extent a company already has confidence in its conventional information sources, the usage of AI shouldn’t require any particular information high quality work.
The catch, nonetheless, is that AI usually makes use of nontraditional information that may’t be cleansed in the identical approach as conventional structured information. Think of pictures, textual content, video, and audio. When utilizing AI fashions with any such information, high quality is as vital as ever. But sadly, the standard strategies utilized for cleaning structured information merely don’t apply. New approaches are required.
AI’s Different Needs: Input And Training
First, let’s use an instance of picture information high quality from the enter and mannequin coaching perspective. Typically, every picture has been given tags summarizing what it comprises. For instance, “scorching canine” or “sports activities automotive” or “cat.” This tagging, sometimes achieved by people, can have true errors and likewise conditions the place completely different individuals interpret the picture in a different way. How can we establish and deal with such conditions?
It isn’t simple! With numerical information, it’s attainable to establish unhealthy information through mathematical formulation or enterprise guidelines. For instance, if the worth of a sweet bar is $125, we could be assured it could actually’t be proper as a result of it’s so far above expectation. Similarly, an individual proven as age 200 clearly doesn’t make any sense. There actually isn’t an efficient approach right now to mathematically test if tags are correct for a picture. The finest option to validate the tag is to have a second individual assess the picture.
An various is to develop a course of that makes use of different AI fashions to scan the picture and see if the tags utilized look like right. In different phrases, we will use current picture fashions to assist validate the info being fed into future fashions. While there may be potential for some round logic doing this, fashions have gotten sturdy sufficient that it shouldn’t be an issue pragmatically.
AI’s Different Needs: Output And Scoring
Next, let’s use an instance of picture information high quality from the mannequin output and scoring perspective. Once now we have a picture mannequin that now we have confidence in, we feed the mannequin new pictures in order that it could actually assess the pictures. For occasion, does the picture comprise a scorching canine, or a sports activities automotive, or a cat? How can we assess if a picture offered for evaluation is “clear sufficient” for the mannequin? What if the picture is blurry or pixelated or in any other case not clear? Is there a option to “clear” the picture?

The confidence we will have in what an AI mannequin tells us is within the picture immediately is dependent upon how clear the picture is. In a case such because the picture above, how do we all know if the picture is a blurred view of bushes or one thing else totally? Even as people, there may be subjectivity on this evaluation and no clear path for having an automatic, algorithmic method to declaring the picture as “clear sufficient” or not. Here, handbook evaluation may be finest. In absence of that, we will once more have an algorithm that scores the readability of the enter picture together with processes to price the arrogance within the descriptions generated by the mannequin’s evaluation. Many AI purposes do that right now, however there may be certainly enchancment attainable.
Rising To The Challenge
The examples offered illustrate that traditional information high quality approaches like lacking worth imputation and outlier detection can’t be utilized on to information reminiscent of pictures or audio. These new information varieties, which AI is closely depending on, would require new and novel methodologies for assessing high quality each on the enter and the output finish of the fashions. Given it took us a few years to develop our approaches for conventional information, it ought to come as no shock that now we have not but achieved comparable requirements for the unstructured information which AI makes use of.
Until these requirements come up, it’s essential to:
- Constantly scan trade blogs, papers, and code repositories to maintain tabs on newly developed approaches
- Make your information high quality processes modular in order that it’s simple to change or add procedures to make use of the newest advances
- Be diligent in learning recognized errors with the intention to establish if patterns exist associated to the place your cleaning processes and fashions are performing higher and worse
Data high quality has at all times been a thorn within the facet of information and analytics practitioners. Not solely do the standard points stay as AI is deployed, however the completely different information that AI makes use of introduces all kinds of novel and tough information high quality challenges to handle. Those working within the information high quality realm ought to have job safety for a while to return!
Originally posted within the Analytics Matters newsletter on LinkedIn
The submit AI’s Achilles’ Heel: The Data Quality Dilemma appeared first on Datafloq.

