High-quality AI Needs High-quality data, but what is that exactly?

Last updated on 06/03/2024

In today’s data-driven world, businesses are increasingly relying on artificial intelligence (AI). They us it to make decisions, automate tasks, and improve customer experiences. However, the quality of the data AI uses determines its efficacy. “Garbage in, garbage,” out as the old saying goes.

What is AI Data Quality?

Data quality refers to the accuracy, completeness, and consistency of data. Good data quality is essential for:

training high-quality machine learning models
building high-fidelity simulations for optimization
achieving high accuracy in computer vision and natural language processing

Understanding Good Data vs. Bad Data

What makes data ‘good’? Good data is:

Accurate

That is, good data does not contain substantial errors. AI techniques require an accurate representation of the problems they solve to produce high-quality solutions. If we start with inaccurate data, we are effectively putting a ceiling on how well the AI can do before we begin.

Consistent

Consistency comes in many forms: cadence, shape, specificity, quality, etc. There are techniques for dealing with all sorts of problematic data. However, it’s easiest to handle one consistent type of data when solving a problem. When the shape of the data is known and doesn’t fluctuate over time, we can invest more effort in both our data preparation efforts as well as our AI engineer efforts. As long as the data are relatively consistent, we can amortize those investments over the lifetime of a project.

Timely

Timeliness is in the eye of the beholder. The decision that AI makes from the data needs to still be relevant by the time the data are collected and the computations completed. In manufacturing and self driving scenarios, timely may mean fraction of a second. For municipal planning, timeliness can mean days or weeks.

Complete & Relevant

AI systems operate on the age old “garbage in, garbage out” principal. In order for any intelligence, artificial or otherwise, to make a good decision it must be informed. Without a complete view of the situation, we cannot hope to build a system that can make good decisions. Similarly, too much data can be a problem. Any data we consider has to be processed and stored. If the data aren’t useful, that’s overhead we don’t have to pay. Extra information may also confuse some AI techniques, causing them to over-focus on irrelevant data points.

Unbiased

Bias can occur in the input to an AI system, as well as in the AI model itself. For input, we need to know that whatever data we use to build and evaluate a system is representative of the data that it will see and use in production. If not, we may field a system we believe to be robust, but that fails to perform in practice. From a modeling perspective, we prefer not to build AI systems that perpetuate human prejudices. Using datasets produced by recording human actions can perpetuate bad human behaviors in automated systems if we do not take care to boil out human biases up front.

Bad Data

Bad data isn’t necessarily bad, it’s just recorded that way. In most cases, we needn’t discard it. Bad data can be rehabilitated depending on why it’s bad:

Untimely data is still useful
Redundant data can be identified and discarded
train historic models
batch training
validating models
Human intervention can rehabilitate
Incomplete data with supplemental research and annotation
Inconsistent data with manual curation
Biased data is still useful in some learning regimes
Bias-sampling allows some learning methods to avoid model bias with explicit training bias
Underrepresenting biased data in test and training sets when bias sampling isn’t possible.

How Bad Data Can Confuse AI

Bad data can have a negative impact on AI models in a number of ways. For example, it can:

Lead to inaccurate predictions
Generate unreliable results
Make it difficult to identify patterns and trends
Increase the risk of bias
Waste time and resources

Why Keeping Data Clean Is a Challenge

Data hygiene doesn’t come for free, and it doesn’t come naturally. Thus, many organizations struggle with it. Data comes from a variety of sources, including:

Customer transactions (e.g. Point of Sales Systems, online shopping)
Social media interactions
Sensors
3rd party brokers

Each source has its own set of issues. Sensor data contains noise. Point of sales systems may misread barcodes. Data from sensors and third party brokers may be delayed depending on how it’s collected. The more systems producing data, the more complete your view of the world. However, the more systems the more complications you’ll encounter.

How Good Data and Good AI Helps Businesses

Good data can help businesses in many ways, including improving:

decision making in many arenas
product pricing
marketing campaigns
customer service
resource allocation
customer outcomes
personalized customer experiences
timely support and interventions
margins
Reduced labor costs by automating menial tasks
Better quality work / higher throughput with AI assistants
Better use of resources in manufacturing contexts

In conclusion, clean data is essential for the success of AI models. Businesses that invest in data quality today will be better positioned to take advantage of AI tomorrow.

Here are some additional tips for businesses to improve their data quality:

Establish clear data quality standards and procedures.
Implement data validation and cleansing processes.
Monitor data quality on an ongoing basis.
Invest in data quality tools and solutions.

By following these tips, businesses can ensure that they have the clean data they need to make the most of AI.