Last updated on 06/03/2024
In today’s data-driven world, businesses are increasingly relying on artificial intelligence (AI). They us it to make decisions, automate tasks, and improve customer experiences. However, the quality of the data AI uses determines its efficacy. “Garbage in, garbage,” out as the old saying goes.
What is AI Data Quality?
Data quality refers to the accuracy, completeness, and consistency of data. Good data quality is essential for:
- training high-quality machine learning models
- building high-fidelity simulations for optimization
- achieving high accuracy in computer vision and natural language processing
Understanding Good Data vs. Bad Data
What makes data ‘good’? Good data is:
Accurate
That is, good data does not contain substantial errors. AI techniques require an accurate representation of the problems they solve to produce high-quality solutions. If we start with inaccurate data, we are effectively putting a ceiling on how well the AI can do before we begin.
Consistent
Consistency comes in many forms: cadence, shape, specificity, quality, etc. There are techniques for dealing with all sorts of problematic data. However, it’s easiest to handle one consistent type of data when solving a problem. When the shape of the data is known and doesn’t fluctuate over time, we can invest more effort in both our data preparation efforts as well as our AI engineer efforts. As long as the data are relatively consistent, we can amortize those investments over the lifetime of a project.
Timely
Timeliness is in the eye of the beholder. The decision that AI makes from the data needs to still be relevant by the time the data are collected and the computations completed. In manufacturing and self driving scenarios, timely may mean fraction of a second. For municipal planning, timeliness can mean days or weeks.
Complete & Relevant
AI systems operate on the age old “garbage in, garbage out” principal. In order for any intelligence, artificial or otherwise, to make a good decision it must be informed. Without a complete view of the situation, we cannot hope to build a system that can make good decisions. Similarly, too much data can be a problem. Any data we consider has to be processed and stored. If the data aren’t useful, that’s overhead we don’t have to pay. Extra information may also confuse some AI techniques, causing them to over-focus on irrelevant data points.
Unbiased
Bias can occur in the input to an AI system, as well as in the AI model itself. For input, we need to know that whatever data we use to build and evaluate a system is representative of the data that it will see and use in production. If not, we may field a system we believe to be robust, but that fails to perform in practice. From a modeling perspective, we prefer not to build AI systems that perpetuate human prejudices. Using datasets produced by recording human actions can perpetuate bad human behaviors in automated systems if we do not take care to boil out human biases up front.
Bad Data
Bad data isn’t necessarily bad, it’s just recorded that way. In most cases, we needn’t discard it. Bad data can be rehabilitated depending on why it’s bad:
- Untimely data is still useful
- Redundant data can be identified and discarded
- train historic models
- batch training
- validating models
- Human intervention can rehabilitate
- Incomplete data with supplemental research and annotation
- Inconsistent data with manual curation
- Biased data is still useful in some learning regimes
- Bias-sampling allows some learning methods to avoid model bias with explicit training bias
- Underrepresenting biased data in test and training sets when bias sampling isn’t possible.
How Bad Data Can Confuse AI
Bad data can have a negative impact on AI models in a number of ways. For example, it can:
- Lead to inaccurate predictions
- Generate unreliable results
- Make it difficult to identify patterns and trends
- Increase the risk of bias
- Waste time and resources
Why Keeping Data Clean Is a Challenge
Data hygiene doesn’t come for free, and it doesn’t come naturally. Thus, many organizations struggle with it. Data comes from a variety of sources, including:
- Customer transactions (e.g. Point of Sales Systems, online shopping)
- Social media interactions
- Sensors
- 3rd party brokers
Each source has its own set of issues. Sensor data contains noise. Point of sales systems may misread barcodes. Data from sensors and third party brokers may be delayed depending on how it’s collected. The more systems producing data, the more complete your view of the world. However, the more systems the more complications you’ll encounter.
How Good Data and Good AI Helps Businesses
Good data can help businesses in many ways, including improving:
- decision making in many arenas
- product pricing
- marketing campaigns
- customer service
- resource allocation
- customer outcomes
- personalized customer experiences
- timely support and interventions
- margins
- Reduced labor costs by automating menial tasks
- Better quality work / higher throughput with AI assistants
- Better use of resources in manufacturing contexts
In conclusion, clean data is essential for the success of AI models. Businesses that invest in data quality today will be better positioned to take advantage of AI tomorrow.
Here are some additional tips for businesses to improve their data quality:
- Establish clear data quality standards and procedures.
- Implement data validation and cleansing processes.
- Monitor data quality on an ongoing basis.
- Invest in data quality tools and solutions.
By following these tips, businesses can ensure that they have the clean data they need to make the most of AI.