According to the Western Digital Blog article “3 Key Data Challenges of Machine Learning” there are three critical data challenges of Machine Learning: Quality, Sparsity and Integrity.
Quality assesses data from external sources where “no quality control or guarantee on how the original data is captured” and “you need to understand the quality of the data and how to prepare it.” Data from experiments and examples must be free from errors and must be cleaned up before proper analysis is conducted.
Sparsity involves incomplete metadata especially when data comes from diverse sources without a standard definition of metadata. When data sources are combined, often fields do not correspond. “How do you correlate and filter data” when you have the same type of data with different metadata fields populated? The answer is “through the metadata disclosing when it was captured. When scientists are doing historical analysis they need metadata in order to be able to adjust their models accordingly.”
Integrity is data accuracy and consistency assurance:
“The chain of data custody is critical to prove that data is not compromised as it moves through pipelines and locations.”
When capture and ingestion of the data is controlled data veracity is not an issue. Yet issues arise such as when one cannot maintain the data was recorded originally as intended nor that the data you obtain is the same as when it was originally recorded. Therefore data integrity is contingent on a combination of cybersecurity technologies and policies such as using https and encryption. Policy driven access control eliminates human errors.
In summation, organizations and businesses should begin refining its machines learning environment success by defining data collection policy, metadata format, and apply standard security techniques.