Dirty data

Data that is incomplete, incorrect, or irrelevant to the problem you're trying to solve

Clean data

Data that is complete, correct, and relevant to the problem you're trying to solve

Data engineers

Transform data into a useful format for analysis and give it a reliable infrastructure

Types of dirty data

Untitled

Duplicate data

Description Possible causes Potential harm to businesse
Any data record that shows up more than once Manual data entry, batch data imports, or data migration Skewed metrics or analyses, inflated or inaccurate counts or predictions, or confusion during data retrieval

Outdated data

Description Possible causes Potential harm to businesses
Any data that is old should be replaced with newer and more accurate information People changing roles or companies, or software and systems become obsolete Inaccurate insights, decision-making, and analytics

Incomplete data

Description Possible causes Potential harm to businesses
Any data that is missing important fields Improper data collection or incorrect data entry Decreased productivity, inaccurate insights, or inability to complete essential services

Incorrect/inaccurate data

Description Possible causes Potential harm to businesses
Any data that is complete but inaccurate Human error inserted during data input, fake information, or mock data Inaccurate insights or decision-making based on bad information resulting in revenue loss

Inconsistent data

Description Possible causes Potential harm to businesses
Any data that use different formats to represent the same thing Data stored incorrectly or errors inserted during data transfer Contradictory data points leading to confusion or inability to classify or segment customers

Validity

Definition