16. Data Analytics - Process Data from Dirty to Clean - Week 2
Number one factor of poor-quality data is human error.
Definitions:
Dirty data // data that is incomplete, incorrect, or irrelevant to the problem you're trying to solve.
Clean data // data this is complete, correct, and relevant to the problem you're trying to solve
Data engineers // transform data into a useful format for analysis and give it a reliable infrastructure
They develop, maintain, and test databases, data processors, and related systems.
Data warehousing specialists // develop processes and procedures to effectively store and organize data
They make sure data is available, secure, and backed up to prevent loss.
Null // an indication that a value does not exist in a dataset
Field // a single piece of information from a row or column of a spreadsheet
Field length // a tool for determining how many characters can be keyed into a field
Data validation // a tool for checking the accuracy and quality of data before adding or importing it
Merger // an agreement that unites two organizations into a single new one
Data merging // process of combining two or more datasets into a single dataset
Compatibility // how well two or more datasets are able to work together
Conditional formatting // a spreadsheet tool that changes how cells appear when values meet specific conditions
Remove duplicates // a tool that automatically searches for and eliminates duplicate entries from a spreadsheet
Text string // a group of characters within a cell, most often composed of letters, numbers, or both
Split // a tool that divides text around a specified character and puts each fragment into a new, separate cell
Delimiter // specified text separator, a character that separates a group of characters
Concatenate // a function that joins multiple text strings into a single string
Function // a set of instructions that performs a specific calculation using the data in a spreadsheet
Syntax // predetermined structure that includes all required information and its proper placement
Sorting // arranging data into a meaningful order to make it easier to understand, analyze, and visualize
Filtering // showing only the data that meets a specific criteria while hiding the rest
Pivot table // data summarization tool that is used in data processing
Data mapping // process of matching fields from one data source to another
Schema // a way of describing how something is organized
Primary key // references a column in which each value is unique
Foreign key // a field within a table that is a primary key in another table
Some organization have rules for data cleaning, like formatting guidelines to follow. EX: lbs instead of kg
Backup data before cleaning.
Plotting/graphing data can help find outliers and fix mistakes.
Data Mapping Process:
- Determine the content of each section to make sure the data ends up in the right place.
- Transforming the data into a consistent format. Concatenation is used.
- Transfer the data to its destination.
- Check data to make sure everything is correctly placed and do spot checks (for nulls)
- Use it for analysis if everything looks good.
Data Cleaning Tools:
- Data validation
- Conditional formatting
- COUNTIF
- Sorting
- Filtering
Workflow Automation:
- process of automating parts of your work which could potentially save time.
What can be automated?
- Modeling the data (YES)
- Preparing and cleaning data (PARTIALLY)
- Data exploration (PARTIALLY)
- Presenting your findings (NOPE)
- Communicating with your team and stakeholders (NOPE)
Common functions:
- COUNTIF // returns the number of cells that match a specified value
- LEN // tells length of a text string by counting number of characters it contain
- LEFT // function that gives you a set number of characters from left side of a text string
- RIGHT // function that gives you a set number of characters from right side of a text string
- MID // gives you a segment from the middle of a text string
- CONCATENATE // joins together two or more text strings
- TRIM // removes leading, trailing, and repeating spaces in data
- VLOOKUP // vertical lookup. searches for certain value in a column to return a corresponding
piece of information.
Basic Data Cleaning Process:
- Remove blank cells
- Fix types
- Fix missing data
- Transpose if needed
- Clear or fix formatting
- Clear whitespaces
Questions to ask before merging data from different sources:
- Do I have all the data I need?
- Does the data I need exist within these datasets?
- Does the data need to be cleaned, or are they ready for me to use?
- Are the datasets cleaned to the same standard?
Types of dirty data:
- Duplicate data
- Outdated data
- Incomplete data
- Incorrect/inaccurate data
- Inconsistent data (wrong formatting)
Common data-cleaning pitfalls to avoid when cleaning data:
- Not checking for spelling errors
- Forgetting to document errors
- Not checking for misfielded values (values entered into the wrong field)
- Overlooking missing values
- Only looking at a subset of the data (not looking at relevant data to get whole picture)
- Losing track of business objectives
- Not fixing the source of the error
- Not analyzing the system prior to data cleaning
- Not backing up your data prior to data cleaning
- Not accounting for data cleaning in your deadlines/process
Additional Resources:
https://support.microsoft.com/en-us/office/top-ten-ways-to-clean-your-data-2844b620-677c-47a7-ac3e-c2e157d1db19
https://support.google.com/a/users/answer/9604139?hl=en#zippy=
https://towardsdatascience.com/automating-scientific-data-analysis-part-1-c9979cd0817e
https://technologyadvice.com/blog/information-technology/top-10-workflow-automation-software/
Comments
Post a Comment