16. Data Analytics - Process Data from Dirty to Clean - Week 2

Number one factor of poor-quality data is human error.


Definitions:

Dirty data // data that is incomplete, incorrect, or irrelevant to the problem you're trying to solve.

Clean data // data this is complete, correct, and relevant to the problem you're trying to solve

Data engineers // transform data into a useful format for analysis and give it a reliable infrastructure

        They develop, maintain, and test databases, data processors, and related systems.

Data warehousing specialists // develop processes and procedures to effectively store and organize data

        They make sure data is available, secure, and backed up to prevent loss.

Null // an indication that a value does not exist in a dataset

Field // a single piece of information from a row or column of a spreadsheet

Field length // a tool for determining how many characters can be keyed into a field

Data validation // a tool for checking the accuracy and quality of data before adding or importing it

Merger // an agreement that unites two organizations into a single new one

Data merging // process of combining two or more datasets into a single dataset

Compatibility // how well two or more datasets are able to work together

Conditional formatting // a spreadsheet tool that changes how cells appear when values meet specific conditions

Remove duplicates // a tool that automatically searches for and eliminates duplicate entries from a spreadsheet

Text string // a group of characters within a cell, most often composed of letters, numbers, or both

Split // a tool that divides text around a specified character and puts each fragment into a new, separate cell

Delimiter // specified text separator, a character that separates a group of characters

Concatenate // a function that joins multiple text strings into a single string

Function // a set of instructions that performs a specific calculation using the data in a spreadsheet

Syntax // predetermined structure that includes all required information and its proper placement

Sorting // arranging data into a meaningful order to make it easier to understand, analyze, and visualize

Filtering // showing only the data that meets a specific criteria while hiding the rest

Pivot table // data summarization tool that is used in data processing

Data mapping // process of matching fields from one data source to another

Schema // a way of describing how something is organized

Primary key // references a column in which each value is unique

Foreign key // a field within a table that is a primary key in another table


Some organization have rules for data cleaning, like formatting guidelines to follow. EX: lbs instead of kg

Backup data before cleaning.

Plotting/graphing data can help find outliers and fix mistakes.


Data Mapping Process:

        - Determine the content of each section to make sure the data ends up in the right place.

        - Transforming the data into a consistent format. Concatenation is used.

        - Transfer the data to its destination. 

        - Check data to make sure everything is correctly placed and do spot checks (for nulls)

        - Use it for analysis if everything looks good.


Data Cleaning Tools:

        - Data validation

        - Conditional formatting

        - COUNTIF

        - Sorting

        - Filtering


Workflow Automation:

        - process of automating parts of your work which could potentially save time.

        

        What can be automated?

                - Modeling the data (YES)

                - Preparing and cleaning data (PARTIALLY)

                - Data exploration (PARTIALLY)

                - Presenting your findings (NOPE)

                - Communicating with your team and stakeholders (NOPE)

                

Common functions:

        - COUNTIF // returns the number of cells that match a specified value

        - LEN // tells length of a text string by counting number of characters it contain

        - LEFT // function that gives you a set number of characters from left side of a text string

        - RIGHT // function that gives you a set number of characters from right side of a text string

        - MID // gives you a segment from the middle of a text string

        - CONCATENATE // joins together two or more text strings

        - TRIM // removes leading, trailing, and repeating spaces in data

        - VLOOKUP // vertical lookup. searches for certain value in a column to return a corresponding

                                 piece of information.


Basic Data Cleaning Process:

        - Remove blank cells

        - Fix types

        - Fix missing data

        - Transpose if needed

        - Clear or fix formatting

        - Clear whitespaces


Questions to ask before merging data from different sources:

        - Do I have all the data I need?

        - Does the data I need exist within these datasets?

        - Does the data need to be cleaned, or are they ready for me to use?

        - Are the datasets cleaned to the same standard?


Types of dirty data:

        - Duplicate data

        - Outdated data

        - Incomplete data

        - Incorrect/inaccurate data

        - Inconsistent data (wrong formatting)


Common data-cleaning pitfalls to avoid when cleaning data:

        - Not checking for spelling errors

        - Forgetting to document errors

        - Not checking for misfielded values (values entered into the wrong field)

        - Overlooking missing values

        - Only looking at a subset of the data (not looking at relevant data to get whole picture)

        - Losing track of business objectives

        - Not fixing the source of the error

        - Not analyzing the system prior to data cleaning

        - Not backing up your data prior to data cleaning

        - Not accounting for data cleaning in your deadlines/process


Additional Resources:

https://support.microsoft.com/en-us/office/top-ten-ways-to-clean-your-data-2844b620-677c-47a7-ac3e-c2e157d1db19

https://support.google.com/a/users/answer/9604139?hl=en#zippy=

https://towardsdatascience.com/automating-scientific-data-analysis-part-1-c9979cd0817e

https://technologyadvice.com/blog/information-technology/top-10-workflow-automation-software/


Comments

Popular posts from this blog

2. FreeCodeCamp - Dynamic Programming - Learn to Solve Algorithmic Problems & Coding Challenges

20. Data Analytics - Analyze Data to Answer Questions - Week 1

3. Algorithms - Selection Sort