18. Data Analytics - Process Data from Dirty to Clean - Week 4

Verifying data integrity and reporting results of the data cleaning process is important.

Verification is a critical part of any analysis project, like a stamp of approval.

Reporting is an effective way to show transparency about your data cleaning efforts and to show you're accountable for the stakeholders.

Documenting all changes is important.


Definition:

Verification // process to confirm that a data-cleaning effort was well-executed, and the resulting data is accurate and reliable

Changelog // a file containing a chronologically ordered list of modifications made to a project. Usually organized by version and show data, features added or removed or updated.

TRIM (SPREADSHEET) // a function that removes leading, trailing, and repeated spaces in data

Remove duplicates (SPREADSHEET) // a tool that automatically searches and eliminates duplicate entries from a spreadsheet

Pivot table (SPREADSHEET) // a data summarization tool that is used in data processing

Find and replace (SPREADSHEET) // tool that looks for specified search term in spreadsheet and allows you to replace it with something else.

COUNTA (SPREADSHEET) // function that counts the total number of values within a specified range

Documentation // process of tracking changes, additions, deletions, and errors involved in your data-cleaning efforts

CASE statement (SQL) // the statement goes through one or more conditions and returns a value as soon as a condition is met

        SELECT

                column1,

                CASE

                        WHEN column2 = 'test' THEN 'test2'

                        ELSE column2

                        END AS cleaned_name

        FROM

                datatable


Advanced functions for speedy data cleaning:


Common data errors:

        - Human error in data entry

        - Flawed processes

        - System issues


Getting feedback from our verification and data reporting can:

        - Help transform data collection processes and business development.

        - Help prevent common errors from being repeated

        - Feedback can sent us back to the drawing board to do the process in another perspective.


Version control system for SQL queries:

        1. Company contain important versions of queries in their version control system.

        2. Syncing: when analyst use the up-to-date version to make changes.

        3. Analyst make change.

        4. Code review: when asking senior analyst to review change

        5. Code commit: when reviewer approves the change and change is submitted to update query.

                    - Make sure to record all changes and why in comments area.

        6. Submitted: when change is submitted and now becomes updated version for others to sync to.

        7. Undo: if there is a problem later, the company can undo the change and revert to previous query

        8. Revert: query is reverted back to previous version.


Changelogs info:

        - Data,file,formula,query, anything that changed.

        - Description, date, person who edited, person who approved edit, version number, reason for edit

        *You can manually create a changelog and list detailed steps and reasons in a doc file.


Recording how data evolve through changes(cleaning) helps with:

        - Recalling data-cleaning errors (know how to solve errors in future)

        - Inform other users of changes (let people around know how to fix those errors)

                - If not fixable, inform others, else, document how it was fixed.

        - Determine quality of data


Verification process:

        1. Go back to original uncleaned dataset and compare it to cleaned dataset. Take note 

            of common dirty data types in the original uncleaned dataset and see if it is still present in the

            cleaned data.

        2. Taking in the big-picture view of your project. To check that you were focusing on the

            business problem needed to solve. Make sure original problem is addressed. Do the numbers

            make sense?

                    Seeing big picture when verifying data-cleaning requires:

                            - Consider the business problem

                            - Consider the goal of the project

                            - Consider the data. Is it good enough to solve the problem and meeting project

                                objective?


Data-cleaning verification checklist:

        Correct the most common problems:

        - sources of errors

        - null data

        - misspelled words

        - extra spaces and characters

        - duplicates

        - mismatched data types

        - messy (inconsistent) strings

        - messy (inconsistent) date formats

        - misleading variable labels (columns)

        - truncated data // any truncated or missing data need correction?

        - business logic // does data make sense?

        Review goal of the project:   

        - confirm business problem

        - confirm goal of project

        - verify that the data can solve the problem and aligns with the goal

Comments

Popular posts from this blog

2. FreeCodeCamp - Dynamic Programming - Learn to Solve Algorithmic Problems & Coding Challenges

20. Data Analytics - Analyze Data to Answer Questions - Week 1

3. Algorithms - Selection Sort