17. Data Analytics - Process Data from Dirty to Clean - Week 3

Cleaning data with SQL

Definition:

SQL // structured query language

Relational database // a database that contains a series of tables that can be connected to form relationships

Text strings // group of characters within a cell, commonly composed of letters, numbers, or both

Float // a number that contains a decimal

Typecasting // converting data from one type to another


SQL Queries:

        - INSERT INTO // inserts data into a database

                INSERT INTO databaseurl

                    (Column,column,column)

                VALUES

                    ('data1',50,20)

        - UPDATE // update a data in a database

                UPDATE databaseurl

                SET columnname = 'data'

                WHERE rowdatacondition = 2555 // this tells database only change data with this condition

        - CREATE TABLE IF NOT EXISTS // create a table

        - DROP TABLE IF EXISTS // delete a table

        - ORDER BY // instead of WHERE, this let us specify what sort to use for the pulled data

                DESC // is descending


SQL Logics:

        - date

        - BETWEEN

        - AND

        - AS


Cleaning Data with SQL:

        - Removing duplicates // include DISTINCT in SELECT statement

                SELECT

                        DISTINCT test_column

                FROM

                        test_dataset

        - Finding length of strings // using LENGTH(value)

                SELECT

                        LENGTH(column) AS length_of_value

                FROM

                        test_dataset

                // this prints out all the lengths and create a new column name length_of_value

                SELECT

                        column

                FROM

                        test_dataset

                WHERE

                        LENGTH(column) = 3

                // this prints out all the data in the column with lengths of 3

        - Finding substrings // using SUBSTR(column,starting index,ending index)

        - Trimming in SQL // remove any whitespaces using TRIM(column name)

        - Max and Min // MIN(column) and MAX(column)

        - Counting rows // COUNT(*) to count all and use WHERE to specify what to count

        - Formatting data // CAST(column AS DATATYPE) used to convert anything from one data 

                type to another. Type Example: FLOAT64

        - Concatenate data // CONCAT(column1,column2,etc) adds strings together to create new text

                 strings that can be used as unique keys

        - Handle non-nulls(missing values) // COALESCE(columnvaluewant, substitute) returns non-null

                values in a list. Gives us something else as a backup if first choice isn't available.


When to use SQL:

        - Depends where the data lives. If in database, then SQL. If in spreadsheet, then spreadsheet.

Features of Spreadsheets

Features of SQL Databases 

Smaller data sets

Larger datasets

Enter data manually

Access tables across a database

Create graphs and visualizations in the same program

Prepare data for further analysis in another software

Built-in spell check and other useful functions

Fast and powerful functionality

Best when working solo on a project.


Store locally

Great for collaborative work and tracking queries run by all users


Store across a database


Additional Resources:

https://learnsql.com/blog/what-sql-dialect-to-learn/

https://www.sqltutorial.org/what-is-sql/

https://www.datacamp.com/community/blog/sql-differences

Comments

Popular posts from this blog

2. FreeCodeCamp - Dynamic Programming - Learn to Solve Algorithmic Problems & Coding Challenges

20. Data Analytics - Analyze Data to Answer Questions - Week 1

3. Algorithms - Selection Sort