17. Data Analytics - Process Data from Dirty to Clean - Week 3
Cleaning data with SQL
Definition:
SQL // structured query language
Relational database // a database that contains a series of tables that can be connected to form relationships
Text strings // group of characters within a cell, commonly composed of letters, numbers, or both
Float // a number that contains a decimal
Typecasting // converting data from one type to another
SQL Queries:
- INSERT INTO // inserts data into a database
INSERT INTO databaseurl
(Column,column,column)
VALUES
('data1',50,20)
- UPDATE // update a data in a database
UPDATE databaseurl
SET columnname = 'data'
WHERE rowdatacondition = 2555 // this tells database only change data with this condition
- CREATE TABLE IF NOT EXISTS // create a table
- DROP TABLE IF EXISTS // delete a table
- ORDER BY // instead of WHERE, this let us specify what sort to use for the pulled data
DESC // is descending
SQL Logics:
- date
- BETWEEN
- AND
- AS
Cleaning Data with SQL:
- Removing duplicates // include DISTINCT in SELECT statement
SELECT
DISTINCT test_column
FROM
test_dataset
- Finding length of strings // using LENGTH(value)
SELECT
LENGTH(column) AS length_of_value
FROM
test_dataset
// this prints out all the lengths and create a new column name length_of_value
SELECT
column
FROM
test_dataset
WHERE
LENGTH(column) = 3
// this prints out all the data in the column with lengths of 3
- Finding substrings // using SUBSTR(column,starting index,ending index)
- Trimming in SQL // remove any whitespaces using TRIM(column name)
- Max and Min // MIN(column) and MAX(column)
- Counting rows // COUNT(*) to count all and use WHERE to specify what to count
- Formatting data // CAST(column AS DATATYPE) used to convert anything from one data
type to another. Type Example: FLOAT64
- Concatenate data // CONCAT(column1,column2,etc) adds strings together to create new text
strings that can be used as unique keys
- Handle non-nulls(missing values) // COALESCE(columnvaluewant, substitute) returns non-null
values in a list. Gives us something else as a backup if first choice isn't available.
When to use SQL:
- Depends where the data lives. If in database, then SQL. If in spreadsheet, then spreadsheet.
Features of Spreadsheets | Features of SQL Databases |
---|---|
Smaller data sets | Larger datasets |
Enter data manually | Access tables across a database |
Create graphs and visualizations in the same program | Prepare data for further analysis in another software |
Built-in spell check and other useful functions | Fast and powerful functionality |
Best when working solo on a project. Store locally | Great for collaborative work and tracking queries run by all users Store across a database |
Additional Resources:
https://learnsql.com/blog/what-sql-dialect-to-learn/
Comments
Post a Comment