30. Data Analytics - Data Analysis with R Programming - Week 3
Learning how to clean data in R.
Data frame // a collection of columns. Like a spreadsheet with column name and rows and cells.
Tibbles (Tidyverse) // are like streamlined data frames
Tidy data (R) // way of standardizing the organization of data within R
Anscombe's quartet // four datasets that have nearly identical summary statistics
- install.packages('Tmisc')
- library(Tmisc)
- install.packages("datasauRus")
Analyzing Bias:
- install.packages("SimDesign")
- library(SimDesign)
- bias(firstdataset, seconddataset) // 1st must be actual, 2nd must be prediction
- unbiased is closer to 0.
- function determines how much the actual outcome is greater than predicted outcome
- cor(data1,data2) // shows correlation
- sd(x) // shows standard deviation
Manually Create a Data Frame:
- data.frame(vector/list name a, vector/list name b, vector/list name c)
- dataframetest <- c(1:10) // this creates a dataframe with data 1 to 10
Additional Arithmetic Operators
- && // modulus
- &/& // integer division. returns integer after a division
- "<-" , "<<-", "=" // leftward assignment
- "->", "->>" // rightward assignment
Cleaning packages:
- install.packages("here") // makes referencing files easier
- install.packages("skimr") // makes summarizing data easier
- install.packages("janitor") // helps clean data
Data frames process:
- Columns should have proper named
- Data stored can be many different types, like numeric, factor, or characters
- Each column should contain the same number of data items
Tibbles: // helps with printing
- Never change the data types of the inputs
- Never change the names of your variables
- Never create row names
- Make printing easier
- as_tibbles(dataset)
* streamlined data frames that automatically pull up the first 10 rows of a dataset and only as
many columns that can fit on the screen.
Tidy Data Standards:
*Refers to the principles that make data structures meaningful and easy to understand.
- Variables are organized into columns
- Observations are organized into rows
- Each value must have its own cell
- head() // previews a dataset by showing first six rows
- str() // highlight the structure of a data/data frame
- colnames() // show column names of a data
- mutate(dataframe, newcolumnname) // adds new columns to a data frame
- glimpse() // shows a glimpse detail of the data set
- skim_without_charts() // need skimr package. summarizes the data set
- select(column) // only display a column
- select(-column) // display everything else except specified column
- rename(dataset, newcolumnname = oldcolumnname)
- rename_with(dataset, toupper) // renames all columnname of dataset to uppercase
- clean_names(dataset) // cleans names and removes special characters in names
- arrange(datacolumn) // sort ascending. add minus sign in front to do descending
- group_by()
- mean(datacolumn), max(datacolumn), min(datacolumn)
Organizing Functions:
- arrange()
- group_by(data) // only show data of certain data(rows) and group them into one value
- can use multiple data separated by column, if so, it only counts data that meets both
- drop_na() // exclude missing values in data set
- filter("data") // only show data that is specified
Transforming Functions:
- separate(dataframename, columnname, into=c('TESTA","TESTB"), sep = ' ' )
- separates a column's value into 2 new columns while defining the separator with sep
- unite(dataframename, 'newcolumn', columna, columnb, sep=' ')
- combines two column values into one new column, and include an optional separator
- mutate(dataframe, newcolumn = columnA*1000)
- creates a new column using columnA's values
Readr Functions:
read_csv(): comma-separated values (.csv) files
read_tsv(): tab-separated values files
read_delim(): general delimited files
read_fwf(): fixed-width files
read_table(): tabular files where columns are separated by white-space
read_log(): web log files
Readxl Function: // reads spreadsheet data
- library(readxl)
- read_excel()
Additional Resources:
Post a Comment