30. Data Analytics - Data Analysis with R Programming

Learning how to clean data in R.

Definition:

Data frame // a collection of columns. Like a spreadsheet with column name and rows and cells.

Tibbles (Tidyverse) // are like streamlined data frames

Tidy data (R) // way of standardizing the organization of data within R

Anscombe's quartet // four datasets that have nearly identical summary statistics

- install.packages('Tmisc')

- library(Tmisc)

- install.packages("datasauRus")

Analyzing Bias:

- install.packages("SimDesign")

- library(SimDesign)

- bias(firstdataset, seconddataset) // 1st must be actual, 2nd must be prediction

- unbiased is closer to 0.

- function determines how much the actual outcome is greater than predicted outcome

- cor(data1,data2) // shows correlation

- sd(x) // shows standard deviation

Manually Create a Data Frame:

- data.frame(vector/list name a, vector/list name b, vector/list name c)

- dataframetest <- c(1:10) // this creates a dataframe with data 1 to 10

Additional Arithmetic Operators

- && // modulus

- &/& // integer division. returns integer after a division

- "<-" , "<<-", "=" // leftward assignment

- "->", "->>" // rightward assignment

Cleaning packages:

- install.packages("here") // makes referencing files easier

- install.packages("skimr") // makes summarizing data easier

- install.packages("janitor") // helps clean data

Data frames process:

- Columns should have proper named

- Data stored can be many different types, like numeric, factor, or characters

- Each column should contain the same number of data items

Tibbles: // helps with printing

- Never change the data types of the inputs

- Never change the names of your variables

- Never create row names

- Make printing easier

Function:

- as_tibbles(dataset)

* streamlined data frames that automatically pull up the first 10 rows of a dataset and only as

many columns that can fit on the screen.

Tidy Data Standards:

*Refers to the principles that make data structures meaningful and easy to understand.

- Variables are organized into columns

- Observations are organized into rows

- Each value must have its own cell

Functions:

- head() // previews a dataset by showing first six rows

- str() // highlight the structure of a data/data frame

- colnames() // show column names of a data

- mutate(dataframe, newcolumnname) // adds new columns to a data frame

- glimpse() // shows a glimpse detail of the data set

- skim_without_charts() // need skimr package. summarizes the data set

- select(column) // only display a column

- select(-column) // display everything else except specified column

- rename(dataset, newcolumnname = oldcolumnname)

- rename_with(dataset, toupper) // renames all columnname of dataset to uppercase

- clean_names(dataset) // cleans names and removes special characters in names

- arrange(datacolumn) // sort ascending. add minus sign in front to do descending

- group_by()

- mean(datacolumn), max(datacolumn), min(datacolumn)

Organizing Functions:

- arrange()

- group_by(data) // only show data of certain data(rows) and group them into one value

- can use multiple data separated by column, if so, it only counts data that meets both

- drop_na() // exclude missing values in data set

- filter("data") // only show data that is specified

Transforming Functions:

- separate(dataframename, columnname, into=c('TESTA","TESTB"), sep = ' ' )

- separates a column's value into 2 new columns while defining the separator with sep

- unite(dataframename, 'newcolumn', columna, columnb, sep=' ')

- combines two column values into one new column, and include an optional separator

- mutate(dataframe, newcolumn = columnA*1000)

- creates a new column using columnA's values

Readr Functions:

read_csv(): comma-separated values (.csv) files
read_tsv(): tab-separated values files
read_delim(): general delimited files
read_fwf(): fixed-width files
read_table(): tabular files where columns are separated by white-space
read_log(): web log files

Readxl Function: // reads spreadsheet data

- library(readxl)

- read_excel()

WIDE VS LONG DATA FORMAT:

*https://tidyr.tidyverse.org/articles/pivot.html

WIDE:

LONG:

Additional Resources:

https://tibble.tidyverse.org/

https://rstudio-education.github.io/tidyverse-cookbook/tidy.html#

https://readxl.tidyverse.org/

https://r-coder.com/operators-r/#Assignment_operators_in_R

https://www.rdocumentation.org/packages/SimDesign/versions/2.2/topics/bias

https://datasciencebox.org/ethics.html

Learning Notes

30. Data Analytics - Data Analysis with R Programming - Week 3

Comments

Post a Comment

Popular posts from this blog

20. Data Analytics - Analyze Data to Answer Questions - Week 1

2. FreeCodeCamp - Dynamic Programming - Learn to Solve Algorithmic Problems & Coding Challenges

4. C# - List