30. Data Analytics - Data Analysis with R Programming - Week 3

Learning how to clean data in R.


Definition:

Data frame // a collection of columns. Like a spreadsheet with column name and rows and cells.

Tibbles (Tidyverse) // are like streamlined data frames

Tidy data (R) // way of standardizing the organization of data within R

Anscombe's quartet // four datasets that have nearly identical summary statistics

        - install.packages('Tmisc')

        - library(Tmisc)

        - install.packages("datasauRus")


Analyzing Bias:

        - install.packages("SimDesign")

        - library(SimDesign)

        - bias(firstdataset, seconddataset) // 1st must be actual, 2nd must be prediction

                - unbiased is closer to 0.

                - function determines how much the actual outcome is greater than predicted outcome

        - cor(data1,data2) // shows correlation

        - sd(x) // shows standard deviation 


Manually Create a Data Frame:

        - data.frame(vector/list name a, vector/list name b, vector/list name c)

        - dataframetest <- c(1:10) // this creates a dataframe with data 1 to 10


Additional Arithmetic Operators

        - && // modulus

        - &/& // integer division. returns integer after a division

        - "<-" , "<<-", "=" // leftward assignment

        - "->", "->>" // rightward assignment

Cleaning packages:

        - install.packages("here") // makes referencing files easier

        - install.packages("skimr") // makes summarizing data easier

        - install.packages("janitor") // helps clean data


Data frames process:

        - Columns should have proper named

        - Data stored can be many different types, like numeric, factor, or characters

        - Each column should contain the same number of data items

        

Tibbles: // helps with printing

        - Never change the data types of the inputs

        - Never change the names of your variables

        - Never create row names

        - Make printing easier

        Function:

                - as_tibbles(dataset)


        * streamlined data frames that automatically pull up the first 10 rows of a dataset and only as

            many columns that can fit on the screen.


Tidy Data Standards:

        *Refers to the principles that make data structures meaningful and easy to understand.

        - Variables are organized into columns

        - Observations are organized into rows

        - Each value must have its own cell


Functions:

        - head() // previews a dataset by showing first six rows

        - str() // highlight the structure of a data/data frame

        - colnames() // show column names of a data

        - mutate(dataframe, newcolumnname) // adds new columns to a data frame

        - glimpse() // shows a glimpse detail of the data set

        - skim_without_charts() // need skimr package. summarizes the data set

        - select(column) // only display a column

        - select(-column) // display everything else except specified column

        - rename(dataset, newcolumnname = oldcolumnname)

        - rename_with(dataset, toupper) // renames all columnname of dataset to uppercase

        - clean_names(dataset) // cleans names and removes special characters in names

        - arrange(datacolumn) // sort ascending. add minus sign in front to do descending

        - group_by()

        - mean(datacolumn), max(datacolumn), min(datacolumn)


Organizing Functions:

        - arrange()

        - group_by(data) // only show data of certain data(rows) and group them into one value

                    - can use multiple data separated by column, if so, it only counts data that meets both

        - drop_na() // exclude missing values in data set

        - filter("data") // only show data that is specified


Transforming Functions:

        - separate(dataframename, columnname, into=c('TESTA","TESTB"), sep = ' ' )

                - separates a column's value into 2 new columns while defining the separator with sep

        - unite(dataframename, 'newcolumn', columna, columnb, sep=' ')

                - combines two column values into one new column, and include an optional separator

        - mutate(dataframe, newcolumn = columnA*1000)

                - creates a new column using columnA's values

Readr Functions:

  • read_csv(): comma-separated values (.csv) files

  • read_tsv(): tab-separated values files

  • read_delim(): general delimited files

  • read_fwf(): fixed-width files

  • read_table(): tabular files where columns are separated by white-space

  • read_log(): web log files


Readxl Function: // reads spreadsheet data

        - library(readxl)

        - read_excel()


WIDE VS LONG DATA FORMAT:

        *https://tidyr.tidyverse.org/articles/pivot.html

        WIDE:     

        LONG:

        


Additional Resources:

https://tibble.tidyverse.org/

https://rstudio-education.github.io/tidyverse-cookbook/tidy.html#

https://readxl.tidyverse.org/

https://r-coder.com/operators-r/#Assignment_operators_in_R

https://www.rdocumentation.org/packages/SimDesign/versions/2.2/topics/bias

https://datasciencebox.org/ethics.html


Comments

Popular posts from this blog

2. FreeCodeCamp - Dynamic Programming - Learn to Solve Algorithmic Problems & Coding Challenges

20. Data Analytics - Analyze Data to Answer Questions - Week 1

3. Algorithms - Selection Sort