Posts

Showing posts from March, 2022

18. Python - Data Analytics - Missing Values

3 Ways To Deal with Missing Values 1. Drop Columns with Missing Values 2. Impute the missing values. Replacing them with something that makes sense, like a mean. 3. Impute the missing values but add a new column to label it to show it was imputed. To separate from non-imputed data.

17. Python - Pandas - Renaming/Combining

 Changing names: // Columns: data_frame.rename(columns = {'OLDNAME' : 'NEWNAME'}) // can add multiple renames by adding a comma // Index: data_frame.rename(index = {0: 'first index' , 1: '2nd index'}) // set_index() is more convenient. // Axis Index data_frame.rename_axis("AXISNAME_A", axis = 'rows').rename_axis("AXISNAME_B", axis = "columns") // renames whole 2 axis, the row and the column Combining: // concat(): pd.concat([data_frame_1, data_frame_2]) // combines two data frames who have the same columns(fields) into one data frame // basically places both data under the same columns // join(): left = data_frame_1.set_index(['FIRST INDEX' , 'SECOND INDEX']) right = data_frame_2.sex_index(['FIRST INDEX' , 'SECOND INDEX']) left.join(right, lsuffix = '_FIRSTDATA', rsuffix='_2NDDATA') // creates a new data frame and combine 2 data frames data if they have the same indexes

16. Python - Pandas - Data Types & Missing Values

Data Types: Common Types: floats, int, object (string or mixed), bool, str Find Data Type: data_frame.COLUMNA.dtype // returns a type Converting Types: data_frame.COLUMNA.astype('float64') // coverts COLUMNA type to float type data_frame.index.dtype // return type of the index Missing Data: // Filter with: pd.isnull(data_frame.COLUMN) pd.notnull(data_frame.COLUMN) // Filling data: // creates a series with filled data data_frame.COLUMN.fillna("SOMETHING") // filles all NaN in COLUMN with "SOMETHING" Replacing Data: data_frame.COLUMN.replace("ORIGINAL","REPLACER") // replaces value

15. Python - Pandas - Sorting/Grouping

Group By: // Basic Methods data_frame.groupby('COLUMNA').COLUMNAA. count() // creates a map with unique values of COLUMNA as index in the first column and then with values showing the count of them in the 2nd column (count of data_frame.COLUMNAA) // also have min(), max(), etc instead of count() // COLUMNAA placeholder is for values that relate to COLUMNA groups. // can have more than one group along with COLUMNA. // Lambda Functions data_frame.groupby('COLUMNA').apply(lambda p: p.COLUMNAA.iloc[0]) // this gets the first item of the list of COLUMNAA items that correspond to each group made from COLUMNA. // agg() function // this creates a data frame data_frame.groupby(['COLUMNA']).COLUMNB.agg([len,min,max]) // groups by unique values in COLUMNA as keys and then provides values for len,min,and max of COLUMNB relating to COLUMNA. data_frame.groupby(['COLUMNA']).COLUMNB.min() // using this instead makes a series // options: max(), mean(), min() Multi-Indexe

14. Python - Pandas - Summary Functions & Maps

  Summary Methods: data_frame.describe() data_frame.COLUMNA.mean() // return mean data_frame.COLUMNA.median() // return median data_frame.COLUMNA.unique() // returns only unque values data_frame.COLUMNA.value_counts() // returns value and their count data_frame.COLUMNA.idxmax() // returns max data_frame.COLUMNA.sum() Maps: data_frame_COLUMNA_mean = data_frame.COLUMNA.mean() // finds the mean of the column data_frame.COLUMNA.map(lambda p: p - data_frame_COLUMNA_mean) // creates a map from COLUMNA. The values of the keys per index will be determined with the lambda function. Lamdba function returns result of COLUMNA's value minus the mean. // lambda is anonymous function. data_frame_mean = data_frame.COLUMNA.mean() data_frame.COLUMNA - data_frame_mean // returns same result as the two earlier map steps. data_frame.COLUMNA + "SPACE" + data_frame.COLUMNB // returns a new map with values that combine COLUMNA and COLUMNB with a string "SPACE" in between Apply Function

13. Python - Pandas - Indexing/Selecting/Assigning

 Selecting Columns: data_frame.column_name // use a method way data_frame['column_name'] // indexing way Selecting Data In Columns: data_frame['column_name'][indexnumber] // indexing way Selecting with 'iloc' and 'loc' :      Selecting a  Column/Row  with iloc:      data_frame.iloc[rowindex]      data_frame.iloc[ rowstart_index:rowend_index , column index]      // index can be negative which means start from the end     data_frame.iloc[[1,2,3],0]     // select a list of specified indexes in first column     // colon species all. if use colon instead of '0' then it selects all columns      Selecting a Column/Row with loc:      data_frame.loc[rowstart_index:rowend_index, ['COLUMNA','COLUMNB']]      // loc uses a list of column to select     Selecting using Conditions:     data_frame.loc[data_frame.COLUMN == 'Value']     // selects only in COLUMN that has that 'Value'           data_frame.loc[(data_frame.COLUMNA == &#

12. Python - Pandas - Basics

 Importing Pandas: import pandas as pd Types of core objects in Pandas: DataFrames // a table like excel worksheet. Has index for rows, column headers, and values Series // A list of values Creating a Data Frame: pd.DataFrame( { 'COLUMNNAME_1' : [VALUE,VALUE] , 'COLUMNNAME_2' : [VALUE] } ) // just like a map, has a key and values (which are lists of items) // index are automatically created from 0...nROWs pd.DataFrame( { 'COLUMNNAME_1' : [VALUE,VALUE] , 'COLUMNNAME_2' : [VALUE] }, index = [ ' FIRST ', ' SECOND '] ) // including a index property allows naming indexes ourselves Creating a Series: pd.Series( [VALUE,VALUE,VALUE] , index = [ 'A' , ' B ' , ' C ' ], name = 'MYSERIES' ) // creates a series with list of values, naming our index, and then giving the series an overall name // Data Frames is like a bunch of series glued together. Basic Methods: data_frame.head() // shows first couple rows data_frame.de

11. Python - Data Analytics - Random Forest Model

Import libraries: from sklearn.ensemble import RandomForestRegressor from sklearn.metrics import mean_absolute_error Create a model and fit it. Then use it to predict. forest_model = RandomForestRegressor(random_state=1) forest_model.fit(train_X, train_y) melb_preds = forest_model.predict(val_X) print(mean_absolute_error(val_y, melb_preds))

10. Python - Data Analytics - Scikit-Learn

MODEL VALIDATION: Splitting Data with Scikit_Learn: from sklearn.model_selection import train_test_split train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0) // splits data into training and validation data. must define X and y beforehand, which is the features and the prediction. melbourne_model.fit(train_X, train_y) # get predicted prices on validation data val_predictions = melbourne_model.predict(val_X) print(mean_absolute_error(val_y, val_predictions)) Calculating MAE (Mean Absolute Error): from sklearn.metrics import mean_absolute_error predicted_home_prices = melbourne_model.predict(X) mean_absolute_error(y, predicted_home_prices) // MAE is the sum average of all absolute difference between the actual and the prediction value. Overfitting and Underfitting: // Overfitting is too many leaves in decision tree. Might seem like it gets accurate prediction but it becomes far off when using new data to predict. // Underfitting is too few leaves in decision tree. Th

9. Python - Data Analytics - Creating a Model

 import pandas as pd data_file_path = 'filepath\file.csv' // indicate the path data_data = pd.read_csv(data_file_path) // reads the data and craete a data frame data_data.describe() // shows a summary statistic of the data. data_data.columns // lists out column headers data_data.head() // summarize the first few rows of the data frame Read Methods read_csv(datapath) read_excel(datapath) Prediction Target // the data we want to predict. Usually represented as y.          Use dot notation to select a prediction from the data frame.          y = data_data.columnname  // this creates a data frame for the prediction Features // data used to make and predict the predictions. Represented as X.          Create a list of features to use by making a list of columns.          data_features = ['Feature1', 'Feature2', 'Feature3']          x = data_data[data_features] // this creates a data frame for the features Scit-learn Library to Create Models: from sklearn.tree

8. Python - Importing Libraries

 import library_name // imports library import library_name as short // import and create a reference from library_name import * // import all of module's variables to be used directly without a (dot).variable from library_name import specific_functionname // import specific things from module only