15. Data Analytics - Process Data from Dirty to Clean

Definition:

Data integrity // accuracy, completeness, consistency, and trustworthiness of data throughout its lifecycle.

Data replication // process of storing data in multiple locations

Data transfer // process of copying data from a storage device to memory, or from one computer to another

Data manipulation // process of changing data to make it more organized and easier to read.

Data duplication // when data is duplicated in multiple places in the same file.

Population // all possible data values in a certain dataset

Sample size // a part of a population that is representative of the population. Cons: uncertainty and not 100% sure of accurate representation of population.

Sampling bias // a sample isn't representative of the population as a whole. Population can be over-represented or under-represented

Random sampling // a way of selecting a sample from a population so that every possible type of the sample has an equal chance of being chosen

Margin of error // The maximum amount that the sample results are expected to differ from those of the actual population. The margins where sample size results might differ than from using a population in analysis. Smaller margin of error means better results.

EX: margin of error or 10% and a survey where 60% respondents vote yes in the sample

would mean that in the population, between 50% and 70% (60% is 10% above 50%, and

below 70%) of people would vote yes.

Confidence level // how confident you are in survey results. The probability that your sample size accurately reflects the greater population. 99% is ideal, 95% is good, and 90% is acceptable.

Confidence interval // range of possible values that population's result would be at the confidence level of the study. This range is sample result +/- the margin of error.

Statistical significance // determination of whether your result could be due to random chance or not. Greater significance, the less due to chance. Need at least 0.8 or 80% to be statistically significant.

If a test is statistically significant, then the results are deemed to be real and not an error caused

by random chance.

Statistical power // probability of getting meaningful results from a test. A value from 0 to 1 that represents percentage. The percentage represents how statistically significance the results are.

Hypothesis testing // a way to see if a survey or experiment has meaningful results

Proxy data // substitution data that is similar to data that you want. EX: If no data for a state, can get proxy data of another state of similar size and demographic.

Clean data is key for data to have integrity before analysis.

A strong analysis depends on the integrity of the data.

Important to check data used is aligning with business objective.

Clean data + alignment to business objective = accurate conclusions.

Alignment to business objective + additional data cleaning = accurate conclusions.

Alignment to business objective + newly discovered variables + constraints = accurate conclusions.

To calculate margin of error you need:

- Population size

- Sample size

- Confidence level

*Use margin of error calculator

https://goodcalculators.com/margin-of-error-calculator/

How to choose sample size:

- Don't choose sample size less than 30. Sample sizes vary by business problem.

- Confidence level of 95%+ is good, 90% is acceptable. Increase by using larger sample.

- Decrease margin of error by increasing sample size.

- Increase statistical significance by using larger sample.

*Large samples have higher cost.

*Use Sample Size Calculator to get minimum sample size needed.

http://www.raosoft.com/samplesize.html

Processes that can compromise data:

- Replication

- Transfers

- Manipulation

Other threats to data integrity:

- Human error

- Viruses

- Malware

- Hacking

- System Failures

Data constraints:

Data type	Values must be of a certain type: date, number, percentage, Boolean, etc.	If the data type is a date, a single number like 30 would fail the constraint and be invalid
Data range	Values must fall between predefined maximum and minimum values	If the data range is 10-20, a value of 30 would fail the constraint and be invalid
Mandatory	Values can’t be left blank or empty	If age is mandatory, that value must be filled in
Unique	Values can’t have a duplicate	Two people can’t have the same mobile phone number within the same service area
Regular expression (regex) patterns	Values must match a prescribed pattern	A phone number must match ###-###-#### (no other characters allowed)
Cross-field validation	Certain conditions for multiple fields must be satisfied	Values are percentages and values from multiple fields must add up to 100%
Primary-key	(Databases only) value must be unique per column	A database table can’t have two rows with the same primary key value. A primary key is an identifier in a database that references a column in which each value is unique. More information about primary and foreign keys is provided later in the program.
Set-membership	(Databases only) values for a column must come from a set of discrete values	Value for a column must be set to Yes, No, or Not Applicable
Foreign-key	(Databases only) values for a column must be unique values coming from a column in another table	In a U.S. taxpayer database, the State column must be a valid state or territory with the set of acceptable values defined in a separate States table
Accuracy	The degree to which the data conforms to the actual entity being measured or described	If values for zip codes are validated by street location, the accuracy of the data goes up.
Completeness	The degree to which the data contains all desired components or measures	If data for personal profiles required hair and eye color, and both are collected, the data is complete.
Consistency	The degree to which the data is repeatable from different points of entry or collection	If a customer has the same address in the sales and repair databases, the data is consistent.

What to do when you have insufficient data?

Types of insufficient data:

- Data from only one source

- Data that keeps updating

- Outdated data

- Geographically-limited data

Ways to address insufficient data:

- Identify trends with the available data

- Wait for more data if time allows

- Talk with stakeholders to adjust your objective

- Look for new dataset

Data issues and solutions:

No data:

- Gather data on small scale, and request additional time to collect more data

- If no time, perform analysis with proxy data (substituted but similar data)

Too little data:

- Do analysis with proxy data along with actual data

- Adjust analysis to align with data you already have

Wrong data, including data with errors:

- If wrong data because misunderstood requirements communicate requirements again

- Identify errors, if possible, correct them at source by looking for patterns

- If can't correct data, ignore wrong data and continue analysis if sample is large enough

and ignoring the data won't cause systematic bias.

Additional Resources:

https://www.investopedia.com/terms/c/central_limit_theorem.asp

https://www.statisticssolutions.com/dissertation-resources/sample-size-calculation-and-sample-size-justification/sample-size-formula/

https://machinelearningmastery.com/statistical-power-and-power-analysis-in-python/

https://towardsdatascience.com/is-there-a-difference-between-open-data-and-public-data-6261cd7b5389

Learning Notes

15. Data Analytics - Process Data from Dirty to Clean - Week 1

Comments

Post a Comment

Popular posts from this blog

20. Data Analytics - Analyze Data to Answer Questions - Week 1

2. FreeCodeCamp - Dynamic Programming - Learn to Solve Algorithmic Problems & Coding Challenges

5. SQL Injection - Blind SQL Injection