15. Data Analytics - Process Data from Dirty to Clean - Week 1
Definition:
Data integrity // accuracy, completeness, consistency, and trustworthiness of data throughout its lifecycle.
Data replication // process of storing data in multiple locations
Data transfer // process of copying data from a storage device to memory, or from one computer to another
Data manipulation // process of changing data to make it more organized and easier to read.
Data duplication // when data is duplicated in multiple places in the same file.
Population // all possible data values in a certain dataset
Sample size // a part of a population that is representative of the population. Cons: uncertainty and not 100% sure of accurate representation of population.
Sampling bias // a sample isn't representative of the population as a whole. Population can be over-represented or under-represented
Random sampling // a way of selecting a sample from a population so that every possible type of the sample has an equal chance of being chosen
Margin of error // The maximum amount that the sample results are expected to differ from those of the actual population. The margins where sample size results might differ than from using a population in analysis. Smaller margin of error means better results.
EX: margin of error or 10% and a survey where 60% respondents vote yes in the sample
would mean that in the population, between 50% and 70% (60% is 10% above 50%, and
below 70%) of people would vote yes.
Confidence level // how confident you are in survey results. The probability that your sample size accurately reflects the greater population. 99% is ideal, 95% is good, and 90% is acceptable.
Confidence interval // range of possible values that population's result would be at the confidence level of the study. This range is sample result +/- the margin of error.
Statistical significance // determination of whether your result could be due to random chance or not. Greater significance, the less due to chance. Need at least 0.8 or 80% to be statistically significant.
If a test is statistically significant, then the results are deemed to be real and not an error caused
by random chance.
Statistical power // probability of getting meaningful results from a test. A value from 0 to 1 that represents percentage. The percentage represents how statistically significance the results are.
Hypothesis testing // a way to see if a survey or experiment has meaningful results
Proxy data // substitution data that is similar to data that you want. EX: If no data for a state, can get proxy data of another state of similar size and demographic.
Clean data is key for data to have integrity before analysis.
A strong analysis depends on the integrity of the data.
Important to check data used is aligning with business objective.
Clean data + alignment to business objective = accurate conclusions.
Alignment to business objective + additional data cleaning = accurate conclusions.
Alignment to business objective + newly discovered variables + constraints = accurate conclusions.
To calculate margin of error you need:
- Population size
- Sample size
- Confidence level
*Use margin of error calculator
https://goodcalculators.com/margin-of-error-calculator/
How to choose sample size:
- Don't choose sample size less than 30. Sample sizes vary by business problem.
- Confidence level of 95%+ is good, 90% is acceptable. Increase by using larger sample.
- Decrease margin of error by increasing sample size.
- Increase statistical significance by using larger sample.
*Large samples have higher cost.
*Use Sample Size Calculator to get minimum sample size needed.
http://www.raosoft.com/samplesize.html
Processes that can compromise data:
- Replication
- Transfers
- Manipulation
Other threats to data integrity:
- Human error
- Viruses
- Malware
- Hacking
- System Failures
Data constraints:
Data type |
Values must
be of a certain type: date, number, percentage, Boolean, etc. |
If the data
type is a date, a single number like 30 would fail the constraint and be
invalid |
Data range |
Values must
fall between predefined maximum and minimum values |
If the data
range is 10-20, a value of 30 would fail the constraint and be invalid |
Mandatory |
Values can’t
be left blank or empty |
If age is
mandatory, that value must be filled in |
Unique |
Values can’t
have a duplicate |
Two people
can’t have the same mobile phone number within the same service area |
Regular
expression (regex) patterns |
Values must
match a prescribed pattern |
A phone
number must match ###-###-#### (no other characters allowed) |
Cross-field
validation |
Certain
conditions for multiple fields must be satisfied |
Values are
percentages and values from multiple fields must add up to 100% |
Primary-key |
(Databases
only) value must be unique per column |
A database
table can’t have two rows with the same primary key value. A primary key is
an identifier in a database that references a column in which each value is
unique. More information about primary and foreign keys is provided later in
the program. |
Set-membership |
(Databases
only) values for a column must come from a set of discrete values |
Value for a
column must be set to Yes, No, or Not Applicable |
Foreign-key |
(Databases
only) values for a column must be unique values coming from a column in
another table |
In a U.S.
taxpayer database, the State column must be a valid state or territory with
the set of acceptable values defined in a separate States table |
Accuracy |
The degree
to which the data conforms to the actual entity being measured or described |
If values
for zip codes are validated by street location, the accuracy of the data goes
up. |
Completeness |
The degree
to which the data contains all desired components or measures |
If data for
personal profiles required hair and eye color, and both are collected, the
data is complete. |
Consistency |
The degree
to which the data is repeatable from different points of entry or collection |
If a
customer has the same address in the sales and repair databases, the data is
consistent. |
Additional Resources:
https://www.investopedia.com/terms/c/central_limit_theorem.asp
https://www.statisticssolutions.com/dissertation-resources/sample-size-calculation-and-sample-size-justification/sample-size-formula/
https://machinelearningmastery.com/statistical-power-and-power-analysis-in-python/
https://towardsdatascience.com/is-there-a-difference-between-open-data-and-public-data-6261cd7b5389
Comments
Post a Comment