11. Data Analytics - Prepare Data for Exploration - Week 2
Definition:
Bias // a preference in favor of or against a person, group of people, or thing
Data bias // a type of error that systematically skews results in a certain direction
Sampling bias // when a sample isn't representative of the population as a whole
Unbiased sampling // when a sample is representative of the population being measured
Ethics // well-founded standards of right and wrong that prescribe what humans ought to do, usually in terms of rights, obligations, benefits to society, fairness, or specific virtues
Data ethics // well-founded standards of right and wrong that dictate how data is collected, shared, and used
GDPR // general data protection regulation of the European union
Data anonymization // process of protecting people's private or sensitive data by eliminating personally identifiable information (PII). Processes: blanking, hashing, masking, or using codes or altered texts.
De-identification // process used to wipe data clean of all personally identifying information
Data interoperability // ability of data systems and services to openly connect and share data. is used in healthcare industry.
What data should be anonymized?
- healthcare and financial data
- numbers, names, license plates/numbers, SSN, IP addresses, medical records, emails, photos, etc
Aspects of data ethics
- ownership:
Individuals own the raw data they provide and they have primary control over its usage,
how it's processed, and how it's shared.
- transaction transparency
All data-processing activities and algorithms should be completely explainable and understood
by the individual who provides their data.
- consent
an individual right to know explicit details about how and why their data will be used
before agreeing to provide it.
- currency
Individuals should be aware of financial transactions resulting from the use of their personal
data and the scale of these transactions.
- privacy
preserving a data subject's information and activity any time a data transaction occurs.
- Protection from unauthorized access to our private data.
- Freedom from inappropriate use of our data.
- The right to inspect, update, or correct our data.
- Ability to give consent to use our data.
- Legal right to access the data.
- openness (or open data)
Free access, usage, and sharing of data
- Availability and access // available and accessable
- Reuse and redistribution // allows reuse and redistribution of data
- Universal participation // no restrictions on who can use the data
Identifying good data:
R = reliable. not biased
O = original. first party data.
C = comprehensive. contains all information needed to answer question or solution
C = current. usefulness of data decreases as time passes
C = cited. makes information credible
*good data rocccs!
Identifying bad data:
opposite of ROCCC.
Every good solution is found by avoiding bad data.
Types of data bias:
- sampling bias
When a sample isn't representative of the population as a whole
- observer bias (experimenter bias/research bias)
The tendency for different people to observe things differently
- interpretation bias
The tendency to always interpret ambiguous situations in a positive or negative way
- confirmation bias
The tendency to search for or interpret information in a way that confirms pre-existing beliefs
Bias Occurs:
- during data collection
- during planning. not being inclusive
- subconsciously or consciously
Solution:
- choose data randomly in a population
Ensuring Data Integrity Process:
- Analyze data for bias and credibility
- Good vs. bad data
- Data ethics, privacy, and access
Additional Resources:
https://www.data.gov/
https://www.census.gov/data.html
https://www.opendatanetwork.com/
https://cloud.google.com/public-datasets
https://datasetsearch.research.google.com/
Comments
Post a Comment