[2023 kcdc] data leakage – why your ML model knows too much

Posted on June 23, 2023 by Jeanne Boyarsky

Speaker: Leah Berg

For more, see the table of contents.

Notes

Night job – https://www.datasciencerebalanced.com

Data Leakage

Also known as leakage or target leakage
Different meaning for information security (data leaking to outside organization)
Can be difficult to spot
Training data includes info about test.
Model trained on info not available in production

How models learn

Split data into training data and test data.
Test data – data model has never seen before and makes sure model gets is right
Can also have an optional validation set
Randomly pick whether data points are training or test data. – Called random train/test split
More training data than test data

Don’t include data from the future

Using a random split of time series data doesn’t work because model has learned about future data.
Better to use a sliding window. Use first few months to predict next month. Then add that next value and predict one after. And keep going. Adding up error gives you accuracy of model.
This works because model only knows about data before one asked to predict.
Create timeline for when events happen. That way you make sure you aren’t using data from before the prediction
Don’t always know where/when data was created. Important to understand business process

Don’t randomly split groups

Have some data from the group you are then predicting
Problem when new student shows up so prediction will be bad
scikit-learn has GroupShuffleSplit() to get full group in same set – testing or training

Don’t forget your data is a snapshot

In school, have pristine data set.
In real world, data is always changing.
Could tell model about data that occurred after prediction. Again think about data on timeline

Don’t randomly split data when retraining

Want to use same training/test data on production and challenger models to see which better.
One has already seen data points during training that you are testing so you don’t know if it is better.
Challenger model can get more data that wasn’t available originally. Ok to split new data into test/train as long as original data part is split same way.

Split data immediately

Risky to rescale before split because data isn’t represented same way. Min/max can vary if split after
Run normalization on different sets of data
Before split, do analysis with business, exploratory data analysis. Split data before start modeling

Use Cross Validation

KFold Validation – split training data into K parts
ex: 3 fold validation – two parts stay as training and one is validation. The test data remains as test data and is kept separate for final evaluation.
The validation set is for an initial test.
Gives more options to train model

Be Skeptical of High Performance

If validation much higher than train/test, suspicious.
If train/test/validation sets are all high/the same, suspicious.

Use scikit-learn pipeline

Helps avoid leaking test data into training data

Check for features correlated with target

If another attribute has a high match with what looking for, make sure not mixing up correlation/causation.
Also, avoid timeline errors for reverse causation. Ex: the thing you are looking for causes, something else

My take

Great talk. Almost all of this was new to me. It was understandable and I learned a lot.

[2023 kcdc] the elephant in your data set – avoid bias in machine learning

Posted on June 22, 2023 by Jeanne Boyarsky

Speaker: Michelle Frost

For more, see the table of contents.

Notes

Intersectionality wheel of privileged. Many spokes and range from power to erased to marginalized. Used the version posted here
Bias – inclination or prejudice for or against one person or group
ML Bias – systematic error in the model itself due to assumptions
Sometimes bias is necessary – inductive bias – assumptions combined with training examples to classify
Models with high bias oversimplify the model
Each stage has potential harmful bias
Bias feeds back into model
In ML, when something looks two good to be true, it probably is

Points of bias

Historical – prejudice in world as it exists today. Gave example from ChatGPT where assumed a nurse was female even when replaced pronouns. Full example here
Representation bias – Sample under-represents part of population. Can’t make effective predictions for that group. Article describing. “Solved” by dropping gorillas as a label
Measurement bias – using a proxy to represent a construct. Problem if oversimplifying or accuracy varies across groups. Compas (Correctional Offender Management Profiling for Alternative Sanctions) example. Data measures policing not just the offender.
Aggregation bias – one size fits all model assumes mapping inputs to labels is consistent. For example, could mean something different across cultures. Such as LSD being Lake Shore Drive in Chicago and not a drug. Or racial differences for HbA1c
Learning bias – modeling choice may prioritize one objective which damages another. Such as Amazon’s recruiting tool discriminating against women
Evaluation bias – benchmark data does not represent the population. Might make sense in some scenarios. Project Gender Shades analyzed differences in different tools.
Deployment bias – model attended to solve one problem, but used a different way. Make a hook for tuna and use it on a shark. Child abuse protection tool fails poor families.

Simpson’s paradox

Other attributes are a proxy for the thing leaving out
Association disappears, reappears or reverses when divide population

Terms

Protected class – category where bias is relevant
Sensitive characteristics – algorithmic decisions where bias could be factor
Disparate treatment
Disparate outcome/impact
Fairness – area of research to ensure biases and model inaccuracies do not lead to models that treat individuals unfavorable due to sensitive characteristics.

Metrics

Demographic partiy – decisions/outcomes independent of protected attribute. Does not protect all unfairness
Equal odds – decision independent of protected attributes. True and false positive rates must be equal
Equal opportunity – like equal odds but only measures fairness for true positive rates

Demo

A popular (bad) data set is “adult data set”. I think i this one.
Not balanced by gender, race, country

Book recommendations

Weapons of math destruction
Biased
The alignment Ppoblem
Invisible Women
The Big Nine
Automating Inequality

My take

The types of bias and examples were interesting. Good end to the day. The demo graphs provided the point about biased data nicely.

[2023 kcdc] data science: zero to hero

Posted on June 21, 2023 by Jeanne Boyarsky

Speaker: Gary Short

Twitter; @garyshort

Repo for presentation/samples: https://github.com/garyshort/kcdc2023

For more, see the table of contents.

Data science overview/rules

Applied data science – solving business problems
Curiosity is most important
The universe does random stuff so you haven’t discovered anything until you prove you’ve discovered something
Only qualitative and quantitative data – people lie, Can’t trust what you ask
Can only do math with numbers. Some things will pretend to be numbers when they are not. Also, can’t add different things (dollars vs killograms)
If you can’t explain it to a six year old, you don’t really understand it
Only have to be more than 51% accurate to do better than guessing
True random data has some clusters. The cluster will not last forever. Gambler’s paradox. 27 blacks doesn’t mean due a red.
If it’s not in production, it doesn’t exist. Can’t just be on your laptop. Most data scientists need to give to someone else to get it to prod. Cultural difference between data scientist and person who is building/deploying.
% chance of hypothesis being right or wrong doesn’t have to sum up to 100%. ex: grass is wet. Could be rain or a dog peeing or something else

Types of data

Structured

Relational data
Get connection, create cursor, fill cursor, close connection
Schema is important on data write.

Semi structured

ex: JSON/MongoDB.
Get connection, name collection, fill cursor, close connection
Schema important when read data

Unstructured

Blob (binary large object)
Stored in pages/blocks
Access via URL

Graph

Degrees of separation – can you deliver a message directly
People in room now more closely connected because in this session (and would stay so if shared contact info)
Wide network effect
Nodes tend to be nouns
Edges tend to be verbs. Can be unidirectional or bidirectional
Get connection, state query, fill cursor, close connection

AI/ML works on data types

Categorical – segregate data by category where category is not important (ex: blue eyes)
Ordinal – order is important but distance between is not important (ex: position in a race)
Numeric – order is important but distance is the same (ex: counting)
Ratio – numeric but with positive numbers

Can only do math with ordinal and ratio types. A survey on a scale of 1-5 (likert scale) is ordinal, not numeric/ratio. Can’t do average. This is categorical data (ex: very happy, pissed off). Can do math with counts of categorical data but not single items.

Exploratory Data analysis

Need to understand the variables. Ex: is it really a number
Handle missing values – depends on scenario. Ex: use mean or median (if not looking for that particular thing), delete row with incomplete data
Outlier detection – sometimes genuinely an outlier (ex: someone who is 8 feet tall), sometimes it is the important piece of data (ex: which exits people use in a fire; one person went the other way and want to know why). Need to determine why outlier and if care so don’t delete data need
Univariate analysis – ex: histogram for categorical data
Bivariate analysis – correlated data; could be hidden variable. Don’t need both of them since one predicts the other. Want minimal variables in model so chose the one that brings in the most info.

Feature Selection

Preprocess the data
Normalize data – units have to be the same. Using variance doesn’t help because unit is now original unit squared. Can use Z-score so everything on scale 0-1 using mean and divisor
Encode the categories – make so can do math
Booleans are numbers (0 and 1)
Word vector – can use math to represent a word. Complicated. Ok to have to look up every time.
Bi/multivariate analysis – high correlation means redundant info
Feature importance – check coefficients from regressions and scores from gradient boosting

Model Selection

People have a favorite model
Use one or more models. See which gives best result before making any changes to the model.
Good to use a linear and non linear one. Normal the linear model is enough because normally dealing with people (directly or indirectly). Linear equations work for a normal distrobution.
Make sure to find global minimum, not local/current one
Compete with yourself. Try to have your second best model beat your current best model. Once something in prod, start again

Train/test split

80/20 split
80% data for training
20% data real
Model never sees training data because can’t grade own homework

Model evaluation

Outcome – model + error
Error is difference between predicted and observed values.
Sample of population can be model. Get error because of sampling bias

Hyper Parameter Tuning

Every models have parameters to govern how works.
Hyper param tuning is fiddling with these
Will be an optional value for each of these parameters for your particular use case

Model Validation

Need to make sure model doesn’t work by chance
K-Fold Cross Validation – after do 80/20 split, can feed data back in and do again
Stratified Cross Validation – same as K-Fold but unbalanced classes

Bayesian inference in Real Life

P(h|e) = P(e|h) * P(h) / P(e)
In English: current belief = new evidence

Estimation

Important to be able to estimate values when have no data
Dumb questions like “how many piano tuners are there in Chicago” was testing this. So few people could do it that pulled question. [I suspect the ridicule and people memorizing the answer was a factor too]
Easier to estimate a range than an actual value
Pick a minimum that it couldn’t possibly be below. You’d be surprised and skeptical if less than that.
Pick a maximum that it couldn’t possible be above.
Pick value spits range in two so that the possibility of being above/below has equal probability. Call this the medium. Resist temptation to pick the mean.
Repeat finding the minimum to median. Call this Q1
Then repeat finding the median to maximum to get Q3.
This gets you a five point description of a distribution
Use sampling to get mean of distribution

Lab part

The lab was to predict something you want to predict and make a model and/or predict a probability. Can do individually or in groups. He also gave the option to leave. I chose leave because there was a little over an hour left when he finished explaining the lab. I need to go over the material for my own workshop so doing that instead of the lab.

My take

This was a good intro and Gary is a good, engaging speaker. I learned (and re-learned) a bunch of stuff. Both concepts and terms. Having a bunch of rules and getting into them made it fun. (ex: math needs numbers). I like that the concept part was longer (except for the lack of a break), but it would hav been better if it was advertised that way in the intro.

I disagree with Gary’s philosophy on not having a bathroom break. He started by saying there would be 60-90 minutes of lecture and then a lab. [wound up to being just over 2.5 hours] And that we are all adults and can go to the bathroom whenever. Someone asked at the 90 minute mark if there would be a bathroom break and he repeated the all adults thing expanding that you’ll catch up and the slides will be online later. He also said people feel compelled to hold it until break or go when told it is break. However, the tradeoff is that you don’t want to go to the bathroom lest you miss something that will wind up being important during the session. It’s super frustrating to miss stuff and then struggle to understand later. It may be that this workshop isn’t cumulative but there’s no way to know. Also, by not having a break, you aren’t giving people’s brain a break. It’s not just about the bathroom.

Gary stated he puts the materials online after so people don’t read it during the session. That I agree with!

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Notes

Data Leakage

How models learn

Don’t include data from the future

Don’t randomly split groups

Don’t forget your data is a snapshot

Don’t randomly split data when retraining

Split data immediately

Use Cross Validation

Be Skeptical of High Performance

Use scikit-learn pipeline

Check for features correlated with target

My take

Share this:

Notes

Points of bias

Simpson’s paradox

Terms

Metrics

Demo

Book recommendations

My take

Share this:

Data science overview/rules

Types of data

AI/ML works on data types

Exploratory Data analysis

Feature Selection

Model Selection

Train/test split

Model evaluation

Hyper Parameter Tuning

Model Validation

Bayesian inference in Real Life

Estimation

Lab part

My take

Share this: