[2023 kcdc] data leakage – why your ML model knows too much

Posted on June 23, 2023 by Jeanne Boyarsky

Speaker: Leah Berg

For more, see the table of contents.

Notes

Night job – https://www.datasciencerebalanced.com

Data Leakage

Also known as leakage or target leakage
Different meaning for information security (data leaking to outside organization)
Can be difficult to spot
Training data includes info about test.
Model trained on info not available in production

How models learn

Split data into training data and test data.
Test data – data model has never seen before and makes sure model gets is right
Can also have an optional validation set
Randomly pick whether data points are training or test data. – Called random train/test split
More training data than test data

Don’t include data from the future

Using a random split of time series data doesn’t work because model has learned about future data.
Better to use a sliding window. Use first few months to predict next month. Then add that next value and predict one after. And keep going. Adding up error gives you accuracy of model.
This works because model only knows about data before one asked to predict.
Create timeline for when events happen. That way you make sure you aren’t using data from before the prediction
Don’t always know where/when data was created. Important to understand business process

Don’t randomly split groups

Have some data from the group you are then predicting
Problem when new student shows up so prediction will be bad
scikit-learn has GroupShuffleSplit() to get full group in same set – testing or training

Don’t forget your data is a snapshot

In school, have pristine data set.
In real world, data is always changing.
Could tell model about data that occurred after prediction. Again think about data on timeline

Don’t randomly split data when retraining

Want to use same training/test data on production and challenger models to see which better.
One has already seen data points during training that you are testing so you don’t know if it is better.
Challenger model can get more data that wasn’t available originally. Ok to split new data into test/train as long as original data part is split same way.

Split data immediately

Risky to rescale before split because data isn’t represented same way. Min/max can vary if split after
Run normalization on different sets of data
Before split, do analysis with business, exploratory data analysis. Split data before start modeling

Use Cross Validation

KFold Validation – split training data into K parts
ex: 3 fold validation – two parts stay as training and one is validation. The test data remains as test data and is kept separate for final evaluation.
The validation set is for an initial test.
Gives more options to train model

Be Skeptical of High Performance

If validation much higher than train/test, suspicious.
If train/test/validation sets are all high/the same, suspicious.

Use scikit-learn pipeline

Helps avoid leaking test data into training data

Check for features correlated with target

If another attribute has a high match with what looking for, make sure not mixing up correlation/causation.
Also, avoid timeline errors for reverse causation. Ex: the thing you are looking for causes, something else

My take

Great talk. Almost all of this was new to me. It was understandable and I learned a lot.

[2023 kcdc] 10 things about postman everyone should know

Posted on June 23, 2023 by Jeanne Boyarsky

Speaker: Pooja Mistry

Twitter: @poojamakes

Public workspace- https://www.postman.com/devrel/workspace/2023-10-postman-features-everyone-should-know/overview

For more, see the table of contents.

Notes

Moving towards an API first world
Postman started in 2012 with a Chrome extension. Evolved into full API platform
More than just sending requests – ex: collections, documentation, servers
Web and app versions
Newman – CLI for postman
Collections, env vars, queries, etc have own id
Different life cycle for two personnas: producer of APIs (define, design, developer, test, secure, deploy, observe, distribute) an consumer of APIs (discover, evaluate, integrate, test, deploy, observe)
Test tab to test the API. Example – pm.test(“assert text”, function () {}
Protocols – graphql, websocket, grpc, socket io, etc
Scripts – can run before and after graphql
Pre-request script – ex: debugging
Can pass in $randomXXX of various types in your postman call

Postman API

Sign in and fork workspace if want to play with the public workspace for this talk
Postman has own API. ex: CRUD for collections, envs etc
Some clients use collection as the deliverable and then get metrics on it.

Postman echo

Sends back whatever you send in.
When pass in get params sends back json with args map being your params.
Post sends the text back as the data key in json.
Always echos headers as well

Postman visualizer

Can build UI in postman
Visualize tab on result. Put pm.visualizer.set(template, response: pm.response.json() in test tab.
Can use to make charts, maps, csv, etc
The template is HTML (which can contain JavaScript)
Postman provides a library of templates that you an copy/paste
Also see https://learning.postman.com/docs/sending-requests/visualizer/ and https://www.postman.com/postman/workspace/more-visualizer-examples/overview

Built in Libraries

Can automatically use faker,js, lodash, moment.js, chai.js and cryto-js
Ex: lodash.functionName()

Workflow Control

Scripting allows oops and conditionals
postman.setNextRequest() lets you change the order of requests in a collection
pm.sendRequest() allows sending multiple APIs at once
Collection and environment variables let you communicate between APIs

Mock Servers

Create a mock server in UI
This gives you a URL
Can deactivate mock server
Set data to return

Code Generation

Includes Java, curl, Node.JS, etc for requests
For providers, less choices but still a number

Test Automation

Bread and butter of postman
Can run manually
Can schedule API runs
Can report on results of API over time – ex: monitoring
Can use Newman and generate how to run CLI on other CICD: ex: Jenkins, CircleCI, GitHub Actions, Gitlab, etc
New: June 15 – can do performance testing using desktop client. Gives response time graph

Flows

Visual diagram showing order/connection/variables.
Can include dashboards in flow

Docs

Markdown syntax: https://daringfireball.net/projects/markdown/syntax
Can embed images
If documented well, can share with others
Explore tab shows all public APIs across Postman. Best ones are well documented.
Can include link to show what person/company created.
Can have creator workspace and aggregate your collections
Get help at – community.postman.com

Can try most for free. CLI not free

My take

I like that she used Postman (a public collection) and demos for most of the presentation. A lot of the features described were new to me. Excellent start to the morning.

[2023 kcdc] the elephant in your data set – avoid bias in machine learning

Posted on June 22, 2023 by Jeanne Boyarsky

Speaker: Michelle Frost

For more, see the table of contents.

Notes

Intersectionality wheel of privileged. Many spokes and range from power to erased to marginalized. Used the version posted here
Bias – inclination or prejudice for or against one person or group
ML Bias – systematic error in the model itself due to assumptions
Sometimes bias is necessary – inductive bias – assumptions combined with training examples to classify
Models with high bias oversimplify the model
Each stage has potential harmful bias
Bias feeds back into model
In ML, when something looks two good to be true, it probably is

Points of bias

Historical – prejudice in world as it exists today. Gave example from ChatGPT where assumed a nurse was female even when replaced pronouns. Full example here
Representation bias – Sample under-represents part of population. Can’t make effective predictions for that group. Article describing. “Solved” by dropping gorillas as a label
Measurement bias – using a proxy to represent a construct. Problem if oversimplifying or accuracy varies across groups. Compas (Correctional Offender Management Profiling for Alternative Sanctions) example. Data measures policing not just the offender.
Aggregation bias – one size fits all model assumes mapping inputs to labels is consistent. For example, could mean something different across cultures. Such as LSD being Lake Shore Drive in Chicago and not a drug. Or racial differences for HbA1c
Learning bias – modeling choice may prioritize one objective which damages another. Such as Amazon’s recruiting tool discriminating against women
Evaluation bias – benchmark data does not represent the population. Might make sense in some scenarios. Project Gender Shades analyzed differences in different tools.
Deployment bias – model attended to solve one problem, but used a different way. Make a hook for tuna and use it on a shark. Child abuse protection tool fails poor families.

Simpson’s paradox

Other attributes are a proxy for the thing leaving out
Association disappears, reappears or reverses when divide population

Terms

Protected class – category where bias is relevant
Sensitive characteristics – algorithmic decisions where bias could be factor
Disparate treatment
Disparate outcome/impact
Fairness – area of research to ensure biases and model inaccuracies do not lead to models that treat individuals unfavorable due to sensitive characteristics.

Metrics

Demographic partiy – decisions/outcomes independent of protected attribute. Does not protect all unfairness
Equal odds – decision independent of protected attributes. True and false positive rates must be equal
Equal opportunity – like equal odds but only measures fairness for true positive rates

Demo

A popular (bad) data set is “adult data set”. I think i this one.
Not balanced by gender, race, country

Book recommendations

Weapons of math destruction
Biased
The alignment Ppoblem
Invisible Women
The Big Nine
Automating Inequality

My take

The types of bias and examples were interesting. Good end to the day. The demo graphs provided the point about biased data nicely.

M	T	W	T	F	S	S
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30

Down Home Country Coding With Scott Selikoff and Jeanne Boyarsky

Java/J2EE Software Development and Technology Discussion Blog

Monthly Archives: June 2023

[2023 kcdc] data leakage – why your ML model knows too much

Notes

Data Leakage

How models learn

Don’t include data from the future

Don’t randomly split groups

Don’t forget your data is a snapshot

Don’t randomly split data when retraining

Split data immediately

Use Cross Validation

Be Skeptical of High Performance

Use scikit-learn pipeline

Check for features correlated with target

My take

[2023 kcdc] 10 things about postman everyone should know

Notes

Postman API

Postman echo

Postman visualizer

Built in Libraries

Workflow Control

Mock Servers

Code Generation

Test Automation

Flows

Docs

My take

[2023 kcdc] the elephant in your data set – avoid bias in machine learning

Notes

Points of bias

Simpson’s paradox

Terms

Metrics

Demo

Book recommendations

My take

Notes

Data Leakage

How models learn

Don’t include data from the future

Don’t randomly split groups

Don’t forget your data is a snapshot

Don’t randomly split data when retraining

Split data immediately

Use Cross Validation

Be Skeptical of High Performance

Use scikit-learn pipeline

Check for features correlated with target

My take

Share this:

Notes

Postman API

Postman echo

Postman visualizer

Built in Libraries

Workflow Control

Mock Servers

Code Generation

Test Automation

Flows

Docs

My take

Share this:

Notes

Points of bias

Simpson’s paradox

Terms

Metrics

Demo

Book recommendations

My take

Share this: