machine-learning | Down Home Country Coding With Scott Selikoff and Jeanne BoyarskyDown Home Country Coding With Scott Selikoff and Jeanne Boyarsky

Speaker: Leah Berg

For more, see the table of contents.

Notes

Night job – https://www.datasciencerebalanced.com

Data Leakage

Also known as leakage or target leakage
Different meaning for information security (data leaking to outside organization)
Can be difficult to spot
Training data includes info about test.
Model trained on info not available in production

How models learn

Split data into training data and test data.
Test data – data model has never seen before and makes sure model gets is right
Can also have an optional validation set
Randomly pick whether data points are training or test data. – Called random train/test split
More training data than test data

Don’t include data from the future

Using a random split of time series data doesn’t work because model has learned about future data.
Better to use a sliding window. Use first few months to predict next month. Then add that next value and predict one after. And keep going. Adding up error gives you accuracy of model.
This works because model only knows about data before one asked to predict.
Create timeline for when events happen. That way you make sure you aren’t using data from before the prediction
Don’t always know where/when data was created. Important to understand business process

Don’t randomly split groups

Have some data from the group you are then predicting
Problem when new student shows up so prediction will be bad
scikit-learn has GroupShuffleSplit() to get full group in same set – testing or training

Don’t forget your data is a snapshot

In school, have pristine data set.
In real world, data is always changing.
Could tell model about data that occurred after prediction. Again think about data on timeline

Don’t randomly split data when retraining

Want to use same training/test data on production and challenger models to see which better.
One has already seen data points during training that you are testing so you don’t know if it is better.
Challenger model can get more data that wasn’t available originally. Ok to split new data into test/train as long as original data part is split same way.

Split data immediately

Risky to rescale before split because data isn’t represented same way. Min/max can vary if split after
Run normalization on different sets of data
Before split, do analysis with business, exploratory data analysis. Split data before start modeling

Use Cross Validation

KFold Validation – split training data into K parts
ex: 3 fold validation – two parts stay as training and one is validation. The test data remains as test data and is kept separate for final evaluation.
The validation set is for an initial test.
Gives more options to train model

Be Skeptical of High Performance

If validation much higher than train/test, suspicious.
If train/test/validation sets are all high/the same, suspicious.

Use scikit-learn pipeline

Helps avoid leaking test data into training data

Check for features correlated with target

If another attribute has a high match with what looking for, make sure not mixing up correlation/causation.
Also, avoid timeline errors for reverse causation. Ex: the thing you are looking for causes, something else

My take

Great talk. Almost all of this was new to me. It was understandable and I learned a lot.

Machine Learning for Java Developers in 45 Minutes

Speakers: Zoran Sevarac & Frank Greco – @zsevarac & @frankgreco

For more blog posts, see The Oracle Code One table of contents

General

“AI is the new electricity” – Andrew Ng (societies with AI were above those without
For many tasks, algorithms are well known
Other algorithms harder – image recognition. Rule based. Constantly add rules. Large number of rules. Complex.
When complexity goes up, bells should go off. Avoid complexity.
When complexity index is too big, it isn’t scalable. Breading ground for bugs.
Not all use cases are not good for ML
Core of ML – recognizing patterns in data and making predictions against the data
Learn language by understanding all the rules (algorithm) or observing patterns (ML)

Terms

AI – type of algorithm where machine emulates aspects of human behavior
ML – subset of AI. Allows machine to learn from experience/data
Deep learning. Subset of ML. Uses powerful computing and advanced nueral networks

Deep learning

Accuracy grows with more data.
Older learning algorithms get outperformed after a certain amount of data.
Think of deep learning as a graph. Each node performs computation. Computation can be reconfigured by tweaking coefficients on edges
Layer – groups of nodes

Examples

Image recognition
Spam classification
Data classification
Identifying handwritten characters/image transformation

Data

Training data
Try to minimize differences as go thru
Once goes below a certain threshold, training stops
Determine whether false positives or false negatives are worse for your use case

JSR381 – Visual Recognition API

Standard API for computer vision tasks using machine learning
Provides generic ML API design to support other libraries
Next phase is to figure out who/what get wider support/adoption
Brings ML closer to general Java dev audience
App programmers need to know this. Don’t need to become a data scientist to use.

Why matters

Patterns
Can change data structures
The case for Learned Index Structures – https://arxiv.org/abs/1712.01208
New hardware for API
What happens to countries that host call centers and their economy?

Issues

Need clean data
Privacy and ethics
Correlation vs causality
Data hacking/poisoning
DeepFakes – can create people that don’t exist
Interpretability
AI/ML talent is scarce

My take

This was a great way to get started. There were a bunch of code samples as well using Java APIs.

M	T	W	T	F	S	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Down Home Country Coding With Scott Selikoff and Jeanne Boyarsky

Java/J2EE Software Development and Technology Discussion Blog

Tag Archives: machine-learning

[2023 kcdc] data leakage – why your ML model knows too much

Notes

Data Leakage

How models learn

Don’t include data from the future

Don’t randomly split groups

Don’t forget your data is a snapshot

Don’t randomly split data when retraining

Split data immediately

Use Cross Validation

Be Skeptical of High Performance

Use scikit-learn pipeline

Check for features correlated with target

My take

[2019 oracle code one] Machine Learning

Notes

Data Leakage

How models learn

Don’t include data from the future

Don’t randomly split groups

Don’t forget your data is a snapshot

Don’t randomly split data when retraining

Split data immediately

Use Cross Validation

Be Skeptical of High Performance

Use scikit-learn pipeline

Check for features correlated with target

My take

Share this:

Share this: