Speaker: Leah Berg
For more, see theĀ table of contents.
- Night job – https://www.datasciencerebalanced.com
Data Leakage
- Also known as leakage or target leakage
- Different meaning for information security (data leaking to outside organization)
- Can be difficult to spot
- Training data includes info about test.
- Model trained on info not available in production
How models learn
- Split data into training data and test data.
- Test data – data model has never seen before and makes sure model gets is right
- Can also have an optional validation set
- Randomly pick whether data points are training or test data. – Called random train/test split
- More training data than test data
Don’t include data from the future
- Using a random split of time series data doesn’t work because model has learned about future data.
- Better to use a sliding window. Use first few months to predict next month. Then add that next value and predict one after. And keep going. Adding up error gives you accuracy of model.
- This works because model only knows about data before one asked to predict.
- Create timeline for when events happen. That way you make sure you aren’t using data from before the prediction
- Don’t always know where/when data was created. Important to understand business process
Don’t randomly split groups
- Have some data from the group you are then predicting
- Problem when new student shows up so prediction will be bad
- scikit-learn has GroupShuffleSplit() to get full group in same set – testing or training
Don’t forget your data is a snapshot
- In school, have pristine data set.
- In real world, data is always changing.
- Could tell model about data that occurred after prediction. Again think about data on timeline
Don’t randomly split data when retraining
- Want to use same training/test data on production and challenger models to see which better.
- One has already seen data points during training that you are testing so you don’t know if it is better.
- Challenger model can get more data that wasn’t available originally. Ok to split new data into test/train as long as original data part is split same way.
Split data immediately
- Risky to rescale before split because data isn’t represented same way. Min/max can vary if split after
- Run normalization on different sets of data
- Before split, do analysis with business, exploratory data analysis. Split data before start modeling
Use Cross Validation
- KFold Validation – split training data into K parts
- ex: 3 fold validation – two parts stay as training and one is validation. The test data remains as test data and is kept separate for final evaluation.
- The validation set is for an initial test.
- Gives more options to train model
Be Skeptical of High Performance
- If validation much higher than train/test, suspicious.
- If train/test/validation sets are all high/the same, suspicious.
Use scikit-learn pipeline
- Helps avoid leaking test data into training data
Check for features correlated with target
- If another attribute has a high match with what looking for, make sure not mixing up correlation/causation.
- Also, avoid timeline errors for reverse causation. Ex: the thing you are looking for causes, something else
My take
Great talk. Almost all of this was new to me. It was understandable and I learned a lot.