[2023 kcdc] the elephant in your data set – avoid bias in machine learning | Down Home Country Coding With Scott Selikoff and Jeanne Boyarsky

Speaker: Michelle Frost

For more, see the table of contents.

Notes

Intersectionality wheel of privileged. Many spokes and range from power to erased to marginalized. Used the version posted here
Bias – inclination or prejudice for or against one person or group
ML Bias – systematic error in the model itself due to assumptions
Sometimes bias is necessary – inductive bias – assumptions combined with training examples to classify
Models with high bias oversimplify the model
Each stage has potential harmful bias
Bias feeds back into model
In ML, when something looks two good to be true, it probably is

Historical – prejudice in world as it exists today. Gave example from ChatGPT where assumed a nurse was female even when replaced pronouns. Full example here
Representation bias – Sample under-represents part of population. Can’t make effective predictions for that group. Article describing. “Solved” by dropping gorillas as a label
Measurement bias – using a proxy to represent a construct. Problem if oversimplifying or accuracy varies across groups. Compas (Correctional Offender Management Profiling for Alternative Sanctions) example. Data measures policing not just the offender.
Aggregation bias – one size fits all model assumes mapping inputs to labels is consistent. For example, could mean something different across cultures. Such as LSD being Lake Shore Drive in Chicago and not a drug. Or racial differences for HbA1c
Learning bias – modeling choice may prioritize one objective which damages another. Such as Amazon’s recruiting tool discriminating against women
Evaluation bias – benchmark data does not represent the population. Might make sense in some scenarios. Project Gender Shades analyzed differences in different tools.
Deployment bias – model attended to solve one problem, but used a different way. Make a hook for tuna and use it on a shark. Child abuse protection tool fails poor families.

Protected class – category where bias is relevant
Sensitive characteristics – algorithmic decisions where bias could be factor
Disparate treatment
Disparate outcome/impact
Fairness – area of research to ensure biases and model inaccuracies do not lead to models that treat individuals unfavorable due to sensitive characteristics.

Demographic partiy – decisions/outcomes independent of protected attribute. Does not protect all unfairness
Equal odds – decision independent of protected attributes. True and false positive rates must be equal
Equal opportunity – like equal odds but only measures fairness for true positive rates

The types of bias and examples were interesting. Good end to the day. The demo graphs provided the point about biased data nicely.