[2023 kcdc] the elephant in your data set – avoid bias in machine learning

Speaker: Michelle Frost

For more, see the table of contents.


  • Intersectionality wheel of privileged. Many spokes and range from power to erased to marginalized. Used the version posted here
  • Bias – inclination or prejudice for or against one person or group
  • ML Bias – systematic error in the model itself due to assumptions
  • Sometimes bias is necessary – inductive bias – assumptions combined with training examples to classify
  • Models with high bias oversimplify the model
  • Each stage has potential harmful bias
  • Bias feeds back into model
  • In ML, when something looks two good to be true, it probably is

Points of bias

  • Historical – prejudice in world as it exists today. Gave example from ChatGPT where assumed a nurse was female even when replaced pronouns. Full example here
  • Representation bias – Sample under-represents part of population. Can’t make effective predictions for that group. Article describing. “Solved” by dropping gorillas as a label
  • Measurement bias – using a proxy to represent a construct. Problem if oversimplifying or accuracy varies across groups. Compas (Correctional Offender Management Profiling for Alternative Sanctions) example. Data measures policing not just the offender.
  • Aggregation bias – one size fits all model assumes mapping inputs to labels is consistent. For example, could mean something different across cultures. Such as LSD being Lake Shore Drive in Chicago and not a drug. Or racial differences for HbA1c
  • Learning bias – modeling choice may prioritize one objective which damages another. Such as Amazon’s recruiting tool discriminating against women
  • Evaluation bias – benchmark data does not represent the population. Might make sense in some scenarios. Project Gender Shades analyzed differences in different tools.
  • Deployment bias – model attended to solve one problem, but used a different way. Make a hook for tuna and use it on a shark. Child abuse protection tool fails poor families.

Simpson’s paradox

  • Other attributes are a proxy for the thing leaving out
  • Association disappears, reappears or reverses when divide population


  • Protected class – category where bias is relevant
  • Sensitive characteristics – algorithmic decisions where bias could be factor
  • Disparate treatment
  • Disparate outcome/impact
  • Fairness – area of research to ensure biases and model inaccuracies do not lead to models that treat individuals unfavorable due to sensitive characteristics.


  • Demographic partiy – decisions/outcomes independent of protected attribute. Does not protect all unfairness
  • Equal odds – decision independent of protected attributes. True and false positive rates must be equal
  • Equal opportunity – like equal odds but only measures fairness for true positive rates


  • A popular (bad) data set is “adult data set”. I think i this one.
  • Not balanced by gender, race, country

Book recommendations

  • Weapons of math destruction
  • Biased
  • The alignment Ppoblem
  • Invisible Women
  • The Big Nine
  • Automating Inequality

My take

The types of bias and examples were interesting. Good end to the day. The demo graphs provided the point about biased data nicely.

Leave a Reply

Your email address will not be published. Required fields are marked *