[2023 kcdc] data science: zero to hero

Speaker: Gary Short

Twitter; @garyshort

Repo for presentation/samples: https://github.com/garyshort/kcdc2023

For more, see the table of contents.


Data science overview/rules

  • Applied data science – solving business problems
  • Curiosity is most important
  • The universe does random stuff so you haven’t discovered anything until you prove you’ve discovered something
  • Only qualitative and quantitative data – people lie, Can’t trust what you ask
  • Can only do math with numbers. Some things will pretend to be numbers when they are not. Also, can’t add different things (dollars vs killograms)
  • If you can’t explain it to a six year old, you don’t really understand it
  • Only have to be more than 51% accurate to do better than guessing
  • True random data has some clusters. The cluster will not last forever. Gambler’s paradox. 27 blacks doesn’t mean due a red.
  • If it’s not in production, it doesn’t exist. Can’t just be on your laptop. Most data scientists need to give to someone else to get it to prod. Cultural difference between data scientist and person who is building/deploying.
  • % chance of hypothesis being right or wrong doesn’t have to sum up to 100%. ex: grass is wet. Could be rain or a dog peeing or something else

Types of data

Structured

  • Relational data
  • Get connection, create cursor, fill cursor, close connection
  • Schema is important on data write.

Semi structured

  • ex: JSON/MongoDB.
  • Get connection, name collection, fill cursor, close connection
  • Schema important when read data

Unstructured

  • Blob (binary large object)
  • Stored in pages/blocks
  • Access via URL

Graph

  • Degrees of separation – can you deliver a message directly
  • People in room now more closely connected because in this session (and would stay so if shared contact info)
  • Wide network effect
  • Nodes tend to be nouns
  • Edges tend to be verbs. Can be unidirectional or bidirectional
  • Get connection, state query, fill cursor, close connection

AI/ML works on data types

  • Categorical – segregate data by category where category is not important (ex: blue eyes)
  • Ordinal – order is important but distance between is not important (ex: position in a race)
  • Numeric – order is important but distance is the same (ex: counting)
  • Ratio – numeric but with positive numbers

Can only do math with ordinal and ratio types. A survey on a scale of 1-5 (likert scale) is ordinal, not numeric/ratio. Can’t do average. This is categorical data (ex: very happy, pissed off). Can do math with counts of categorical data but not single items.

Exploratory Data analysis

  • Need to understand the variables. Ex: is it really a number
  • Handle missing values – depends on scenario. Ex: use mean or median (if not looking for that particular thing), delete row with incomplete data
  • Outlier detection – sometimes genuinely an outlier (ex: someone who is 8 feet tall), sometimes it is the important piece of data (ex: which exits people use in a fire; one person went the other way and want to know why). Need to determine why outlier and if care so don’t delete data need
  • Univariate analysis – ex: histogram for categorical data
  • Bivariate analysis – correlated data; could be hidden variable. Don’t need both of them since one predicts the other. Want minimal variables in model so chose the one that brings in the most info.

Feature Selection

  • Preprocess the data
  • Normalize data – units have to be the same. Using variance doesn’t help because unit is now original unit squared. Can use Z-score so everything on scale 0-1 using mean and divisor
  • Encode the categories – make so can do math
  • Booleans are numbers (0 and 1)
  • Word vector – can use math to represent a word. Complicated. Ok to have to look up every time.
  • Bi/multivariate analysis – high correlation means redundant info
  • Feature importance – check coefficients from regressions and scores from gradient boosting

Model Selection

  • People have a favorite model
  • Use one or more models. See which gives best result before making any changes to the model.
  • Good to use a linear and non linear one. Normal the linear model is enough because normally dealing with people (directly or indirectly). Linear equations work for a normal distrobution.
  • Make sure to find global minimum, not local/current one
  • Compete with yourself. Try to have your second best model beat your current best model. Once something in prod, start again

Train/test split

  • 80/20 split
  • 80% data for training
  • 20% data real
  • Model never sees training data because can’t grade own homework

Model evaluation

  • Outcome – model + error
  • Error is difference between predicted and observed values.
  • Sample of population can be model. Get error because of sampling bias

Hyper Parameter Tuning

  • Every models have parameters to govern how works.
  • Hyper param tuning is fiddling with these
  • Will be an optional value for each of these parameters for your particular use case

Model Validation

  • Need to make sure model doesn’t work by chance
  • K-Fold Cross Validation – after do 80/20 split, can feed data back in and do again
  • Stratified Cross Validation – same as K-Fold but unbalanced classes

Bayesian inference in Real Life

  • P(h|e) = P(e|h) * P(h) / P(e)
  • In English: current belief = new evidence

Estimation

  • Important to be able to estimate values when have no data
  • Dumb questions like “how many piano tuners are there in Chicago” was testing this. So few people could do it that pulled question. [I suspect the ridicule and people memorizing the answer was a factor too]
  • Easier to estimate a range than an actual value
  • Pick a minimum that it couldn’t possibly be below. You’d be surprised and skeptical if less than that.
  • Pick a maximum that it couldn’t possible be above.
  • Pick value spits range in two so that the possibility of being above/below has equal probability. Call this the medium. Resist temptation to pick the mean.
  • Repeat finding the minimum to median. Call this Q1
  • Then repeat finding the median to maximum to get Q3.
  • This gets you a five point description of a distribution
  • Use sampling to get mean of distribution

Lab part

The lab was to predict something you want to predict and make a model and/or predict a probability. Can do individually or in groups. He also gave the option to leave. I chose leave because there was a little over an hour left when he finished explaining the lab. I need to go over the material for my own workshop so doing that instead of the lab.

My take

This was a good intro and Gary is a good, engaging speaker. I learned (and re-learned) a bunch of stuff. Both concepts and terms. Having a bunch of rules and getting into them made it fun. (ex: math needs numbers). I like that the concept part was longer (except for the lack of a break), but it would hav been better if it was advertised that way in the intro.

I disagree with Gary’s philosophy on not having a bathroom break. He started by saying there would be 60-90 minutes of lecture and then a lab. [wound up to being just over 2.5 hours] And that we are all adults and can go to the bathroom whenever. Someone asked at the 90 minute mark if there would be a bathroom break and he repeated the all adults thing expanding that you’ll catch up and the slides will be online later. He also said people feel compelled to hold it until break or go when told it is break. However, the tradeoff is that you don’t want to go to the bathroom lest you miss something that will wind up being important during the session. It’s super frustrating to miss stuff and then struggle to understand later. It may be that this workshop isn’t cumulative but there’s no way to know. Also, by not having a break, you aren’t giving people’s brain a break. It’s not just about the bathroom.

Gary stated he puts the materials online after so people don’t read it during the session. That I agree with!

AT&T Grandfathers Unlimited Data Plans with iPhone 5 [CONFIRMED]

Following the much anticipated announcement of Apple’s iPhone 5 today, a single unanswered question has been weighing on my mind. Will AT&T customers with existing unlimited data plans still be grandfathered in if they choose to purchase a new iPhone 5 at the subsidized prices ($199 for 16GB, $299 for 32GB, $399 for 64GB)? The answer is here, and it is a resounding “yes”!

CONFIRMED [9/12/2012]: According to today’s AT&T press release, users with unlimited plans will definitely be able to keep them:

“We are thrilled to offer iPhone 5 on the nation’s largest 4G network. We’ve worked with Apple since the original in 2007 and this is hands-down their best phone yet,” said Ralph de la Vega, president & CEO, AT&T Mobility. “We offer customers the flexibility to keep the iPhone data plans they already have or choose any of our individual or new Mobile Share plans. We’re proud that more customers choose AT&T for iPhone than any other U.S. carrier and look forward to making iPhone 5 the newest addition to our lineup.”

Verizon’s Poor Customer Service Decision

Avid Apple enthusiasts may recall that Verizon nixed unlimited plans for subsidized upgrades back in May 2012, a poor customer service move indeed. While Verizon still allows users to keep the plans by purchasing the unsubsidized iPhone, it does not make a lot of sense to spend hundreds of extra dollars on what may amount to $5-$20 per month (or nothing at all) in savings on the tiered plans.

Two AT&T Representatives Confirm Grandfather Rule

I called AT&T (1-888-333-6651) and spoke with a woman named Patricia, who assured me that as long as I upgraded my phone to an iPhone 5 through an AT&T representative, I would be able to keep my plan. She did caution me against upgrading at the Apple Store, not because it was not allowed, but that because last year some less-educated customers accidentally signed up for a new data plan at the Apple Store, and as AT&T will tell you, “Once the unlimited data plan has been removed, it cannot be added again”.

I also used Live Chat support on AT&T’s website and spoke to a person name Andre who confirmed that information; the transcript is posted below:

You are now chatting with Andre S., an AT&T sales representative.

Andre S.: Welcome to AT&T online Sales support. How may I assist you with placing your order today?

Scott: Hi Andre S. I currently have an unlimited Data plan for my iPhone 4. Will I lose this unlimited data plan if I choose to upgrade to an iPhone 5 at the subsidized price?

Andre S.: Hi Scott!! I'll be happy to assist you today with your inquiry.
Andre S.: You will not lose your data plan if you upgrade to a new phone.

Scott: even the iPhone 5 with LTE?

Andre S.: No matter which phone you get.

Scott: does it matter if I order via the Apple Store or the AT&T store?

Andre S.: It would not matter you will still be keeping your old plan.

Scott: Thank you

Disclaimer: I do not work for AT&T and cannot guarantee that, even if AT&T has a company-wide policy of grandfathering plans, your particular representative will be able to upgrade your phone without accidentally losing your data plan. As always, buyer beware.

sql query optimization – why temp tables can help

Last month, I migrated the jforum.net forum data to a coderanch jforum forum.  I had a requirement/goal to update the links in our forum so they work rather than point to the jforum broken links.

First I created the mapping.  (lesson: store this data as you migrate so you don’t have to do it later.)  I wound up mapping on subject lines and dates.  Luckily the threads were migrated in numeric order so I could fill in the gaps.  But that wasn’t interesting enough to blog about.  What was interesting enough to blog about was the SQL queries to update the database.

My goal

I wanted to make the changes entirely through the database – no Java code.  I also wanted to avoid postgres stored procedures because I encountered some time sinks last time I wrote a postgres stored procedure.  I am happy to say I achieved my goal.

Step 1 – Analysis

Noted that I need to update 375 rows.  Too many to do by hand.  There are just under 5000 posts in the jforum forum.  Which means therea rea t most 5000 search/replace strings to check.  This doesn’t seem bad for a computer.  Once I know the SQL, I can write code to generate 5000 of them using my local mappings and then run the SQL script on the server.

select * from jforum_posts_text where post_text like '%http://www.jforum.net/posts/list/%'

Step 2 – Horrible performing but functional query

It’s got to work before you can tune it.  My first attempt was:

explain update jforum_posts_text set post_text=replace(post_text, 'http://www.jforum.net/posts/list/5.page','http://www.coderanch.com/t/574339 ')  where post_text like '%http://www.jforum.net/posts/list/5.page%';

The query plan was:

Seq Scan on jforum_posts_text (cost=0.00..132971.82 rows=1 width=459) 
Filter: (post_text ~~ '%http://www.jforum.net/posts/list/5.page%'::text)

5000 of those is going to take over an hour of hitting the database hard. Not good. I could run it over the weekend when volumes are need be, but I can do better than that.

Step 3 – Trying to use an index

I know that most (95% maybe) of the updates are in the JForum forum.  We had a few “legacy” links to jforum.net in other forums, but not a lot.  I then tried adding a condition on forum id

explain update jforum_posts_text set post_text=replace(post_text, 'http://www.jforum.net/posts/list/5.page','http://www.coderanch.com/t/574339 ')  where post_text like '%http://www.jforum.net/posts/list/5.page%' and post_id in (select post_id from jforum_posts where forum_id = 95);

The query plan was:

Nested Loop (cost=22133.48..97281.40 rows=1 width=459)
HashAggregate (cost=22133.48..22263.46 rows=12998 width=4)
Bitmap Heap Scan on jforum_posts (cost=245.18..22100.99 rows=12998 width=4)
Recheck Cond: (forum_id = 95)
Bitmap Index Scan on idx_posts_forum (cost=0.00..241.93 rows=12998 width=0)
Index Cond: (forum_id = 95)
Index Scan using jforum_posts_text_pkey on jforum_posts_text (cost=0.00..5.76 rows=1 width=459)
Index Cond: (jforum_posts_text.post_id = jforum_posts.post_id)
Filter: (jforum_posts_text.post_text ~~ '%http://www.jforum.net/posts/list/5.page%'::text)

Well, it is using the indexes now.  But it is only about a 30% drop in cost for the worst case. On an untuned complex query, I usually see at least an order of magnitude performance jump on my initial tuning.

Step 4 – time for a temp table

Most of the work is finding the 375 rows that need updating in a table with 1,793,111 rows.  And it has to happen for each of the 5000 times I run the query.

I decided to use a temporary table so I could run the expensive part once.

create table jeanne_test as select * from jforum_posts_text where post_text like '%http://www.jforum.net/posts/list/%';

explain update jforum_posts_text set post_text=replace(post_text, 'http://www.jforum.net/posts/list/5.page','http://www.coderanch.com/t/574339 ')  where post_id in (select post_id from jeanne_test where post_text like '%http://www.jforum.net/posts/list/5.page%');

Now I’m doing the expensive part once.  It still takes a couple seconds to do the first part.  But the second part – the update I’m running 5000 times – drops the query plan to

Nested Loop (cost=13.50..21.99 rows=1 width=459)
HashAggregate (cost=13.50..13.51 rows=1 width=4)
Seq Scan on jeanne_test (cost=0.00..13.50 rows=1 width=4)
Filter: (post_text ~~ '%http://www.jforum.net/posts/list/5.page%'::text)
Index Scan using jforum_posts_text_pkey on jforum_posts_text (cost=0.00..8.46 rows=1 width=459)
Index Cond: (jforum_posts_text.post_id = jeanne_test.post_id

Nice.  Running the script with the 5000 update statements only took a few seconds.

Conclusion

Database tuning is fun.  Explain is your friend.  As are different approaches.  And for those who aren’t doing match, the performance jump was 3-4 orders of magnitude.