sql query optimization – why temp tables can help

Last month, I migrated the jforum.net forum data to a coderanch jforum forum.  I had a requirement/goal to update the links in our forum so they work rather than point to the jforum broken links.

First I created the mapping.  (lesson: store this data as you migrate so you don’t have to do it later.)  I wound up mapping on subject lines and dates.  Luckily the threads were migrated in numeric order so I could fill in the gaps.  But that wasn’t interesting enough to blog about.  What was interesting enough to blog about was the SQL queries to update the database.

My goal

I wanted to make the changes entirely through the database – no Java code.  I also wanted to avoid postgres stored procedures because I encountered some time sinks last time I wrote a postgres stored procedure.  I am happy to say I achieved my goal.

Step 1 – Analysis

Noted that I need to update 375 rows.  Too many to do by hand.  There are just under 5000 posts in the jforum forum.  Which means therea rea t most 5000 search/replace strings to check.  This doesn’t seem bad for a computer.  Once I know the SQL, I can write code to generate 5000 of them using my local mappings and then run the SQL script on the server.

select * from jforum_posts_text where post_text like '%http://www.jforum.net/posts/list/%'

Step 2 – Horrible performing but functional query

It’s got to work before you can tune it.  My first attempt was:

explain update jforum_posts_text set post_text=replace(post_text, 'http://www.jforum.net/posts/list/5.page','http://www.coderanch.com/t/574339 ')  where post_text like '%http://www.jforum.net/posts/list/5.page%';

The query plan was:

Seq Scan on jforum_posts_text (cost=0.00..132971.82 rows=1 width=459) 
Filter: (post_text ~~ '%http://www.jforum.net/posts/list/5.page%'::text)

5000 of those is going to take over an hour of hitting the database hard. Not good. I could run it over the weekend when volumes are need be, but I can do better than that.

Step 3 – Trying to use an index

I know that most (95% maybe) of the updates are in the JForum forum.  We had a few “legacy” links to jforum.net in other forums, but not a lot.  I then tried adding a condition on forum id

explain update jforum_posts_text set post_text=replace(post_text, 'http://www.jforum.net/posts/list/5.page','http://www.coderanch.com/t/574339 ')  where post_text like '%http://www.jforum.net/posts/list/5.page%' and post_id in (select post_id from jforum_posts where forum_id = 95);

The query plan was:

Nested Loop (cost=22133.48..97281.40 rows=1 width=459)
HashAggregate (cost=22133.48..22263.46 rows=12998 width=4)
Bitmap Heap Scan on jforum_posts (cost=245.18..22100.99 rows=12998 width=4)
Recheck Cond: (forum_id = 95)
Bitmap Index Scan on idx_posts_forum (cost=0.00..241.93 rows=12998 width=0)
Index Cond: (forum_id = 95)
Index Scan using jforum_posts_text_pkey on jforum_posts_text (cost=0.00..5.76 rows=1 width=459)
Index Cond: (jforum_posts_text.post_id = jforum_posts.post_id)
Filter: (jforum_posts_text.post_text ~~ '%http://www.jforum.net/posts/list/5.page%'::text)

Well, it is using the indexes now.  But it is only about a 30% drop in cost for the worst case. On an untuned complex query, I usually see at least an order of magnitude performance jump on my initial tuning.

Step 4 – time for a temp table

Most of the work is finding the 375 rows that need updating in a table with 1,793,111 rows.  And it has to happen for each of the 5000 times I run the query.

I decided to use a temporary table so I could run the expensive part once.

create table jeanne_test as select * from jforum_posts_text where post_text like '%http://www.jforum.net/posts/list/%';

explain update jforum_posts_text set post_text=replace(post_text, 'http://www.jforum.net/posts/list/5.page','http://www.coderanch.com/t/574339 ')  where post_id in (select post_id from jeanne_test where post_text like '%http://www.jforum.net/posts/list/5.page%');

Now I’m doing the expensive part once.  It still takes a couple seconds to do the first part.  But the second part – the update I’m running 5000 times – drops the query plan to

Nested Loop (cost=13.50..21.99 rows=1 width=459)
HashAggregate (cost=13.50..13.51 rows=1 width=4)
Seq Scan on jeanne_test (cost=0.00..13.50 rows=1 width=4)
Filter: (post_text ~~ '%http://www.jforum.net/posts/list/5.page%'::text)
Index Scan using jforum_posts_text_pkey on jforum_posts_text (cost=0.00..8.46 rows=1 width=459)
Index Cond: (jforum_posts_text.post_id = jeanne_test.post_id

Nice.  Running the script with the 5000 update statements only took a few seconds.

Conclusion

Database tuning is fun.  Explain is your friend.  As are different approaches.  And for those who aren’t doing match, the performance jump was 3-4 orders of magnitude.

installing postgres and tomcat on the mac

Back when I was setting up my Mac, I installed Eclipse 3.7.  I also installed Postgres and Tomcat so I could run locally.  Postgres was either unnecessarily frustrating or it required knowledge I didn’t know I was missing.  The whole shared memory thing should have been a clue.  The good news is that I think I did almost everything wrong one could so I have a nice list of problems.  I wrote this up two months ago, but waited to post until I understood better what was going on.

I learned I didn’t know where the applications folder is on the hard drive.   First I learned you can view the path of a file in the finder. Which is all well and good.  It tells me the path is Macintosh HD > Applications > Eclipse.  Awesome.  On a UNIX file system, a directory has a slash up front.  Ok.  So I’m now at a UNIX prompt looking for it.  No Mac HD at the root level.  Volumes looks promising so I cd there and see Macintosh HD as a choice.  Great.  And there’s even an Applications folder under it with my apps!  I found it without having to use “find.”  That wasn’t bad, but it was more surprising than I expected.  I also noticed there is /Applications which will be more convenient next time.

Also, the “one click installer that doesn’t tell you what it is doing” installs some things by unzipping to the directory you are in, some files to /Library and presumably some files to who knows where.  Now I see the scripts I was expecting.

initdb doesn’t have enough privileges as me so I tried root and got initdb cannot be run as root.  Ok so it’s a real UNIX install where I need a user to run as.  I wasn’t expecting this because all my previous installs on the Mac have shielded me from a real UNIX install.

I thought I needed to create a  postgres user to run the database.  In Apple’s system preferences it tells me an id with that name already exists.  But /etc/passwd says nothing about it.  Running

dscl . -list /Users UniqueID

does turn up a postgres id. I need to figure out how the Mac Directory Service thing works.  Anyway, now that I know the postgres users exists I did “sudo passwd postgres” to reset the password to something I know.  Turns out the “sorry” message when trying to sudo to postgres wasn’t from an incorrect password.  Eureka!  I needed to type “sudo su postgres” and then type *my* admin password to switch.  Which kind of makes sense because my regular id isn’t allowed to switch user but my sudo admin user is.

I then changed owner of /Library/PostgreSQL recursively to postgres user.  A nice easy step.  (sudo chown -R postgres /Library/Post*)

Finally, I could run initdb and get prompted for the new superuser password!

I then tried to start the database and got a complaint the lock file already exists.

FATAL:  lock file “/tmp/.s.PGSQL.5432.lock” already exists

HINT:  Is another postmaster (PID 94) using socket file “/tmp/.s.PGSQL.5432”?bash-3.2$

Tried rebooting.  Error still there.  Turns out postgres automatically starts up.  I went on to create a user and database using the usual commands.

Then I opened pgadmin to see if I could access the database.

Connected – maintenance db is database name.

Almost there.  I used the wrong password for the id I created in postgres.  (This matters because the coderanch integration tests assume a certain password.  And because the one I picked I will never remember.)

alter user postgres with password ‘better_password’;

Importing data from file went smoothly.

Ongoing Little problems

  1. Copy paste from Eclipse doesn’t work.  You have to go through textedit.
  2. Postgres uses shift arrow key to select a line.  I wish Apple apps would all use the same convention.

Resources

  • http://developer.apple.com/internet/opensource/postgres.html
  • http://zanshin.net/2009/09/07/installing-postgresql-on-mac-10-6-snow-leopard/ – except for the building from source parts this was useful

Tomcat

Given how many problems I had with Postgres, I decided to follow a tutorial for Tomcat.  This one was easy to understand.  I know enough about the Mac to understand what the sudo is for (to be an admin.)  And the rest is pure UNIX – permissions, making a symbolic link, etc.

Live from TSSJS – PostgreSQL with Tom

Live blogging from TheServerSide Java Symposium with Tom Kincaid at his vendor presentation “Introduction to PostgreSQL for Development and Deployment”.  Tom spends a lot of time contrasing Postgres with MySQL, and commenting how since Oracle’s aquisition of MySQL, the licensing of MySQL is now much more restrictive. Tom says the licensing of Postgres is basically “Do what you want with the code but don’t bother us”.

1. Why has it not the most adopted open source DB?
Tom talks about some of the limitation of Postgres that prevented it from becoming one of the most adopted DBMS software including:

  • Early versions were Linux only with Windows support coming later
  • Installation was difficult
  • Default configuration was not neccessarily the best, and was designed for widest platform adoption
  • Lacked bundled distribution tools

Tom points out that originally, developers “first 20 mins” of using the product was often frustrating and would turn people off to the software.

2. Today’s Postgres
Postgres has made a number of changes in the last few years to grow Postgres has a more developer-friendly and easier-to-use product.  It also has been extended to work with all major IDEs, object-relational mapping tools such as JDBC, ODBC drivers, and stored procedures based in Java.  They have also spent time improving the GUI tools to compete with other major DBMS providers.  The GUI also shows performance tools and query plans.

Conclusion
The vendor sessions were abridged compared to the normal sessions so there was a lot Tom did not get to.  He did peak my interest in learning more about Postgres, now that it has grown to a more substational platform.  Tom’s belief is that everyone should use Postgres given its power and hands-off licensing, especially compared to the direction MySQL may be going.