After a new feature was implemented (not by me), coderanch started crashing almost every day in the middle of the night. In a few days, I’ll be blogging about the troubleshooting process and how timezones helped. This post focuses on the end game – once we knew the problem was being caused by a large query – which is when I started being involved.
The tuning process
Like all good SQL tuning, I ran explain and iterated. As background the new feature was a join table with half a million rows.
|Explain cost||What changed||Observations|
|210,184||n/a||No wonder the site is crashing. For a web page (the forum list), this is forever! While the query plan is using an index, it is using the index to join a table with half a million rows to a table with millions of rows.|
|40,590||Removed an unnecessary subquery. (It was unnecessary because the column in populates isn’t used.)||The problem is that the query isn’t using the index for a where clause. Which is causing joins on very large tables to get a small amount of data. Another problem is that the query limits the # rows returned to one page worth but does it at the end prohibiting the database from saving work.|
|1,807||Hack – we really want to query the post time from the join table. Since it wasn’t on there and it was too much work to add during the week, I introduced a hack. I sorted by post creation (post id) and limited the query to sorting the most recent 100 records for the rest of the query.||While this is much faster, it is functionally incorrect. If an older post is still under discussion, it didn’t appear in the post list. So broken, but fast enough to get us to the weekend.|
|288||Added the latest post time as a stored field on the join table.||Ah. Done|
Learnings about postgres – locks
I ran a really simple statement to add a column to a table:
alter table jforum_topics_forums add column last_post_time TIMESTAMP without time zone not null default now();
Luckily I was on the test server because I had to kill it after 5 minutes. At first, I thought the problem was setting the field to a value since it had to go through all the records. That wasn’t the problem though. The problem was that postgres was waiting on a lock.
SELECT * FROM pg_stat_activity; select * from pg_locks where pid= 4503 and granted='f';
Running the above SQL, showed me postgres was waiting on an exclusive lock. After I shut down the forum, the alter statement ran almost instantaneously. The actual stored procedure to populate the new field (based on another table) took a few minutes. But that makes sense as it was a stored procedure doing half a million queries.
Testing with lots of data
Everything went fine on my machine. On the test server (which does have lots of data), I realized that I forgot to add the index that uses the new last post time column. That was the entire point of this exercise! And it goes to show how important it is to have production volumes of test data.