production problems across time zones

A couple days ago, I blogged about the technical details of a production problem (not caused by me) at coderanch.  Now that the problem is resolved, is an interesting time to reflect on how time zones helped us.

Peak volume at the ranch

While we have users from 219 countries, roughly half our volume is from the US and India combined.  (source google analytics)  I also learned that our “peak time” is midnight to 6am Mountain Standard Time followed by 6am to 3pm.  This would be business hours in Asia and Europe followed by Europe and North America.  Peak time is misleading because bots count as users for hits.

As an added bonus, peak time for search engines/bots is 5am to 7am Mountain Standard Time.  Yes, these overlap.

When the problem occurred

Lucky for us, we have a moderator in India (Jaikiran Pai) who was able to investigate the problem real time.  Which mean those of us in the United States woke up to an almost daily email saying that site went down and an attempted fix.

Fixes for other problems

It turned out there were a couple resource leaks in the code that Jaikiran found/fixed.  One had been in the code for over a year.  One was new (due to an API being converted to JPA and the caller not adapting the open session filter.)  One was a less than desirable transaction setting.  All of these manifested because of the new, bigger problem – but were not the cause.  This is a common problem in software – finding the RIGHT problem.

Converging on the right problem

Another advantage of having someone who could look at the problem real time was that he was able to capture the database logs real time.  Right before going to sleep, Jaikiran found two queries taking a long time to run.  And by a long time, I mean one was taking OVER A MINUTE under load.  Which he found by running:

select current_query,now() - pg_stat_activity.query_start as duration from pg_stat_activity order by duration desc

He posted the two queries.  One took 200K explain plan units.  At this point, we had something that could be fixed without witnessing the problem firsthand and sql tuning work moved back to the United States. One thing the *right* solution had that the others didn’t was that it explained everything.  All the other fixes made sense, but relied on a “magic” step to get from the problem to the solution.

Tuning the hack

I created a hack that would limit the # threads shown in a forum to get us through another day or two until the weekend.  It required tuning during the production problem time.  Back to India.

Conclusion

Communication across time zones only worked because of email.  (Normally, we’d have used the forums.  But the forums weren’t a very reliable place given that the problem was the forums going down.)  I’ve never been on a team at work more than 3 time zones away.  It was a great experience working with a strong developer half the world away.  And while we’ve been developing features together, it is what you do in times of difficulty that shows your process.  It was wonderful to see ours working.

And finally: GREAT JOB JAIKIRAN!!!

postgres tuning – fixing a production problem

After a new feature was implemented (not by me), coderanch started crashing almost every day in the middle of the night.  In a few days, I’ll be blogging about the troubleshooting process and how timezones helped.  This post focuses on the end game – once we knew the problem was being caused by a large query – which is when I started being involved.

The tuning process

Like all good SQL tuning, I ran explain and iterated.  As background the new feature was a join table with half a million rows.

Explain cost What changed Observations
210,184 n/a No wonder the site is crashing.  For a web page (the forum list), this is forever!  While the query plan is using an index, it is using the index to join a table with half a million rows to a table with millions of rows.
40,590 Removed an unnecessary subquery.  (It was unnecessary because the column in populates isn’t used.) The problem is that the query isn’t using the index for a where clause.  Which is causing joins on very large tables to get a small amount of data.  Another problem is that the query limits the # rows returned to one page worth but does it at the end prohibiting the database from saving work.
1,807 Hack – we really want to  query the post time from the join table.  Since it wasn’t on there and it was too much work to add during the week, I introduced a hack.  I sorted by post creation (post id) and limited the query to sorting the most recent 100 records for the rest of the query. While this is much faster, it is functionally incorrect.  If an older post is still under discussion, it didn’t appear in the post list.  So broken, but fast enough to get us to the weekend.
288 Added the latest post time as a stored field on the join table. Ah.  Done

Learnings about postgres – locks

I ran a really simple statement to add a column to a table:

alter table jforum_topics_forums add column last_post_time TIMESTAMP without time zone not null default now();

Luckily I was on the test server because I had to kill it after 5 minutes.  At first, I thought the problem was setting the field to a value since it had to go through all the records.  That wasn’t the problem though.  The problem was that postgres was waiting on a lock.

SELECT * FROM pg_stat_activity;

select * from pg_locks where pid= 4503 and granted='f';

Running the above SQL, showed me postgres was waiting on an exclusive lock.  After I shut down the forum, the alter statement ran almost instantaneously.  The actual stored procedure to populate the new field (based on another table) took a few minutes.  But that makes sense as it was a stored procedure doing half a million queries.

Testing with lots of data

Everything went fine on my machine. On the test server (which does have lots of data), I realized that I forgot to add the index that uses the new last post time column.  That was the entire point of this exercise!  And it goes to show how important it is to have production volumes of test data.

why i usually like books over video

A teammate was discussing the “wonders of learning from video” yesterday.  Which got me thinking.  I generally like learning from books/articles best.  This would be text with illustrations/diagrams, not raw text.  I like reading better because:

  1. It is easier to go at my own pace.  (While you can speed up video, it takes more energy to listen to fast.  And I don’t want it uniformly fast. I want to be able to stop and re-read.  Which is a pain on video.)
  2. I find it easier to find information in text.
  3. I can later search text if electronic.  Or have “physical presence” cues if hard copy.

That said, I’m enjoying some of the MOCC courses online.  Some being the operative word.  A video has to be done right to be good.  (As does a book; it’s just that books tend to go through more editing.)  I’ve noticed that the videos I like tend to be less than 5 -10 minutes in length.  With quizzes or exercises in between or in the middle.  I think the interaction helps.  It is easy to see if I understand what is going on so far.  And to revisit select parts.

Live/in person video doesn’t have the negative side effects that recorded videos do for me.  I think that is because the presenter can adjust real time.  Either by seeing reactions or looking at visual cues or answering questions.  It still feels interactive even if a high percentage is lecture.

When creating documentation

When looking for general information, there are many forms and it is relatively easy to pick the format one desires.  (Although books are more common than videos on specialized topics.)  In a company, the cost to produce internal documentation often precludes doing both.  It’s also harder on the creators to do video because:

  1. Content needs to be searchable (I suppose a video transcription could allow this.)  This is the same reason text in an image should be available in pure text as well.
  2. Producing content for video consumption is very different than merely recording an in person training session.  The focus is different.  The “real time clutter” needs to be removed.  The screen needs to be shown with a different emphasis.  It’s not something to just do on a whim.
  3. Video can’t be watched while on hold, on a conference call, etc.  Granted these aren’t the ideal times to be learning, but it does happen.  Again subtitles could help with this.

What do you think?  How do you balance text vs video for technical content?