production problems across time zones

A couple days ago, I blogged about the technical details of a production problem (not caused by me) at coderanch.  Now that the problem is resolved, is an interesting time to reflect on how time zones helped us.

Peak volume at the ranch

While we have users from 219 countries, roughly half our volume is from the US and India combined.  (source google analytics)  I also learned that our “peak time” is midnight to 6am Mountain Standard Time followed by 6am to 3pm.  This would be business hours in Asia and Europe followed by Europe and North America.  Peak time is misleading because bots count as users for hits.

As an added bonus, peak time for search engines/bots is 5am to 7am Mountain Standard Time.  Yes, these overlap.

When the problem occurred

Lucky for us, we have a moderator in India (Jaikiran Pai) who was able to investigate the problem real time.  Which mean those of us in the United States woke up to an almost daily email saying that site went down and an attempted fix.

Fixes for other problems

It turned out there were a couple resource leaks in the code that Jaikiran found/fixed.  One had been in the code for over a year.  One was new (due to an API being converted to JPA and the caller not adapting the open session filter.)  One was a less than desirable transaction setting.  All of these manifested because of the new, bigger problem – but were not the cause.  This is a common problem in software – finding the RIGHT problem.

Converging on the right problem

Another advantage of having someone who could look at the problem real time was that he was able to capture the database logs real time.  Right before going to sleep, Jaikiran found two queries taking a long time to run.  And by a long time, I mean one was taking OVER A MINUTE under load.  Which he found by running:

select current_query,now() - pg_stat_activity.query_start as duration from pg_stat_activity order by duration desc

He posted the two queries.  One took 200K explain plan units.  At this point, we had something that could be fixed without witnessing the problem firsthand and sql tuning work moved back to the United States. One thing the *right* solution had that the others didn’t was that it explained everything.  All the other fixes made sense, but relied on a “magic” step to get from the problem to the solution.

Tuning the hack

I created a hack that would limit the # threads shown in a forum to get us through another day or two until the weekend.  It required tuning during the production problem time.  Back to India.

Conclusion

Communication across time zones only worked because of email.  (Normally, we’d have used the forums.  But the forums weren’t a very reliable place given that the problem was the forums going down.)  I’ve never been on a team at work more than 3 time zones away.  It was a great experience working with a strong developer half the world away.  And while we’ve been developing features together, it is what you do in times of difficulty that shows your process.  It was wonderful to see ours working.

And finally: GREAT JOB JAIKIRAN!!!

One thought on “production problems across time zones

  1. Pingback: postgres tuning – fixing a production problem | Down Home Country Coding With Scott Selikoff and Jeanne Boyarsky

Leave a Reply

Your email address will not be published. Required fields are marked *