Last night, I needed to do a full Time Machine restore on my Mac. I messed up installing things and made the problem far worse in my attempt to “fix” it. Then I tried to rollback a fraction of the disk. The laptop predicted 11 hours to rollback the files. It was easier to rollback to the state a few hours earlier. The only time I did a full Time Machine restore was when I got a new computer. That was going forward in calendar time though. This is the first time I went backwards.
For the most part, it was easy. The Mac prompted me on whether it was ok to erase the disk and restore. That took about two hours. My computer was pretty much set up on restore. It remembered lots down to my internet connection and github key. I got re-prompted from my Dropbox credentials (and for two factor.) No big deal.
Then there was Postgres. Every few seconds, I got the message “Do you want the application “postgres” to accept incoming network connections?” The message only appeared for about a second. Even if I was fast enough to click on it, it was back a few seconds later.
I tried signing the app based on some internet posts. No luck.
I got temporarily relief while I looked into the problem by blocking all connections:
- System Preferences
- Security & Preferences
- Click the lock to make changes
- My password
- Click unlock
- Firewall Options
- Click “Block all incoming connections”
This is not helpful in the long run, but it let me look at things without going crazy.
I checked the database logs in /Library/PostgreSQL/8.4/datapg_log and found a file every few seconds with:
2016-04-25 20:39:15 EDT LOG: database system was interrupted; last known up at 2016-04-24 09:18:51 EDT
2016-04-25 20:39:15 EDT LOG: record with zero length at 0/5C42472C
2016-04-25 20:39:15 EDT LOG: invalid primary checkpoint record
2016-04-25 20:39:15 EDT LOG: record with zero length at 0/5C4246E8
2016-04-25 20:39:15 EDT LOG: invalid secondary checkpoint record
2016-04-25 20:39:15 EDT PANIC: could not locate a valid checkpoint record
2016-04-25 20:39:15 EDT LOG: startup process (PID 2326) was terminated by signal 6: Abort trap
2016-04-25 20:39:15 EDT LOG: aborting startup due to startup process failure
Ah ha! That makes perfect sense. I jumped around in time which would confuse a transaction log. Now,t hat’s something I know how to fix. I did a forced reset of the transaction log:
- sudo su postgres
- cd /Library/PostgreSQL/8.4/bin
- ./pg_resetxlog -f /Library/PostgreSQL/8.4/data
And all better! I turned the firewall connections back to the way they were before I started and good as new.
I upgraded my postgres today. (Yes, I know it is an old version – that is what the CodeRanch forums use.) First I reviewed what I did to install 8.3 originally and how frustrating that was. Upgrading went smoother. Here’s what I did:
- Download the Mac Postgres 8.4 graphical installer.
- Install it.
- Create my database and load data
- Change context.xml in Tomcat to point to port 5433. (since I had two postgres installs at the same time, it incremented the port from the default of 5432) (changed this back)
- sudo su – root and changed /Library/PostgreSQL/8.4/data/postgresql.conf to use port 5542. (If I was doing this again, I’d just uninstall the old version as first step to avoid port conflicts)
- In finder: Go > Go to folder > Libraries > Postgres > 8.3 and select the uninstaller app
That’s it. Nice and easy. Which is good because the initial install was nothing but easy!
Now I can rewrite this without a stored procedure!
After a new feature was implemented (not by me), coderanch started crashing almost every day in the middle of the night. In a few days, I’ll be blogging about the troubleshooting process and how timezones helped. This post focuses on the end game – once we knew the problem was being caused by a large query – which is when I started being involved.
The tuning process
Like all good SQL tuning, I ran explain and iterated. As background the new feature was a join table with half a million rows.
||No wonder the site is crashing. For a web page (the forum list), this is forever! While the query plan is using an index, it is using the index to join a table with half a million rows to a table with millions of rows.
||Removed an unnecessary subquery. (It was unnecessary because the column in populates isn’t used.)
||The problem is that the query isn’t using the index for a where clause. Which is causing joins on very large tables to get a small amount of data. Another problem is that the query limits the # rows returned to one page worth but does it at the end prohibiting the database from saving work.
||Hack – we really want to query the post time from the join table. Since it wasn’t on there and it was too much work to add during the week, I introduced a hack. I sorted by post creation (post id) and limited the query to sorting the most recent 100 records for the rest of the query.
||While this is much faster, it is functionally incorrect. If an older post is still under discussion, it didn’t appear in the post list. So broken, but fast enough to get us to the weekend.
||Added the latest post time as a stored field on the join table.
Learnings about postgres – locks
I ran a really simple statement to add a column to a table:
alter table jforum_topics_forums add column last_post_time TIMESTAMP without time zone not null default now();
Luckily I was on the test server because I had to kill it after 5 minutes. At first, I thought the problem was setting the field to a value since it had to go through all the records. That wasn’t the problem though. The problem was that postgres was waiting on a lock.
SELECT * FROM pg_stat_activity;
select * from pg_locks where pid= 4503 and granted='f';
Running the above SQL, showed me postgres was waiting on an exclusive lock. After I shut down the forum, the alter statement ran almost instantaneously. The actual stored procedure to populate the new field (based on another table) took a few minutes. But that makes sense as it was a stored procedure doing half a million queries.
Testing with lots of data
Everything went fine on my machine. On the test server (which does have lots of data), I realized that I forgot to add the index that uses the new last post time column. That was the entire point of this exercise! And it goes to show how important it is to have production volumes of test data.