Conor McDermottroe

I don't like kludgey solutions to problems. They catch up with you eventually and it's usually more expensive in the long run. Unfortunately, like buying a house, sometimes you have to take on some debt now rather than spend decades trying to save enough to buy without being in debt.

Recently I've been trying to squeeze some more performance out of a LAMP application - forum software to be precise - and I've been forced to compromise a little. The latest challenge was a query like this:

SELECT postid
  FROM post
    WHERE
    threadid = 1234 AND
    visible = 1
  ORDER BY TIMESTAMP
    LIMIT 30, 15;

It pulls out the IDs of the posts that should be displayed on a given page of a particular thread. In this case, it's the third page of the thread with the ID of 1234. Pretty simple, and fast enough. The problem happens when you have a thread with over 5,000 pages at 15 posts per page. Then a query for page 2,820 looks like this:

SELECT postid
  FROM post
    WHERE
    threadid = 1234 AND
    visible = 1
  ORDER BY TIMESTAMP
    LIMIT 42300, 15;

Now the query is slow because it has to sort the results by timestamp and then seek all the way through that sorted list until it finds the 15 IDs it wants. Worse, the query plan looks something like this (some columns removed for formatting purposes):

+-------------+------+---------------+----------+-------+-------+-----------------------------+
| select_type | type | possible_keys | key      | ref   | rows  | Extra                       |
+-------------+------+---------------+----------+-------+-------+-----------------------------+
| SIMPLE      | ref  | threadid      | threadid | const | 76161 | Using where; Using filesort |
+-------------+------+---------------+----------+-------+-------+-----------------------------+

One very obvious problem here is the "Using filesort" part. No-one wants to sort large numbers of rows like that. The simplest approach is to add an index which covers the timestamp so that the entries can be read from the index in sorted order.

ALTER TABLE post ADD INDEX tvt (threadid, visible, TIMESTAMP);

A little over an hour later, the query plan is now a bit better:

+-------------+------+---------------+-----+-------+-------+-------------+
| select_type | type | possible_keys | key | ref   | rows  | Extra       |
+-------------+------+---------------+-----+-------+-------+-------------+
| SIMPLE      | ref  | threadid,tvt  | tvt | const | 76161 | Using where |
+-------------+------+---------------+-----+-------+-------+-------------+

Testing this change shows approximately a 3x speedup. Sounds great until you realise you're going from a 6 second query to a 2 second query. It's still way too slow and the reason is that we're still scanning a huge amount of data. We have a good pool of memcache servers so perhaps the sensible option is to cache the results of the query. Unfortunately, there are 5 different page sizes and other user-configurable bits and bobs that make the query hard to cache as-is. The solution I came round to is to cut down on the number of rows being paged through. The easiest way to do that is to calculate "hints" for the query so that it can skip most of the data in one go.

The end result is something like this:

-- limit = page_number * page_size
-- limit_rounded = floor(limit / 1000) * 1000
-- limit_new = limit - limit_rounded
--
-- Check memcached for the hint for {1234, limit_rounded}
-- If memcached returns a miss, then calculate the hint like so:
SELECT TIMESTAMP
  FROM post
    WHERE
    threadid = 1234 AND
    visible = 1
  ORDER BY TIMESTAMP
  LIMIT @limit_rounded, 1;
-- Stash that timestamp in memcached.
--
-- Now actually run the query:
SELECT postid
  FROM post
    WHERE
    threadid = 1234 AND
    visible = 1 AND
    TIMESTAMP >= @hint
  ORDER BY TIMESTAMP
  LIMIT @limit_new, 15;

Since the first query is only run on cache misses we're only really interested in the performance of the second one. Here's an example query plan for a page near the end of the thread:

+-------------+-------+---------------+-----+------+------+-------------+
| select_type | type  | possible_keys | key | ref  | rows | Extra       |
+-------------+-------+---------------+-----+------+------+-------------+
| SIMPLE      | range | threadid,tvt  | tvt | NULL | 301  | Using where |
+-------------+-------+---------------+-----+------+------+-------------+

Far fewer rows were examined and so the query executed much faster (0.01 seconds). All should be well, but I'm still left with a bunch of bugs (and these are the ones I can think of immediately):

  • The timestamp is only gated in one direction so pages at the start of the thread are much slower than pages at the end of the thread. This is (barely) acceptable for two reasons: 1) people tend to read the newer posts, not the older ones and 2) the cost of 2 cache misses would be in the 4 second range.
  • If two posts are made in the same second and span a 1000 post boundary the paging will be off.
  • If a post is deleted or hidden in the middle of a thread, the paging will be off by 1 until the cache expires.
  • If memcached disappears and the cache call always misses, then the delay will be roughly twice the length of the unhinted query (4-5 seconds).

I'm not happy introducing all those bugs but performance requirements dictated that some compromises were made. The original query was being run somewhere in the region of 400,000 times per day, all that time adds up. Overall I think it was a necessary bodge but I'm already dreading the day when I have to find a less buggy solution to the problem.

What do you think? Was it worth it? Is there a bug-free way of doing that query without taking much too long?

I was participating in a thread on Boards.ie recently where the original poster asked the question:

"[W]hat do I now need to improve on and learn about in order to get any kind of programming/software development job?"

bookshelf

As the thread progressed I mentioned what I'd expect a novice programmer to know and how I used to go about interviewing but the part that really got me thinking was when I mentioned some books that I have sitting on my bookshelf. Which ones were good for helping me improve as a programmer and why? I answered on the thread but I figured I should expand on my choices here:

The Pragmatic Programmer: From Journeyman to Master - It's pretty much a collection of good advice to programmers who are trying to improve themselves professionally. Most of the advice is stuff you'll learn sooner or later but it's worth seeing it written down and in an easily readable form. The authors have branched out into publishing a whole load of other books (none of which I've read yet) and their podcast

Programming Pearls - A collection of small, but tricky problems and really nice worked solutions for them. The chapters were once articles in the Communications of the ACM and each one is arranged around a theme with sample problems at the end (there are solutions at the back of the book). You could think of this book as a sort of "Knuth-lite" . I find reading it makes me want to open an editor and start cranking out code.

Design Patterns: Elements of Reusable Object-Oriented Software (a.k.a. "the Gang of Four book") - A lot of object-oriented problems seem to reoccur over and over again so it's worth being able to spot those and know some high-level solutions for them. You get more out of this book if you read it, work for a year and come back and read it again. On reading it the 2nd and subsequent times, I've found myself saying "Ah, that's what I was doing when I implemented X". It's also useful as a Rosetta Stone for communicating with senior developers who are full of themselves.

The Mythical Man-Month - This is a pretty old collection of essays, but if you look past the antiquated bits (like punch cards and paper manuals) you'll find a lot of wisdom. It's more a book on project management than software development but it's well worth a read, after all every software developer is at least partly a project manager. File this book under "learn from the mistakes of others".

Joel on Software - I don't actually own a copy of this, I borrowed it from a friend. All the essays that make up the book are available on http://www.joelonsoftware.com so there's no need to buy the book unless you feel like supporting the author. It's a mixed bag of stuff but it's the collection of thoughts of a successful programmer so it's worth picking through. Joel helpfully includes a "Top 10" listing on his blog. That's a good place to start. Along with Jeff Atwood he created Stack Overflow, an excellent programming Q&A site. The podcast to go with the site is also worth a listen.

Apart from those books, I'd also recommend consuming as many technology blogs and podcasts as you have time for. Here's a list of blogs and podcasts that I follow. I don't want to review them, but they may be worth a look:

Blogs

Podcasts

I'm always looking for new stuff to read/listen to so if you have any suggestions, please leave a comment.

Full marks to whoever decided to use sheep as a tool for advertising televisions. Just goes to show, ewe shouldn't feel sheepish about bringing up left field ideas. :)

Want to add Google Analytics tracking to all the non-HTML resources on your site? How about the outbound links to other websites? If you're like me, you've considered it and then rejected it for being too annoying to add the tracking code manually.

Time for jQuery to come to the rescue. By adding the tracking code automatically to the links that need it you can avoid the hassle of editing all of your existing pages.

$(document).ready(
    function () {
        $("a").click(
            function () {
                var protocol = this.protocol;
                var link = $(this).attr("href");
                if (link.substring(0, protocol.length) == protocol) {
                    pageTracker._trackPageview('/exit/' + escape(link));
                } else {
                    link = this.pathname;
                    if  (
                            (link.substring(link.length - 1) != "/") &&
                            (link.substring(link.length - 4) != ".php")
                        )
                    {
                        pageTracker._trackPageview(link);
                    }
                }
            }
        );
    }
);

The following caveats apply:

  1. You need to use the newer version of the Google Analytics tracking code (ga.js, not urchin.js).
  2. It assumes that all links pointing to a URL ending in / or .php have tracking code installed. If you have other readily identifiable URLs that you want to exclude then exclude them in the obvious place above.

Spot anything that looks wrong? Does this not work on your browser of choice? Let me know.

According to Larry Wall, there are three great virtues of a programmer: laziness, impatience and hubris. I seem to have a surplus of the first and so have resisted blogging for a long time. Maybe my impatience and hubris have overtaken the laziness.

This may be the first of many blog posts or I may give up after a few. Either way, enjoy it while it lasts.