New Bamboo Web Development

Bamboo blog. Our thoughts on web technology.

Database Optimization for Rails Apps

by Gwyn Morfey

Summary

  • Use STRAIGHT_JOIN if MySQL is doing something silly
  • MySQL Datetimes are not slow
  • MyISAM is much, much faster than InnoDB
  • Consider MySQL partitioning; it's not hard
  • Watch out for background optimisation and query caching
  • Log everything you do, and why

The reporting application I'm working on does a date-range select against a table with ten million rows. It took over two minutes to execute. I cut it down to around two seconds without significantly changing the application. Here's how.

Straight Joins

The original SQL statement was 12 lines, featuring three joins, a group-by and a computed column.

To start with I used EXPLAIN SELECT to look at the execution plan. MySQL had made a bad guess, and was table-scanning a four-million-row related table that could have been joined by ID instead.

I added STRAIGHT_JOIN to the SQL statement and dropped the execution time to 20 seconds.

Use STRAIGHT_JOIN if MySQL is doing something silly.

Finding the Weak Point

Progress; MySQL was now table-scanning a 10 million row 'events' table to satisfy the WHERE clause, then joining the other three tables by ID, grouping and summing. I cut the query down progressively to a simple "SELECT count(*) FROM events WHERE starttime < '2006-12-30 08:30:00' AND endtime > '2006-12-30 08:15:00'" and found that performance didn't significantly change.

This isolated the slow part of the query. My original plan had been to denormalize the table structure, but there was no point.

At this point I built a test database of two million rows on my local machine with only the one table in question, and only the three columns that were needed.

Indexes

I added indexes on starttime, endtime, and (starttime,endtime), but the MySQL optimiser chose not to use them. When I forced it to use them, using FORCE INDEX, performance deteriorated (!). OPTIMIZE TABLE made no difference.

Dead End: Int Conversion

The application needs only 15-minute resolution on the date times, so I created start_int and finish_int columns and filled them:

update events set start_int = floor(unix_timestamp(starttime)/900) 
update events set finish_int = floor(unix_timestamp(endtime)/900)

Then I created indexes on the columns and tested again. The plan here was that, by reducing the resolution, there'd be fewer unique values and the indexes would be more effective.

It didn't work. Tablescanning against datetimes was faster than indexed lookup against ints.

MySQL DateTimes are not slow.

Switching To MyISAM

I'd noticed that my test database was performing much faster than the production database, even taking into account the smaller size. Running SHOW CREATE TABLE against the two revealed that I was using MyISAM, but the production database had InnoDB. I converted my local database to InnoDB and took a speed hit on the order of 6x.

MyISAM is much faster than InnoDB. (but see notes)

Note that the conversion process is hideously slow; it's faster to create a new table and load it up with your test data yourself. If you do this, use extended inserts; they're about 25x faster than singular inserts.

You give up some goodies, including transactions, by making the switch.

Partitioning

The application never queries over a time range of greater than 15 minutes, and always aligns its requests on 15-minute boundaries. If we had to tablescan, I could use this to at least make sure we didn't have to tablescan all ten million rows.

I installed MySQL 5.1, which is the first release that supports partitioning, and created a new table using MyISAM, with a partition per 3 months, copying data from the old table in one hit:

CREATE TABLE `events2` (
 `id` int(11) NOT NULL AUTO_INCREMENT,
 `starttime` datetime DEFAULT NULL,
 `endtime` datetime DEFAULT NULL,
 KEY `id` (`id`),
 KEY `starttime` (`starttime`),
 KEY `endtime` (`endtime`),
 KEY `starttime_and_endtime` (`starttime`,`endtime`),
) ENGINE=MyISAM DEFAULT CHARSET=utf8
PARTITION BY RANGE (TO_DAYS(starttime)) (
    PARTITION p0 VALUES LESS THAN (732312),
    PARTITION p1 VALUES LESS THAN (732677),
    PARTITION p2 VALUES LESS THAN (732767),
...
    PARTITION p17 VALUES LESS THAN MAXVALUE 
) SELECT * FROM events;

I'm using TO_DAYS because partitions by range can only be described in terms of a single integer (not a datetime).

Now I could adjust my query to address a single partition only:

SELECT count(*) FROM events WHERE starttime < '2006-12-30 08:30:00' 
AND endtime > '2006-12-30 08:15:00" AND starttime > '2006-12-01 00:00:00'.

I used EXPLAIN PARTITION SELECT ... to confirm that MySQL was accessing only a single partition.

The benefit of this particular partitioning strategy is that it's transparent to the application. More advanced approaches involve splitting records across database _servers_, rather than simply changing the 'create table' statement.

Consider MySQL partitioning; it's not hard.

Results

The full query originally took around 152 seconds to run. It's now down to around two seconds, or a 75x speed improvement. This is running essentially the same query, and returning data in the same format; the only change necessary at the application level was to reword the SQL slightly.

General Tactics

Those were the specific victories. Here are the overall tactics which are more generally applicable:

I found it helpful to think not about what my application needed to know, but what it didn't need to know. For example, rules like "we never run a query that does not align on a 15-minute boundary" can simplify the problem.

Nothing substitutes for benchmarks. Things that ought to be fast are sometimes slow. Things that ought to be slow are sometimes fast.

But don't trust the benchmarks. Sometimes I'd return to my machine to find the *same query* running in less than a second instead of over a minute. I was using SQL_NO_CACHE, so I think background optimisation is responsible here. Since the database was very large, it's possible that OSX's virtual memory management was also interfering.

Watch out for background optimisation and query caching.

The only reason I caught this was that I kept notes of every single query I ran, which machine I ran it on, the time it took to execute, and anything interesting about the results. It's a simple routine to create or edit the query in Textmate, then paste it into MySQL, instead of working in the MySQL window directly.

Log everything you do, and why.

I also kept notes about theories and ideas, inline with the sql, and later marked them with PROVED / DISPROVED and the evidence so that I could be sure I'd covered all the bases.

Conclusion and Notes

Tweaking the database lets you make huge gains in performance without modifying your complex application. The assertions above are only guidelines based on the evidence I collected on one project; ultimately it's your own benchmarks that will matter. Simplify the problem, record your investigations, and benchmark the results.