Wednesday, 30 May 2012

Data compression

Data compression not only significantly reduces the storage required for the database, but also improves throughput by reducing the I/O workload, at a modest cost in processing overhead. The storage cost savings can be important, but the reduction in I/O costs can be even more valuable. There are three possible ways to achieve that

1. Use compress/uncompress function in the database

2. Innodb-plugin offers table compression

3. Compress/uncompress data in client not in the database. This will offload more stuff – CPU and network. . This approach uses CPU cycles for compression and uncompression on the client machine rather than the database server, which might be appropriate for a distributed application with many clients, or where the client machine has spare CPU cycles.

We performed benchmarks (using table structre mentioned on my previous post: binary-log-synchronization) to evaluate the impact of using above mentioned data compression techniques (1 & 2 only), test data  was loaded using 8 parallel database sessions  into original table, compressed table and as well as into original table but using compress function to compress "content" column.

Time to
load 80k records
Data file size
0m57.515s 1.6GB
1m9.613s 648MB
using compress()
0m35.883s  156MB

The results show that inserting data in compressed format (i.e. using compress() function) is the fastest approach and it gives better compression than using innodb compressed table. Using compress() function we have the control to compress only selected columns whereas innodb compressed table compress the entire table.

We also perfomed benchmarks to read data from original table, compressed table and original table but using uncompress() function (to uncompress compressed data), total 800k lookups were perfomed using id=<>
Table Time to perform
800k lookups
using uncompress()
Original table 6m29.504s

The test results show that lookups performed againts compressed tables are at least 6 times faster than original table. The reason for which is that compressed table is much smaller than original table thus most of the data can fit into memory and hence reducing disk I/Os.

1 comment:

  1. If your clients are on separate server(s) than the database, do the compression/decompression in the client. It is easier to scale clients than the database Master.

    If you are compressing what was a TEXT field, use a BLOB field for the compressed equivalent. (Ditto for MEDIUMTEXT/MEDIUMBLOB, etc.)

    Rule of Thumb: English text, JSON, XMS, code, etc, compress 3:1. JPEG, PDF, etc, are already compressed; you should not bother compressing again.

    If you are mixing compressible and incompressible stuff, or long and short values, do not bother to have a "is_compressed" flag; simply compress everything.

    Alternatively, you could mix compressed and uncompressed values in the same BLOB -- check "HEX(LEFT(col, 2)) == '789C'". (At least, that is the magic for some client compression functions.)