Optimizing PostgreSQL Performance & Compression: pglz vs. LZ4
It is not uncommon for Timescale users to see compression rates of 90% or more when using our hybrid row-columnar storage engine, which we built using custom compression methods and specific compression algorithms.
We’ve long believed that TOAST (The Oversized-Attribute Storage Technique) is not good enough on its own for data compression in PostgreSQL. That is not to say we don’t use it; we just build on it.
While the main compression option built into PostgreSQL’s TOAST mechanism is pglz, a variant of the Lempel-Ziv (LZ) family of algorithms, we use array-based LZ4 compression for data types such as text or JSON columns. LZ4 compression allows us to compress data values across multiple rows into a single array and then the entire array, which is beneficial for many use cases.
In this blog post, we’ll compare pglz and the LZ4 compression algorithm (introduced in PostgreSQL 14) before showing you how enabling LZ4 in your database can boost your PostgreSQL performance.
Pglz vs. LZ4 for PostgreSQL Performance
As mentioned, among the LZ compression algorithms used to compress large values stored in TOAST tables, PostgreSQL’s TOAST mechanism uses pglz by default. This method is a variant of the Lempel-Ziv family of algorithms and aims to strike a balance between compression ratio and speed.
With the release of PostgreSQL 14, PostgreSQL began supporting LZ4, a lossless data compression algorithm that allows for the fast compression of table and index data in PostgreSQL. LZ4 is more efficient, less CPU-intensive, and provides better compression and decompression throughput.
This means that when using LZ4 compression, the storage requirements of tables and indexes can be significantly reduced while minimizing the performance impact of compressing and decompressing data during access.
LZ4 can be particularly useful for improving disk space usage and speeding up query performance due to reduced I/O load, especially for workloads where compressed data is frequently accessed.
For PostgreSQL and TimescaleDB users, these benefits are extremely relevant when scanning a compressed text or JSON column. Less time is spent decompressing the data, which leads to a performance boost for almost every use case, leading to reduced query times across the board.
Let's look at how you can set lz4
compression on your own box.
Setting Up LZ4 on Your PostgreSQL Instance
First, ensure that your PostgreSQL instance supports lz4
. If you’re using the official PostgreSQL binaries or Docker images, you’re good to go. If you've built PostgreSQL from source code, then make sure to pass the --with-lz4
flag to the configure
utility before building PostgreSQL.
Once you've done that, connect to your database instance using psql. Then you can run:
This sets lz4
compression for your current session. To make this the default for all connections, add this to your postgresql.conf file:
TOAST will now default to using lz4
compression in all cases. When you change the default TOAST compression, previously compressed data is not re-compressed using the new compression algorithm, so it's safe to change this. Only future compressed data will pick up the new default. In TimescaleDB, new chunks will be compressed using lz4
.
You can also specify the compression method for a specific column by adding the COMPRESSION
option in CREATE TABLE:
This can be useful if you want to use different kinds of compression algorithms on different columns or override default behavior.
So how does lz4
fare against pglz
? Let’s compare them when used with Timescale compression.
Benchmarking Pglz vs. LZ4
As mentioned, we use custom compression methods and particular algorithms to help users compress their data:
- Gorilla compression for floats
- Delta-of-delta and simple-8b with run-length encoding compression for timestamps and other integer–like data types
- Whole-row dictionary compression for columns with a few repeating values plus LZ compression
- And LZ-based array compression for all other types
Timescale uses TOAST compression for some non-numeric data types. To see the impact of this compression, we did a benchmark that runs queries on a dataset with wide text columns: hackers
.
The hackers
benchmark runs aggregate queries over the pgsql-hackers e-mail archives, filtering over text columns. This is what the table looks like:
The table has about 20 years of data with approximately one million rows. The chunk interval on the hypertable is two years, and compression is enabled:
The body_plain
column is a very wide column with an average length of about 1,354 characters.
We then set the LZ algorithm to use:
Then, after compressing all the chunks, we run the queries in the benchmark.
These queries involve calculating the minimum, maximum, and number of rows that match specific predicates on text columns. These predicates involve WHERE ... LIKE '%...%'
clauses, which require the database to decompress the data in order to read it. An example of one of the queries in the benchmark:
We can see that query execution time for Q1 goes from 1.7 s to under a second (0.98 s)—a whopping 72 % improvement. On the entire test suite, which consists of similar aggregation queries, we see an average improvement of 13 %.
The longest-running queries, i.e., the ones that involve reading more data, show the largest improvements. When run sequentially, the entire benchmark is 37.32 % faster with lz4
.
We also see improvements in the clickbench
benchmark: for four queries involving text scans using the LIKE
in operator, we see an average improvement of 24.20 %:
Conclusion
As you can see, lz4
provides a significant improvement to your PostgreSQL or TimescaleDB queries. And with very little effort, too—you just need to change one setting! We definitely recommend enabling it, especially if you work with non-numeric data and run queries that involve reading text columns.
Any new data that you ingest will automatically pick up the new compression setting and use lz4
, providing performance benefits. And if you’re a Timescale Cloud customer, lz4
is enabled for you by default. You can try it for free for 30 days.