Subscribe to the Timescale Newsletter

By submitting you acknowledge Timescale's  Privacy Policy.

Optimizing PostgreSQL Performance & Compression: pglz vs. LZ4

Optimizing PostgreSQL Performance & Compression: pglz vs. LZ4
01
Pglz vs. LZ4 for PostgreSQL Performance
02
Setting Up LZ4 on Your PostgreSQL Instance
03
Benchmarking Pglz vs. LZ4
04
Conclusion

It is not uncommon for Timescale users to see compression rates of 90% or more when using our hybrid row-columnar storage engine, which we built using custom compression methods and specific compression algorithms.

We’ve long believed that TOAST (The Oversized-Attribute Storage Technique) is not good enough on its own for data compression in PostgreSQL. That is not to say we don’t use it; we just build on it. 

While the main compression option built into PostgreSQL’s TOAST mechanism is pglz, a variant of the Lempel-Ziv (LZ) family of algorithms, we use array-based LZ4 compression for data types such as text or JSON columns. LZ4 compression allows us to compress data values across multiple rows into a single array and then the entire array, which is beneficial for many use cases.

In this blog post, we’ll compare pglz and the LZ4 compression algorithm (introduced in PostgreSQL 14) before showing you how enabling LZ4 in your database can boost your PostgreSQL performance.

Pglz vs. LZ4 for PostgreSQL Performance

As mentioned, among the LZ compression algorithms used to compress large values stored in TOAST tables, PostgreSQL’s TOAST mechanism uses pglz by default. This method is a variant of the Lempel-Ziv family of algorithms and aims to strike a balance between compression ratio and speed. 

With the release of PostgreSQL 14, PostgreSQL began supporting LZ4, a lossless data compression algorithm that allows for the fast compression of table and index data in PostgreSQL. LZ4 is more efficient, less CPU-intensive, and provides better compression and decompression throughput. 

This means that when using LZ4 compression, the storage requirements of tables and indexes can be significantly reduced while minimizing the performance impact of compressing and decompressing data during access.

LZ4 can be particularly useful for improving disk space usage and speeding up query performance due to reduced I/O load, especially for workloads where compressed data is frequently accessed.

For PostgreSQL and TimescaleDB users, these benefits are extremely relevant when scanning a compressed text or JSON column. Less time is spent decompressing the data, which leads to a performance boost for almost every use case, leading to reduced query times across the board.

Let's look at how you can set lz4 compression on your own box. 

Setting Up LZ4 on Your PostgreSQL Instance

First, ensure that your PostgreSQL instance supports lz4. If you’re using the official PostgreSQL binaries or Docker images, you’re good to go. If you've built PostgreSQL from source code, then make sure to pass the --with-lz4 flag to the configure utility before building PostgreSQL.

Once you've done that, connect to your database instance using psql. Then you can run:

SET default_toast_compression=lz4;
      

This sets lz4 compression for your current session. To make this the default for all connections, add this to your postgresql.conf file:

default_toast_compression = 'lz4'
      

TOAST will now default to using lz4 compression in all cases. When you change the default TOAST compression, previously compressed data is not re-compressed using the new compression algorithm, so it's safe to change this. Only future compressed data will pick up the new default. In TimescaleDB, new chunks will be compressed using lz4.

You can also specify the compression method for a specific column by adding the COMPRESSION option in CREATE TABLE:

CREATE TABLE lz4_example(id int, lz4_column text COMPRESSION lz4)
      

This can be useful if you want to use different kinds of compression algorithms on different columns or override default behavior.

So how does lz4 fare against pglz? Let’s compare them when used with Timescale compression.

Benchmarking Pglz vs. LZ4

As mentioned, we use custom compression methods and particular algorithms to help users compress their data:

  • Gorilla compression for floats
  • Delta-of-delta and simple-8b with run-length encoding compression for timestamps and other integer–like data types
  • Whole-row dictionary compression for columns with a few repeating values plus LZ compression
  • And LZ-based array compression for all other types

Timescale uses TOAST compression for some non-numeric data types. To see the impact of this compression, we did a benchmark that runs queries on a dataset with wide text columns: hackers.

The hackers benchmark runs aggregate queries over the pgsql-hackers e-mail archives, filtering over text columns. This is what the table looks like:

id

int

sent

timestamp

subject

text

author

text

body

text

The table has about 20 years of data with approximately one million rows. The chunk interval on the hypertable is two years, and compression is enabled:

create table hackers(id int, sent timestamp, subject text, author text, body_plain text);
select create_hypertable('hackers', 'sent', chunk_time_interval => interval '2 year');
alter table hackers set (timescaledb.compress);

      

The body_plain column is a very wide column with an average length of about 1,354 characters.

We then set the LZ algorithm to use:

SET default_toast_compression=lz4; -- ‘pglz’ in the second run
      

Then, after compressing all the chunks, we run the queries in the benchmark.

These queries involve calculating the minimum, maximum, and number of rows that match specific predicates on text columns. These predicates involve WHERE ... LIKE '%...%' clauses, which require the database to decompress the data in order to read it. An example of one of the queries in the benchmark:

select min(sent), max(sent), count(*) from hackers where body_plain like '%vacuum%' group by author;
      
A bar chart measuring queries that calculate the minimum, maximum, and number of rows that match specific predicates on text columns. pglz is consistently slower than lz4.

We can see that query execution time for Q1 goes from 1.7 s to under a second (0.98 s)—a whopping 72 % improvement. On the entire test suite, which consists of similar aggregation queries, we see an average improvement of 13 %.

The longest-running queries, i.e., the ones that involve reading more data, show the largest improvements. When run sequentially, the entire benchmark is 37.32 % faster with lz4.

We also see improvements in the clickbench benchmark: for four queries involving text scans using the LIKE in operator, we see an average improvement of 24.20 %:

Four bar graphs: for four queries involving text scans using the LIKE in operator, we see an average improvement of 24.20 %.

Conclusion

As you can see, lz4 provides a significant improvement to your PostgreSQL or TimescaleDB queries. And with very little effort, too—you just need to change one setting! We definitely recommend enabling it, especially if you work with non-numeric data and run queries that involve reading text columns. 

Any new data that you ingest will automatically pick up the new compression setting and use lz4, providing performance benefits. And if you’re a Timescale Cloud customer, lz4 is enabled for you by default. You can try it for free for 30 days.

Ingest and query in milliseconds, even at petabyte scale.
This post was written by

Originally posted

Last updated

5 min read
PostgreSQL, Blog
Contributors

Related posts