Under the Hood: How TimescaleDB Achieves 90%+ Time-Series Data Compression

Picture this: You’ve just launched a new IoT monitoring service or a real-time financial dashboard. Everything is running smoothly. But as the weeks roll on, your database size ballooned from gigabytes to terabytes. Your AWS RDS bill is skyrocketing, disk I/O is bottlenecking, and queries that used to take milliseconds are now sluggishly dragging along.

If you're dealing with time-series data—metrics, logs, events, or sensor readings—you are fighting a constant battle against data gravity. Time-series data is relentless; it only grows, and it grows fast. Traditionally, developers had to choose between paying exorbitant storage fees or setting up aggressive retention policies that delete valuable historical data.

That is, until PostgreSQL met TimescaleDB.

One of TimescaleDB's most revered superpowers is its ability to compress data by 90% or more without sacrificing your ability to query it using standard SQL. But how does it actually do this under the hood? It’s not just running a generic gzip over your database files. Instead, it uses a brilliant combination of database architecture and specialized, type-specific compression algorithms. Let’s dive deep into how TimescaleDB compresses time-series data, and how you can leverage it in your own infrastructure.

The Core Challenge of Time-Series Compression

To understand why TimescaleDB’s approach is so clever, we first need to understand why traditional relational databases (like vanilla PostgreSQL) struggle with time-series storage.

PostgreSQL writes data in pages (typically 8KB). It writes data row-by-row (tuple-by-tuple). For a time-series table, a single row might look like this:

| timestamp (8 bytes) | device_id (4 bytes) | temperature (8 bytes) | status_code (2 bytes) |

Because Postgres stores data row-by-row, adjacent bytes on disk belong to different columns. Applying a standard compression algorithm like LZ4 or ZSTD across these rows yields mediocre results because the data types and values change rapidly from byte to byte. The timestamp is a highly predictable incremental number, the device_id is a repeating integer, and the temperature is a fluctuating float. Compressing them together is highly inefficient.

TimescaleDB solves this by fundamentally changing how data is laid out on disk before compression is applied.

Step 1: Transforming Rows into Columns

Before compressing anything, TimescaleDB performs a structural transformation. It takes a chunk of rows (TimescaleDB partitions data into physical "chunks" based on time intervals) and converts the row-oriented format into a columnar format.

Imagine we have the following simplified dataset in a chunk:

Timestamp  | Device_ID | Temp
-----------+-----------+-----
10:00:00   | A101      | 21.5
10:00:01   | A101      | 21.6
10:00:02   | A101      | 21.5

TimescaleDB group-compresses these rows by converting them into arrays of values for each column:

  • Timestamps: [10:00:00, 10:00:01, 10:00:02]
  • Device_IDs: [A101, A101, A101]
  • Temps: [21.5, 21.6, 21.5]

By grouping similar data types together, TimescaleDB can apply type-specific compression algorithms to each individual column. This is where the magic happens.

Step 2: Type-Specific Compression Algorithms

Once the data is arranged column-by-column, TimescaleDB applies the absolute best compression algorithm suited for that specific data type.

1. Delta-of-Delta Compression (for Timestamps)

Timestamps in time-series datasets are highly predictable. They usually arrive at regular intervals (e.g., every 1 second, every 10 seconds).

Instead of storing raw 8-byte integers or timestamps, TimescaleDB uses Delta-of-Delta encoding (inspired by Facebook's Gorilla paper). Here is how it works:

  • Store the first timestamp as a reference.
  • For subsequent timestamps, calculate the difference (the delta) from the previous timestamp.
  • If the data is sampled regularly, these deltas will be almost identical (e.g., always 1000 milliseconds).
  • Calculate the "delta-of-the-delta" (the difference between the current delta and the previous delta). If the interval is perfectly regular, this value is 0.
  • TimescaleDB uses a variable-length bit-packing scheme to store these zeros and small numbers using only a few bits instead of 64 bits (8 bytes) per timestamp.

2. XOR-Based Compression (for Floats)

Floating-point data (like temperatures, CPU usage, or stock prices) is notoriously hard to compress using traditional algorithms because the mantissa (fractional part) changes rapidly and looks like random noise to standard algorithms.

TimescaleDB uses XOR-based compression (also adapted from Gorilla). It compares the current floating-point value with the previous one using a bitwise XOR operation.

Because sequential measurements in time-series data usually don't change drastically, many of the leading and trailing bits in the XOR result are zeros. TimescaleDB discards these redundant zeros and packs only the meaningful bits. This often reduces 64-bit floats to just a fraction of their original size.

3. Run-Length Encoding / Dictionary Compression (for Low-Cardinality Data)

If you have columns like device_id, status_code, or location, you likely have highly repetitive data.

TimescaleDB uses Run-Length Encoding (RLE) for these. If you have 10,000 rows where the status_code is 'OK', instead of storing 'OK' 10,000 times, RLE stores it once along with a counter: ('OK', 10000).

For columns with a few distinct strings, it uses Dictionary Compression, mapping long strings to small, easily compressible integer IDs.

4. LZW / LZ4 / ZSTD (for Everything Else)

For arbitrary text columns or complex JSON payloads where specialized algorithms don't apply, TimescaleDB falls back to robust industry-standard compression algorithms like LZ4 or ZSTD, compressed over the column array.

Hands-On: Implementing Compression in TimescaleDB

Now that we understand the theory, let's look at how incredibly easy this is to set up in practice. You don't have to write custom microservices or manage complex serialization libraries; it's all handled declaratively inside PostgreSQL.

Creating a Hypertable

First, let’s create a standard PostgreSQL table to store server metrics and convert it into a TimescaleDB hypertable (which automatically partitions our data by time):

-- Create a standard table
CREATE TABLE sys_metrics (
    time TIMESTAMPTZ NOT NULL,
    device_id INT NOT NULL,
    cpu_utilization DOUBLE PRECISION,
    status TEXT
);

-- Convert it to a hypertable partitioned by the 'time' column
SELECT create_hypertable('sys_metrics', 'time');

Enabling Compression

Next, we enable compression on this hypertable. We tell TimescaleDB how we want to group our data using the segmentby option, and how we want it ordered using orderby.

ALTER TABLE sys_metrics SET (
    timescaledb.compress,
    timescaledb.compress_segmentby = 'device_id',
    timescaledb.compress_orderby = 'time DESC'
);

Pro-Tip on Segmentby: The compress_segmentby column is crucial. TimescaleDB will group data by this column within each chunk before compressing. This means all data for device_id = 1 is grouped together, making the compression algorithms (like XOR and Delta-of-Delta) incredibly efficient because the sequence of data is highly correlated.

Setting up a Compression Policy

You don't want to compress data that is currently being written to, as compressed chunks are read-only (though TimescaleDB supports seamless upserts and deletes on compressed data in newer versions, it is still highly optimized for append-only operations).

Let's set up a policy to automatically compress data that is older than 7 days:

SELECT add_compression_policy('sys_metrics', INTERVAL '7 days');

Just like that, behind the scenes, a background worker will wake up periodically, take chunks older than 7 days, transform them from row-store to columnar-store, apply the specialized compression algorithms, and free up your disk space.

Verifying the Results

To see how much space you're actually saving, TimescaleDB provides a handy informational function:

SELECT * FROM hypertable_detailed_size('sys_metrics');

In real-world production setups, developers frequently see metrics look something like this:

  • Uncompressed Size: 1.2 TB
  • Compressed Size: 98 GB
  • Compression Ratio: 91.8%

The Impact on Query Performance

A common concern is: "Won't decompression slow down my queries?"

Counter-intuitively, compression often speeds up many common time-series queries. Here is why:

  1. Reduced Disk I/O: The slowest part of most database queries is pulling data off the physical SSD/HDD. Because compressed data is 10x smaller, the database has to read 90% less data from the disk.
  2. Smart Chunk Exclusion: TimescaleDB knows the time ranges of each compressed chunk. If your query only asks for the last 24 hours of data, it won't even touch the compressed historical chunks.
  3. Segmentby Pushdowns: If you query a specific device_id, TimescaleDB only decompresses the specific vector of data matching that device_id within the chunk, thanks to the compress_segmentby configuration.

Conclusion: Keep Your Data, Lose the Bill

As developers, we shouldn't have to throw away valuable historical insights just because storage costs are high. TimescaleDB's approach to time-series compression shows the power of combining traditional relational database reliability with specialized, modern column-store techniques.

By transforming rows to columns and applying Delta-of-Delta, XOR, and Run-Length Encoding, TimescaleDB lets you keep your database footprint small, your SQL queries fast, and your infrastructure bills highly optimized.

Are you running into time-series scaling issues in your current stack? Have you tried TimescaleDB, or are you running a different setup like InfluxDB or ClickHouse? Let me know in the comments below, and don't forget to subscribe to the "Coding with Alex" newsletter for more deep dives into database internals and cloud architecture!

Post a Comment

Previous Post Next Post