Design of a Tiger-style Time Series Database

I’m interested to hear how yall would design a tiger-style time-series database.

Background

Time series databases are a type of database optimized for storing many series, where a single series can be represented like the following (in ZON):

.{
    .id = "f0e4c2f76c58916ec258f246851bea091d14d4247a2fc3e18694461b1816e13b",
    .data = .{
        .{
            .time = 12345134213143,
            .value = 3
        },
        .{
            .time = 12312513431212,
            .value = 1,
        },
        .{
            .time = 12312513431225,
            .value = 5,
        },
    },
}

The ID is just a hash value the identifies the series. Many databases maintain indexes for looking up series with queries like:

find all series where location=headquarters, sensor_type=hvac 

This query might return 32 series, each series being the temperature over time measured from the 32 HVAC units in the headquarters building. The query might be displayed as a 2D graph with 32 lines on it, the x axis being time and y axis being temperature.

Examples of time-series databases:

  • InfluxDB
  • TimescaleDB
  • TDengine

Problems

I’ve personally experienced some issues with existing time-series databases (mainly InfluxDB)

  1. Out-of-memory failures under high query load
  2. Performance issues, query load effects ingestion performance and vice-versa
  3. High disk usage, unlimited disk usage, disk fills, DB just dies.
  4. Bad interface design leads to unoptimizable ingestion (line protocol is a horrible data ingestion format)

Bad Interface Design

Lets take a look at how InfluxDB ingests data (line protocol).

Our above dataset could be constructed by ingesting the following line-protocol:

hvac,location=headquarters compressor_temperature=3i 12345134213143
hvac,location=headquarters compressor_temperature=1i 12312513431212
hvac,location=headquarters compressor_temperature=5i 12312513431225

Glossing over things like “measurement”, “tags”, and “fields” (bad vernacular invented by InfluxDB), the first part is used to identify the series, “hvac,location=headquarters” will be sorted, hashed, the hash looks up a series ID, and data is inserted into the series on disk.

This ingestion format has the following disadvantages:

  1. users must repeat themselves agnosium, every line contains redundant strings, 70% of our data is just repeated “hvac,location=headquarters compressor_temperature=”. InfluxDB can ingest gzip compressed line protocol, but it still has to be expanded, sorted, and scanned before insertion.
  2. Its a text format, not binary, so every ingest now has to covert this text too.
  3. Its comma, space, and newline delimited, with annoying escaping rules.
  4. You get recommendations in the docs like “sort your tags before you send them to the database, according to the Go sort() function…”

It has the following advantages:

  1. Its pretty easy to construct a compliant implementation, most implementations are < 1000 LOC but are typically low performance. The python implementation, for example, can be made 3x more performant with some LRU caching of string replacements, for example. This just illustrates how inefficient the format is.

The database accepts HTTP requests of gzipped line protocol. While simple in theory, wouldn’t it be cool to ingest, idk, maybe UDP datagrams, etc? Maybe even leave the whole problem of authentication, security, etc to a reverse proxy in front of the DB?

How can we provide a more reliable time-series database?

Out of memory failures

Follow a reserve-first principle. Allocate all the memory used by the database up front. Set concurrent query limits and maximum data returned per query, etc. Somehow setup an easy to define configuration for the database that enables bounded memory usage.

Performance issues

Provide a better interface for ingestion, that can be more quickly serialized to disk

Relevant tiger beetle talk:

Provide a better interface for querying (TBD).

High disk usage, unlimited disk usage

This one is harder. If we follow “reserve first”, perhaps the database could reserve its disk space at startup:

start-my-db --disk-size 100GB --time-max-years 5 --series-max-count 100k

The database could calculate the maximum possible disk usage for a 5 year ring buffer of 100k individual series and refuse to startup if the --disk-size parameter is insufficient. It could also create this fixed-size file on startup.

This is non-trivial though. Disk usage can depend on the compression strategy used (for example differential encoding), and performance can depend on how well we compress, the speed of the disk, etc.

The point is we want the user to be assured that if their disk works and the space is reserved, their database can never fall over due to disk usage.

A initial design

Given the above, I propose a design with the following characteristics:

  1. single node, single disk (fully in-memory “disk” easily done as well if desired)
  2. static configuration provided at startup
  3. extremely limited query api, no query engine, no crap query programming language
  4. maximum disk usage is known at startup, disk is reserved at startup
  5. maximum memory usage is known at startup, memory is reserved at startup

Which enables the following properties:

  1. DB is unlikely to fall over after startup.
  2. Monitoring the DB, setting up alerts for disk usage, etc is less necessary.
  3. Users are forced to set expectations up front, and are less likely to be surprised by the consequences of their database filling up.

Open Questions

  1. Can “reserve-first” be extended to performance? What if the database could test its own performance at startup to provide similar guarantees. Perhaps it could spin-lock reserved CPUs (that seems wasteful)? Perhaps just performance testing itself at startup and at regular intervals?
  2. What data ingestion format would you want?
  3. What query API would you want?
2 Likes

Disclaimer: I’m heavily using classic RDBMS at work, have tinkered with DuckDB and using HomeAssistant, but have no actual experience with Time-series DBs.

To work around issues with additional load caused by queries, you could use an additional DB optimized for querying like DuckDB and feed it in batches from the TDB. Of course the data isn’t fully up-to-date then.

I guess the inefficient ingestion format is made this way for different reasons:

  • AFAIK many “modern” DBs are schema-less.
    This makes them more dynamic, but comes at a price.

  • The nature of time series is that usually there are no batch inputs. It’s always one record at a time.

  • I guess the DBs work without prepared statements.

Prepared statements are significant for performance in classical RDBMS. The idea is to transfer only the actual data to reduce network load, and even more to avoid overhead on the DB side.
In your example, that would mean:
Preparation: Tell the DB what the data you are about to send means and what the DB should do with it. The DB creates a cursor resource for it and gives you back an ID/handle for it.
Then you send this handle together with the record values (TS/measure) for each measuring timestamp (3 times in your example).
You can send one record at a time, or buffer multiple records at the client side and send them in batches.
When you’re finished, tell the DB to close the cursor resource.
That’s how one would work with a classic RDBMS.
But with a TDB, you are never finished with sending data as long as you are connected, so the DB needs to keep the cursor open or use at least be able to save restore it very fast.
Should be possible, though.

Regarding the transfer format, using a pure text format is just simple to create in any language. However, binary formats are more precise and more compact, in particular for transferring numerical data. I’d be surprised if InfluxDB doesn’t support binary input data.

1 Like

I have a feeling you are dealing with non-problems. Yes, some databases have trouble with out-of-space situations, but these are much less common than other problems. Even if you reserve disk/memory, you need to deal with disk being too slow, network lagging, etc.

You will hardly beat a well designed LSM tree with WAL in front of it, for ingest performance. InfluxDB is really not the state of art in timeseries DBs. Even they are now rewriting it using Parquet files and DataFusion (SQL engine). Look at ClickHouse for how efficient time-series can be. Or even DuckDB, although that’s more for data already ingested. Their CSV parsing and conversion to Parquet is insanely fast.

I think text format on input is still a good idea, you are rarely limited by that.

While being a very different project, I recently migrated one of my previous projects to Zig, you can see how I handle things there. It’s an inverted index represented in a segmented LSM tree, using two levels of segments, and WAL for in-memory segments. It’s not really optimized, I prefer readability over raw speed, if I can affort it. And it’s still plenty fast. GitHub - acoustid/acoustid-index: Minimalistic search engine searching in audio fingerprints from Chromaprint

The biggest obvious bottleneck is that you need to buffer input data somehow. If you send one row at a time, performance will be horrible, if you want it to be reliable. If you can accept that losing e.g. the last few seconds of data of data is OK, you can significantly improve it.

4 Likes

Hello, regarding disks, I don’t entirely agree with you on security. You’re forgetting checksums, parity, and even RAID.

I’ve been working with databases for over 40 years.

As for speed, apart from the choice of professional database,

you also need to consider the CPU, RAM, disk speed, RAID, number of processors, whether you use commit/rollback, backups, and finally the programming model.