Wait, what is my fleet doing?

Mika Boström
Smarkets HQ
Published in
6 min readMar 20, 2018

--

A modern online system can have anywhere from a few dozen to hundreds, even thousands of independent services. It is impossible for humans to be constantly aware what any one of them is doing at any given time, let alone what is going on between them.

That information is collected and made available through constant monitoring. An engineering organisation lives and breathes through their monitoring setup.

Our frontend team monitors expensive pages’ rendering time and changes to their output

Visibility is like air — unseen but critical for survival

As a betting exchange, our time windows for reaction and remediation are pretty tight. We transmit and rely on near real-time data.[1] Major sports events can have their peak activity happen inside couple of minutes. Sometimes even less. When things go wrong, we need to see it.

From my discussions with engineers in other companies, a number of online businesses can accept “things being slow” for a couple minutes. Transient errors happen and as long as their system recovers cleanly, it’s not an urgent matter. (But still needs to be fixed.)

At Smarkets, a minute of sluggish response times is already too long: the peak trading window for an event may have been missed.

This time sensitivity means that our monitoring setup has to track and be able to react to events within time windows most other online businesses would consider outrageously narrow.

Telemetry vs. Monitoring

The term “monitoring” is somewhat ambiguous. The ambiguity comes from the fact that monitoring as a domain is large and complex. In any sufficiently large problem space, different people will weigh different aspects. Even common terminology becomes overloaded.

For the purpose of this article I will separate monitoring from telemetry. I have purposely avoided the word metric due to its semantic overload.

Telemetry is the data that is sent to, aggregated and transmitted via a monitoring system. Monitoring is what that data is used for. And a series of uniquely identifiable data points is a dimension.

(Alerting is the act of raising a signal for humans, based on monitoring status.)

Events team are always on the lookout for tail latency changes and request error rates

Conflicting requirements

At the heart of any modern monitoring system, we find a complex beast: a time-series database.

Traditional databases are often tuned to support one of two different, conceptually opposite workloads. These workloads are known as read-heavy and write-heavy, and optimisations for one type of workload routinely make the database less suitable for the other.

A database tuned to write-heavy workload needs to support high volume of incoming writes. Under these conditions as much as 98% of operations may be writes. Data is read infrequently, and only for small amounts at a time. A good example of write-heavy is accounting data.

Write-heavy systems must ensure that the writes are persisted correctly, but they can also allow for small delays when reading.

Read-heavy databases on the other hand see lots of reads, but only a few writes. In this type of workload the data is written infrequently and modified rarely, but accessed all the time. A good example of read-heavy would be a user’s personal data or their authentication information.

A time-series database has to support both workload patterns with exceptionally good performance. These conflicting requirements explain why these databases are so different, and custom built for just one purpose.

To better explain the conflicting requirements, I’ll use numbers from Smarkets’s fleet.

Our betting exchange has a number of individual services, who together generate more than 200,000 unique telemetry datapoints. Our monitoring system stores these measurements at 10-second intervals, so we can average slightly over 20,000 datapoints per second. However, the majority of the writes occur around these regular 10-second marks. We also have a couple of services that require measurements at one-second intervals, but luckily these are a minority and generate only a modest amount of telemetry dimensions.

Because monitoring is highly time sensitive, the written data must also be immediately available for reading. As we increase the scope and coverage of monitoring, we also increase the read activity. Each new visualisation or monitoring setup adds to the read load. Lots of small reads make up a significant load, so the combination exercises both read-heavy and write-heavy workload patterns. A regular database would be hard-pressed to perform well.

In a regulated (let alone time-sensitive) industry, accurate clocks are essential for reliable timestamps

Time-series based monitoring follows a simple logic. We request all the data points for a given set of dimensions for the N most recent time units. These requests must be served rapidly, without delay. We then check if any aggregate values (median, 99th percentile, maximum, etc.) are outside the boundaries we consider healthy. And if we detect a problematic state, we trigger an action.

This action can be an automatic response and remediation, or — as is often the case — to notify a human who figures out what the response should be.

A time-series database, when used for monitoring, is being written to very frequently, and read from almost constantly.

Unintended (ab)use cases

The demands for high reliability and availability, combined with the fact that any generated data point is available in a matter of seconds, allow for some interesting uses.

By design, monitoring data has to be readable from anywhere in the fleet. Also, the clients must be very simple to use, because if monitoring data is difficult to read, it would not be used voluntarily.

If you provide engineers with a highly available, always-on, easy-to-use service for transmitting and exposing arbitrary data, they will find uses that were not considered during the original design.

Like signaling.

One of our teams actually experimented with transmitting cross-service status change information through telemetry, and in the process ended up generating several thousand additional dimensions for the services in question. (Usually a service exposes couple of thousand, at most.)

However, using a best-effort delivery channel for cross-service communication is not exactly a good idea. Monitoring systems are, by design, fire-and-forget delivery channels. Individual messages may get lost, or telemetry collection may suffer transient failures.

But most importantly, encoding signaling as monitoring datapoints is practically guaranteed to generate a lot of telemetry dimensions. Such bloat can make the monitoring system less reliable overall, which is why those maintaining the machinery are constantly alert on scope creep.

The latency windows for exchange team can range from hundreds of milliseconds for database I/O and accounting data service, to less than 10 milliseconds for message delivery

Final Words

The heart of any reliably running online service is its monitoring system — things that cannot be observed are accidents waiting to happen. When dealing with real-time activities and trading, accidents can happen pretty fast.

At Smarkets we have a monitoring setup that can react to changes in a matter of seconds, and which the teams rightfully expect to work like our site — at all times.

Footnotes

1: The term “near real-time” is notoriously imprecise, but for sports betting purposes, we can consider any delay between 1 and 10 seconds to be near real-time.

--

--