SQR-068: Sasquatch: beyond the EFD

  • Angelo Fausti

Latest Revision: 2022-08-26

1 Abstract

Sasquatch is a service for recording, displaying, and alerting on Rubin’s engineering data.

It is a unification of SQuaSH [1] used for tracking science performance metrics and the Engineering and Facilities Database (EFD) [2] used to record observatory telemetry data.

The new features include two-way data replication between the Summit and USDF Sasquatch instances to make science performance metrics computed at the USDF and observatory telemetry produced at the Summit available locally in both places.

Sasquatch can be easily extended to record other time-series data such as camera diagnostic metrics, rapid analysis metrics, scheduler events etc.

In its third generation, we took the opportunity to rebrand the service as Sasquatch. Sasquatch is the service that manages the EFD and other time-series databases.

Sasquatch is currently deployed at the test stands, Summit, and USDF through Phalanx integrated with Rubin’s Science Platform.

2 Overview

Sasquatch architecture is based on InfluxDB, an open-source time-series database optimized for efficient storage and analysis of time series data, and Apache Kafka which is used as a write-ahead log to InfluxDB and for data replication between sites.

InfluxDB OSS 2.x introduces Flux as its native language combining querying and data analysis functionalities. This version also has a new task engine to process time-series data with Flux, and a new Python client.

Apache Kafka is now deployed with Strimzi, a Kubernetes operator to manage the Kafka resources. It also brings Kafka bridge, a component used in Sasquatch for connecting HTTP-based clients with Kafka.

Figure 1 shows a diagram of the Sasquatch architecutre highlighting the new functionalities: two-way replication between the Summit and USDF; multiple InfluxDB databases; Flux Tasks; and a REST API based on Strimzi Kafka bridge.

_images/sasquatch_overview.svg

3 Sending data to Sasquatch

There are two main mechanisms for sending data to Sasquatch. One is based on the SAL Kafka Producers (ts_salkafka) and the other is based on the Strimzi Kafka Bridge REST API.

ts_salkafka is currently used with Sasquatch at the Summit and test stands to forward DDS messages to Kafka. Once DDS is replaced by Kafka, ts_salkafka won’t be longer necessary [4].

3.1 Strimzi Kafka bridge

Strimzi Kafka bridge provides a REST interface for connecting HTTP-based clients with Kafka.

With Kafka Bridge a client can send messages to or receive messages from Kafka topics using HTTP requests. In particular, a client can produce messages to topics in JSON format by using the topics endpoint.

Once the data lands in Kafka, an InfluxDB Sink connector is responsible for consuming the Kafka topic and writing the data to a bucket in InfluxDB.

In SQuaSH we use the SQuaSH REST API to connect the lsst.verify client with InfluxDB. In Sasquatch, the SQuaSH REST API is replaced by the Strimzi Kafka bridge and a new InfluxDB connector.

A new client that needs to send data to Sasquatch would use the same pattern.

In addition, because we send data to Kafka, we can easily replicated data between sites and persist data into other formats like Parquet using off-the-shell Kafka connectors.

4 Two-way replication between Summit and USDF

In the current EFD implementation, data replication between the Summit and USDF is done throught the Kafka Mirror Maker 2 connector (MM2) [3].

The EFD replication service allows for one-way replication (or active/standby replication) from the Summit to the USDF. We have measured sub-second latency for a high throughput topic from the MTM1M3 subsystem in that set up.

In Sasquatch, two-way replication (or active/active replication) is now required. With two-way replication, metrics computed at USDF (e.g. from Prompt Processing), for example, are sent to the USDF instance of Sasquatch and replicated to the Summit.

In addition to the instance of MM2 configured at USDF to replicate Observatory telemetry, events and metrics from the Summit, Sasquatch adds a second instance of MM2 at the Summit.

The Kafka Topics to be replicated are listed in the MM2 configuration on each Kafka cluster.

Two-way replication requires Kafka Topic renaming. Usually, in this scenario, the Kafka Topic at the destination cluster is prefixed with the name of the source cluster. That helps to identify its origin and avoid replicating it back to the source cluster. Consequently, any topic schemas at the destination cluster need to be translated adding more complexity compared to the one-way replication scenario.

5 Storing telemetry, metrics and events into multiple databases

In InfluxDB OSS 2.x, a database or bucket is a named location where time-series data is stored.

By using multiple buckets we can specify different retention policies, time precision, access control and backup strategies. InfluxDB OSS 2.x provides a buckets API to programatically interact with buckets.

In the original EFD implementation, telemetry and events from the Observatory are recorded into a single InfluxDB database. In Sasquatch, when migtrating to InfluxDB OSS 2.x we are planning on storing telemetry and events into separate buckets. In particular, because the time difference between events is not regular, they need to be stored with higher time precision than telemetry and metrics to avoid overlaping data.

5.1 Mapping Kafka topics to connector instances and buckets

When using the Strimzi Kafka bridge it makes sense to map Kafka topics to connector instances and buckets.

For example, the analysis_tools topic in Kafka holds the lsst.verify measurements. The analysis_tools connector instance is configured to extract the measurements and metadata from Kafka and write them to the analysis_tools bucket in InfluxDB.

6 Flux Tasks

InfluxDB OSS 2.x provides a new task engine that replaces Continuous Queries and Kapacitor used in InfluxDB OSS 1.x.

An InfluxDB task is a scheduled Flux script that takes an input data stream, transforms or analyzes it, and performs some action.

In most cases, the transformed data can be stored into a new InfluxDB bucket, or sent to other destinations using Flux output functions. An example is sending a notification to Slack, or triggering some computation using the Flux http.post() function.

InfluxDB OSS 2.x also provides a tasks API to programatically interact with tasks.

7 Implementation phases

This section describes the Sasquatch implementation phases. As of August 2022 we are completing phase 1 and starting phase 2.

7.1 Phase 1 - Replace EFD deployments

  1. Add Sasquatch to Phalanx.

  2. Enable Chronograf authentication through Gafaelfawr.

  3. Replace Confluent Kafka with Strimzi Kafka.

  4. Automate Strimzi Kafka image builds adding the InfluxDB Sink, Mirror Maker 2, and S3 connectors.

  5. Deploy Sasquatch at IDF Dev.

  6. Deploy Sasquatch at TTS (Pillan cluster).

  7. Add csc and kafka-producer subcharts to Sasquatch for end-to-end testing.

  8. Add SASL configuration to ts_salkafka.

  9. Test connectors and integration with CSCs.

  10. Integrate news feeds with rsp_broacast.

  11. Implement external listeners in Strimzi Kafka.

  12. Migrate Sasquatch monitoring to monitoring.lsst.codes

  13. Deploy Sasquatch at USDF (SLAC).

  14. Migrate EFD data from the Summit to the Sasquatch instance at USDF.

  15. Deploy Sasquatch at the Summit (Yagan cluster).

  16. Migrate EFD data from the efd-temp-k3s.cp.lsst.org server to Sasquatch at the Summit.

  17. Implement data replication bewteen Sasquatch at the Summit and USDF with Strimzi Kafka.

  18. Deploy Sasquatch at the BTS (Antus cluster).

7.2 Phase 2 - Replace the SQuaSH deployment

  1. Implement Strimzi Kafka bridge as a replacement for the SQuaSH API in Sasquatch.

  2. Configure InfluxDB Sink connector to parse lsst.verify job messages.

  3. Implement two-way replication in Sasquatch.

  4. Deploy Sasquatch on IDF int.

  5. Migrate SQuaSH data to Sasquatch at USDF.

7.2.1 Related goals

  1. Remove squash and influxdb-demo clusters on Google

7.3 Phase 3 - Migration to InfluxDB OSS 2.x

  1. Add InfluxDB OSS 2.x to Sasquatch deployment.

  2. Test InfluxDB Sink connector with InfluxDB OSS 2.x.

  3. Migrate EFD database to 2.x format (TTS, BTS, Summit, USDF).

  4. Exercise InfluxDB OSS 2.x backup/restore tools.

  5. Connect Chronograf with InfluxDB OSS 2.x (rquires DBRP mapping).

  6. Migrate Kapacitor alerts to Flux tasks.

  7. Migrate Chronograf 1.x annotations (_chronograf database) to InfluxDB 2.x.

  8. Upgrage EFD client to use the InfluxDB OSS 2.x Python client.

References

[1]

[SQR-009]. Angelo Fausti. The SQuaSH metrics dashboard. 2020. Vera C. Rubin Observatory SQuaRE Technical Note. URL: https://sqr-009.lsst.io/

[2]

[SQR-034]. Angelo Fausti. EFD Operations. 2021. Vera C. Rubin Observatory SQuaRE Technical Note. URL: https://sqr-034.lsst.io/

[3]

[SQR-050]. Angelo Fausti. The EFD replication service. 2021. Vera C. Rubin Observatory SQuaRE Technical Note. URL: https://sqr-050.lsst.io/

[4]

[TSTN-033]. Russell Owen. Exploring Kafka for Telescope Control. 2022. Vera C. Rubin Observatory. URL: https://tstn-033.lsst.io/