SQR-068: Sasquatch: beyond the EFD

  • Angelo Fausti

Latest Revision: 2022-05-20

Note

This technote is a work-in-progress.

1 Abstract

DMTN-082 [4] presents a high-level architecture to enable real-time analysis of the Engineering and Facilities Database (EFD) through the Rubin Science Platform (RSP).

The current EFD implementation is described in SQR-034 [2] and deployed at the Summit, test stands, and USDF.

The EFD architecture is based on InfluxDB, an open-source time-series platform optimized for efficient storage and analysis of time series data, and Apache Kafka which is used as a write-ahead log to InfluxDB and for data replication between the Summit and the USDF.

The EFD terminology, however, is confusing as the system is more than a database. In fact, it can be extended to store and analize other scalar metrics such as science performance metrics computed by lsst.faro, scheduler events, camera diagnostic metrics, etc.

In this technote, we introduce Sasquatch (Scientific Analysis of Scalar QUantities and Telemetry Curation Harness) a unified service to store and analyze telemetry, events and metrics for the Rubin Observatory.

2 Overview

Sasquatch architecture is based on InfluxDB OSS 2.x and Strimzi.

InfluxDB OSS 2.x uses Flux as its native language. Flux combines query and data analysis functionalities. Other features of interest for Sasquatch that are included in the new InfluxDB version are buckets, a task engine to process time-series data with Flux, and a new Python client.

Strimzi simplifies the process of running Apache Kafka in a Kubernetes cluster, providing a set of operators to manage the Kafka resources. It also brings Kafka bridge, a component used in Sasquatch for connecting HTTP-based clients with Kafka.

Figure 1 shows a diagram of the Sasquatch architecutre highlighting the new functionalities: a REST API based on Strimzi Kafka bridge; two-way replication between the Summit and USDF; InfluxDB buckets; and Flux Tasks.

_images/sasquatch_overview.svg

3 Sending data to Sasquatch

There are two main mechanisms for sending data to Sasquatch. One is based on the ts_salkafka producer and the other is based on Strimzi Kafka Bridge.

ts_salkafka was implemented to send DDS messages to the EFD Kafka cluster. It can still be used with Sasquatch at the Summit and test stands, but won’t be necessary if DDS is replaced by Kafka (see TSTN-033 [5]).

3.1 Strimzi Kafka bridge

Strimzi Kafka bridge provides a REST interface for connecting HTTP-based clients with Kafka.

With Kafka Bridge a client can send messages to or receive messages from Kafka topics using HTTP requests. In particular, a client can produce messages to topics in JSON format by using the topics endpoint.

Once the data lands in Kafka, an InfluxDB Sink connector is responsible for consuming the Kafka topic and writing the data to a bucket in InfluxDB.

Any client that needs to send data to Sasquatch would follow that pattern. In general, the following steps are required to add a new client (1) create a Kafka Topic in Strimzi; (2) configure a new InfluxDB Sink connector; and (3) create a new bucket in InfluxDB.

In SQuaSH (see SQR-009 [1]) we use the SQuaSH API to connect dispatch_verify.py, the HTTP client in the lsst.verify package, with InfluxDB.

In Sasquatch, the Strimzi Kafka bridge combined with an InfluxDB connector is a replacement for the SQuaSH API. Because we send that to Kafka, it can be replicated or persisted into another format using Kafka connectors.

4 Two-way replication between Summit and USDF

In the current EFD implementation, data replication between the Summit and USDF is done throught the Kafka Mirror Maker 2 connector (MM2) (see SQR-050 [3]).

The EFD replication service allows for one-way replication (or active/standby replication) from the Summit to the USDF. We have measured sub-second latency for a high throughput topic from the MTM1M3 subsystem in that set up.

In Sasquatch, two-way replication (or active/active replication) is now required. With two-way replication, metrics computed at USDF (e.g. from Prompt Processing), for example, are sent to the USDF instance of Sasquatch and replicated to the Summit.

In addition to the instance of MM2 configured at USDF to replicate Observatory telemetry, events and metrics from the Summit, Sasquatch adds a second instance of MM2 at the Summit.

The Kafka Topics to be replicated are listed in the MM2 configuration on each Kafka cluster.

Two-way replication requires Kafka Topic renaming. Usually, in this scenario, the Kafka Topic at the destination cluster is prefixed with the name of the source cluster. That helps to identify its origin and avoid replicating it back to the source cluster. Consequently, any topic schemas at the destination cluster need to be translated, which adds more complexity compared to the one-way replication scenario.

5 Storing telemetry, metrics and events into multiple buckets

In InfluxDB OSS 2.x, a bucket is a named location where time series data is stored.

By using multiple buckets we can specify different retention policies, time precision, access control and backup strategies. InfluxDB OSS 2.x provides a buckets API to programatically interact with buckets.

In the current EFD implementation, telemetry and events from the Observatory are being recorded into a single EFD database, the equivalent to a bucket in InfluxDB OSS 1.x.

In Sasquatch, we are considering storing telemetry and events into separate buckets. In particular, because the time difference between events is not regular, events need to be stored with higher time precision than telemetry and metrics to avoid losing data.

5.1 Mapping Kafka topics to connector instances and buckets

When using the Strimzi Kafka bridge it makes sense to have a 1:1 mapping between Kafka topics, connector instances and buckets.

For example, a faro topic in Kafka would hold the lsst.verify job messages produced by lsst.faro. A faro InfluxDB connector instance would have the configuration to extract the metric values and metadata from those messages, and would write them to a faro bucket in InfluxDB.

6 Flux Tasks

InfluxDB OSS 2.x provides a new task engine that replaces Continuous Queries and Kapacitor used in InfluxDB OSS 1.x.

An InfluxDB task is a scheduled Flux script that takes an input data stream, transforms or analyzes it, and performs some action.

In most cases, the transformed data can be stored into a new InfluxDB bucket, or sent to other destinations using Flux output functions. An example is sending a notification to Slack, or triggering some computation using the Flux http.post() function.

InfluxDB OSS 2.x also provides a tasks API to programatically interact with tasks.

7 Implementation phases

7.1 Phase 1 - Replace EFD deployments

  1. Add Sasquatch to Phalanx.

  2. Enable Chronograf authentication through Gafaelfawr.

  3. Replace Confluent Kafka with Strimzi Kafka.

  4. Automate Strimzi Kafka image builds adding the InfluxDB Sink, Mirror Maker 2, and S3 connectors.

  5. Deploy Sasquatch at IDF Dev (side application: monitor JupyterHub metrics).

  6. Deploy Sasquatch at NCA Int (test Mirror Maker 2 and S3 connectors).

  7. Deploy Sasquatch at NCSA Stable.

  8. Migrate EFD data to Sasquatch at NCSA Stable.

  9. Add csc and kafka-producer subcharts to Sasquatch for end-to-end testing.

  10. Deploy Sasquatch at TTS (Pillan cluster).

  11. Add SASL configuration to ts_salkafka.

  12. Test connectors and integration with CSCs.

  13. Deploy Sasquatch at the Base (Antus cluster).

  14. Deploy Sasquatch at the Summit (Yagan cluster).

  15. Migrate EFD data from the efd-temp-k3s.cp.lsst.org server to Sasquatch at the Summit.

7.2 Phase 2 - Replace the SQuaSH deployment

  1. Implement Strimzi Kafka bridge as a replacement for the SQuaSH API in Sasquatch.

  2. Configure InfluxDB Sink connector to parse lsst.verify job messages.

  3. Implement two-way replication in Sasquatch.

  4. Deploy Sasquatch on IDF int.

  5. Deploy Sasquatch on IDF prod.

  6. Migrate SQuaSH data to Sasquatch at IDF (or USDF).

7.2.1 Related goals

  1. Remove squash and influxdb-demo clusters on Google

7.3 Phase 3 - Migration to InfluxDB OSS 2.x

  1. Add InfluxDB OSS 2.x to Sasquatch deployment.

  2. Test InfluxDB Sink connector with InfluxDB OSS 2.x.

  3. Migrate EFD database to 2.x format (TTS, Base, Summit, NCSA Int, NCSA Stable).

  4. Exercise InfluxDB OSS 2.x backup/restore tools.

  5. Connect Chronograf with InfluxDB OSS 2.x (rquires DBRP mapping).

  6. Migrate Kapacitor alerts to Flux tasks.

  7. Migrate Chronograf 1.x annotations (_chronograf database) to InfluxDB 2.x.

  8. Upgrage EFD client to use the InfluxDB OSS 2.x Python client.

References

1

[SQR-009]. Angelo Fausti. The SQuaSH metrics dashboard. 2020. Vera C. Rubin Observatory SQuaRE Technical Note. URL: https://sqr-009.lsst.io/

2

[SQR-034]. Angelo Fausti. EFD Operations. 2021. Vera C. Rubin Observatory SQuaRE Technical Note. URL: https://sqr-034.lsst.io/

3

[SQR-050]. Angelo Fausti. The EFD replication service. 2021. Vera C. Rubin Observatory SQuaRE Technical Note. URL: https://sqr-050.lsst.io/

4

[DMTN-082]. Simon Krughoff and Frossie Economou. On accessing EFD data in the Science Platform. 2018. Vera C. Rubin Observatory Data Management Technical Note. URL: https://dmtn-082.lsst.io/

5

[TSTN-033]. Russell Owen. Exploring Kafka for Telescope Control. 2022. Vera C. Rubin Observatory. URL: https://tstn-033.lsst.io/