Track 8 Topic 3: Monitoring Data Lake

Two things you need to constantly watch:

the lag in pipelines and writers
the error topics

The lag tells you how much the pipelines/writers are behind the readers. And the error topics tell you what errors readers, pipelines, and writers have experienced.

1. Checking the Lag

You need to deploy a KCE for monitoring the lags. In the KCE, issue the following command to check the offset and lag of a consumer group.

% ./desc_consumer_group.sh jdoe payment-p1

GROUP           TOPIC           PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID     HOST            CLIENT-ID
payment-p1      dirty-pond      1          205             205             0               -               -               -
payment-p1      dirty-pond      0          207             209             2               -               -               -

In partition 0, the current offset is the same as the high watermark “LOG-EMD-OFFSET.” Therefore, there there is no lag in this partition. However, in partition 1, there is a lag of two records.

In a smooth data flow, some lags are acceptable as long as they do not increase over time. So you want to keep watching the consumer group and look for the lag increase.

What is this “consumer group,” and how does it relate to the pipelines or writers in my system?

For each pipeline or writer, Kafka creates a consumer group. The pipeline/writer name is used as the group name. For example, “payment-p1” is a pipeline corresponding to the consumer group “payment-p1.”

You may issue the “list_consumer_groups” command to find out all the consumer groups:

% ./list_consumer_groups.sh jdoe
payment-p1
payment-p2
payment-p3
text-file-writer-1

If you see steadily increasing lag, you are in an emergency. It is time to take immediate action. You can undeploy/pause the reader, redeploy the pipeline with a higher degree of parallelism, or beef up the machines it uses (if it is the resource performance issue). See the next topic on Quick Action for details.

2. Checking the Error Topics

An error topic is a Kafka topic. You can use any tool to consume from it.

In Calabash KCE, there is a “consume” command. You can use it to check the content of an error topic.

% ./consume.sh april error-1-0 g0
"0010"|{"val":"05/10/2021 11:15:02,account0000003,\"fivethousand\"","err":"Failed processor preprocessing in step 0: Field amount requires int, but fivethousand found","ts":1626726147459}
"0125"|{"val":"05/10/2021 11:15:20,account0000001,\"5000\"","err":"Failed processor preprocessing in step 0: Field amount requires int, but 5000 found","ts":1626726174501}
...

In this example, “april” is a consumer user with permission to read from topic “error-1-0.” This topic is one of the error topics of the pipeline “payment-p1.”

The results from the “consume” command are the records in the topic, displayed one record per line in real-time. The vertical bar separates the record key from the value. The key is the key of the data record that hit the error.

The value part is a JSON record. The “err” attribute carries the error message, and “val” is the raw data value (in String) that causes the error. There is also a timestamp for this event under “ts.” The record value provides all the information to diagnose and fix the issue.