Monitoring and metrics

A look at all stats/metrics generated by the backend and how to monitor the applications using them.

5 minute read

The backend uses statsd client to log stats. These stats can be collected by any statsd server such as Graphite, CloudWatch, etc. For example, CloudWatch Agent can be used to collect stats to Amazon CloudWatch.

Note that the collection of stats can be disabled using enableStats in config.yaml.

Every metric has a dimension called instanceName that can be used to filter metrics. This can be helpful in case of multi-node deployments.

Available metrics

Recovery Mode

The backend usually runs in normal mode. If backend crashes and restarts multiple times in a short span, it is started in either degraded or maintenance mode. In degraded mode, events are collected and stored by the backend gateway, but are not sent to destinations. In maintenance mode, existing database is set aside for further inspection and a new database is used. So, it is important that recovery mode is monitored and appropriate action is taken when backend enters either degraded or maintenance mode.

This is the most important metric to monitor as it directly indicates the health of the application.

Name Type Description

Name	Type	Description
`recovery.mode_normal`	`Gauge`	has a value of : 1 when running in normal mode 0 when running in degraded or maintenance mode

recovery.mode_normal

Gauge

has a value of :

1 when running in normal mode

0 when running in degraded or maintenance mode

Gateway

Name	Type	Description	Dimensions
`gateway.response_time`	`Timer`	Response time of each request	-
`gateway.batch_size`	`Counter`	Requests are grouped together internally for processing. It captures the size of such batch	-
`gateway.batch_time`	`Timer`	Time taken to process each batch of requests	-
`gateway.write_key_requests`	`Counter`	Number of requests received with each write key	`writekey`
`gateway.write_key_successful_requests`	`Counter`	Number of successful requests with each write key	`writekey`
`gateway.write_key_failed_requests`	`Counter`	Number of failed requests with each write key. *	`writekey`

* Requests fail in cases such as large request size, invalid write key, bad format of events, etc.

Processor

Name	Type	Description
`processor.active_users`	`Gauge`	Number of active users. This is based on the most recent events received. Useful for monitoring real time traffic.
`processor.gateway_db_read`	`Counter`	Number of events read from database for processing.
`processor.gateway_db_write`	`Counter`	Number of events whose status is updated in gateway database after processing.
`processor.router_db_write`	`Counter`	Number of events written to router db.
`processor.batch_router_db_write`	`Counter`	Number of events written to batch router db. Note that batch router db is used for handling batch dumping destinations like S3, MinIO, etc.
`processor.transformer_sent`	`Counter`	Number of events sent to transformer.
`processor.transformer_received`	`Counter`	Number of events received from transformer. Note that this may not always be the same as transformer_sent even if there are no failures.
`processor.transformer_failed`	`Counter`	Number of events from transformer with error responses.

Router

Name	Type	Description
`router.[destination_code]_delivery_time*`	`Timer`	Time taken to send each event to a specific destination.
`router.[destination_code]_batch_time*`	`Timer`	Time taken by routing worker for each iteration. Multiple events are sent in each iteration. Equivalent to the interval with which a worker picks new batch of events to send.**
`router.[destination_code]_failed_attempts*`	`Counter`	Number of retries made for a specification destination.
`router.events_delivered`	`Counter`	Total number of events delivered to all destinations.

These metrics are each destination type such as GA, AMP, etc. All the different Google Analytics destinations are grouped under a single metric (e.g: router.GA_worker_network). Useful for monitoring if there are failures or delays in delivering to a particular destination.
Number of events picked in each iteration can be configured using noOfJobsPerChannel from config.yaml.

BatchRouter

Destinations such as S3, MinIO, where raw events are dumped, are handled by Batch Router.

Name	Type	Description	Dimension
`batch_router.dest_successful_events`	`Counter`	Number of successful events sent to a specific destination	`destID`
`batch_router.dest_failed_attempts`	`Counter`	Number of failed attempts per specific destination. Increased number of this metric means we are unable to reach that specific destination (usually due to invalid authorization or endpoint).	`destID`
`batch_router.[destination_code]_dest_upload_time`	`Timer`	Time taken to upload events to a specific destination (S3, MinIO, etc.)	-
`batch_router.errors`	`Counter`	Total number of errors when sending events to destinations	-

JobsDB

These are the backend’s implementation-specific metrics that can be used to analyze the performance based on traffic. JobsDB maintains active events and their statuses. For optimizing db operations, we periodically add new tables in the db and migrate rows from older tables.

Name	Type	Description
`jobsdb.gw_tables_count`	`Gauge`	Number of gateway tables in JobsDB
`jobsdb.rt_tables_count`	`Gauge`	Number of router tables in JobsDB
`jobsdb.brt_tables_count`	`Gauge`	Number of batch router tables in JobsDB

Ideally, the above tables count should not be ever growing. Ever growing tables:
Indicate events not getting processed and delivered in time.
Or
Indicate the load exceeded what current setup can handle and it is time to scale.

JobsDB - Table Dump specific

All the events from gateway tables are periodically dumped to S3/MinIO as a backup and also to facilitate event replay. These stats monitor delays or errors in dumping.

Name	Type	Description
`jobsdb.table_file_dump_time`	`Timer`	Time taken to dump gateway tables to a JSON file
`jobsdb.file_upload_time`	`Timer`	Time taken to compress and upload the generated JSON files.
`jobsdb.total_table_dump_time`	`Timer`	Total time taken for the whole process of dumping tables to S3.

Config Backend Polling

Configuration of the sources and their corresponding destinations is polled from config backend. Any errors in fetching this config can be monitored using config_backend_errors.

Name	Type	Description
`config_backend.errors`	`Counter`	Number of errors in fetching or processing config from control-plane’s backend.

Was this page helpful?

Glad to hear it! Please tell us how we can improve.

Sorry to hear that. Please tell us how we can improve.

Questions? Contact us by email or on Slack