The backend uses statsd client to log stats. These stats can be collected by any statsd server such as Graphite, CloudWatch, etc. For example, CloudWatch Agent can be used to collect stats to Amazon CloudWatch.

Note that the collection of stats can be disabled using enableStats in config.yaml.

Every metric has a dimension called instanceName that can be used to filter metrics. This can be helpful in case of multi-node deployments.

Available metrics

Recovery Mode

The backend usually runs in normal mode. If backend crashes and restarts multiple times in a short span, it is started in either degraded or maintenance mode. In degraded mode, events are collected and stored by the backend gateway, but are not sent to destinations. In maintenance mode, existing database is set aside for further inspection and a new database is used. So, it is important that recovery mode is monitored and appropriate action is taken when backend enters either degraded or maintenance mode.

This is the most important metric to monitor as it directly indicates the health of the application.

NameTypeDescription
recovery.mode_normalGauge

has a value of :

1 when running in normal mode

0 when running in degraded or maintenance mode

Gateway

NameTypeDescriptionDimensions
gateway.response_timeTimerResponse time of each request-
gateway.batch_sizeCounterRequests are grouped together internally for processing. It captures the size of such batch-
gateway.batch_timeTimerTime taken to process each batch of requests-
gateway.write_key_requestsCounterNumber of requests received with each write keywritekey
gateway.write_key_successful_requestsCounterNumber of successful requests with each write keywritekey
gateway.write_key_failed_requestsCounterNumber of failed requests with each write key. *writekey

* Requests fail in cases such as large request size, invalid write key, bad format of events, etc.

Processor

NameTypeDescription
processor.active_usersGaugeNumber of active users. This is based on the most recent events received. Useful for monitoring real time traffic.
processor.gateway_db_readCounterNumber of events read from database for processing.
processor.gateway_db_writeCounterNumber of events whose status is updated in gateway database after processing.
processor.router_db_writeCounterNumber of events written to router db.
processor.batch_router_db_writeCounterNumber of events written to batch router db. Note that batch router db is used for handling batch dumping destinations like S3, MinIO, etc.
processor.transformer_sentCounterNumber of events sent to transformer.
processor.transformer_receivedCounterNumber of events received from transformer. Note that this may not always be the same as transformer_sent even if there are no failures.
processor.transformer_failedCounterNumber of events from transformer with error responses.

Router

NameTypeDescription
router.[destination_code]_delivery_time*TimerTime taken to send each event to a specific destination.
router.[destination_code]_batch_time*TimerTime taken by routing worker for each iteration. Multiple events are sent in each iteration. Equivalent to the interval with which a worker picks new batch of events to send.**
router.[destination_code]_failed_attempts*CounterNumber of retries made for a specification destination.
router.events_deliveredCounterTotal number of events delivered to all destinations.

* These metrics are each destination type such as GA, AMP, etc. All the different Google Analytics destinations are grouped under a single metric (e.g: router.GA_worker_network). Useful for monitoring if there are failures or delays in delivering to a particular destination.

** Number of events picked in each iteration can be configured using noOfJobsPerChannel from config.yaml.

BatchRouter

Destinations such as S3, MinIO, where raw events are dumped, are handled by Batch Router.

NameTypeDescriptionDimension
batch_router.dest_successful_eventsCounterNumber of successful events sent to a specific destinationdestID
batch_router.dest_failed_attemptsCounterNumber of failed attempts per specific destination. Increased number of this metric means we are unable to reach that specific destination (usually due to invalid authorization or endpoint).destID
batch_router.[destination_code]_dest_upload_timeTimerTime taken to upload events to a specific destination (S3, MinIO, etc.)-
batch_router.errorsCounterTotal number of errors when sending events to destinations-

JobsDB

These are the backend's implementation-specific metrics that can be used to analyze the performance based on traffic. JobsDB maintains active events and their statuses. For optimizing db operations, we periodically add new tables in the db and migrate rows from older tables.

NameTypeDescription
jobsdb.gw_tables_countGaugeNumber of gateway tables in JobsDB
jobsdb.rt_tables_countGaugeNumber of router tables in JobsDB
jobsdb.brt_tables_countGaugeNumber of batch router tables in JobsDB

Ideally, the above tables count should not be ever growing. Ever growing tables:

  • Indicate events not getting processed and delivered in time.
Or
  • Indicate the load exceeded what current setup can handle and it is time to scale.

JobsDB - Table Dump specific

All the events from gateway tables are periodically dumped to S3/MinIO as a backup and also to facilitate event replay. These stats monitor delays or errors in dumping.

NameTypeDescription
jobsdb.table_file_dump_timeTimerTime taken to dump gateway tables to a JSON file
jobsdb.file_upload_timeTimerTime taken to compress and upload the generated JSON files.
jobsdb.total_table_dump_timeTimerTotal time taken for the whole process of dumping tables to S3.

Config Backend Polling

Configuration of the sources and their corresponding destinations is polled from config backend. Any errors in fetching this config can be monitored using config_backend_errors.

NameTypeDescription
config_backend.errorsCounterNumber of errors in fetching or processing config from control-plane's backend.

Contents