Performance data pipeline

Performance Collection and Storage

Collector daemons gather metrics via their specific protocols

Metrics are written as a collection of key-value pairs and tags in the following format:

2017-02-15 19:45:39,225 DEBUG zen.MetricWriter: \
publishing metric 10.90.36.137/cpu_ssCpuSystemPerCpu \
162450 1487187939 {'device': '10.90.36.137', \
'contextUUID': '12d97430-f7bc-4073-91f8-6743d3ae94a1', \
'key': 'Devices/10.90.36.137'}
# Line breaks added for legibility

Included in the message are: device and datapoint, value, and timestamp
Collected metrics are sent to CollectorRedis which acts as a queue.
All collector daemons on a collector share a CollectorRedis instance.
MetricShipper consumes from CollectorRedis and publishes to MetricConsumer
Traffic is proxied through zproxy
Authentication for zproxy is managed by zauth
MetricConsumer acts as the aggregator for metrics from ALL collectors
It then forwards metrics to the OpenTSDBWriter
OpenTSDBWriter is responsible for actually writing the metrics to HBase

Performance Retrieval

User interacts with RM UI through a Zope instance
API request for performance data is made to CentralQuery
Request is proxied through zproxy
- zproxy makes a request to zauth if the request doesn't have an authentication cookie
Request includes device(s), component(s), datapoint(s), a range of timestamps, and (sometimes) a cookie
Central Query forwards request to OpenTSDB Reader
OpenTSDB Reader retrieves data from HBase

Performance Collection/Storage Troubleshooting

Sometimes performance collection can fail

Check for events; they may point you to a failure
Verify credentials, connectivity, required target device configuration
Run collector daemon in debug mode
See if values are being collected
- If they are, check for failures to send data to redis
- If they're not, check for error messages that may indicate where it's failing
- If you're getting 500 errors, check zauth and see if it's overloaded; if it is, you may need to add more instances
Ensure performance data pipeline services are running and passing all health-checks
Failing that, look at failing services' logs to identify the source of the problem
- It's possible that you have some sort of data corruption or the service didn't start correctly for some reason; the logs should help you identify the source of the problem

Performance Retrieval Troubleshooting

Sometimes performance retrieval (graph rendering) can fail

If you have a graphing failure localized to a single device, check its associated collector daemon
If you have graphing failure localized to an entire remote collector, check CollectorRedis and MetricShipper
Make sure that your graph configuration is valid
Has it worked before?
Is the problem with all graphs or one graph?
Make sure values are being collected and stored
You can check the OpenTSDB public endpoint directly for graph data
Check health checks
Failing that, look at failing services' logs to identify the source of the problem
Make sure that zauth isn't overloaded
Instances of zauth can be added to help distribute load