Skip to content

Performance data pipeline

Performance Collection and Storage

Collector daemons gather metrics via their specific protocols

  • Metrics are written as a collection of key-value pairs and tags in the following format:
    2017-02-15 19:45:39,225 DEBUG zen.MetricWriter: \
    publishing metric 10.90.36.137/cpu_ssCpuSystemPerCpu \
    162450 1487187939 {'device': '10.90.36.137', \
    'contextUUID': '12d97430-f7bc-4073-91f8-6743d3ae94a1', \
    'key': 'Devices/10.90.36.137'}
    # Line breaks added for legibility
    
  • Included in the message are: device and datapoint, value, and timestamp
  • Collected metrics are sent to CollectorRedis which acts as a queue.
  • All collector daemons on a collector share a CollectorRedis instance.
  • MetricShipper consumes from CollectorRedis and publishes to MetricConsumer
  • Traffic is proxied through zproxy
  • Authentication for zproxy is managed by zauth
  • MetricConsumer acts as the aggregator for metrics from ALL collectors
  • It then forwards metrics to the OpenTSDBWriter
  • OpenTSDBWriter is responsible for actually writing the metrics to HBase

Performance Retrieval

  • User interacts with RM UI through a Zope instance
  • API request for performance data is made to CentralQuery
  • Request is proxied through zproxy
    • zproxy makes a request to zauth if the request doesn't have an authentication cookie
  • Request includes device(s), component(s), datapoint(s), a range of timestamps, and (sometimes) a cookie
  • Central Query forwards request to OpenTSDB Reader
  • OpenTSDB Reader retrieves data from HBase

Performance Collection/Storage Troubleshooting

Sometimes performance collection can fail

  • Check for events; they may point you to a failure
  • Verify credentials, connectivity, required target device configuration

  • Run collector daemon in debug mode

  • See if values are being collected

    • If they are, check for failures to send data to redis
    • If they're not, check for error messages that may indicate where it's failing
    • If you're getting 500 errors, check zauth and see if it's overloaded; if it is, you may need to add more instances
  • Ensure performance data pipeline services are running and passing all health-checks

  • Failing that, look at failing services' logs to identify the source of the problem
    • It's possible that you have some sort of data corruption or the service didn't start correctly for some reason; the logs should help you identify the source of the problem

Performance Retrieval Troubleshooting

Sometimes performance retrieval (graph rendering) can fail

  • If you have a graphing failure localized to a single device, check its associated collector daemon
  • If you have graphing failure localized to an entire remote collector, check CollectorRedis and MetricShipper
  • Make sure that your graph configuration is valid
  • Has it worked before?
  • Is the problem with all graphs or one graph?
  • Make sure values are being collected and stored
  • You can check the OpenTSDB public endpoint directly for graph data
  • Check health checks
  • Failing that, look at failing services' logs to identify the source of the problem
  • Make sure that zauth isn't overloaded
  • Instances of zauth can be added to help distribute load