Event pipeline
Event Pipeline Overview Video
Event Processing Explained
- Events are generated by collector daemons
- Passive collection is performed by daemons like zentrap, zensyslog, etc.
-
Active collection is performed by daemons like zenpython, zencommand, etc.
-
Queued events are sent to ZenHub
-
If zenhub is not available to receive queued events, they will be held internally
- Collector daemons keep an internal event queue
- Not persistent
- Defaults to 5000
- Rotating buffer; holds only the 5000 most recent events
- Adjust by via maxqueuelen setting in collector daemon config
-
Zenhub acts as an aggregator for events from collectors
- zenhub's parent thread receives events from collectors
- Events are validated; they must contain at least a device (which doesn't necessarily have to be added into RM for monitoring), a severity, and a message or else they're dropped
- Event processing tasks are queued internally in zenhub's worklist
-
zenhub workers consume tasks from the worklist
- SendEvents tasks are worked by publishing events to the zenoss.queues.zep.rawevents queue in RabbitMQ
-
Zeneventd consumes events from zenoss.queues.zep.rawevents
-
Maps events from /Unknown to appropriate class based on mappings
- Event enrichment
- Provides 'device' and 'component' contexts in transforms
- Populates device data in events
- ProductionState
- Groups/Systems/Locations
- Device Priority
- etc
- Applies transform code based on event class membership
- Executes zeneventd post-plugins
- Publishes contextualized events to zenoss.queues.zep.zenevents
-
Zeneventserver consumes events from zenoss.queues.zep.zenevents
-
Serves as the heart of the event processing engine and has multiple roles
- Stores events in the zenoss_zep database
- MariaDB-Events container
- Indexes stored events in Lucene indexes
- $ZENHOME/var/zeneventserver
- Handles event aging and archiving
- Provides the back-end for the event API
- Serves the event console and archive UI elements
- Serves the event 'rainbow' functionality in the UI
- Maintains a list of ping-down devices which zenhub uses when it builds configs
- Handles heartbeat monitoring
- Evaluates triggers and publishes to zenoss.queues.zep.signal when a match is found
-
Zenactiond consumes events from zenoss.queues.zep.signal queue
- Processes notifications, which can:
- Email a user
- Run a command to perform corrective action
- Send a syslog message
- Send an SNMP trap
- Generate or update a support ticket
- Integrate with other custom solutions
- ServiceNow
- RemedyITSM
- etc
- Processes notifications, which can:
Troubleshooting event flow
Identifying Bottlenecks
If you suspect you're having a problem with the event pipeline, check rabbitmq first:
- rabbitmq public endpoint
rabbitmqctl
in the rabbitmq containerrabbitmqctl list_queues -p /zenoss
rabbitmqctl list_queues -p /zenoss messages consumers name
Rawevents Troubleshooting
Backups in the rawevents queue are typically:
- An event flood
- Look at collector performance graphs (event queues graph) to identify the source of the flood
- Look in the event console for new syslog or snmp trap messages that are rapidly incrementing in count
-
As a last-ditch effort you can turn off zensyslog/zentrap temporarily to stop a flood
-
Slowness in zeneventd
- Look in the zeneventd logs for long-running transform messages
- You may need to optimize your event transforms
-
You may need more zeneventd workers or instances
-
Throttling in RabbitMQ (really only an issue in 4.x)
- Make sure the RabbitMQ container has at least 1GB of memory available
- Make sure that /var in the RabbitMQ container has at least 1GB of disk space available
Zenevents Troubleshooting
Backups in the zenevents queue are typically
- An event flood or RabbitMQ throttling
-
Use the steps in the Rawevents Troubleshooting section above
-
A problem in zeneventserver
- Check the zeneventserver log
- If there's an error coming from the database, it will normally show up as
a jdbc exception here
- MariaDB/Mysql Error codes are often easy to diagnose with google-fu
-
If there's an error indexing events, it may be necessary to rebuild the Lucene indexes
- Non-graceful shutdowns and other causes of data corruption can lead to index corruption, which may require rebuilding the zep Lucene indexes
-
Slowness in Mariadb-Events
-
Check database tuning
- Make sure your innodb_buffer_pool_size is adequate for the size of your zep db
- Make sure there's not a cpu/memory/disk IO constraint
- Use zencheckdbstats to identify tuning issues
-
Slowness in zeneventserver
- Do you have more than 200 triggers enabled?
- You may need to increase your trigger cache size in zeneventserver.conf
- Does zeneventserver need more memory?
- You can increase the heapsize by increasing the RAM commitment
Signal Troubleshooting
Backups in the signal queue are typically
- An event flood
-
Use previously-defined steps for troubleshooting floods
-
An external failure
- Put zenactiond in debug mode
-
Check the zenactiond log
- Slowness or failures to send notifications will show up here
-
Too many notifications for your zenactiond workers to keep up
- Add more workers to zenactiond and restart it, or add more instances of zenactiond