Anatomy of a chat bot

Posted by ProgrammingAce on Mon 04 May 2015

I’ve been a big fan of ‘Chat Ops‘ since long before it had a cool name. A couple of years ago, I authored a high-speed outage alert system called ‘Pony Express’ running on XMPP for a high frequency trading firm I was working for. The mantra in that environment was ‘Seconds can cost millions’ so having a quick way to alert the operations team about potential problems was a huge deal. Last year I started working on a similar system for my current employer, and I was able to open source the project.

After several false starts over the years, our operations team finally settled on a ‘chat platform’ so we could have open discussions with our globally distributed team. In our case, we’re using Slack; but we’re stuck with the free version, which means we don’t get any of the cool automatic integrations that Slack is known for. I strongly believe in using the carrot rather than the stick, so we needed something to lure our team into the channel. Enter a Chat Bot as an information radiator.

Luckily even the free version of Slack has an IRC integration. Unfortunately their IRC server doesn’t actually follow the IRC RFC, and this causes most chat bots to crash. I’m not sure if the discrepancy is intentional or not, but it’s a great way to force people onto the paid version of the Slack platform. I happen to be a longtime user/admin/operator on IRC, and I’m fairly experienced in designing and reverse engineering communications protocols. So I wrote my own chat bot, [STRIKEOUT:with blackjack and hookers] ignoring Slack’s broken implementation of the IRC spec.

Meet Univac, it’s a multi-threaded IRC bot written entirely in python and designed to be easy to manipulate. Changing out the transport layer (ie, switching from IRC to XMPP) is trivial, and adding functionality just means adding a new processing thread. It’s currently designed with an open source component (linked to github above) for casual use, as well as an enterprise version that will be open sourced later. The framework is identical between the two, but the enterprise version of the bot is built to tie into the SaaS tools that our operations team uses on a daily basis.

Univac’s framework is simple. It’s designed to be multi-threaded, to prevent blocking when it’s managing data in multiple channels. The threads are broken down into two types: proactive and reactive, with a main loop tying the threads together.

Proactive threads gather data from our environment, and post to the appropriate channels when there’s data to report. The proactive threads are spun off when the Univac starts up, and function independently of the main loop. The main loop only monitors these threads to see if they’ve crashed and restarts them if necessary. Proactive threads are usually querying APIs in our environment looking for things like new tickets in our ticketing system, or merge requests in our Git repos. They’re basically loops with a sleep timer, usually set to run every 60 seconds. Proactive threads are usually stateful, creating an index when they start up, and only alerting when new information comes in.

Reactive threads respond to user requests in the chat channels, go lookup the requested information, then report back in the channel. The main loop blocks waiting for communications from the IRC server, then spawns a reactive thread to process each message. When a reactive thread starts up, it searches through the message from the server looking for key words that trigger it to perform an action. They can do things like DNS lookups, searches in other API systems, or just reporting status information about the Univac bot itself. Reactive threads are usually triggered by using a command in the chat window with a global command character (defaulted to ‘!’). “!whereareyou” tells the bot to post its IP address in the channel. Univac can also use reactive threads to parse general chat info for commands. Posting a link to an xkcd comic will make the bot post a summary of the comic, including the title, mouseover text, and year of publication. Reactive threads can also be used to post things like the bot’s ‘help’ screen for a list of all the available commands.

Error handling is one of the main issues with Univac right now, and is probably the area that could use the most work. When trying to integrate with a protocol that doesn’t obey the spec, you have to be prepared for random ‘strangeness’. Univac can detect a few error conditions, but generally it just makes sure all of its threads are running, and tries to do a safe shutdown if any of them die. By default, it comes with a wrapper to watch the service and restart if it shuts down, but this can be replaced with something like systemd or initd. By detecting errors and shutting down, Univac should be safer to run without worrying about it forking out of control or being banned from the chat service for misbehaving. It should also mean that Univac is safer to develop against, as bugs in the code will just cause a shutdown rather than a fork bomb. Univac is separated from standard input on startup, and standard output is piped to a local console, so it can’t accidentally expose ‘secrets’ by dumping a crash report to the IRC channel.

Univac is very much a toy project, and is not designed to compete with or replace any of the big names in chat ops. It has, however, been useful in helping to teach people in programming python. Since each thread is designed to be self contained, it’s quick and easy to show someone how to take incoming messages, process them, and respond in a meaningful way. It doesn’t take very long to go from learning what a variable is to having a chat bot responding to your messages. The fact it ties into the public IRC networks means it can run from anywhere you have a network connection, without needing to setup your own infrastructure.