Slack 101: Controlling App Infrastructure with Slack Bots
Hello, friends. It’s Dustin Keib. I’m here to discuss one of my favorite topics: ChatOps. That’s right, we’re going to talk about using bots in Slack to control application infrastructure in this first Slack Bots 101 blog.
First, let me describe a typical application infrastructure, centered around a model/view/controller (MVC) architecture. Many applications follow this architecture, as it separates the logic in the application keeping the user interface (view) separate from the database (model) and business logic (controller).
This provides many benefits, such as improving security, while allowing different stakeholders to focus on the various parts of the system. For example, a team of data experts would focus on the model, a team of user experience experts would focus on the view, and a backend team would direct its attention to the controller.
In a modern cloud-native application, it’s common to host the frontend (View) in a serverless Platform as a Service (PaaS), such as AWS Elastic Beanstalk or GCP AppEngine. You’d host the backend in a managed database solution, such as GCP CloudSQL or AWS RDS Aurora, and run the controller in either a managed container solution such as Google Kubernetes Engine (GKE) or AWS Elastic Kubernetes Service (EKS). One advantage of this architecture is that it allows the frontend to be decoupled from the backend API so we can test modifications to each independently.
Consider, if you will, the following simplified application architecture:
Traffic flows in from internet users, through the Web Application Firewall (WAF) to the current live production Web UI server (Prod A), which requests the needed information from the REST API (Controller). This controller, in turn, requests the appropriate record in the database. The astute observer will notice that we’ve omitted a few steps for simplicity.
Typically, in a cloud-native environment, we’ll have a Web Application Firewall and a load balancer that checks that the web server is alive and can handle a request before routing traffic to that server. In the example architecture above, Prod A is currently serving traffic (Green), while Prod B (Blue) is sitting there ready to take requests as soon as Prod A (Green) is either down or unable to fulfill the request.
Let’s suppose there’s a critical issue with the Prod A web server, and you’re out at dinner (a little wishful thinking while we’re all staying safely isolated!). You happen to be on-call for the environment, and a recent deployment of the Web UI code improperly calls the controller. Critical information isn't being returned to when a user is navigating the application. Unfortunately, this change slipped through QA, and the development team is gone for the day. Since you’re at dinner, you frantically run out to the car, realizing in an instant that your work laptop is still sitting on the counter by where you nearly left your keys while running out the door.
Now, you have a few options. First, you can try to use your phone’s web browser or a mobile cloud management application to failover from Prod A to Prod B. I think we’ve all been there, and it’s not the most optimal situation. You may cause more damage than good! Second, you could dash home and grab your laptop, but your partner may take issue with being left behind at the restaurant or cutting dinner short!
This is where Slack Infrastructure Automation Bots come in very handy! Imagine instead, you go to a Slack Channel you’ve set up with a Custom Bot. You type in a simple Slack command, eg “/infrabot fail prod”. The Bot then intelligently responds to the command, saying “Hello, I see you want to fail over the Prod environment. Prod A is currently active. Would you like to failover to Prod B?”
You answer “Yes.” The Bot, after taking any additional confirmation steps you require (e.g., Peer Review, etc.) then issues the appropriate commands to the WAF / Load Balancer, routing all incoming traffic to Prod B, thereby eliminating the issue.
Now that the issue has been resolved, you can go back to enjoying your dinner. Once the root cause has been discovered and resolved, you can push the fix to Prod A, swing traffic from Prod B to Prod A, then implement the fix on Prod B.
As you can see, having a conversational interface to failover your production environment in an outage easily can be a huge benefit. There are a number of other interesting things you can do with Slack Bots that we will cover in future sessions. Stay tuned!!