Operating and maintaining systems at scale with automation: A guide
For the large or midsize MSP, managing numerous customers with unique characteristics and tens of thousands of systems at scale can be challenging. In this article I want to pull back the curtain, so to speak, on some of the automation and tools that I have used to solve these problems. The approach has three main components: collect, model and react.
The first problem facing us is an overwhelming flood of data. I prefer using CloudWatch metrics, CloudTrail events, custom monitoring information, service requests, incidents, tags, users, accounts, subscriptions, and alerts. The data is all structured differently, tells us different stories, and is collected at an unrelenting pace. We need to identify all the sources, collect the data, and store it in a central place so we can begin to consume it and make correlations between events.
Most of the data I described above can be gathered from AWS and Azure APIs directly, while others may need to be ingested with an agent or by custom scripts. We also need to make sure we have a consistent core set of data being brought in for each of our customers, while also expanding that to include some specialized data that perhaps only certain customers have. All data can be gathered and sent to Splunk indexers, for example, in order to build an index for every customer and to ensure that data stays segregated and secure.
Next we need to present the data in a useful way. The modeling of the data can vary depending on who is using it or how it is going to be consumed. A dashboard with a quick look at several important metrics can be very useful to an engineer to see the big picture. Seeing this data daily or throughout the day will make anomalies very apparent. This is especially helpful because gathering and organizing data at scale can be time consuming, and thus could reasonably only be done during periodic audits.
Modelling data in a tool like Splunk allows for a low overhead view with up-to-date data so an engineer can do more important things. A great example is provisioned resources by region. If an engineer looks at the data on a regular basis, he or she would quickly notice that the number of provisioned resources has drastically changed. A 20% increase in the number of EC2 resources could mean several things; perhaps a customer is doing a large deployment or someone accidently put an AWS access key and secret key on GitHub.
I like to provide customers with regular reports and reviews of their cloud environments. I also use the data collected and modeled in Splunk for providing that data. Historical data trended over a month, quarter and year can prompt questions or tell a story. It can help business forecasting or the number of engineers needed to support a given project.
I recently used historical trending data to show progress of a large project that included waste removal and a resource tagging overhaul for a customer. Not only was I able to show progress throughout the project, but I used the same view to ensure that waste did not creep up and that the new tagging standards were being applied going forward.
Finally, it’s time to act on the data we collected and modelled. Using Splunk alerts, I am able to provide conditional logic to the data patterns and act upon them. From Splunk I can call our ticketing system’s API and create a new incident for an engineer to investigate concerning trends or notify the customer of a potential security risk. I can also call our own APIs that trigger remediation workflows. A few common scenarios are encrypting S3 buckets, deleting old snapshots, restarting failed backups and requesting cloud provider limit increases.
Because we have several independent data sources providing information, we can also correlate events and have more advanced conditional logic. If we see that a server is failing status checks, we can also look to see if it recently changed instance families or if it has the appropriate drivers. This data can be included in the incident and available for the engineer to review without having to check it.
The entire premise of this idea and the solution it outlines is about efficiency and using data and automation to make quicker and smarter decisions. Operating and maintaining systems at scale brings forth numerous challenges and if you are unable to efficiently accommodate the vast amount of information coming at you, you will spend a lot of energy just trying to keep your head above water.
Interested in hearing industry leaders discuss subjects like this and sharing their experiences and use-cases? Attend the Cyber Security & Cloud Expo World Series with upcoming events in Silicon Valley, London and Amsterdam to learn more.