As you scale your business, your team(s), and your application infrastructure(s), one of the things you might find challenging is the growing divide between your development teams and your operations team. As you inevitably have more developers shipping features, you might find them at odds with the operations team. After all, the operations team is trying to keep your applications stable, while developers are trying to introduce inherently unstable changes to those same applications. DevOps is supposed to fix this! But, as you might well know, it’s not magic. Doing DevOps correctly takes some work and some practice.
Here, I’m going to highlight a few of my learnings based on my experience with DevOps at some of our clients.
1) Don’t get stuck on definitions, concentrate on evolving the process
If you read about DevOps online, you might notice that it still seems like there is no real agreement on what the definition of DevOps actually is. Some will say it’s a methodology, some say it’s a culture, others say it’s a job title! It’s very easy to get stuck thinking that you need to know the precise definition of DevOps before you start to implement it in your team(s). Instead, I would suggest that you only need to understand why you need DevOps – likely to address the issues we talked about in the last paragraph. Then you can think about solutions that will work for your team.
There is no one-size-fits-all solution for DevOps. You might find that embedding ops people into each of your delivery teams is a good solution. Or maybe upskilling existing members of the team with DevOps skills is more suitable. Applying development philosophies to cloud operations works pretty well when you give your team the right tools, so listen to your team and evolve the process based on the feedback they are giving you. Just like there is little agreement on a single definition of DevOps, there is no one-size-fits-all solution for DevOps implementation. Let your team define DevOps for itself!
2) Use automation to improve the wellbeing of your team
One common misconception about DevOps is that the goal is to automate everything. But, while that might sound nice, it’s probably not realistic for most teams. So, if we can’t automate everything, how do we choose what to automate? A few years ago, Google coined the term “toil” to define tasks that provide little long term value, require little human decision making, are automatable and are required to maintain a healthy system. You can read more about “toil” here if you want to understand it more specifically, how you might measure it, and why it’s bad to have too much of it.
Collaboration between teams can help scale your business more efficiently.
Identifying “toil” in our work can help us uncover the best opportunities for automation. These are the tasks that don’t change much from execution to execution, are executed more frequently, can create confusion for new team members, and can cause career stagnation when your team is required to spend too much time on them. Perhaps onboarding new customers requires that a member of the dev team logs into and creates a new user manually each time. That’s probably “toil”. If your team is SSHing into a production instance every week to fix some bogus data or to touch a configuration file – again, “toil”. If someone needs to deploy code or infrastructure manually once or twice a week because of a flaky build box, probably “toil”. Focus your tactical automation efforts here to improve your team’s morale and happiness and make the longer -term strategic goals easier to achieve.
3) Create meaningful dashboards and alerts
Build alerts that actually create meaning for everyone, not just technical members of the team. It’s pretty easy to set up a few graphs of CPU/Memory/Disk usage and request throughput and call it a day, but what does that really tell us? What happens when CPU usage hits 65%? 95%? In today’s world of cloud-based infrastructure and out of the box autoscaling, these stats might be meaningless, but we still seem to love creating our dashboards and filling them full of them.
Instead, what if we created a dashboard that tells us how many customers were active on the site, how many items are being added to carts, how many purchases are being completed? These metrics are much more useful because they actually describe critical customer-focused actions taking place. And even better, if something looks strange with them, non-technical team members can identify it!
The alerts that come out of this are much better too. Instead of being woken up at 2 am because disk usage is at 80%, maybe wait until we actually have a problem with customers adding items to their cart. There’s a good chance we would have noticed the disk space problem before it was going to hurt us anyway. But if it’s actually causing customers problems, then we want to know immediately.
In conclusion, I hope these tips will help you think of ways to evolve your DevOps processes, automate with wellbeing in mind, and create dashboards and alerts that align to the most meaningful functions of your applications and systems. If you are ever in need of some guidance in navigating the sometimes confusing world of DevOps, please let us know!