The rollercoaster of 2020 highlighted the competitive advantage that well-oiled software delivery teams provide. The minute COVID-19 hit and everyone had to become not just remote-first, but remote-only, many engineering teams were forced to reckon with the number of manual processes they had in place. Suddenly, they could no longer rely on the fact that there was a build machine under someone’s desk, and if that machine had a problem, they could just reboot it. Suddenly, they needed to automate everything.
This idea of automation, of being able to move quickly and reliably, is no longer ‘nice to have,’ but core to the responsibilities of today’s software delivery teams. I have the good fortune of not only leading a developer-first company focused on delivering value to engineers, but also seeing how thousands of the world’s best teams move their code from idea to delivery. While I have a lot of strong opinions on how to best bring automation to engineering teams, it’s clear that the most advanced, powerful automation tools need one crucial element to succeed: people.
For businesses, the first phase of the pandemic was about systems: updating and building your tech stack so you could operate as fully remote. I believe the next phase is going to be about teams and people.
Lessons Learned: Building Resilient Software Teams.
Having a team big enough to handle the day-to-day, while also being able to innovate, is crucial. This is especially true in a world where user expectations and demands have dramatically increased. You need enough individual contributors to handle the continually increasing overhead of maintenance and escalation. If your team is too small, the pace of innovation goes down.
Keeping pressure in the hose — the feeling of forward momentum — is critical, for both your team and your customers. You need to maintain a steady heartbeat of innovation. By continuously addressing asks of the community, you’re providing a consistent rhythm and confidence that progress is being made.
The breakdown of a resilient software delivery team should look like this:
- 50% of what you do is targeted at user-focused features — you’re working on something that will improve the lives of your users.
- 25% is focused on technical investment — what do you need to do to maintain your system? Where is it going to tip over?
- 25% is focused on escalations and defects — is the system not behaving how it’s supposed to? Did the system break, preventing users from building?
As your system gets bigger, the amount of time dedicated to maintaining the system naturally goes up. You have to increase the size of your team to maintain the ratio of user-focused work. Eventually, the team gets too large and that’s when you split them into smaller groups. Think of Amazon’s Two Pizza Rule.
While the ideal team size is going to depend on experience, the total scope of responsibilities, on-call burden, and more, somewhere between 5 and 20 code contributors is the right place to aim. Of course, there’s an upper size limit where the cost of communication and coordination gets high, but you always want enough developers to both maintain the service and continue to innovate it.
Another reason I think, especially right now, that bigger teams are better, is that 2020 has shown us how life gets in the way of our best-laid plans. If your teams are too small and somebody goes on extended leave, has an emergency, or simply needs a break, you won’t have enough kindling to keep the fire going.
You want to get your team to a place where you have enough hands on deck that you can absorb the shocks of life.
What about distributed teams?
If building resilient teams is hard overall, then building resilient distributed teams is even harder. Many of the challenges that all teams face are exacerbated when we’re spread across locations and time zones. The most effective ways of building and leading teams just can’t be easily applied. That means as leaders, we often need to get much more creative.
In my experience, successful leaders focus on building structures and supporting connection, communication, and collaboration.
As humans, we strive to be connected to a larger purpose but we also want to feel connected with the people around us. When we mostly interact with people in the shape of pixels in video calls or icons in chat apps, it’s easy to forget there are humans at the other end of the screen. First and foremost, we need to be curious about who our teammates are as people and know what drives them.
As leaders, it’s also our responsibility to make sure that expectations are always clear to everyone on our teams. This is especially true when your organization is in high-growth mode. A few years ago, the engineering team at CircleCI had doubled year over year and became more globally distributed. Conversely, the management team was incredibly small. After all this growth, we were running into challenges around evolving our engineering culture. Our knowledge was very siloed and we needed to find a path forward that would allow us to scale the business.
So we created an engineering competency matrix, which is woven into everything we do. From hiring, to structured feedback, to performance reviews, it helps us hold everyone to the same standards and clarify expectations as we scale.
Collaboration is crucial as well. With distributed teams, it’s easy to accidentally develop “subteams” split along time zones. One way to avoid this is by encouraging cross-timezone work, so-called “ping-pong” pairing. Ping-pong pairing is similar in principle to traditional pair-programming, but asynchronous.
To facilitate ping-ponging we found it useful to:
- Limit our work in progress to 3 cards.This encourages us to look for tickets already in progress before starting anything new.
- Post work-in-progress Pull Requests (PRs) as soon as possible. It should be natural that, unless someone is actively working on a given task at the moment, anyone should be able to read through the PR (and other relevant communication) and continue working on it.
- Hand off work in progress. Either asynchronously on Slack, or in video calls when people’s working hours happen to overlap, especially at the day’s edges as one person’s day is ending and another’s is just getting started.
At the end of the day, resiliency in your platform or service, and in your technology, certainly relies on good CI/CD practices, cloud-based monitoring, testing, and other DevOps processes. But the human component is essential.
When you start to see performance lagging, that’s fatigue. It’s up to us as business leaders to create resilient systems and teams, right now, and for the long haul. One of the most important components of creating an efficient software delivery team is prioritizing the people who run it.