TLDR; Scaling teams are hard. A platform team done right can help ease the hardships.
At Conde Nast International we grew from a team of 20 engineers to less than 100 in less than a year. We found out that building out a system that will be used in many markets has a lot of moving parts and repetition. For example rebuilding the infrastructure and application configuration. Adding third party add-on software. Building the application using CDN redirects. DNS registration and configuration.
There were many AWS accounts utilized by many teams. Tracking usage monitoring was a mess. Every developer need to think about where and how to run their system. They usually do so independent of each other. They have to think about monitoring and logging. This also includes miscellaneous subsystems like queuing logging deploying and routing traffic.
There were a lot of things we could have done to make the process much easier and smoother. This is the primary reason why we decided to build an infrastructure team then later on a platform team.
The Platform team
The platform team is not a part of the product teams but instead acts an engineering efficiency team. This means that the platform team’s main clientele is the product teams. That said product team also needs to learn about the platform in general. Then raise issues and feed backs for Continuous Improvement (the new CI). This should not mean that platform team is isolated with the rest of organization. But rather a vital player to the success of the organization.
Infrastructure management is one of the responsibility of the platform team. Ensuring best practices and deep understanding of infrastructure in the cloud or onsite. For example making sure that infrastructure will be audit-able. This can be implemented in many ways. But the most common way of implementation is infrastructure as code (IAC).
IAC is enabled by infrastructure as a service (IaaS). The Platform team handles building IAC using tools which are open sourced. this means that the platform being built is an abstraction of the tools. These tools are loosely connected and the integration of these tools is the platform. Think of it like a platform as a service (PaaS) but closer to what the business use-cases.
We knew why we were building the platform team. Now we had to lay the foundation on which the platform team is built. Unlike product teams which usually have a visible goal and mandates. Platform team have more non-functional requirements, we had to define this in-depth.
Here is my personal take on how to build a successful platform team.
People are prone to errors. Automation within the platform allows us to be more confident when executing a piece of code. This allows us to isolate any bugs and errors within the code. and then do continuous deployment.
Automated tests are important whatever is not tested is not yet fully implemented. There are many kinds of tests needed depending on what kind of software. For example integration unit end-to-end fuzzing pen testing.
Security is paramount fuzzing and automated security testing should be a priority. This to prevent CORS attack SQL injection and other. Having this will lessen the attack surface.
Use the principle of least privilege whenever giving access. At the same time make sure to balance this with ease of entry. A developer using a platform that needs access every 5 seconds is bad for interpersonal relations. A platform team should be enablers not barriers. This means going at great length to build relationships and enabling efficiency in the team.
Everything that has to be done twice should be automated. Keep to DRY Principle as much as possible.
The platform should be automated to remove cognitive overhead. Also help us to be more stable as a platform. This is not an alternative to documentation and post mortems but rather a result of them.
A big part of automation is deployment strategy and measuring deployments using metrics. Finally plotting the metrics against customer adoption.
Use smart deployments and understand when to they apply . Example of this are the following. Blue green deployments, a/b testing, automatic rollbacks and zero knowledge rollbacks.
Building a highly efficient platform is important. This will allow us to move faster. Fixing bugs fast to build efficiency. And building features on necessity basis. Reusing code and creating reference implementation is key. This will help the wider business to get a higher lead time to market as well as a competitive advantage. Make sure to document any known unknowns and edge cases. Common problems and escalation paths.
Efficiency in the platform also means failing fast and fixing it. The platform should be as transparent as possible when showing errors. Errors will then lead to faster debugging and deployments. Efficiency lies in iterating small features rather than a big deployment.
Having a system of escalation for knowledge base is not a hindrance. Instead it is a place to start whenever you feel lost. This with good relationships yields productive results and more efficient cooperation. Helping teams to share knowledge. They will gain experience with each other and it is a good way to build a highly efficient team.
Sufficient and continuous documentation is important. Training is needed for developers. The overhead of training new developers should be taken into account. Each new technology we adopt has an overhead. This needs careful consideration if the overhead is worth the value of adoption. Interactive training labs and developers portal is useful. A place where we can do discovery of mvp and reference implementation. All this will help us achieve self sufficiency.
All new engineers should build something using the platform on their first month. This can be a part of initial orientation of new hires. This will also let us uncover issues within the self service nature of the platform. Also retraining for each new part of the platform. Doing DIY discovery within the platform is encouraged. Reinventing the wheel and using shadow IT is actively discouraged. Maintenance of many implementation of the same thing is wasteful and unneeded.
Monitoring metrics and alerting tracing are powerful tools. SRE can be initially a part of platform function embedded within the core platform team. This will help SRE to understand the underlying implementation of the platform.
The most important part of the platform is that it is built for developers. Striving to balance building out best practices and fostering interpersonal communication. A self service platform means you will have the know-how. Then understanding the value of having a platform. This means that developers will sometimes have frustrations. Feedbacks should be taken into account while iterating platform development. There should be a way to give feedback to platform developers and how the platform is doing in general. Without this the platform lives in isolation with the rest of the company. Adoption will be strenuous at best. People want to use and adopt something they feel good about using. After-all software development is a people centric type of project. Communication, interactions motivation is important part of development. We have to perfect this together with the business requirements and deadlines. A non existent perfect platform is of no use to anybody. A semi functional and unsecured platform is a curse to any company.
Finally there will always be things that are outside of the platform scope. This should always be decided on a case to case basis. Knowing that people still need it at the end of the day and you will need to redirect the request at another team. Possibly escalate it.