Introduction

Building an internal platform isn't just a technology problem, it's a trust and adoption problem. Over the last year I've been building a multi-tenant Kubernetes platform at Sky, spanning observability, secrets management, CI/CD, security, and Developer Experience (DevEx). Before that, I built a cloud-native platform for Generative AI video and image processing at Syntonym startup that I've helped co-found. Across both, the books I've read have quietly shaped my decisions — from how I design interfaces and guardrails to how I measure success: onboarding time, time-to-first-deploy, MTTR (Mean time to restore), and golden-path adoption.

Most recently, Platform Engineering by Camille Fournier and Ian Nowland gave language and structure to practices I'd been applying intuitively.

This article distils seven foundational software books into practical lessons for platform engineering - treating the platform not just as infrastructure, but as a product and experience with measurable ROI (Return on Investment) and a long-term investment in how teams build and ship with confidence.

1. The Pragmatic Programmer by Andrew Hunt and David Thomas

→ Lesson: Prioritize Pragmatism Over Perfection, simplify golden path by automating the boring stuff

It's easy to chase shiny tools and patterns. This book helps keep you anchored: do what works, ship value fast, and automate anything repetitive so developers don't have to think about it. Dave and Andy emphasize the importance of practical solutions over theoretical perfection, reminding us to focus on what works best for our teams and customers.

What Makes a Pragmatic Programmer? Early Adopters, Inquisitive, Critical Thinker, Realistic, Jack of all trades (Don't let legacy systems fool you, you will always be able to move on to new areas and new challenges).

Their advice on writing maintainable code and avoiding over-engineering has been invaluable. It's not about using the latest tools; it's about using the right tools effectively. This mindset has guided many approaches to building a robust, user-friendly platform that developers actually want to use.

Embracing accountability for the decisions and errors made during programming is a practical strategy.

Provide Options, Don't Make lame excuses.

Dave and Andy.

When faced with a challenge, it's easy to fall back on excuses or blame external factors. Instead, focus on providing solutions and options. Inquisitive about the decision made by the team, previous team members or vendors - understand the reasoning behind will help you make better decisions. This mindset shift not only empowers you but also fosters a culture of ownership among those using the platform.

DRY - Don't Repeat Yourself? DRY is more than code, is about the duplication of knowledge, intent. It's about expressing the same thing in two different places.

Automate the boring stuff, andmake it easy for developers to use the platform without getting bogged down in repetitive tasks. Keep the 'how-to' guides up to date as the platform evolves, ensure it is share the same meaning across the teams. Terragrunt is a great orchestration tool for managing infrastructure as code (IaC) at scale, and help keeps things DRY. We have managed to migrate our Terraform configuration to Terragrunt, which has simplified our workloads and environment management significantly.

Free the data, let it flow as a river. If your background is object-oriented programming, then your reflexes demand that you hide the data, encapsulating it inside objects.

When building your own operators, ensure they expose the appropriate events and metrics—from development through to production—to empower users to make informed decisions.

Work with a User to think Like a User.

Dave and Andy.

Each time a team migrating their workload to our platform, I've found that being with them during the process helps uncover their needs and pain points. Not just relying on surveys or interviews, but actively participating in their workflows has provided invaluable insights. Joining their planning session, or even standups once in a week has shown me the platform we provided by being in their shoes.

Dave and Andy, clearly put The Moral Compass; "First, Do no harm". Ask yourself, would I be happy to be a user of this platform ?.

It's your Life. Share it. Celebrate it. Build it. AND HAVE FUN!

Dave and Andy.

For example, automating the "boring stuff" which I'll explain in a bit, has been a game-changer. I keep the "fun" when researching and implementing new features not just for the developers who use the platform, but also for myself and my team.

Are you onboarding a new team to secret manager like vault, or a new team requires to use a multi-cloud services like GKE to accessing AWS S3? Automate the service account creating by simply using identity federation, or automate the secrets management by using a simple CLI tool that abstracts the complexity of vault service account creation and secret management.

I find the joy of building with bubble-tea for terminal applications, it allows me to create functional and stateful terminal apps with ease.

2. Patterns of Enterprise Application Architecture by Martin Fowler

→ Lesson: Build Clear Interfaces & Consistent Contracts

Real challenges in platform engineering often stem from complex interactions between systems. Over the past years, Fowler’s book has been one of the go-to resources for understanding architectural patterns that can help simplify these interactions. Even though when building start-ups from ground up, enterprise patterns sounds overkill but actually it helps to build a solid foundation to our systems.

We all want agile, fast, go-to-market, sure but we also want to build a system that is maintainable, scalable, and easy to understand and can be extended by customer demands.

Most often platforms are kept as a black box for a valid reason, but it’s important to build clear interfaces and consistent contracts between different components. This is where Fowler’s patterns come into play. Most recently we migrated a legacy git managed terraform configuration for teams exposing services, this was siloed the queued team request and a nightmare remote state. To move from the nightmare we abstract this complexity by developing a clear Kong Gateway Operator, enabling tenants to declaratively manage API exposure and Kong plugins through simple YAML manifests. The more we look at moving away this yaml manifests from team spaces to something more configured by argo cd is more shining light on the platform.

The more we approached this concepts is an opportunity to build trust between platform and customers. More increased transparency and reduced friction, allowing teams to focus on building features rather than wrestling with infrastructure, or terraform states. Simple backward compatible interfaces and documentation is increased the platform consisteny and reduced onboarding complexity more than half.

Yes, I must admit the DDD patterns are not always applicable, but the principles of building clear interfaces and consistent contracts are universal.

3. Software Architecture: The Hard Parts by Neal Ford & Mark Richards

→ Lesson: Make Trade-Offs Visible and Intentional

Ford and Richards remind us that software architecture is about decisions that are hard to reverse. One tongue-in-cheek definition they cite is:

Software architecture is the stuff that is hard to change later. This definition is not entirely accurate, but it does point to the fact that software architecture is about making decisions that are difficult to reverse. The decisions made during the architecture phase of a project can have a significant impact on the project's success or failure.

3.1 Why this matters for platform engineering

In platform work, architectural decisions ripple accross dozens of teams and projects. Trade-offs affect reliability, scalability, and developer experience. The authors' advice is simple but powerful: make those trade-offs explicit, and involve stakeholders in the reasoning. Transparency builds trust, even when the decision isn't perfect.

In practice, I've found this works best by:

Writing RFCs to capture the “why” behind decisions.
Involve tenant teams early when addressing key workflow changes. Their participation not only fosters a culture of collaboration but also steadily increases the platform's adoption rate.
Shielding them from the "boring but essential" tasks for example, applying security patches, performing cluster upgrades, enforcing network policies, and managing continuous deployment.

This balance prevents the “not invented here” syndrome while keeping decision making grounded in real user impact.

3.1.2 Real Life Scenario: Implementing Comprehensive Network Policies with Cilium

When enforcing L4 and L7 network policies across to multi-tenant GKE clusters, the trade-offs are significant..

Security gain: Tighter workload isolation and reduced blast radius in case of compromise, plus the ability to enforce default-deny policies.
Operational cost: Complexity in debugging connectivity issues, training teams on new policies, and potential performance overhead.

To mitigate these, we took a phased approach:

Piloting Cilium in a single non-critical environment to measure impact and gather scenario based results.
Using Cillium Hubble for observability so teams could see traffic flows and policy effects. Package traffic flows in a way that is easy to understand.
Gradually introducing default-deny policies, paired with clear YAML templates for common allow-list scenarios.

Involving early-adopter tenant teams and collecting feedback on friction points before the cluster-wide rollout. The key trade-off balancing security and complexity can make transparent from the start, ensuring no surprises when stricter policies went live.

3.2 Data as a first-class citizen

Data is a precious thing and will last longer than the systems themselves.

Tim Berners-Lee

Data management in platform engineering isn't about being another DBA. Instead, it's about designing for disaster recovery, ensuring data consistency and integrity, and helping teams select the right technologies to build and maintain ETL pipelines.

One of the other things I found is that platform engineers decentralize the data ownership, allowing teams and stakeholders to access the right data at the right time. I recommend reading Data mesh principles

3.3 Deployment boundaries in distributed systems

Deployment boundaries in distributed systems should be visible and well-understood, whether you're running a monolith, microservices, or a hybrid. And Architects must benefit from a firm understanding of exactly where deployment boundaries lie in distributed systems. In modern era, most of services are deployed in a way that is not visible to the end-users, even to developers. This sometimes can lead some confusion and frustration. To make deployment boundaries visible and clear to the developers who are using the platform, ensure gitops aren't just a buzzword, but a practice.

When it comes to microservices, the advice is clear: if you can deploy them independently, do so. But if your microservices must deploy in a fixed sequence, Matt Stine's advice applies:

Please put them back in a monolith and save yourself some pain.

Matt Stine

Microservices should evolve independently, and dependents should degrade gracefully when a service changes or disappears. Use discovery, timeouts/circuit breakers, retries with backoff, bulkheads, and fallbacks plus backward compatible APIs and contract tests to avoid cascading failures.

3.4 Maintainability in practice

Maintainability is the ease of adding, changing, removing features or performing essential operational work of patches, upgrades, and scaling.

Alexander von Zitzewitz wrote an article about a new metric to objectively track maintainability level of a software system. The metric directly relates to the coupling and cyclic dependencies between components. In plain terms: the more components depend on each other, the harder the system is to change safely.

$c_i = \frac{size(i) * (1 - \frac{inf(i)}{numberOfComponentsInHigherLevels(i)})}{n}$

where:

n: total number of components
size(i): number of components in the logical node
inf(i): number of components influenced by c_i

To calculate the system’s overall Maintainability Level (0–100%) is:

$ML_1 = 100 * \sum_{i=1}^{k}c_i$

Where:

ML = Maintainability Level of overall system (0-100%)
k = Total number of logical components in the system.
$c_i$ = Coupling level for any given component, with a special focus ony incoming coupling levels.

This equation basically states that the higher the incoming coupling level between components, the lower the overall maintainability level of the system/codebase.

Component coupling is something the degree and manner to which components know about another. Cohesion is the degree and manner to which the operations of a component interralate. Cyclomatic complexity is the overall level of indirection and nesting within a component.

I've also found the Technical Roadmap chart to be useful especially when working with legacy systems. It is another way of keeping the maintainability score high and prioritizing work that matters.

3.5 Competitive Advantage and Platform Capabilities

How do we convince the business to invest more time and resources in refactoring architecture? It’s tempting to rely on buzzwords—agility, speed, resilience but the real challenge is demonstrating how architectural improvements directly support business-critical capabilities. The key is to clearly articulate the trade-offs and show how platform decisions align with organizational goals. By focusing on outcomes such as faster delivery, reduced operational risk, and improved developer productivity, you can make a compelling case for ongoing investment in platform evolution.

A simple graph to illustrate the relationship between competitive advantage and platform capabilities:

As Ford and Richards explains the Modularity Drivers in their book, the drivers for modularity and the relationships among them.

I've found Customer Satisfaction where both developers and end-users are satisfied as the outcomes of speed-to-market with platform capabilities.

Where Speed-to-Market is achieved through architectural agility-the ability to response quickly to change.

I've also think that Resilience is one of the compound of an architectural characteristic of the Agility, where a Platform withstand and recover from failures, distruptions, errors and attacks. I can't think of any architectural agility without Resilience being part of it. From CRD's to operators, from service mesh to network policies, from observability to security, all these components contribute to the platform's resilience.

The Key difference between Availability(proactive) and Resilience (reactive) is that Availability is about preventing failiures, while Resilience is about handling failures that occur.

This graph shows how platform capabilities evolve over time, with competitive advantage peaking when the platform is well-aligned with business needs. As the platform matures, it becomes a foundation for future innovation and growth, enabling teams to deliver value more effectively.

4. Designing Data-Intensive Applications by Martin Kleppmann

-> Lesson: Master Data Management and Scalability in Distributed Systems

Kleppmann's book is a deep dive into the principles of data management in distributed systems. It covers everything from data modeling to consistency and fault tolerance. Building a platform, especially in a complex, legacy corporate environment where data is often siloed and tangled in monolithic architecture with complex SOAP integrations presents unique challenges. Supporting modern solutions alongside legacy systems without disruption requires a solid grasp of these principles, making Kleppmann's insights invaluable.

Over the months, several patterns and practices from this book have shaped my approach:

Assume network is hostile, and design for failure. Reconcile the desired state with the actual state, rather than assuming the network is always reliable, We use a combination of Kubernetes operators and controllers to ensure that the desired state of the platform components is always maintained.
Plan for schema evolution Data models will change over time. Techniques like versioning and backward compatibility are essential and for smooth transitions and minimal disruption.
Separate compute modes: Use streaming for freshness, batch for completeness. This enables efficient handling of large data volumes and real-time event processing. Leverage the GKE's new approach compute-classes or dedicate node pools further optimizes resource utilization for different workloads.
Design near-real-time platform metrics, reliable audit trails and tenant-isolated telemetry. While challenging, collecting and analyzing metrics across platform components is highly rewarding and critical for operational excellence.
Adopt event-driven autoscaling. Tools like KEDA (Kubernetes Event-driven Autoscaling) is a great tool for scaling workloads based on events, not just CPU or memory. It makes it possible to handle spikes in cronjobs heavy workloads but also other event-driven workloads efficiently.
Architect for scale and resilience. Storing logs for tens of thousands applications in a single cluster requires fan-out architectures, partitioning, sharding, and replication startegies. Kleppmann explores these topics in depth, providing practical guidance on how to design systems that can handle large-scale data processing.

5. Refactoring by Martin Fowler

→ Lesson: Start Small, Improve Continuously

Refactoring; disciplined technique for restructuring an existing body of code, altering its internal structure without changing its external behavior. Martin Fowler.

maintaining good regression tests is the key refactoring safely.

The critical part of the definition is that external behavior must remain unchanged, it isn't good time to add features. Refactoring is a day-to-day activity, not a one-time event and taking low-risk small steps.

When should you refactor ? Refactor when you've learned something; when you understand something better than you did before, last year, yesterday or even just ten minutes ago. Duplication, Outdated knowledge, usage, performance, and design flaws are all valid reasons to consider refactoring.

Test your software or your users will.

Andrew Hunt

Don't let tenants the find outs the hard way and don't let the trust you build fade away for every feature deliverd by platform teams.

Every program and every privileged user of the system should operate using the least amount of privilege necessary to complete the job.

Jerome Saltszer, Communications of the ACM, 1974.

Design the RBAC system and monitor the audit logs, ensuring least privilege access is enforced. Implement network policies by default so tenants are isolated from each other and can only access the resources they need. Regularly maintain security patches; Snyk is an excellent tool to have as a companion. Take a look at snyk-controller, warning though can be noisy at the beginning.

Fowler's book on refactoring is a must-read for any software engineer, especially those early in their careers. It emphasizes the importance of making small, incremental changes to codebases rather than attempting large-scale overhauls. In my experience mentoring others, I often see the common pitfall of trying to make sweeping changes all at once to create impact. Fowler's principle of incremental improvement is particularly relevant in platform engineering, where system complexity makes large changes risky and difficult to manage.

A hard truth is that as a platform grows and serves more customers, managing infrastructure as code (IaC) with Terraform can become increasingly complex often devolving into a mess of copy-paste configurations and hard-to-maintain modules. We addressed this by prioritizing small, manageable refactors over sweeping changes..

For example, onboarding services to an observability platform used to require complex Terraform modules; we replaced these modules with simple, declarative YAML configurations that abstract away infrastructure complexity. This is not only simplified onboarding but also made ongoing maintenance and extension much easier.
Similarly, our ingress configuration was previously managed by Terraform modules, consolidated into a single team configuration YAML file. This approach became difficult to maintain and scaled poorly as teams grew. We transitioned to using straightforward YAML descriptions for the Kong Gateway Operator, enabling teams to declaratively manage ingress resources with greater clarity and autonomy.
We used to manage the k8s service accounts with a complex Terraform setup; we have refactored the configuration and altered with simple CLI tool that automates the federation process from GKE to AWS-IAM, and making it onboarding much smoother for teams.

These incremental improvements have significantly reduced CI/CD pipeline times from 60 minutes to 10 minutes and greatly increased developer satisfaction.

This book has taught me the value of continuous improvement and the importance of keeping codebases clean and maintainable. It's about fostering a culture of quality, where developers feel empowered to make changes that enhance the platform over time.

Managing technical debt wisely is crucial: just as in finance, borrowing (taking on debt) can be strategic if it's managed, invested and paid down thoughtfully, rather than avoided altogether.

6. The Lean Startup by Eric Ries

→ Lesson: Build Iteratively, Validate with Users

While not strictly a software book, "The Lean Startup" has had a profound impact on how I build products. Ries advocates for building products iteratively and validating assumptions through experimentation. This mindset is essential when building a platform that needs to meet the diverse needs of different teams and projects.

By applying Build-Measure-Learn principles, I've been able to transform ideas into products and features, measure how customers/developers respond, and then learn whether to pivot or persevere. Crucially, this approach helps keep interest in using the platform alive without burning out the platform team and customer teams.

Build (smallest slice);

Example: A new Gateway controller operator for 2 pilot teams.
Scope: Implement kong ingress controller, next create a simple operator that allow teams to declaratively manage kong crds httpRoutes, kong consumers, and plugins.

Measure (user feedback);

Activation: Percentage of teams that complete onboarding in less than an hour.
Time to first... deploy/integrate with third-party service.
Support burden: Percentage of support tickets related to gateway controller.
Qualitative: DevEx/NPS(Developer Experience/net promoter score) after the first month.

Learn (iterate);

Persevere: Double down on the features that are working well, like simplifying the configuration of kong plugins, jwt authentication, and rate limiting.
Pivot: If teams are struggling with the complexity of the configuration, we might need to simplify the operator or provide better documentation and examples
Kill: Archive if measurable impact is flat or negative.

Experiment briefs (one-pager) before building (Request For Comments), Feature flags per tenant; canary rollouts and A/B rollouts. Cohort dashboards, compare pilot teams vs later adopters. Kill-switch & rollback by flag, not PR revert. Sunset criteria baked into the brief (how we know to retire), capture these in ADRs (Architecture Decision Records).

Innovation accounting is a key concept here, focusing on the "boring stuff": measuring progress, setting milestones, right-sizing OKRs, and measuring outcomes rather than outputs. By applying these principles, it helped me to prioritize features based on user feedback and data-driven decisions. This all about is creating a platform that evolves based on real-world usage rather than assumptions. The iterative approach has helped us build a platform that is not only functional but also aligned with the needs of our users.

And remember, customer empathy is crucial in this process; you must always keep the end-user in mind.

7. Platform Engineering by Camille Fournier and Ian Nowland

→ Lesson: Think Like a Product Team, Not Just an Infra Team

This book is a recent addition to my reading list, but it has quickly become one of my favorites. Fournier and Nowland provide a comprehensive overview of platform engineering, covering everything from team structure to technical practices. They emphasize the importance of building platforms that are not just technically sound but also deliver real value to users.

Their insights into multi-tenancy, developer experience, and platform ROI resonate deeply with my own experiences. They provide a framework for thinking about platform engineering as a product, not merely a technical challenge. This perspective has been instrumental in shaping my approach to building a platform that developers actually use and trust.

It emphasized treating platform capabilities as user-facing products, with metrics, roadmaps, and user feedback loops.

Example: Our CI/CD Runway system is a direct application of this principle — it abstracts complexity from tenant teams and makes deployment a guided, measurable experience.

It also gave me language to communicate Platform ROI: onboarding time, support burden, adoption rates, golden path usage.

Example: We're now tracking how long it takes teams to onboard to our platform, which features they use most, and where they get stuck. This data helps us prioritize improvements and demonstrate the value of our platform investments.

A Unified Framework: How These Books Complement Each Other, Platform Return on Investment

If I had to distill the insights from these books into a unified framework, it would look like this:

Pillar	Books	What it adds	ROI
Foundational Architecture & Iterative Refinement	Patterns of Enterprise Application Architecture + Refactoring	Set platform design foundations, emphasizing modularity, clear interfaces, and iterative enhancements.	Reduced complexity, improved maintainability, and faster onboarding.
Explicit Complexity Management & Decision Transparency	Software Architecture: The Hard Parts + Designing Data-Intensive Applications	Provided tools for navigating trade-offs explicitly, building trust, and creating flexible, composable systems.	Enhanced platform reliability, reduced support burden, and improved developer satisfaction.
Reliability, Observability & DevEx as Core Principles	The Pragmatic Programmer + Designing Data-Intensive Applications	Reinforced platform resilience and usability, driving intuitive developer experiences.	Increased platform adoption, reduced friction in developer workflows, and improved system observability.
Platform as a Measurable Product	Platform Engineering	Integrated all previous concepts into a product-focused approach, measuring success through user satisfaction, adoption, and clear ROI.	Established a framework for continuous improvement, aligning platform goals with user needs and business objectives.
Iterative, User-Centric Development	The Lean Startup	Applied lean principles to continuously improve the platform based on user feedback, ensuring it evolves to meet real-world needs.	Increased platform relevance, improved user satisfaction, and enhanced adaptability to changing requirements.

Platform Return on Investment based by these books (ROI)

The ROI of a platform could be measured in several ways:

Onboarding Time: How quickly new teams can start using the platform.
Support Burden: Reduction in support tickets and issues related to platform usage.
Adoption Rates: Percentage of teams actively using the platform and its features.
Golden Path Usage: How many teams follow the recommended practices and patterns, indicating trust in the platform.
Developer Satisfaction: Surveys and feedback loops to measure how developers feel about the platform.

Final Thoughts

Books alone don’t build great platforms and products but they profoundly shape our decisions, mindset, and methods. My experience brought me the realization that by connecting these foundational books into one cohesive framework, I've consistently improved platform adoption, reliability, security, and developer happiness over the past years, and I hope this article inspires you to reflect on the books that have influenced your own journey.

Think like a product team, not just a platform provider-your developers will reward you with trust, adoption, and continuous improvement. Platforms succeed when they’re run like products: trusted by developers, widely adopted, and constantly improving.

Additional Resources

~/whoami

I'm a software engineer with a passion for crafting things that people love to use. Other times find me on the water either Kite surfing, paddling or surfing, or chasing formula 1 cars around the world. Thoughts are on my own: Still Twitter, Bsky or LinkedIn.

The 7 Books That Taught Me About Building Platforms That Developers Actually Use