The past couple of years have been tough for IT organizations. Between headwinds from the COVID-19 pandemic and other macroeconomic factors, teams have been tasked with optimizing their cloud infrastructure footprint while keeping the services that are core and crucial to the business up and running. Today, Google are excited to publish the inaugural State of Kubernetes Cost Optimization report to provide insights and best practices to the Kubernetes community about running cost-efficient clusters in the public cloud without compromising the performance or reliability of their workloads. 

Why we authored this report

The report addresses the intersection of IT organizations looking to reduce costs and the continued rise of Kubernetes adoption across industries. We performed a large-scale analysis of Kubernetes clusters to understand what makes high performers for cost optimization. And now, we’re excited to share our key findings.

How we conducted this research

The report centers around four “golden signals” for Kubernetes cost optimization (not to be confused with the four golden signals of monitoring). These signals, derived from years of collaboration with Kubernetes users, can be used to measure how well you balance workload reliability and cost-optimization of your clusters.

The “golden signals” of Kubernetes cost optimization

Using these golden signals as a baseline measurement, we looked at large-scale, anonymized data from Google Kubernetes Engine (GKE) clusters, sorting clusters into At Risk, Low, Medium, High, and Elite segments using a classification tree weighted with quasi-equal intervals. With these five segments in place, we compared and analyzed how high-performing clusters perform against these golden signals. 

The most important takeaway: set your requests!

Do you know how many workloads are not setting requests in your production clusters? One of our key observations is that many developers are not setting requests for their workloads. And that’s worrisome. Because Kubernetes reclaims resources when node-pressure occurs, it is critical to set requests for workloads that require even a minimum level of reliability. 

Not setting requests implicitly assigns the BestEffort Quality of Service (QoS) class to your Pods. In times of resource scarcity on a given Node — and without any warning or graceful termination — BestEffort Pods are often the first to be killed. This can lead to intermittent performance or reliability issues for your workloads, and can occur depending on Pod resource utilization and where the scheduler places Pods. When these issues arise, they can be difficult to identify and debug. 

To identify workloads that do not set requests, you can use one of the following tools:

  • If you are running Kubernetes clusters in GKE, use the GKE Workloads at Risk dashboard. This identifies workloads that have not set requests across your fleet of GKE clusters, along with other workloads at a performance or reliability risk based on how they have requests set.
  • If you want a very simple script to list containers that are not setting requests in any Kubernetes cluster, check out kube-requests-checker.

Once you’ve set requests for your workloads, you can then proceed with workload rightsizing. This golden signal is at the heart of the cost optimization journey; if requests more closely reflect reality, then the decisions Kubernetes makes using requests will be more effective. 

Google are seeing this focus on proper setting of resource requests across the community. Ajay Tripathy, the author of the OpenCost project, noted that “setting appropriate resource requests and prioritizing workload rightsizing…is the biggest area of opportunity for OpenCost users.”

In conclusion

No one team alone is responsible for Kubernetes cost optimization — rather, it’s a joint effort that spans developers, platform admins, and even billing and budget owners. The report contains insights and recommendations for each of these personas in its key findings. 

Google also know that lessons from these findings are not one-time fixes. Rather, they are continuous practices that you should build into your team culture over time.

To learn more about how we take lessons from the State of Kubernetes Cost Optimization report and build them into GKE, check out the resources below:

  • A solution guide on best practices for running cost-optimized on GKE
  • A solution guide on rightsizing your workloads at scale in GKE
  • A demo video on rightsizing your workloads at scale in GKE
  • A demo video on using the cloud console for GKE Optimization
  • An interactive tutorial to try out GKE with sample workloads