Securing apps for Googlers using Anthos Service Mesh

Hi there! In this blog post, Google would like discuss on how Corp Eng is adopting Anthos Service Mesh internally at Google. 

Corp Eng is Google’s take on “Enterprise IT”. A big part of the Corp Eng mission is running the first and third party software that powers internal business processes – from legal and finance to floor planning and even the app hosting our cafe menus – all with the same security or production standards as any of Google’s first party applications.   

Googlers need to access these applications, which sometimes then need to access other applications or other Google Cloud services. This traffic can cross different trust boundaries which can trigger different policies.

Access SRE runs the systems that mediate this access, and Google implemented Anthos Service Mesh as part of the solution to secure the way Googlers access these applications.

But why?

You can probably tell, but the applications Corp Eng is responsible for have disparate requirements. This often means that certain applications are tied to disparate infrastructure due to legal, business or technical reasons – which can be challenging when those infrastructures work and operate differently.

Enter Anthos. Google Cloud built Anthos to provide a consistent platform interface unifying the experience of working with apps on these varying underlying infrastructures, with the Kubernetes API at its foundation.

So when searching for the right tool to build a common authorization framework to mediate access to CorpEng services, Google turned to Anthos – specifically Anthos Service Mesh, powered by the open-source project, Istio. Whether these services were deployed in Google Cloud, in Corp Eng data centers, or at the edge onsite at actual Google campuses, Anthos Service Mesh delivered a consistent means for us to program secure connectivity.

To frame the impact ASM had on our organization, it’s helpful to introduce the roles of the folks who manage and use it:

Figure 1 – Anthos Service Mesh empowers multiple people across different roles to connect services securely

For security stakeholders, ASM provides an extensible policy enforcement point running next to each application capable of provisioning a certificate based on the identity of the workload and enforcing mandatory fine-grained application-aware access controls.

For platform operators, ASM is delivered as a managed product, which reduces operational overhead by providing out-of-the-box release channels, maintenance windows, and a published Service Level Objective(SLO). 

For service owners, ASM enables the decoupling of their applications from networking concerns, while also providing features like rate limiting, load shedding, request tracing, monitoring, and more. Features like these were typically only available for applications that ran on Borg, Google’s first-party cluster manager that ultimately inspired the creation of Kubernetes.

In sum, Google were able to secure access to a plethora of different services with minimal operational overhead, all while providing service owners granular traffic control.

Let’s see what this looks like in practice!

The architecture

Figure 2 – High-level architecture for Corp Eng services and Anthos

In this flow, user access first reaches the Google Cloud Global Load Balancer [1], configured with Identity Aware Proxy (IAP) and Cloud Armor. IAP is the publicly available implementation of Google’s internal philosophy of BeyondCorp, providing an authentication layer that works from untrusted networks without the need for a VPN. 

Once a user is authenticated, their request then flows to the Ingress Gateway provided by Anthos Service Mesh [2]. This provides additional checks that traffic flows to services only when the request has come through IAP, while also enforcing mutual TLS (mTLS) between the Anthos Service Mesh Gateway to the Corp services owned by various teams.

Finally, additional policies are enforced by the sidecar running in every single service Pod [3]. Policies are pulled from source control using Anthos Config Management[4], and are propagated to all sidecars by the managed control plane provided by Anthos Service Mesh[5]. 

Managing the mesh

If you’re not familiar with how Istio works, it follows the pattern of a control plane and a data plane. They talked a little bit about the data plane – it is made up of the sidecar containers running alongside all of our service Pods. The control plane, however, is what’s responsible for updating these sidecars with the policies they want to enforce:

Figure 3 – High-level architecture for Istio

Thus, it is critical for them to ensure that the control plane is healthy. This is where Anthos Service Mesh gives the platform owners a huge advantage with its support for a fully-managed control plane. To provision cloud resources, like many other companies, the organization uses Terraform, the popular open-source infrastructure as code project. This gave us a declarative and familiar means for provisioning the Anthos Service Mesh control plane. 

First, you enable the managed control plane feature for GKE by creating the google_gke_hub_feature resource below using Terraform.

  resource "google_gke_hub_feature" "feature_asm" {
  name     = "servicemesh"
  location = "global"
  provider = google-beta
}

Keep in mind that at publication time, this is only available via the google-beta provider in Terraform.

Once created, Google then provision a ControlPlaneRevision custom resource in a GKE cluster to spin up a managed control plane for ASM in that cluster:

  apiVersion: mesh.cloud.google.com/v1alpha1
kind: ControlPlaneRevision
metadata:
  name: asm-managed
  namespace: istio-system
spec:
  type: managed_service
  channel: regular

Using this custom resource, Google are able to set the release channel for the ASM managed control plane. This allows for our platform team to define the pace of upgrades in accordance with our team’s needs.

In addition to managing the control plane, ASM also provides management functionality around the data plane to ensure each sidecar Envoy is kept up to date with the latest security updates and is compatible with the control plane – one less thing for service operators to worry about. It does this using Kubernetes Mutating Admission Webhooks and Namespace labels to modify the Pod workload definitions to inject the appropriate sidecar proxy version.

Syncing mandatory access policies

With the core Anthos Service Mesh components in place, the security practitioners can define consistent, mandatory security policies for every single GKE cluster, using Istio APIs.

For example, one policy is enforcing strict mTLS between Pods using automatically provisioned workload identity certificates. Earlier, they talked about how this is enforced between the Istio Gateway; that same policy enforces mTLS between all Pods in the cluster. 

Figure 4 – A high-level diagram of mutual TLS

Another policy they implement is denying all egress traffic by default, requiring service teams to explicitly declare their outbound dependencies. The following is an example of using an Istio Service Entry to allow granular access to a specific external service – in this case, Google. This helps prevent unintended access to external services.

  apiVersion: networking.istio.io/v1alpha3
kind: ServiceEntry
metadata:
  name: google
spec:
  hosts:
  - www.google.com
  ports:
  - number: 443
    name: https
    protocol: HTTPS
  resolution: DNS
  location: MESH_EXTERNAL
EOF

These policies are automatically synced to all service mesh namespaces in each cluster using Anthos Config Management. By using the internal source control system as a source of truth, Anthos Config Management can sync and reconcile policies across all of the GKE clusters, ensuring that these policies are in place for every single one of the services. You can find more details about the implementation of Anthos Config Management here.

With this in place, the team plans on eventually migrating away from security automation that operates solely based on explicit IP, port and protocol policies. 

Integration with Identity-aware Proxy

The publicly available version of the BeyondCorp proxy used by CorpEng is called Identity-aware Proxy (IAP), which offers an integration with Anthos Service Mesh. IAP allows you to authenticate users trying to access your services and apply Context-Aware-Access policies.  This integration comes with two main benefits:

Identity-aware Proxy allows us to capture this information in a Request Context Token (RCToken), which is a JSON Web Token (JWT) created by Identity-aware Proxy that can be verified by ASM. IAP inserts this JWT into the Ingress-Authorization header. Using Istio Authorization Policies similar to the following policy, any requests without this JWT are denied:

  apiVersion: security.istio.io/v1beta1
  kind: AuthorizationPolicy
  metadata:
    name: iap-gateway-require-jwt
    namespace: istio-system
  spec:
    selector:
      matchLabels:
        app: istio-iap-ingressgateway
    action: DENY
    rules:
      - from:
          - source:
              notRequestPrincipals: ["*"]

Here is an example policy that requires a fullyTrustedDevice access level – this might be a device in your organization that is known to be corporate-owned, fully-updated, and running an IT-approved configuration :

  apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
  name: require-fully-trusted-device
  namespace: fooService
spec:
  selector:
    matchLabels:
      app: fooService
  action: ALLOW
  rules:
  - from:
    - source:
       requestPrincipals: ["*"]
    when:
    - key: request.auth.claims[google.access_levels]
      values: ["accessPolicies/$orgId/accessLevels/fullyTrustedDevice"]

This allows the security team to not only secure service to service communications, or outbound calls from services, but also specifically require incoming requests come from trusted devices and authenticated users using a trusted device.

Enabling service teams

As an SRE, one of our priorities is ensuring Service-level indicators (SLIs), SLOs, and Service-level agreements (SLAs) exist for services. Anthos Service Mesh helps us empower service owners to do this for their services, as it exposes horizontal request metrics like latency and availability to all services in the mesh. 

Before Anthos Service Mesh, each application had to export these separately (if at all). With ASM service owners can easily define their Service’s SLOs in the cloud console or via terraform using these horizontally exported metrics.  This then allows them to integrate SLOs into the higher-level service definitions so they can enable SLO monitoring and alerting by default. You can see the SRE book for more details on SLOs and Error budgets.

The takeaway

ASM is a powerful tool that enterprises can use to modernize their IT infrastructure. It provides:

This also enables previously unheard of operational capabilities such as distributed tracing or incremental canary rollouts – which were difficult to find in the typical enterprise application landscape.  

Because it can be incrementally adopted and composed with existing authorization systems to close gaps – barriers to adoption are low and Google recommend you start evaluating it today!

Related posts

Unlock real-time insights from your Oracle data in BigQuery

by Kartika Triyanti
2 years ago

Transforming the Developer Experience for Every Engineering Role

by Cloud Ace Indonesia
4 months ago

Get to Know Google Kubernetes Engine Autopilot

by Cloud Ace Indonesia
3 years ago