What we learned deploying Vault and Consul on Kubernetes
Secret management - or the ability to consistently store, manage, and access user, application, and infrastructure level secrets e.g: credentials, tokens, API keys etc in dynamic environments - is critical to the success of any platform. It also enables better handling of microservices-based architecture. With CI/CD cycles becoming shorter, maintaining that ability to develop, test, and deploy microservices is critical. Better secrets management allows the entire cloud infrastructure to remain flexible and scalable without sacrificing security in the process.
But, secret management in a multi-cloud platform like Archera can be a challenging task. We must also navigate the added complexity of secret management across dev, staging, and production environments. For example our Tilt Integration which is used for spawning local development environments requires development secrets in contrast to our CI/CD pipelines requiring staging secrets and our main app which requires production secrets.
Another important aspect when it comes to shortlisting a secret management solution for us is the ability to deploy or use an existing on-prem secrets manager to enable smoother implementation for many customer use cases. We also wanted a platform which can be deployed not just on different cloud platforms but also across multiple tenants and regions is imperative to the success of enterprise grade platforms. Another important factor for many organizations is support for Kubernetes deployments to take advantage of existing infrastructure hence scaling.
Secret Management is one of the central components for building any scalable platform. The following are some key areas we evaluated different approaches to secret management across when we looking for a solution to integrate into our platform
- Security: This is the most important consideration for obvious reasons. Transparency around the secret store architecture helps building confidence and hence adoption. Having an open source architecture is a big plus.
- Robustness and Scalability: Having a good failover recovery strategy and configurable autoscaling mechanism to ensure minimal downtime are very important.
- Access via Standard APIs and Libraries: Ability to perform CRUD operations on the secrets via a standard set of APIs or libraries will help reduce friction in adoption.
- User and Group Management: Since in most instances there will be multiple users involved, therefore having an ability to manage users and groups with well-defined RBAC policies is crucial.
- Platform Agnostic: The secret management tool should not be tied to a particular cloud platform and should be easier to migrate from one platform to another. Also, having native library support for major if not all programming languages is a big addition.
Tools and Research
When it comes to secret management, there are a plethora of tools to choose from. We will restrict our discussion to comparing AWS Secret Manager, GCP Secret Manager, Azure Key Vault (which provide more or less similar functionality despite being from different cloud providers) vs HashiCorp's Vault and Consul.
All tools in the first category are great solutions and well integrated with other services in their respective ecosystem. The ability to manage user permissions, automatic key rotations, and API/library access are common across all these tools. Also, all these tools are built upon highly scalable platforms. The biggest downside is that all of these tools are cloud provider specific. Developers will have limited understanding of these tools’ storage backend architecture hence limited ability to configure based on their use case. But this customizability may not be a requirement or need for many organizations.
The second category are open source tools that are highly configurable and can work with on-prem deployment. Vault which is a secret management solution and Consul is a popular storage backend which is fault-tolerant and highly scalable. Both Vault and Consul are offered by Hashicorp. Together, they are a promising solution for teams running and managing other services on kubernetes because it can be deployed on a Kubernetes cluster as a helm chart.t. Though there are many more features, here are some we wanted to highlight:
- Dynamic Credentials
- Identity based access control
- Partitioning of secrets
- High Availability deployment option
- Configurable storage backends
- Standardized API and library access
Since we have most of our services running on Kubernetes, therefore it was an obvious choice when it came to medium for deployment. We used Vault and Consul helm charts for this purpose. Here is a reference architecture for Archera’s Vault/Consul deployment describing high level setup and interaction with our internal and external services via ingress controller:
Our initial deployment consisted of out of the box configuration with a file storage backend. Although it's okay for experimentation, it's not well suited for production grade deployments. Eventually we started using Consul. How it works - the client talks to the Vault server through HTTPS, the Vault server processes the requests, and then forwards it to the Consul agent on a loopback address. The Consul client agents serve as an interface to the Consul server. They are very lightweight and maintain very little state of their own. The Consul server stores the secrets encrypted at rest.
This setup laid the groundwork for further adding ingress for load balancing the L7 traffic which in turn is used by various environments such as development/production/staging, CI/CD systems and Argo workflow pipelines. The secrets can now be segmented across different environments and systems and enables us to set specific permissions. This also enables central management with minimal disruption to the application.
We used the above reference architecture (image from Hashicorp) with a three-node Vault cluster with one active node, two standby nodes and a Consul agent sidecar deployed talking on behalf of the Vault node to the five-node Consul server cluster. The architecture can also be extended to a multi-availability zone, rendering your cluster to be highly fault-tolerant.
We were able to check most if not all of our requirements by using vault and consul. Here are list of some key objectives that we achieved:
- A consistent and reliable secret store with standard access mechanisms in the form of APIs and libraries
- Ability to manage user policies and to add auth providers such as GitHub making it easier to manage permissions for various groups
- Using paths to segregate secrets for dev, staging and production environments
- HA mode ensures robustness and scalability for critical systems such as applications, CI/CD, Argo workflows etc
- Platform agnostic deployment via kubernetes and helm which makes migration to other cloud providers easier
- Multi-tenancy and on-prem support for facilitating customer use-cases
- Ability to configure underlying storage backend
Implementing a secret management solution in a platform can be a hard problem since the changes have to be propagated at many levels ranging from application to infrastructure. However, having the right set of tools can lay the groundwork and make the transition easier. Here are some of the lessons we learned while coming up with our existing architecture:
- Scalability is one of the key aspects when it comes to implementing a secret store. It might not be an important consideration in the beginning but as the application and systems grow it might be a bottleneck. We realized this quickly when our initial test deployments and requests responses were taking unusually longer due to its deployment alongside other resource hungry workfloads such as Argo. So always have a scaling strategy suitable for your platforms in place while designing such systems.
- There can be various ways to configure your vault/consul deployments depending on your use case. For example High Availability, Multi-cluster deployment etc. Listing your requirements will help you in shortlisting a deployment strategy. A good place to start are tutorials from Hashicorp.
- We noticed the need for resource reservation for vault/consul deployment especially when running alongside resource hungry applications like argo. We eventually moved our setup to a separate cluster altogether to maintain near 100% uptime.
- Having an auto-unseal mechanism is critical for failover recovery otherwise your deployment can go down indefinitely with need of manual steps for unsealing the keys.
- Understanding various storage backends supported by vault and their pros and cons is an important step which should be taken into account at early design phases. We explored storage backends like s3 and RDS before shortlisting Consul due to its smooth integration with vault and robustness.
There are many great tools out there for secret management. Every project has a different set of requirements and objectives, at the end of day the choice of the tool will depend on your use-case. If you are already tied to a particular cloud provider and don't have much need for flexibility in terms of configuring your secret store then using a secret manager from that cloud provider might be a good option. However, if you are looking for a more generic and configurable secret store with robust and well proven design then Vault might be the way to go.