26 Oct 2024

AWS CDK - you can't polish a turd

Mythbusters tries to (and successfully) polish a turd.

First there was clickops, then came CloudFormation, and now we have CDK. But can anything really polish the turd that is CloudFormation?

AWS CloudFormation

CloudFormation is an AWS service that lets you declaratively describe AWS resources using templates. CloudFormation templates are regular YAML files with basic templating engine functions implemented as syntactic sugar.

AWS Cloud Development Kit

Cloud Development Kit (CDK) is AWS' new infrastructure as code (IaC) tool. It was not built as a replacement, but rather, builds on top of CloudFormation. CDK tries to solve the verbose and repetitive nature of declarative templates, by letting you write your infrastructure imperatively using your favorite general-purpose language like Typescript, Python, and Java. CDK can be thought of as a souped-up templating engine for CloudFormation. This however means that CDK suffers from all the limitations of CloudFormation, and also introduces new ones.

Reinventing the square wheel?

With IaC tools like Terraform already ubiquitous in the industry, do we really need another one? Let's look at some of CDK and CloudFormation's limitations and drawbacks.

The bad

CloudFormation only supports AWS services. While it is possible to manage third-party resources using custom resources, you will need to implement custom provisioning logic to handle the CRUD updates to those resources. This makes common use cases like provisioning an EKS cluster and deploying a pod inside it non-trivial.

CloudFormation cannot deploy resources cross account or cross region. To get around this you will need to deploy them separately and manually pass resource identifiers around.

CloudFormation is slooow. It is slow at creating, slow at updating, and slow at deleting. There's not much you can do about it because it's all under the hood, so go browse Reddit while it's updating. Unless you get into a really nasty situation with nested stacks that requires you to open a case with AWS support to unblock you. Yes, it's happened to me before.

CloudFormation has a per-stack resource limit of 500. Once you hit this limit, you will need to break down your stack and use nested stacks or multiple stacks. Nested stacks can get stuck (mentioned above), and multiple stacks introduces complex cross stack dependencies in CDK that it earned the name deadly embrace.

CloudFormation has limited drift detection and no drift correction ability. CloudFormation can detect manual changes to certain resources but it is unable to change them back to the desired state.

...the ugly

New AWS services take longer to support CloudFormation than third-party tools. Amazon Managed Grafana (AMG) took 13 months before CloudFormation resources became available, compared to six months for Terraform, and seven months for Pulumi.

Inconsistent interfaces. Amazon teams are structured to be bottom-up, and avoids giving top-down edicts. This is great for team autonomy, but terrible for when you need to present a unified experience. This can clearly be seen from the inconsistent user experience between different services in CloudFormation (and the AWS console). The VPC parameter is spelt VpcId in AWS::EC2::Subnet, while in AWS::RDS::DBInstance it is spelt VPCSecurityGroup. It might seem like a nitpick, but when you are looking at and writing templates all day, small things like this just adds to the cognitive load. And not to mention, it's OCD-inducing.

CloudFormation has limited support on importing existing resources, such as a resource created in the console that later got promoted to production. For that resource to be managed by CloudFormation, it will need to be deleted and recreated through the template. This also makes refactoring and moving resources around extremely difficult to impossible.

...and the fugly

CDK is full of leaky abstractions. CDK constructs are abstractions of CloudFormation resources. With level 1 constructs (L1) being a one-to-one mapping to a CloudFormation resource, and level 3 constructs (L3) being a fully baked, batteries included, blueprint of commonly used patterns. Problems arise when L2 and L3 constructs abstract (important) details away, forcing you to fall back to using L1 constructs, at which point, you are just writing CloudFormation with extra steps. For example, the L2 Vpc construct assumes you want to stripe your subnets equally across availability zones, and if you don't, you must fall back to using L1 constructs.

Leaky abstractions makes it difficult to debug your CDK. When you write CDK, the Typescript code is transpiled to Javascript, and then executed to generate CloudFormation templates. During the generation process, certain runtime (or I guess CloudFormation compile time?) checks are done. These checks might raise exceptions to inform you of invalid parameters, but the stack trace is of the transpiled Javascript, so line numbers will not match to what you wrote in Typescript making.

Error: A subscription with id "Topic" already exists under the scope FooStack/BazLambda
    at Topic.addSubscription (/home/ubuntu/workspace/my-cdk/node_modules/aws-cdk-lib/aws-sns/lib/topic-base.js:1:1481)
    at FunctionHook.bind (/home/ubuntu/workspace/bar-cdk/node_modules/aws-cdk-lib/aws-autoscaling-hooktargets/lib/lambda-hook.js:1:1386)
    at new LifecycleHook (/home/ubuntu/workspace/bar-cdk/node_modules/aws-cdk-lib/aws-autoscaling/lib/lifecycle-hook.js:1:1102)
    at AutoScalingGroup.addLifecycleHook (/home/ubuntu/workspace/bar-cdk/node_modules/aws-cdk-lib/aws-autoscaling/lib/auto-scaling-group.js:1:7368)
    at new FooStack (/home/ubuntu/workspace/bar-cdk/lib/bar-stack.ts:81:31)
    at Object.<anonymous> (/home/ubuntu/workspace/bar-cdk/bin/bar.ts:7:1)
    at Module._compile (node:internal/modules/cjs/loader:1356:14)
    at Module.m._compile (/home/ubuntu/workspace/bar-cdk/node_modules/ts-node/src/index.ts:1618:23)
    at Module._extensions..js (node:internal/modules/cjs/loader:1414:10)
    at Object.require.extensions.<computed> [as .ts] (/home/ubuntu/workspace/bar-cdk/node_modules/ts-node/src/index.ts:1621:12)

Stack track generated from trying to add a lifecycle hook on an auto-scaling group; the original source file doesn't even contain the string Topic.

Parameter validation is deferred to deploy time. Certain parameters with constraints (like Cpu) could be easily validated ahead of time. But instead these invalid parameters go all the way to CloudFormation, and only fails when it tries to create or update that resource, triggering a rollback.

So is the turd shiny now?

No. Having written CloudFormation and Terraform at different companies, and used CDK internally at Amazon for the past four years. I believe CDK solves the repetitiveness of CloudFormation, but also introduces a heap of quirks. Given the industry already has a de-facto tool of choice, I question AWS' decision not to embrace Terraform, but try to reinvent the wheel.

My views are my own.