AWS Step Functions
I briefly worked on a workflow system at $past_job
, implemented using AWS Step Functions. My experience was pretty
terrible. I wasn’t sure which technical requirements led the team to this system. Some people said we needed a “system
that is configuration driven and not code driven” and some people said “we needed something that scales.” Whatever the
reason was, making improvements to this system was a pain in the ass, with AWS Step Functions itself being somewhat
responsible.
What is a configuration-driven system? It is a system where changes can be described concisely and the effects of a change can be easily understood. Ideally code changes, which may have wide-reaching consequences, are rarely necessary. However, either because of poorly understood requirements or rushed delivery, the code that underlies this system needed to be changed frequently. There was very little “configuration” in this system actually. Most configurations were API endpoints and credentials. They almost never changed.
The AWS State Machine Definition is a domain-specific language. When a workflow changed, the DSL needed to be changed. That is a code change, not a configuration change. For this project a change-set (for a user story) often required changes to both the DSL AND the underlying programming languages. Um… we used both JavaScript and Python to implement the worklflow. But that’s a different problem and we won’t talk about that. Anyway, most changes required someone be familiar with both the DSL and the underlying programming languages. That’s actually a pretty big problem. Most engineers have a hard time being proficient at a single programming language. Combine that with a AWS-specific DSL and you end up with a lot of risk. Unsurprisingly, this system ended up quite fragile and bug-ridden.
Creating a deployment was painful. Recall that our changes generally affected both the state machine and underlying code. This means changes generally requires updating the underlying Lambda Functions AND state machine definition. This required learning a specialized deployment framework. Sometimes, we also needed to coordinate changes to the underlying infrastructure (e.g., databases, S3, queues). Suddenly, you need to understand a pretty complex system and AWS-specific tooling to work on a rather straight-forward problem.
Testing changes was difficult. We tested our AWS Step Functions by deploying the changes into a “QA” AWS environment. At least with other frameworks like Cadence, you can test things locally. Unlike Cadence, your pile of spaghetti AWS lambda definitions and DSL can be validated statically before you even deploy any code. Not to mention it’s not easy to run automated integration tests. I guess the advantage is that AWS manages the infrastructure and scale. But unless you really need that scale, is it really worth committing to their framework? I just don’t buy it.
AWS Step Functions felt like an unwieldy solution to the problem we were trying to solve and almost certainly slowed down our delivery speed. It might have worked out well if its use was planned beter. But in this particular project it was a liability.
Other workflow systems