Tuesday 17 March 2026, 08:46 PM

How Apache Flink 2.0's ForSt backend cuts checkpoint latency by 94%

Discover how Apache Flink 2.0's ForSt disaggregated state backend and asynchronous execution model reduce checkpoint latency by 94% for stream processing.

If you’ve spent any time scaling stream processing pipelines over the last decade, you know the dirty secret of real-time data: it’s not the streaming that kills your budget, it’s the state.

For years, we’ve been treating local disks like a crutch. In my experience building and advising data-intensive startups across the Bay Area, the conversation always hits the same wall. You want low latency, so you keep your state on local TaskManager disks. But as your data volume explodes, you end up over-provisioning expensive compute instances just to get more attached storage. It creates a rigid, tightly coupled architecture that leads to frequent Out-Of-Memory crashes, agonizingly slow checkpoints, and a massive cloud bill.

With the General Availability release of Apache Flink 2.0.0 on March 24, 2025, that era is effectively over. After two years of development, 25 Flink Improvement Proposals (FLIPs), and 369 resolved issues from 165 contributors, Flink has fundamentally shifted its architecture.

Let's look at why this matters, who wins, and whether this new paradigm actually has product-market fit for modern data teams.

The shift to disaggregated state

The core of Flink 2.0 is the transition to a Disaggregated State Architecture, governed by FLIP-423. Instead of hoarding state on local disks, Flink 2.0 introduces the ForSt ("For Streaming") state backend.

ForSt decouples compute from storage, pushing the primary state store to a remote Distributed File System (DFS) like Amazon S3 or HDFS. Local disks are relegated to an optional, secondary cache.

From an infrastructure perspective, this is the holy grail. It means you can finally scale your compute and your storage independently. If you have a massive state but low CPU requirements, you no longer have to pay for top-tier compute instances just to hold the data. According to a recent peer-reviewed paper in the Proceedings of the VLDB Endowment, this architecture delivers up to 50% cost savings compared to the Flink 1.20 baseline.

Beating the physics of network latency

Whenever you tell an engineer you are moving state to S3, their immediate reaction is to point out network latency. Naively querying a remote DFS for every state access would introduce crippling overhead, completely defeating the purpose of a real-time stream processor.

To solve this, Flink 2.0 implements an Asynchronous Execution Controller (AEC). The AEC decouples record processing from state access. It issues non-blocking requests and allows for out-of-order CPU processing, all while strictly preserving exactly-once fault tolerance and per-key FIFO ordering.

Standardized NEXMark streaming benchmarks validate that for stateless operators, the AEC can be bypassed entirely, introducing zero performance overhead. But for heavy I/O queries, it vastly outperforms Flink 1.x.

What I love about this release is the practical focus on user experience and migration. The maintainers didn't just build a new backend and tell everyone to rewrite their code. Under FLIP-473, they re-implemented seven critical SQL operators—including Window and Group Aggregations—to utilize the new asynchronous State APIs under the hood. SQL users get the massive latency reductions out of the box. That is how you build product-market fit into an open-source tool: make the adoption curve flat.

The market reality: Who actually wins here?

When we look at the raw numbers, the operational improvements are staggering. By decoupling the state and utilizing zero-copy checkpointing, Flink 2.0 achieves up to a 94% reduction in checkpoint duration and up to 49x faster recovery after failures or rescaling.

But the real winner here is the FinOps movement.

Cloud API costs are the silent killer of data startups. S3 GET and PUT requests add up fast when you are processing millions of events per second. By utilizing local disks as a secondary cache in front of the remote DFS, testing shows Flink 2.0 achieves up to 94% fewer remote reads and a 75% reduction in S3 GET requests. You get the resilience of cloud storage without the punishing API tax.

Spearheaded by Alibaba Cloud and Ververica, this architecture is already production-proven. It cements Flink's position as the dominant stream processing engine for the enterprise market. The losers? Legacy streaming systems that still force a coupled compute-and-storage model, and perhaps cloud providers who have been happily cashing checks for over-provisioned EC2 instances.

By eliminating the "digital hoarding" problem of massive state accumulation, Flink 2.0 does more than just fix local disk constraints. It paves the way for true stream-batch unification and makes it financially viable to integrate large-scale AI models directly into real-time pipelines. For data teams looking to optimize their infrastructure without sacrificing performance, this is exactly the kind of practical innovation the ecosystem needed.

How Apache Flink 2.0's ForSt backend cuts checkpoint latency by 94%

The shift to disaggregated state

Beating the physics of network latency

The market reality: Who actually wins here?

References

Subscribe to our mailing list