CRDT Survey, Part 4: Further Topics

Matthew Weidner | Feb 13th, 2024
Home | RSS Feed
Keywords: CRDTs, bibliography

This blog post is Part 4 of a series.

Part 1: Introduction

Part 2: Semantic Techniques

Part 3: Algorithmic Techniques

Part 4: Further Topics

Further Topics

This post concludes my CRDT survey by describing further topics that are outside my focus area. It is not a proper survey of those topics, but instead a lightweight bibliography, with links to a relevant resource or two. To find more papers about a given topic, see crdt.tech/papers.html.

My previous posts made several implicit assumptions, which I consider part of the “traditional” CRDT model:

All operations only target eventual consistency. There is no mechanism for coordination/strong consistency (e.g. a central server), even for just a subset of operations.
Users always wish to synchronize all data and all operations, as quickly as possible. There is no concept of partial access to a document, or of sync strategies that deliberately omit operations.
All devices are trustworthy and follow the given protocol.

The further topics below concern collaborative apps that violate at least one of these assumptions. I’ll end with some alternatives to CRDT design and an afterword.

Incorporating Strong Consistency

Strong consistency is (roughly) the guarantee that all replicas see all operations in the same order. This contrasts with CRDTs’ eventual consistency guarantee, which allows different replicas to see concurrent operations in different orders. Existing work explores systems that mix CRDTs and strong consistency.

“Consistency in Non-Transactional Distributed Storage Systems”, Viotti and Vukolić (2016). Provides an overview of various consistency guarantees.

Mixed Consistency

Some apps need strong consistency for certain operations but allow CRDT-style optimistic updates for others. For example, a calendar app might use strong consistency when reserving a room, to prevent double-booking. Several papers describe mixed consistency systems or languages that are designed for such apps.

“Making Geo-Replicated Systems Fast as Possible, Consistent when Necessary”, Li et al. (2012). Describes a system that offers “RedBlue consistency”, in which operations can be marked as needing strong consistency (red) or only eventual consistency (blue).

Server-Assisted CRDTs

Besides providing strong consistency for certain operations, a central server can provide strong consistency for parts of the protocol while clients use CRDTs for optimistic local updates. For example, the server can decide which operations are valid or invalid, assign a final total order to operations, or coordinate “garbage collection” actions that are only possible in the absence of concurrent operations (e.g., deleting tombstones).

“Replicache”, Rocicorp (2024). Client-side sync framework that uses a server to assign a final total order to operations, but allow clients to perform optimistic local updates using a “rebase” technique.
“Making Operation-Based CRDTs Operation-Based”, Baquero, Almeida, and Shoker (2014). Defines causal stability: the condition that there can be no further operations concurrent to a given operation. Causal stability is usually necessary for garbage collection actions, but tricky to ensure in a collaborative app where replicas come and go.

Incomplete Synchronization

Undo

Part 2 described exact undo, in which you apply your normal semantics to the operation history less undone operations. The same principle applies to other situations where you deliberately omit operations: rejected/reverted changes, partial replication, “cherry-picking” across branches, etc.

However, applying your normal semantics to “the operation history less undone operations” is more difficult that it sounds, because many semantics and algorithms assume causal-order delivery. For example:

There are two common ways to describe an Add-Wins Set’s semantics, which are equivalent assuming causal-order delivery but inequivalent without it.
To compare list CRDT positions, you often need some metadata (e.g., an underlying Fugue tree), which is not guaranteed to exist if you are missing prior operations.

It is an open problem to formulate undo-tolerant variants of various CRDTs.

I am interested in this problem and happy to chat about it: mweidner037@gmail.com

“Supporting Undo and Redo for Replicated Registers in Collaborative Applications”, Brattli and Yu (2021). Describes an undo-tolerant variant of the multi-value register.

Partial Replication

Previous posts assumed that all collaborators always load an entire document and sychronize all operations on that document. In partial replication, replicas only store and synchronize parts of a document. Partial replication can be used as part of access control, or as an optimization for large documents - clients can leave most of the document on disk or on a storage server.

Existing CRDT libraries (Yjs, Automerge, etc.) do not have built-in partial replication, but instead encourage you to split collaborative state into multiple documents if needed, so that you can share or load each document separately. However, you then lose causal consistency guarantees between operations in different documents. It is also challenging to design systems that perform the actual replication, e.g., intelligently deciding which documents to fetch from a storage server.

“Conflict-Free Partially Replicated Data Types”, Briquemont et al. 2015. Describes a system with explicit support for partial replication of CRDTs.
“Automerge Repo”, Automerge contributors (2024). “Automerge Repo is a wrapper for the Automerge CRDT library which provides facilities to support working with many documents at once”.

Versions and Branches

In a traditional collaborative app, all users want to see the same state, although they may temporarily diverge due to network latency. Version control systems instead support multiple branches that evolve independently, except during explicit merges. CRDTs are a natural fit for version control because their concurrency-aware semantics may lead to better merge results (operations in parallel branches are considered concurrent).

The original local-first essay already discussed versions and branches on top of CRDTs. Recent work has started exploring these ideas in practice.

“Upwelling: Combining real-time collaboration with version control for writers”, McKelvey et al. (2023). A Google Docs-style editor with added version control features that uses CRDTs.
“Proposal: Versioned Collaborative Documents”, Weidner (2023). A workshop paper where I propose an architecture that combines features of git and Google Docs.

Security

Byzantine Fault Tolerance

Traditional CRDTs assume that all devices are trustworthy and follow the given protocol. A malicious or buggy device can easily deviate from the protocol in a way that confuses other users.

In particular, a malicious device could assign the same UID to two versions of an operation, then broadcast each version to a different subset of collaborators. Collaborators will likely process only the first version they receive, then reject the second as redundant. Thus the group will end up in a permanently inconsistent state.

A few papers explore how to make CRDTs Byzantine fault tolerant so that these inconsistent states do not occur, even in the face of arbitrary behavior by malicious collaborators.

“Byzantine Eventual Consistency and the Fundamental Limits of Peer-to-Peer Databases”, Kleppmann and Howard (2020).
“Making CRDTs Byzantine fault tolerant”, Kleppmann (2022).

Access Control

In a purely local-first setting, access control - controlling who can read and write a collaborative document - is tricky to implement or even define. For example:

If a user loses access to a collaborative document, what happens to their local copy?
How should we handle tricky concurrent situations, such as a user who performs an operation concurrent to losing access, or two admins who demote each other concurrently?

A few systems implement protocols that handle these situations.

“Matrix Decomposition: Analysis of an Access Control Approach on Transaction-based DAGs without Finality”, Jacob et al. (2020). Describes the access control protocol used by the Matrix chat network.
“@localfirst/auth”, Caudill et al. (2024). “@localfirst/auth is a TypeScript library providing decentralized authentication and authorization for team collaboration, using a secure chain of cryptographic signatures.”

Alternatives to CRDT Design

Operational Transformation

In a collaborative text document, when one users inserts or deletes text, they shift later characters’ indices. This would cause trouble for other users’ concurrent editing operations if you interpreted their original indices literally:

The *gray* cat jumped on **the** table.

Alice typed " the" at index 17, but concurrently, Bob typed " gray" in front of her. From Bob's perspective, Alice's insert should happen at index 22.

The CRDT way to fix this problem is to use immutable list CRDT positions instead of indices. Another class of algorithms, called Operational Transformation (OT), instead “transform” indices that you receive over the network to account for concurrent operations. So in the above figure, Bob would receive “Index 17” from Alice, but add 5 to it before processing the insertion, to account for the 5 characters that he concurrently inserted in front of it.

I personally consider list CRDT positions to be the simpler mental model, because they stay the same over time. They also work in arbitrary networks. (Most deployed OT algorithms require a central server; decentralized OT exists but is notoriously complicated.) Nonetheless, OT algorithms predate CRDTs and are more widely deployed - in particular, by Google Docs.

“An Integrating, Transformation-Oriented Approach to Concurrency Control and Undo in Group Editors”, Ressel, Nitsche-Ruhland, and Gunzenhäuser (1996). A classic OT paper.
“Enhancing rich content wikis with real-time collaboration”, Ignat, André, and Oster (2017). Describes a central-server OT algorithm for block-based rich text.

Program Synthesis

Traditional CRDT papers are algorithm papers that design a CRDT by hand and prove eventual consistency with pen-and-paper. Some more recent papers instead try to synthesize CRDTs automatically from some specification of their behavior - e.g., a sequential (single-threaded) data structure plus some invariants that it should preserve even in the face of concurrency.

Note that while synthesis may be easier than designing a CRDT from scratch, there is little guarantee that the synthesized semantics are reasonable in the eyes of your users. Indeed, some existing papers choose rather unusual semantics.

“Katara: Synthesizing CRDTs with Verified Lifting”, Laddad et al. 2022. Synthesizes a CRDT from a specification by searching a space of composed state-based CRDTs until it finds one that works.

Afterword: Why Learn CRDT Theory?

I hope you’ve enjoyed reading this blog series. Besides satisfying your curiosity, though, you might wonder: Why learn this in the first place?

Indeed, one approach to CRDTs is to let experts implement them in a library, then use those implementations without thinking too hard about how they work. That is the approach we take for ordinary local data structures, like Java Collections. It works well when you use a CRDT library for the task it was built for - e.g., Yjs makes it easy to add central-server live collaboration to various rich-text editors.

However, in my experience so far - both using and implementing CRDT libraries - it is hard to use a CRDT library in any way that the library creator didn’t anticipate. So if your app ends up needing some of the Further Topics above, or if you need to tune the collaborative semantics, you may be forced to develop a custom system. For that, CRDT theory will come in handy. (Though still consider using established tools when possible - especially for tricky parts like list CRDT positions.)

Even when you use an existing CRDT library, you might have practical questions about it, like:

What invariants can I expect to hold?
What happens if <…> fails?
How will it perform under different workloads?

Understanding a bit of CRDT theory will help you answer these questions.