Change data capture and microservices
What is Change Data Capture?
Change data capture (CDC) is a transaction safe way of performing real time data replication between databases. During the extract stage, CDC extracts data in real-time (or near-real time) and provides a continuous stream of change data.
Traditionally, this process is performed in batches, where a single database query extracts a large amount of data in bulk. While this certainly gets the job done, it quickly becomes inefficient as source databases are continuously updated.
By having to refresh a replica of source tables every time, the target table may not accurately reflect the current state of the source application. CDC sidesteps this problem by maintaining a data stream in real-time.
This allows us to:
Real-time operations (i.e., no more bulk loading)
Reduced impact on system resources
Faster database migrations with no downtime
Synchronization across multiple data systems
But… in a microservices driven world doesn't this break some of the core tenants of domain logic? Isn't this the famous reach-in antipattern O'Reilly warned us about?
This can lead to bad things…
Fine-grained events: CDC event streams typically expose one event per affected table row, whereas it can be desirable to publish higher-level events to consumers. An example of this would be wanting one event for one purchase order with all its order lines, even if they are stored within two separate tables in an RDBMS. The loss of transaction semantics in CDC event streams can aggravate that concern, as consumers cannot easily correlate the events originating from one and the same transaction in the source database.
Your table model becomes your API: by default, your table’s column names and types correspond to fields in the change events emitted by the CDC tool. This can yield less-than-ideal event schemas, particularly for legacy applications.
Schema changes might break things: Downstream consumers of the change events will expect the data to adhere to the schema known to them. As there is no abstraction between your internal data model and consumers, any changes to the database schema, such as renaming a column or changing its type, could cause downstream event consumers to break, unless they are updated in lockstep.
You may accidentally leak sensitive data: a change event stream will, by default, contain all the rows of a table with all its columns. This means that sensitive data which shouldn’t leave the security perimeter of your application could be exposed to external consumers.
Data contracts could potentially solve this. But do we really need the overhead of all these additional tools? Protobuf? Avro? Apicurio?
Schema changes as API changes
All these will provide solutions but with added additional complexity. The right way to do this is to make the responsible owner aware of schema breaks, the tool doesn't matter so much. It can be as easy as a DBT feature such as model contracts.
Data contracts require cross-team collaboration and tools, but also a cultural change. New technology won't make up for the fact that team dependence needs to be properly understood.
Developers have been tracking dependent API consumers for years, we need to include data engineering as schema consumers that also need to be made aware of changes.
Related Posts
governance
Jan 29, 2024
Deleting data is tough and feels like a big responsibility. The best way to avoid hoarding is to not gather unnecessary data in the first place.
data engineering
Nov 24, 2023
There is more to user needs data outside of web analytics, data engineering challenges organisations to tackle this