Migrating Between Fivetran and Airbyte - A Two-Way Journey
Why switch?
The question that frequently arises in forums is how to transition between Fivetran and Airbyte. User migration typically stems from cost considerations or limitations.
Airbyte is often viewed as a more affordable alternative to Fivetran, particularly with its open-source self-hosting option, which can be must have for certain organisations. However, users who initially chose Airbyte may encounter limitations with Community Connectors lacking all the necessary features.
This scenario is quite common, as each product caters to a specific niche of customers, and ongoing user migration is a regular occurrence across many industries. Both products perform the essential task of replicating data between pre-configured sources and destinations.
Fivetran is a more mature and feature-rich tool, albeit at a premium price, while Airbyte is a rapidly evolving underdog with a lower price point and an additional open-source 'do-it-yourself' option.
Before the move
For an end user, the primary interaction with either tool occurs at the connector level. Ultimately, both will perform as well as the connectors you use. It's essential to ensure that the new chosen tool supports all types of data fields you wish to extract and can store them in the destination of your choice in a format that closely matches your requirements.
When considering swapping one tool for another, the logic is straightforward:
Does the tool have connectors for the sources and destinations I need?
If not, how difficult would it be to create them? Airbyte allows you to create your own connector, while with Fivetran, you can use cloud functions.
If connectors exist, ensure they support your APIs and data types. Therefore, take your time when evaluating these products and don't rush to conclusions.
Tracking incremental sync
Both tools maintain their internal state of each connection by tracking the latest replicated block/records using some form of cursor for incremental copying.
Airbyte offers more control over this aspect. Many connectors have the concept of the initial sync point, allowing you to skip some historical records and start syncing from any specified time.
Fivetran, on the other hand, is more opinionated in this regard. It manages the state and often extracts all available data. Fivetran does not charge for the initial historical sync, so even a large amound of data won't incur additional costs.
At the destination, for each dataset (referred to as a 'table' in Fivetran and a 'stream' in Airbyte), both tools add metadata for every replicated record. In a generic example, it looks like the following:
Fivetran:
Airbyte:
Depending on the connectors few extra fields can be present.
Additionally Airbyte offers a glimpse on the internals for incremental sync tracking by exposing “Connection state” JSON configuration. It keeps track of every incremental stream managed by the connector.
Avoiding Full Re-sync
By trying to avoid a Full Re-sync we want one of two things, or both:
Saving time/money on the initial sync
Not disturbing all dependent models, reports and data products
Unfortunately, it's not something we can do easily. Each tool expects full ownership of the destination, so the replication will simply fail if you try to use a different tool in-place, or a new table will be created and everything will start from scratch.
But you can minimise the impact of the transition.
If you want to avoid that initial first sync, which takes ages and stresses the source system, you are out of luck when moving from Airbyte to Fivetran. It’s almost guaranteed you will need a full resync. Very few Fivetran connectors have the concept of a starting point for synchronization. Most of the time, Fivetran wants to ingest all the data. Being a cloud-native solution and offering a free initial sync, it doesn’t seem like a big deal (unless it does!).
Reverse migration looks more promising. Moving from Fivetran to Airbyte, you will often be able to save time and bandwidth by setting a start time for the data you want to pull. “Start datetime” is one of the fundamental configurations in Airbyte and present in probably all connectors.
Avoiding changes to the downstream system is harder. We won't be able to simply switch these tools if we already have dependencies.
However, we can use dbt to abstract the differences. By following dbt best practices when bringing the data into a Warehouse, we can use staging layer to re-wire staging models to point to a new data source.
This works the same way in both directions of the migration.
Simplistic workflow as following:
Airbyte/Fivetran runs a sync
Data Replicated into Raw tables
Incremental staging model runs and transform the data into new agnostic materialised view
Raw tables truncated
Special care should be taken with the propagation of the deleted records and updates as those are handled differently by both tools.
Incremental sync and truncating the data in raw tables can improve the performance of your syncs. However, you need to take care of de-duplication in your staging models.
Related Posts
finance
Sep 4, 2024
For any financial organisation, being able to access all relevant client data quickly is not just a competitive advantage in the current market - it’s an absolute necessity for the company’s survival.