Migrating Between Fivetran and Airbyte - A Two-Way Journey

Why switch?

The question that frequently arises in forums is how to transition between Fivetran and Airbyte. User migration typically stems from cost considerations or limitations.

Airbyte is often viewed as a more affordable alternative to Fivetran, particularly with its open-source self-hosting option, which can be must have for certain organisations. However, users who initially chose Airbyte may encounter limitations with Community Connectors lacking all the necessary features.

This scenario is quite common, as each product caters to a specific niche of customers, and ongoing user migration is a regular occurrence across many industries. Both products perform the essential task of replicating data between pre-configured sources and destinations.

Fivetran is a more mature and feature-rich tool, albeit at a premium price, while Airbyte is a rapidly evolving underdog with a lower price point and an additional open-source 'do-it-yourself' option.


Before the move

For an end user, the primary interaction with either tool occurs at the connector level. Ultimately, both will perform as well as the connectors you use. It's essential to ensure that the new chosen tool supports all types of data fields you wish to extract and can store them in the destination of your choice in a format that closely matches your requirements.

When considering swapping one tool for another, the logic is straightforward:

  • Does the tool have connectors for the sources and destinations I need?

  • If not, how difficult would it be to create them? Airbyte allows you to create your own connector, while with Fivetran, you can use cloud functions.

  • If connectors exist, ensure they support your APIs and data types. Therefore, take your time when evaluating these products and don't rush to conclusions.


Tracking incremental sync

Both tools maintain their internal state of each connection by tracking the latest replicated block/records using some form of cursor for incremental copying.

Airbyte offers more control over this aspect. Many connectors have the concept of the initial sync point, allowing you to skip some historical records and start syncing from any specified time.

Fivetran, on the other hand, is more opinionated in this regard. It manages the state and often extracts all available data. Fivetran does not charge for the initial historical sync, so even a large amound of data won't incur additional costs.

At the destination, for each dataset (referred to as a 'table' in Fivetran and a 'stream' in Airbyte), both tools add metadata for every replicated record. In a generic example, it looks like the following:

Fivetran:

            Column             |            Type             | Collation | Nullable | Default 
-------------------------------+-----------------------------+-----------+----------+---------
 _fivetran_deleted             | boolean                     |           |          | 
 _fivetran_synced              | timestamp with time zone

Airbyte:

                  Column                   |            Type             | Collation | Nullable | Default 
-------------------------------------------+-----------------------------+-----------+----------+---------
 _airbyte_unique_key                       | text                        |           |          | 
 _airbyte_ab_id                            | character varying           |           |          | 
 _airbyte_emitted_at                       | timestamp with time zone    |           |          | 
 _airbyte_normalized_at                    | timestamp with time zone    |           |          | 
 _airbyte_<your_stream_name>_hashid        | text

Depending on the connectors few extra fields can be present.

Additionally Airbyte offers a glimpse on the internals for incremental sync tracking by exposing “Connection state” JSON configuration. It keeps track of every incremental stream managed by the connector.


Avoiding Full Re-sync

By trying to avoid a Full Re-sync we want one of two things, or both:

  • Saving time/money on the initial sync

  • Not disturbing all dependent models, reports and data products

Unfortunately, it's not something we can do easily. Each tool expects full ownership of the destination, so the replication will simply fail if you try to use a different tool in-place, or a new table will be created and everything will start from scratch.

But you can minimise the impact of the transition.

If you want to avoid that initial first sync, which takes ages and stresses the source system, you are out of luck when moving from Airbyte to Fivetran. It’s almost guaranteed you will need a full resync. Very few Fivetran connectors have the concept of a starting point for synchronization. Most of the time, Fivetran wants to ingest all the data. Being a cloud-native solution and offering a free initial sync, it doesn’t seem like a big deal (unless it does!).

Reverse migration looks more promising. Moving from Fivetran to Airbyte, you will often be able to save time and bandwidth by setting a start time for the data you want to pull. “Start datetime” is one of the fundamental configurations in Airbyte and present in probably all connectors.

Avoiding changes to the downstream system is harder. We won't be able to simply switch these tools if we already have dependencies.

However, we can use dbt to abstract the differences. By following dbt best practices when bringing the data into a Warehouse, we can use staging layer to re-wire staging models to point to a new data source.

This works the same way in both directions of the migration.

Simplistic workflow as following:

  • Airbyte/Fivetran runs a sync

  • Data Replicated into Raw tables

  • Incremental staging model runs and transform the data into new agnostic materialised view

  • Raw tables truncated

Special care should be taken with the propagation of the deleted records and updates as those are handled differently by both tools.

Incremental sync and truncating the data in raw tables can improve the performance of your syncs. However, you need to take care of de-duplication in your staging models.

Incremental sync and truncating the data in raw tables can improve the performance of your syncs.

Incremental sync and truncating the data in raw tables can improve the performance of your syncs.

Incremental sync and truncating the data in raw tables can improve the performance of your syncs.

Andrey Kozichev

Subscribe for the latest blogs and news updates!

Related Posts

© MetaOps 2024

© MetaOps 2024

© MetaOps 2024