Airbyte Data Pipelines as code

Utilising Low-Code Data Replication Tools such as Airbyte provides immediate advantages to Data Analytics Teams.

Setting up new Data pipelines takes minutes, and the low-code connector builder allows for easy extraction of Data even from APIs without official connectors available.

This makes Data Analytics Teams self-sufficient, removing dependency on DevOps Engineers, allowing for quick prototyping, and speeding up the overall development and data integration process.

However, the GUI nature of Airbyte brings some trade-offs, such as the inability to peer-review changes, lack of configuration versioning, limited testing of new parameters, and inevitable environment divergence.

This is where Terraform comes in handy. Terraform has become the de facto standard for cloud infrastructure development. Airbyte also offers its own Terraform provider, providing all enterprise features for your data pipelines.

  • The central Data Engineering Team develops and manages connectors with Terraform, allowing users to consume pre-configured connectors in their data pipelines.

  • End users no longer need access to database passwords or need to configure connectors themselves.

  • The Data Engineering Team can easily roll out new connectors to all environments and make configuration or credential changes with a simple pull request.

  • Airbyte servers can be easily rebuilt, or jobs can be moved into the cloud, and all connections can be reconfigured from Terraform in a matter of minutes.

Before we start with Terraform, we need to set up our instance to enable access to the API.

Airbyte runs the API on port 8006 on instances using Docker(docker.io/airbyte/airbyte-api-server ). In Kubernetes deployment it is exposed via Service airbyte-api-server-svc.

You can connect to it via:

You will need to make sure that this endpoint is available to your terraform. If you are running it in the cloud and planning to connect remotely it is a good idea to protect it with a password and ssl certificates. In Kubernetes Environment this can be done via Ingress controller.

Configuring terraform provider:

terraform {
  required_providers {
    airbyte = {
      source  = "airbytehq/airbyte"
      version = "0.4.1"
    }
  }
}

provider "airbyte" {
  password = "password" # your airbyte password
  username = "airbyte" # your username
  server_url = "http://localhost:8006/v1/" # replace is with url to access Airbyte API
}

Creating sources and destinations is quite easy.

Example. Replication job to pull csv files from Azure storage account and push into Postgres.
# Azure Blob Storage Source
resource "airbyte_source_file" "my_source_file" {
  configuration = {
    dataset_name = "myHL7"
    format       = "csv"
    provider = {
      az_blob_azure_blob_storage = {
        sas_token       = "UoPbYlneI30qhW7+Mtnwu...tR2YDHw==" # long key don't put it in code
        storage_account = "hl7examples"
      }
    }
    reader_options = "{\"sep\": \"None\", \"names\": [\"column1\"] }"
    url            = "extract/sample.hl7"
  }
  name          = "HL7-dataset-terraform"
  workspace_id  = "8f29d5ed-9d3b-4f5f-b1bc-7e1486aa1ca9" # use default workspace for OSS
}

# Postgres destination
resource "airbyte_destination_postgres" "my_destination_postgres" {
  configuration = {
    database            = "warehouse"
    disable_type_dedupe = false
    host                = "host.docker.internal"
    password            = "password"
    port                = 5432
    raw_data_schema     = "public"
    schema              = "public"
    ssl_mode = {
      allow = {}
    }
    tunnel_method = {
      no_tunnel = {}
    }
    username = "airbyte"
  }
  name          = "Postgres-destination-terraform"
  workspace_id  = "8f29d5ed-9d3b-4f5f-b1bc-7e1486aa1ca9"
}

# Connection
resource "airbyte_connection" "my_connection" {
  destination_id                       = airbyte_destination_postgres.my_destination_postgres.destination_id
  name                                 = "connection-terraform"
  non_breaking_schema_updates_behavior = "propagate_columns"
  source_id                            = airbyte_source_file.my_source_file.source_id
  status                               = "active"
}

What if you already created bunch of connections? You don't have to re-create them. Just import it into a state file and reverse-populate the configuration:

bash-3.2$ terraform import airbyte_source_microsoft_sharepoint.main 314ddda8-0160-46f5-98ee-4f04c0dac1f3
airbyte_source_microsoft_sharepoint.main: Importing from ID "314ddda8-0160-46f5-98ee-4f04c0dac1f3"...
airbyte_source_microsoft_sharepoint.main: Import prepared!
  Prepared airbyte_source_microsoft_sharepoint for import
airbyte_source_microsoft_sharepoint.main: Refreshing state...

Import successful!

The resources that were imported are shown above. These resources are now in
your Terraform state and will henceforth be managed by Terraform

The ID of the Resource to import you can find in the browser URL.

Developing and managing pre-configured Airbyte connectors with Terraform removes the dependency on DevOps.

Developing and managing pre-configured Airbyte connectors with Terraform removes the dependency on DevOps.

Developing and managing pre-configured Airbyte connectors with Terraform removes the dependency on DevOps.

Andrey Kozichev

Subscribe for the latest blogs and news updates!

Related Posts

finance

Sep 4, 2024

For any financial organisation, being able to access all relevant client data quickly is not just a competitive advantage in the current market - it’s an absolute necessity for the company’s survival.

fivetran

Jun 11, 2024

Incremental sync and truncating the data in raw tables can improve the performance of your syncs.

© MetaOps 2024

© MetaOps 2024

© MetaOps 2024