Data Engineering in NHS: processing HL7 with Airbyte and dbt
HL7 in Healthcare and Data Analytics
Health Level Seven (HL7) messages are widely used in the healthcare industry for exchanging information between different systems. They facilitate interoperability between various healthcare information systems, such as Electronic Health Records (EHRs), laboratory systems(LIS), radiology information systems(RIS), medical devices and more.
HL7 has become a de facto standard for system-to-system communications within the NHS.
The use of HL7 messages in data analytics offers a wealth of possibilities in the healthcare domain. It enables in-depth analysis of clinical information, patient demographics, diagnoses, medications, and treatment plans. By aggregating data extracted from HL7 messages, healthcare professionals can conduct trend analysis, identify patterns, and derive valuable insights.
Moreover, HL7 often contains essential information for regulatory reporting and compliance, contributing to key performance indicators (KPIs) in healthcare analytics and ensuring data quality assurance. The structured nature of these messages enhances the efficiency and accuracy of data analytics processes, making it a cornerstone for informed decision-making in the healthcare industry.
Ingesting HL7 into a Data Warehouse
In order to process HL7 message first we need to load it into a Data Warehouse. We will use Airbyte to ingest HL7 message from an Azure Storage Account into Postgres.
HL7 v2.x messages use a non-XML encoding syntax based on segments(lines) and one-character delimiters. Each segment contains fields separated by delimiters. Each field can be split into multiple sub-fields separated by another delimiter, and so on creating complex nested structures.
Example:
There are few connectors in Airbyte we can use for ingestion of such messages:
"File" source connector for a single file ingestion
One of the platform-specific storage source connectors, i.e. S3, Azure Storage Account or Google Cloud Storage(GCS) for bulk operations
Configure a Source Connector to pull data from Azure Storage Account using "File Source" type.
Important! “Reader Option” configuration. By Setting separator to “None” we telling Airbyte not to parse the line into fields and ingest it as a single string. Providing one-element array “names”, we configure a single column in destination dataset for our strings called “col0”.
Configure the Destination as a Postgres Database and setting up a connection. We can then sync the messages into Postgres.
After the initial sync completed, we can see the results in the Postgres Database:
With data taking 1 line per row
Processing HL7 message with dbt
We use dbt to parse HL7 message in the Data Warehouse.
Every HL7 Message has a header in a segment code “MSH”. The Data in this first line will define how to interpret the rest of the message.
Lets have a closer look at this segment and see how can we process it.
‘|’ - recommended as a field delimiter in a segment but technically it can be any single character.
We can parse this header with the following dbt code:
Example: hl7_parsed.sql
This model will form a basis for parsing of all HL7 segments.
The second dbt model to parse MSH segments:
Example: hl7_segment_msh.sql
and finally, the dbt model for MSH Message type
Example: hl7_msh_mesg_type.sql
After running all 3 models, we get the following results in the Database:
Gradually iterating over this approach, we can easily extend this dbt model to support a greater number of segment types and incorporate bulk ingestion with Airbyte by using vendor-specific source connectors.
The same approach with minor differences can be applied for ingesting into Snowflake, Databricks or BigQuery.
References
https://github.com/metaops-solutions/hl7-airbyte-dbt
Related Posts
finance
Sep 4, 2024
For any financial organisation, being able to access all relevant client data quickly is not just a competitive advantage in the current market - it’s an absolute necessity for the company’s survival.
fivetran
Jun 11, 2024
Incremental sync and truncating the data in raw tables can improve the performance of your syncs.