ETL pipeline and Cloud Data Warehouse

Analytics Engineer

Get reliable, centralized, accessible, and protected data.

Scope

The volume of data, business and budgetary constraints, as well as the needs for analysis and visualizations, depend on your organization and determine how your data should be managed. The goal here is to define this framework that is specific to you, in order to ultimately achieve centralized, high-quality, available, and secure data.

NeedsCost optimizationData model

Collection and definition of your needs to enable you to analyze and visualize your data, thus meeting your business requirements.

Framing of expectations regarding the quality, availability, and security of the data.

Design of a Data Warehouse and an ETL process that fits your budget.
Selection of a data model that meets your business needs.

Extract

The data is scattered across your tools and services. The goal here is to connect to all these sources and extract your data according to your needs, either partially or fully, at the right frequency.

Structured data

Relational databases
API / Web Service
Data from software / applications

Semi-structured data

XML
JSON
CSV, Excel, Google Sheets
Logs
…

Unstructured data

Text files
Surveys
(Powerpoint, Google Slides)
…

Load

Load the extracted data into your Data Warehouse for transformation.

This step occurs before transformation for structured data. For other types of data (semi-structured and unstructured), it occurs after transformation, as the data needs to be in a ‘table-like’ format to be stored in your Data Warehouse.

Transform

dbt (Data Build Tool)

Here lies the core of the ETL process. Raw, unprocessed data is a source of errors and complexity. By transforming your data according to your needs, you will benefit from high-quality data enriched with crucial information for the business.

Cleaning

Deduplication
Removal of unnecessary data
Correction of incorrect or inconsistent data

Validation

The right format
The right type
The right range
The right structure

Restructuring

Format change
Renaming columns
Currency or other unit conversions
Text string modifications

Enrichment

Identification of missing data and auto-completion
Addition of calculated fields (segmentation, scores, etc.)
Finding the ID of an unknown prospect / user / customer

Aggregation

Data ready for dashboards
Data usable by anyone

Data structure

Environments

The Data Warehouse should not be an obscure area reserved for Data Engineers, Scientists, or Analysts. It must align with the needs of your business, meaning that its structure and the names used should be understandable by everyone, whether it’s Management, Marketing, or others.

It is important to keep the raw data intact in your Data Warehouse. This makes the Data Warehouse not only the source for high-quality data but also for the original data. By doing so, you ensure that everyone is using data from a common source.

Structure your Data Warehouse into 3 sets of data:

Raw data – raw
Intermediate data (technical data before reaching the cleaned data) – temp
Clean data – clean

Then use naming conventions that are understandable to everyone for each processed element. For example, the cleaned database of your contacts could be named ‘marketing_contacts’. This naming step is crucial for ensuring that everyone takes ownership of the data and feels empowered to use it.

Documentation

data build tool (dbt)

Documentation of your ELT / ETL pipeline is essential. It helps simplify how data is extracted, how it is linked together, how it is transformed, and how it is loaded into the Data Warehouse. It serves as a sort of user manual for anyone discovering this process, such as a new Data Analyst who has just joined your team.

Data security

IAM

Everyone needs different data, and at different levels of detail.

Secure your data to, for example, offer:

Access to all data (raw, intermediate, and cleaned) for your Data team
Access to Marketing data for the Marketing team
Access to logs for the IT team
Access to macro KPIs for the entire company, to provide a shared view across the organization

My answers to your questions

“A traditional database (PostgreSQL, MySQL, etc.) is designed to optimize the storage and retrieval of data in real time. In contrast, a Data Warehouse is designed to store historical data and enable complex analyses on that data.

A Data Warehouse allows you to store structured data (raw and processed). In contrast, a Data Lake allows you to store any type of data (structured, semi-structured, unstructured).

A Data Warehouse is used by the entire organization as the reference for data. A Data Mart, on the other hand, caters to the specific needs of a department or team (such as Marketing, Finance, etc.).

ETL stands for Extract, Transform, and Load. Through this process, data is extracted from various sources, transformed according to the company’s needs, and loaded into the Data Warehouse.

Depending on the input data, its loading into the Data Warehouse occurs either after the transformation (ETL pipeline) or before the transformation (ELT pipeline).

For structured data, I recommend using the ELT process. The data is extracted (E), loaded into the Data Warehouse (L), and then transformed (T). By storing the data as-is, you make the Data Warehouse the source of all the company’s data (both raw and processed data).

For semi-structured or unstructured data, the ETL process must be used. Indeed, a Data Warehouse cannot store semi-structured or unstructured data. Once the data is extracted (E), it must immediately be transformed (T) into a format compatible with the Data Warehouse, and then loaded into the Data Warehouse (L) once processed.

If you have both structured, semi-structured, and unstructured data, the pipeline will thus be a combination of the ELT and ETL processes.

DBT stands for Data Build Tool. This tool greatly facilitates the construction of the pipeline. Environment management (test / production) is integrated, and documentation is automatically generated. Here are more details on the DBT website.

I recommend not modifying the input data before importing it into the Data Warehouse for two main reasons:

– You know the current business needs, but you don’t know the future business needs. Allow flexibility by retrieving the data as-is and transforming it later based on the evolving business requirements.

– Your Data Warehouse will serve as the source for all data, whether raw or processed. It will become THE reference for all collaborators.

Both. BigQuery is the solution developed by Google to create a high-quality Data Warehouse. It was also originally the name of the language used to query the Data Warehouse (BigQuery is an enhanced version of SQL). Recently, Google renamed the BigQuery language to GoogleSQL, but the term BigQuery is still commonly used to refer to both the language and the Data Warehouse. For more details, here is the official Google documentation on BigQuery.

If you’re unsure about which transformations to perform, an effective approach is to start using data visualization tools directly on your raw data in the Data Warehouse. This way, you’ll quickly identify performance issues or queries that involve joins between many tables.

You can then:

– Prepare views that correspond to frequent use cases in dashboards

– Aggregate this data directly in the Data Warehouse using the ELT / ETL pipeline

Several approaches can help reduce BigQuery costs:

– Reduce the number of rows in each table as much as possible. You can aggregate data when a detailed view is not necessary.

– Partition your tables in BigQuery and use this partitioning in your queries (both manual queries from your collaborators and dashboards). Here is the BigQuery documentation on partitioned tables.

– Carefully choose the update frequency for each table in the Data pipeline. Reduce this frequency as much as possible to lower ETL pipeline costs. For example, data that is reviewed on a monthly basis doesn’t need to be recalculated daily.