Analytics Engineer



Get reliable, centralized, accessible, and protected data.

Scope

NeedsCost optimizationData model

Extract

Structured data
Semi-structured data
Unstructured data

Load

Transform

dbt (Data Build Tool)
Cleaning
Validation
Restructuring
Enrichment
Aggregation

Data structure

Environments

Documentation

data build tool (dbt)

Documentation of your ELT / ETL pipeline is essential. It helps simplify how data is extracted, how it is linked together, how it is transformed, and how it is loaded into the Data Warehouse. It serves as a sort of user manual for anyone discovering this process, such as a new Data Analyst who has just joined your team.

Data security

IAM

My answers to your questions


“A traditional database (PostgreSQL, MySQL, etc.) is designed to optimize the storage and retrieval of data in real time. In contrast, a Data Warehouse is designed to store historical data and enable complex analyses on that data.

A Data Warehouse allows you to store structured data (raw and processed). In contrast, a Data Lake allows you to store any type of data (structured, semi-structured, unstructured).

A Data Warehouse is used by the entire organization as the reference for data. A Data Mart, on the other hand, caters to the specific needs of a department or team (such as Marketing, Finance, etc.).

ETL stands for Extract, Transform, and Load. Through this process, data is extracted from various sources, transformed according to the company’s needs, and loaded into the Data Warehouse.

Depending on the input data, its loading into the Data Warehouse occurs either after the transformation (ETL pipeline) or before the transformation (ELT pipeline).

For structured data, I recommend using the ELT process. The data is extracted (E), loaded into the Data Warehouse (L), and then transformed (T). By storing the data as-is, you make the Data Warehouse the source of all the company’s data (both raw and processed data).

For semi-structured or unstructured data, the ETL process must be used. Indeed, a Data Warehouse cannot store semi-structured or unstructured data. Once the data is extracted (E), it must immediately be transformed (T) into a format compatible with the Data Warehouse, and then loaded into the Data Warehouse (L) once processed.

If you have both structured, semi-structured, and unstructured data, the pipeline will thus be a combination of the ELT and ETL processes.

DBT stands for Data Build Tool. This tool greatly facilitates the construction of the pipeline. Environment management (test / production) is integrated, and documentation is automatically generated. Here are more details on the DBT website.

I recommend not modifying the input data before importing it into the Data Warehouse for two main reasons:

– You know the current business needs, but you don’t know the future business needs. Allow flexibility by retrieving the data as-is and transforming it later based on the evolving business requirements.

– Your Data Warehouse will serve as the source for all data, whether raw or processed. It will become THE reference for all collaborators.

Both. BigQuery is the solution developed by Google to create a high-quality Data Warehouse. It was also originally the name of the language used to query the Data Warehouse (BigQuery is an enhanced version of SQL). Recently, Google renamed the BigQuery language to GoogleSQL, but the term BigQuery is still commonly used to refer to both the language and the Data Warehouse. For more details, here is the official Google documentation on BigQuery.

If you’re unsure about which transformations to perform, an effective approach is to start using data visualization tools directly on your raw data in the Data Warehouse. This way, you’ll quickly identify performance issues or queries that involve joins between many tables.

You can then:

– Prepare views that correspond to frequent use cases in dashboards

– Aggregate this data directly in the Data Warehouse using the ELT / ETL pipeline

Several approaches can help reduce BigQuery costs:

– Reduce the number of rows in each table as much as possible. You can aggregate data when a detailed view is not necessary.

– Partition your tables in BigQuery and use this partitioning in your queries (both manual queries from your collaborators and dashboards). Here is the BigQuery documentation on partitioned tables.

– Carefully choose the update frequency for each table in the Data pipeline. Reduce this frequency as much as possible to lower ETL pipeline costs. For example, data that is reviewed on a monthly basis doesn’t need to be recalculated daily.