ETL pipeline and Cloud Data Warehouse
Analytics Engineer
Get reliable, centralized, accessible, and protected data.
Scope
The volume of data, business and budgetary constraints, as well as the needs for analysis and visualizations, depend on your organization and determine how your data should be managed. The goal here is to define this framework that is specific to you, in order to ultimately achieve centralized, high-quality, available, and secure data.
Collection and definition of your needs to enable you to analyze and visualize your data, thus meeting your business requirements.
Framing of expectations regarding the quality, availability, and security of the data.
Design of a Data Warehouse and an ETL process that fits your budget.
Selection of a data model that meets your business needs.
Extract
The data is scattered across your tools and services. The goal here is to connect to all these sources and extract your data according to your needs, either partially or fully, at the right frequency.
- Relational databases
- API / Web Service
- Data from software / applications
- XML
- JSON
- CSV, Excel, Google Sheets
- Logs
- …
- Text files
- Surveys
- (Powerpoint, Google Slides)
- …
Load
Load the extracted data into your Data Warehouse for transformation.
This step occurs before transformation for structured data. For other types of data (semi-structured and unstructured), it occurs after transformation, as the data needs to be in a ‘table-like’ format to be stored in your Data Warehouse.
Transform
Here lies the core of the ETL process. Raw, unprocessed data is a source of errors and complexity. By transforming your data according to your needs, you will benefit from high-quality data enriched with crucial information for the business.
- Deduplication
- Removal of unnecessary data
- Correction of incorrect or inconsistent data
- The right format
- The right type
- The right range
- The right structure
- Format change
- Renaming columns
- Currency or other unit conversions
- Text string modifications
- Identification of missing data and auto-completion
- Addition of calculated fields (segmentation, scores, etc.)
- Finding the ID of an unknown prospect / user / customer
- Data ready for dashboards
- Data usable by anyone
Data structure
The Data Warehouse should not be an obscure area reserved for Data Engineers, Scientists, or Analysts. It must align with the needs of your business, meaning that its structure and the names used should be understandable by everyone, whether it’s Management, Marketing, or others.
It is important to keep the raw data intact in your Data Warehouse. This makes the Data Warehouse not only the source for high-quality data but also for the original data. By doing so, you ensure that everyone is using data from a common source.
Structure your Data Warehouse into 3 sets of data:
- Raw data – raw
- Intermediate data (technical data before reaching the cleaned data) – temp
- Clean data – clean
Then use naming conventions that are understandable to everyone for each processed element. For example, the cleaned database of your contacts could be named ‘marketing_contacts’. This naming step is crucial for ensuring that everyone takes ownership of the data and feels empowered to use it.
Documentation
Documentation of your ELT / ETL pipeline is essential. It helps simplify how data is extracted, how it is linked together, how it is transformed, and how it is loaded into the Data Warehouse. It serves as a sort of user manual for anyone discovering this process, such as a new Data Analyst who has just joined your team.
Data security
Everyone needs different data, and at different levels of detail.
Secure your data to, for example, offer:
- Access to all data (raw, intermediate, and cleaned) for your Data team
- Access to Marketing data for the Marketing team
- Access to logs for the IT team
- Access to macro KPIs for the entire company, to provide a shared view across the organization