Context & problem
The client, a large multi-brand organisation with a turnover of € 200–300 million, operates seven consumer brands (Brand A to G). Within the organisation, A/B testing is performed continuously — over 120+ experiments per year on a daily processed data volume of ± 2.5 TiB. Historically, analysis of these tests relied on aggregated GA4 interface metrics and manual spreadsheets, which was error-prone, non-reproducible, and methodologically fragile.
The raw GA4 export to BigQuery solved the data availability problem but introduced new engineering challenges:
- Nested structures: Event parameters, user properties, and items are stored in repeated record fields that must be unpacked for every single analytical query.
- Session granularity: A standard GA4 session is a coarse time window, not a clean unit of analysis. A single session can contain multiple experiment exposures.
- Lack of causal logic: The GA4 interface does not slice data at the exact moment of experiment exposure. Consequently, behaviour occurring before exposure is incorrectly included in test KPIs.
To systematically resolve these issues, a dbt transformation platform was built on Google BigQuery, coupled with an advanced Power BI consumption layer.
Architecture
Data flows from user interactions on the websites of the seven brands to the reporting dashboard through a single integrated chain:
Website & app (Brands A-G) → GTM/dataLayer → GA4 → BigQuery raw export
→ dbt Staging → Core → CRO-mart (subsession-grain)
→ Power BI dataflow → semantic model → dashboard
Within this pipeline, dbt acts as the exclusive carrier of all business transformation logic and data quality rules, while Power BI is strictly responsible for visualization, filtering, and the semantic model utilizing inactive relationships.
Waarom standaard GA4-sessies niet genoeg waren
The methodological heart of this platform is the elimination of causal pollution. Suppose a visitor opens a product page, browses images, and only after three minutes gets exposed to an experiment (e.g., an optimized checkout CTA). If the entire session is used as the unit of analysis, the behaviour from those first three minutes is included in the test metrics.
This introduces severe methodological errors:
- Distorted causality: Pre-exposure conversions or interactions cannot possibly be caused by the test variant. They do not belong in the test metrics.
- Dampened test effect: Noise from pre-exposure behaviour dilutes the true impact (uplift) of a variant, causing significant effects to go unnoticed (Type II errors).
- Spurious uplift: Accidental differences in pre-exposure activity between groups introduce systematic bias into the results.
The solution is subsession construction. The session is sliced at the exact timestamp of the first experiment exposure (the trigger). Everything before the trigger is discarded; everything from the trigger onwards forms the subsession on which the test is evaluated. A single GA4 session can thus yield multiple independent subsessions, one per experiment.
Vier dbt-pipelines voor zeven merken
The platform is designed for maximum scalability. Instead of building a separate pipeline for each of the seven brands, the organisation maintains four dbt pipelines that serve the seven brands. This repository contains the pipeline for Brand A, but leverages a fully brand-agnostic design pattern:
- The source configuration accepts a flexible array of GA4 property IDs (
property_ids). - The dbt macro
combine_property_datacan dynamically merge data from multiple properties into a single combined dataset. - The transformation and experimentation logic depends entirely on standardized event parameters and triggers, allowing new brands to be integrated seamlessly via configuration rather than code duplication.
This repeatable engineering pattern ensures that platform updates or bugfixes are instantly rolled out to all connected brands, minimizing the maintenance overhead on a volume of 2.5 TiB/day.
Van GA4-events naar analyseerbare subsessies
The dbt architecture is structured in sequential layers to transform raw event shards into a clean data mart:
- Staging (
models/_1_staging): Unpacks nested GA4 records and generates stable, deterministic keys (client_key,session_key,event_key) using MD5 hashing. There is one staging model per critical event type (such as pageviews, clicks, and e-commerce interactions) to ensure readability. - Core (
models/_2_core): Models the fundamental dimension and fact tables for sessions, users, and pages. This forms the generic web analytics foundation. - CRO Mart (
models/_3_mart_cro): The dedicated experimentation pipeline. This is where experiment exposures are isolated, the first exposure per(session, experiment)is detected, trigger timestamps are captured, and sessions are sliced.
Each constructed subsession receives a deterministic sub_session_id (hash of user, session, experiment, and variant). Every event within that subsession is assigned a unique cro_event_key and is filtered with the strict condition: event_timestamp >= trigger_timestamp.
Experiment-isolatie met het impressie-event
Because a visitor can be exposed to multiple A/B tests during a single browser session, strict experiment isolation is essential. The dbt CRO mart pipeline ensures that experiments are analyzed completely independently of one another.
Additionally, the pipeline proactively identifies and flags complex edge cases in the experiment library:
- ABBA exposures: Scenarios where a user, due to technical errors or cookie resets, is exposed to both the control and variant versions within the same experiment.
- Multi-test overlaps: Situenarios where a session falls into multiple potentially interacting tests simultaneously, enabling downstream filtering or interaction analysis.
These detected anomalies are passed as flags to the final model, allowing them to be explicitly excluded from primary analyses in Power BI to protect statistical validity.
Dynamische KPI-selectie op eventniveau
Traditional A/B testing pipelines aggregate data directly to a pre-defined goal (such as transactions). If the business later decides to analyze a different KPI (such as add to cart or begin checkout), the pipeline must be redesigned and rebuilt.
This platform resolves this by materializing the final model at the event-grain: one row per event within a subsession, enriched with the full experiment context. Conversion rates and uplift are calculated on the fly in the consumption layer (Power BI) using flexible DAX measures.
The Power BI dashboard features a "Main KPI" slicer that allows users to dynamically select any event from the dataset as the primary KPI, along with a "Custom KPI" block for secondary metrics. The dashboard instantly computes the corresponding user counts, conversion rates, and relative uplift without reprocessing a single byte of data in BigQuery.
Modererende eventanalyse
The event-grain structure also enables advanced moderating analyses. In addition to the primary KPI, users can select a second event as a moderating factor (e.g., "visited the FAQ page" or "opened the size guide").
The Power BI model then splits the results into two groups within the subsession: visitors who performed the moderating event versus visitors who did not. In the DAX metrics, this is technically realized via disconnected parameter tables (Segments and KPIs) and an INTERSECT over user IDs:
conversion = users with (moderator event) AND (KPI event)
────────────────────────────────────────────
users with (moderator event)
This enables the business to conduct deep behavioural analyses, such as determining whether a new checkout variant is particularly effective for users who hesitated and consulted the help pages.
Power BI-dashboard voor experimentation analytics
The dashboard is designed according to the principle of a clean separation of roles. dbt carries all business and transformation logic; Power BI reads the resulting event-grain table via a Dataflow Gen 2 and presents the metrics through a fixed page structure:
- Overall Results: Displays primary metrics (users, conversions, conversion rate, uplift) for the selected Main and Custom KPIs per variant, including a cumulative trend line over time.
- New & Returning: The same test results and funnels, directly segmented by user type.
- Traffic Source & Browsers: Full breakdown of conversions, uplift, and statistical reliability by acquisition channel and browser version.
- SRM Check: A dedicated page for the statistical validation of the visitor distribution.
- Data: Direct insights into raw transaction counts, revenue per variant, and transaction-level outlier detection to prevent revenue distortion.
Statistische validiteit: meer dan alleen uplift
To prevent the business from making decisions based on chance or skewed test setups, the dashboard incorporates rigorous statistical guardrails:
- Sample Ratio Mismatch (SRM) Guardrail: An embedded chi-square goodness-of-fit test (
CHISQ.DIST.RT) checks whether the actual distribution of visitors across variants deviates significantly from the expected ratio (e.g., 50/50). This runs at the aggregate level and per segment (channel, browser). At a p-value of< 0.001, the dashboard immediately flags the test as a "Mismatch" (invalid). - Two-Proportion Z-Test: For proportional metrics (conversion rates), the DAX layer computes a z-score and yields the "Probability of Being Better" (PoBB). This indicates the statistical confidence that the variant actually outperforms the control.
- Welch's T-Test: For continuous metrics (such as average order value or revenue per user), Welch's t-test is applied with dynamic degrees of freedom to display reliable p-values, independent of unequal variances between the groups.
dbt-modellering en datakwaliteit
The reliability of the pipeline is continuously guaranteed by a comprehensive test suite of approximately 68 automated tests in dbt and CI:
| Custom dbt test | Severity | Guaranteed functionality |
|---|---|---|
event_ | Error | Ensures absolutely no events land in a subsession with a timestamp prior to the experiment exposure. |
purchase_ | Error | Prevents purchases that occurred before exposure from being incorrectly attributed to a test variant. |
one_ | Error | Guarantees the uniqueness and integrity of the subsession model. |
purchase_ | Error | Ensures every e-commerce purchase event is correctly classified to product or subscription types. |
stage2_ | Warn | Immediately signals any potential data loss during the subsession-slicing step. |
Additionally, pytest unit tests run in the GitHub Actions CI/CD pipeline to validate crucial SQL macros (such as channel grouping logic and URL parsing) against mock data before code can be merged.
Engineeringbeslissingen die ertoe doen
- Subsessions over raw sessions: Slicing sessions at the exact exposure timestamp restores causality and completely eliminates pre-exposure noise.
- Event-grain modelling: Avoiding premature aggregation preserves maximum data flexibility for dynamic KPI and moderator selection in Power BI.
- Strict separation of roles (dbt vs. BI): All business logic is defined and tested in dbt. Power BI solely computes measures, ensuring a single reliable "source of truth".
- Cost-conscious BigQuery strategy: CRO models utilize incremental
insert_overwritematerialization on daily partitions, combined with clustering and an enforced partition filter (require_partition_filter) to prevent unnecessary full-table scans on the 2.5 TiB/day volume. - Disconnected tables for DAX flexibility: The semantic model leverages parameter tables and
USERELATIONSHIPto resolve filter conflicts between multiple event slicers. - Industrialized multi-property design: The pipeline is brand-agnostic and replicates the same logic across four pipelines and seven brands via central configuration.
Resultaat & waarde
The implementation of this dbt-driven experimentation analytics pipeline delivered direct strategic and operational value:
- Causally clean A/B test analyses by completely excluding pre-exposure behaviour.
- A single, trusted "source of truth" in dbt rather than fragmented logic scattered across dashboards.
- Fully dynamic KPI selection in Power BI without reprocessing or re-engineering in the data warehouse.
- Automated statistical validity monitoring featuring chi-square SRM detection and z-tests.
- High query efficiency on BigQuery through advanced partitioning, clustering, and incremental builds.
- Scalable multi-brand architecture serving seven consumer brands reliably with just four pipelines.
The platform transforms raw, complex GA4 event streams into a causally clean, tested, and cost-aware analytical foundation. This empowers the organisation to evaluate the results of its 120+ annual experiments with complete reliability and end-to-end transparency.