[NL]
Region: europe-west4 [v2.1.0] | status: online!
// projects / competitive_intel.tf

Competitive-Intelligence Platform

type: data platform · status: production · year: 2025
TL;DR
Two scrapers with deliberately opposite philosophies — a throwaway snapshot crawler and an append-only PDP archive — share one library and one deterministic ID scheme. Weekly runs deduplicate and archive travel price & departure data in BigQuery, and dbt transforms it into tested dashboard models. No console click in production: every GCP resource is Terraform.
Infrastructure
TerraformCloud Run JobsCloud WorkflowsSecret ManagerGitHub Actions
Data
BigQuerydbt
Language
Python

Weekly, idempotent data collection in the travel industry — fully on GCP, everything Infrastructure-as-Code.

Context & problem

Competitive data in the travel industry is volatile: prices and departure dates shift daily, pages appear and disappear. A single scrape gives you a snapshot; the value is in the time series — price movements, new departures, trips that vanished. At the same time you don't want every run to be a full crawl of heavy product-detail pages (PDPs): that is slow, expensive on proxy traffic, and fragile.

The core of the design is making that tension explicit in two separate components instead of overloading one scraper with conditional logic.

Architecture: two scrapers, two philosophies

The key design decision: not one configurable scraper, but two components that each do one thing well and deliberately have different levels of robustness.

Snapshot crawlerArchive scraper (PDP)
RoleCheap sitemap crawl, determines what there is to scrapeDeep PDP extraction, builds the historical archive
Data lifetimeThrowaway — every run overwritesPermanent — append-only time series
HTTPsession.get() directly, no retryfetch_with_retry() with backoff + proxy
RobustnessMinimal, allowed to failFull: dedup, fingerprinting, anomaly detection
BQ write modeoverwrite (snapshot)WRITE_APPEND
Feedsthe archive scraperthe dbt layer

Why the asymmetry is deliberate. The snapshot crawler is a feeder: all it has to deliver is the current set of URLs and their paths. If a run fails, you simply run it again — nothing is lost because the output is throwaway anyway. Stuffing it with retry logic and deduplication would add complexity to something that should stay simple and replaceable.

The archive scraper is the opposite: every successful fetch is an irreplaceable point in a time series. That is where all the robustness lives — retries, idempotent IDs, content fingerprinting, anomaly detection on row counts. The separation keeps both components readable: no if is_snapshot: branches trying to reconcile two opposing requirements in one code path.

Shared library

Both scrapers — and every future scraper — build on one shared library. No class hierarchy, just direct functions per responsibility.

ModuleResponsibility
ids.pyDeterministic ID generation (make_scrape_id, make_scrape_url_id, make_scrape_url_departure_id)
http.pyfetch_with_retry(session, url, timeout=15, is_api=False)Response | None
bq.pypush_rows, check_exists, get_scraped_paths, get_sitemap_urls
config.pyCentral config: project/dataset, proxy toggle, delays, max retries
log.pyStructured logging: log(severity, message, **extra)

The library is the contract: a new scraper never has to think about what an ID looks like or how to write to BigQuery. That is solved well once and reused everywhere.

Determinism & idempotency

The backbone of the whole platform. IDs are deterministic, never random. Every row gets an ID via a base64-encoded MD5 hash over a fixed key composition:

make_scrape_url_id()  →  b64_md5("<source>|{departure_date}|{page_path}")

This immediately yields a number of properties:

On top of the IDs:

HTTP layer & robustness

All PDP traffic goes through fetch_with_retry(). The only exception is the sitemap crawl, which deliberately uses bare session.get() (throwaway, allowed to fail).

BigQuery data model

Two write regimes, matching the two scrapers:

Idempotency (see above) makes append safe: duplicate runs do not lead to duplicate rows because the deterministic IDs catch collisions.

dbt transformation layer

Raw append-only data is not dashboard data. The dbt layer transforms in clear stages, every model tested and documented:

staging  →  stg_*         (1:1 source cleaning)
marts    →  fct_* / dim_* (business logic, facts & dimensions)
dashboard→  dash_*        (presentation views for the frontend)

BigQuery craft that matters in this domain:

Infrastructure & operations

The platform is serverless and fully declarative.

Engineering decisions that matter

What lifts this project above "a script that scrapes":

  1. Two philosophies instead of one configurable behemoth. Robustness where it counts (archive), simplicity where it's allowed (snapshot). Complexity is placed, not avoided.
  2. Determinism as the foundation. Deterministic IDs enable idempotency, safe appends, and stable joins in one stroke. One design choice, three problems solved.
  3. Append-only with guardrails. History is sacred; truncate does not exist. Anomaly detection keeps a broken run from silently polluting the archive.
  4. The shared library as contract. New scrapers inherit robustness instead of rewriting it.
  5. Everything IaC, least-privilege, secrets out of the code. Production is reproducible and auditable; the infrastructure itself is the proof of discipline.

Status

The platform runs in production on the weekly cadence. The dbt models feed tested dashboard views; the time series keeps growing append-only. Next steps are in the presentation layer — live dashboards and exposures on top of fct_competitor__departures.