Weekly, idempotent data collection in the travel industry — fully on GCP, everything Infrastructure-as-Code.
Context & problem
Competitive data in the travel industry is volatile: prices and departure dates shift daily, pages appear and disappear. A single scrape gives you a snapshot; the value is in the time series — price movements, new departures, trips that vanished. At the same time you don't want every run to be a full crawl of heavy product-detail pages (PDPs): that is slow, expensive on proxy traffic, and fragile.
The core of the design is making that tension explicit in two separate components instead of overloading one scraper with conditional logic.
Architecture: two scrapers, two philosophies
The key design decision: not one configurable scraper, but two components that each do one thing well and deliberately have different levels of robustness.
| Snapshot crawler | Archive scraper (PDP) | |
|---|---|---|
| Role | Cheap sitemap crawl, determines what there is to scrape | Deep PDP extraction, builds the historical archive |
| Data lifetime | Throwaway — every run overwrites | Permanent — append-only time series |
| HTTP | session.get() directly, no retry | fetch_with_retry() with backoff + proxy |
| Robustness | Minimal, allowed to fail | Full: dedup, fingerprinting, anomaly detection |
| BQ write mode | overwrite (snapshot) | WRITE_APPEND |
| Feeds | the archive scraper | the dbt layer |
Why the asymmetry is deliberate. The snapshot crawler is a feeder: all it has to deliver is the current set of URLs and their paths. If a run fails, you simply run it again — nothing is lost because the output is throwaway anyway. Stuffing it with retry logic and deduplication would add complexity to something that should stay simple and replaceable.
The archive scraper is the opposite: every successful fetch is an irreplaceable point in a time series. That is where all the robustness lives — retries, idempotent IDs, content fingerprinting, anomaly detection on row counts. The separation keeps both components readable: no if is_snapshot: branches trying to reconcile two opposing requirements in one code path.
Shared library
Both scrapers — and every future scraper — build on one shared library. No class hierarchy, just direct functions per responsibility.
| Module | Responsibility |
|---|---|
ids.py | Deterministic ID generation (make_scrape_id, make_scrape_url_id, make_scrape_url_departure_id) |
http.py | fetch_with_retry(session, url, timeout=15, is_api=False) → Response | None |
bq.py | push_rows, check_exists, get_scraped_paths, get_sitemap_urls |
config.py | Central config: project/dataset, proxy toggle, delays, max retries |
log.py | Structured logging: log(severity, message, **extra) |
The library is the contract: a new scraper never has to think about what an ID looks like or how to write to BigQuery. That is solved well once and reused everywhere.
Determinism & idempotency
The backbone of the whole platform. IDs are deterministic, never random. Every row gets an ID via a base64-encoded MD5 hash over a fixed key composition:
make_scrape_url_id() → b64_md5("<source>|{departure_date}|{page_path}")
This immediately yields a number of properties:
- Idempotency. The same trip on the same date always produces the same ID. A repeated run therefore duplicates nothing —
check_exists()knows whether a record already exists without fuzzy matching. - Stable join key. The same ID scheme connects the snapshot, archive and dbt layers without fragile multi-column joins. The key is reproducible from the source data itself.
- No UUID/random. A random ID would be new on every run — you could never follow an entity over time, and dedup would be impossible without expensive comparisons.
On top of the IDs:
- Content fingerprinting detects whether a page actually changed, independent of whether the ID is the same — distinguishing a real price change from an unchanged re-fetch.
- Anomaly detection on row counts per run: a crawl that suddenly yields 90% fewer rows is probably a broken selector or a blocked proxy, not legitimate shrinkage. That run gets flagged instead of silently polluting the archive.
HTTP layer & robustness
All PDP traffic goes through fetch_with_retry(). The only exception is the sitemap crawl, which deliberately uses bare session.get() (throwaway, allowed to fail).
- Retry with backoff around transient errors and proxy hiccups, with a configured maximum.
- Rotating proxy. Traffic can run through a commercial proxy provider with IP rotation, toggled via config. SSL verification is off for the proxy endpoint (a deliberate, controlled choice within the proxy context).
- Separate delays for API versus page requests, centralised in config — no magic
sleep()numbers scattered through the codebase. Noneas explicit failure.fetch_with_retry()returns aResponseorNone. The caller must handle the None case; no exceptions leaking through the call stack.
BigQuery data model
Two write regimes, matching the two scrapers:
- Snapshot — overwriting. Always the current state, no history needed.
- Archive — strictly
WRITE_APPEND, never truncate. Every run adds to the time series. Truncate would mean an operational mistake irreversibly wiping historical data. That door is nailed shut — append-only is a hard rule, not a preference.
Idempotency (see above) makes append safe: duplicate runs do not lead to duplicate rows because the deterministic IDs catch collisions.
dbt transformation layer
Raw append-only data is not dashboard data. The dbt layer transforms in clear stages, every model tested and documented:
staging → stg_* (1:1 source cleaning)
marts → fct_* / dim_* (business logic, facts & dimensions)
dashboard→ dash_* (presentation views for the frontend)
- Main fact:
fct_competitor__departures— the time series of departures/prices per anonymised source. - CTEs, no nested subqueries. Every transformation readable top to bottom.
- Every model documented in
.ymlwith test definitions. A model without tests/docs doesn't ship.
BigQuery craft that matters in this domain:
DATE_TRUNC(date, WEEK(MONDAY))— week buckets on Monday, consistent with the weekly cadence.APPROX_QUANTILES(col, 2)[OFFSET(1)]— median without a full sort, scales on large volumes.SAFE_DIVIDE()everywhere — price ratios and percentages don't break on zero denominators.
Infrastructure & operations
The platform is serverless and fully declarative.
- Cloud Run Jobs for the scrapers — no idle compute, pay per run.
- Cloud Workflows orchestrates the weekly chain: snapshot crawl → archive scrape → dbt run, with the dependencies explicit.
- Terraform in production. Every GCP resource — buckets, datasets, service accounts, IAM bindings, scheduler triggers — lives in code. No console click that isn't in the state.
- Secret Manager. Proxy keys and credentials via environment variables, fed from Secret Manager. Nothing hardcoded, nothing in the repo.
- Least-privilege IAM per workload. Each scraper runs under its own service account with exactly the permissions it needs, no more.
- GitHub Actions CI. Tests run (
pytest) before every deploy; Docker images build from a shared context so the shared library stays reachable.
Engineering decisions that matter
What lifts this project above "a script that scrapes":
- Two philosophies instead of one configurable behemoth. Robustness where it counts (archive), simplicity where it's allowed (snapshot). Complexity is placed, not avoided.
- Determinism as the foundation. Deterministic IDs enable idempotency, safe appends, and stable joins in one stroke. One design choice, three problems solved.
- Append-only with guardrails. History is sacred; truncate does not exist. Anomaly detection keeps a broken run from silently polluting the archive.
- The shared library as contract. New scrapers inherit robustness instead of rewriting it.
- Everything IaC, least-privilege, secrets out of the code. Production is reproducible and auditable; the infrastructure itself is the proof of discipline.
Status
The platform runs in production on the weekly cadence. The dbt models feed tested dashboard views; the time series keeps growing append-only. Next steps are in the presentation layer — live dashboards and exposures on top of fct_competitor__departures.