Marcel Pons — Senior Data Scientist

About

My default has always been to get something real into production as fast as possible, then improve from there. That rhythm shaped how I think about both data engineering and data science: define what's enough to be useful, ship it, watch how it behaves, sharpen it.

I've built most of my career in small data teams of two or three people. That context shaped how I work: I own the full stack across data engineering and modeling, which means a faster path from question to answer, and answers that hold up in production.

I work at Cloud and Big Data scale, across the full data stack. In the last year I've added Claude Code, and it's genuinely changed how I work. I spend more time on the parts that require judgment, and I'm constantly testing in practice how far AI can go in data work: what it can replace, what it can augment, and where human judgment still matters more than people think.

I care about clean abstractions, reproducible environments, and systems that explain themselves. The right system multiplies everything built on top of it. Getting that foundation right is always worth the time.

When I'm not wrangling pipelines, I'm reading, writing about data strategy, or exploring how technology shapes the way we make sense of things.

Experience

2024 — PRESENT

Senior Data Scientist · Madbox↗

Principal of end-to-end data pipelines, data infrastructure, and data governance across the full GCP stack. Developed LTV modeling using chain-ratio decomposition and maximum likelihood estimation, directly improving UA bid strategies. Built churn models to improve user retention and performed uplift analysis to measure their causal effect.

PythonBigQuerydbtLTV ModelingGCP

2022 — 2024

BI Engineer · Madbox↗

Migrated all data models from BigQuery-orchestrated Airflow pipelines to dbt. Optimized the Pub/Sub event streaming pipeline, halving infrastructure costs. Designed the iOS attribution pipeline post-ATT and defined the Conversion Value schema using unsupervised machine learning.

BigQuerydbtAirflowPub/SubPython

2021 — 2022

Programmatic Data Analyst · Smadex↗

Automated ad processing, reporting, and API integrations in Python, significantly reducing manual operational time.
Performed KPI analysis and visualization using AWS Athena and Tableau, driving data-informed strategy improvements.

PythonAWS AthenaTableauBI

2021 — 2022

Research Data Scientist · TALP, UPC↗

Developed two configurable Python algorithms for mining temporal association rules from clinical data, identifying sequential relationships between drugs, conditions, and diseases across patient visits. Coordinated clinical validation with 3 physicians, measuring inter-rater agreement via Fleiss' Kappa.

PythonData MiningSlurmLaTeX

2018 — 2019

Data Scientist Intern · Predictheon↗

Collected and synchronized high-resolution EEG and anesthesia data for signal analysis. Built a logistic regression model to detect burst-suppression proneness. Created a text parser to extract data from medical documents.

PythonETLLogistic RegressionNLP

Projects

01/

Banking Fraud Detection Pipeline ↗

End-to-end fraud detection on 13M credit card transactions with a LightGBM + focal-loss classifier served via FastAPI on Cloud Run. BigQuery/dbt feature layer, Terraform-managed GCP, and a LangChain agent that generates PDF reports.

LightGBMBigQueryFastAPIGCP

02/

BigQuery Air Quality Forecasting ↗

Multi-pollutant hourly forecasting across 25 Seoul stations using a LightGBM ensemble with conformalized quantile regression for calibrated prediction intervals. Supervised anomaly detection, dbt on BigQuery, FastAPI on Cloud Run.

BigQueryLightGBMForecastingConformal Prediction

03/

Seoul Air Quality Dashboard ↗

Interactive Next.js 16 frontend for the BigQuery Air Quality Forecasting project above. Six views covering timeseries, geospatial, forecasts, and anomalies. Built on ECharts and MapLibre GL with URL-shareable state over a BigQuery backend.

Next.jsEChartsMapLibreBigQuery

04/

Music Streaming Churn Prediction ↗

Churn prediction on the KKBox WSDM dataset (~970K users) with temporal holdout validation reaching 0.924 ROC-AUC. LightGBM tuned via Optuna, 36 dbt-engineered features over DuckDB, SHAP explainability.

LightGBMDuckDBdbtSHAP

05/

Session Recommender with LambdaRank ↗

Two-stage session recommender for an Inditex hackathon: multi-signal candidate generation (co-visitation, Item2Vec, CV embeddings) plus a LightGBM LambdaRank reranker. NDCG@5 of 0.377 under 93% cold-start sessions.

LambdaRankItem2VecLightGBMdbt

06/

Temporal Association Rules for Multimorbidity ↗

Two configurable algorithms for mining temporal association rules from electronic health records, uncovering sequential patterns between drugs, conditions, and diseases across patient visits. Published in BMC Medical Informatics and Decision Making.

PythonData MiningNLPClinical Data

Products

sessum_ai↗

Transforms Claude Code session transcripts into a structured knowledge base in Obsidian. Bronze/Silver/Gold processing pipeline with cross-project concept deduplication and zero additional API cost.

PythonSQLiteClaude CodeObsidian

Kindle Highlights↗

Web app for organizing and searching Kindle highlights. Parses clippings files, deduplicates at the database layer, and renders with an amber highlighter effect.

Next.jsPostgreSQLTypeScript

ccloquells↗

Portfolio website for Catalan writer and translator Carme Cloquells Tudurí. Responsive, SEO-optimized, with dynamic routing for books and publications.

Next.jsTypeScriptTailwind

beacon-dqaWIP

Data pipelines test framework with integrated AI resolution.

PythondbtAI