About
My default has always been to get something real into production as fast as possible, then improve from there. That rhythm shaped how I think about both data engineering and data science: define what's enough to be useful, ship it, watch how it behaves, sharpen it.
I've built most of my career in small data teams of two or three people. That context shaped how I work: I own the full stack across data engineering and modeling, which means a faster path from question to answer, and answers that hold up in production.
I work at Cloud and Big Data scale, across the full data stack. In the last year I've added Claude Code, and it's genuinely changed how I work. I spend more time on the parts that require judgment, and I'm constantly testing in practice how far AI can go in data work: what it can replace, what it can augment, and where human judgment still matters more than people think.
I care about clean abstractions, reproducible environments, and systems that explain themselves. The right system multiplies everything built on top of it. Getting that foundation right is always worth the time.
When I'm not wrangling pipelines, I'm reading, writing about data strategy, or exploring how technology shapes the way we make sense of things.
Experience
2024 — PRESENTSenior Data Scientist · Madbox↗
Principal of end-to-end data pipelines, data infrastructure, and data governance across the full GCP stack. Developed LTV modeling using chain-ratio decomposition and maximum likelihood estimation, directly improving UA bid strategies. Built churn models to improve user retention and performed uplift analysis to measure their causal effect.
BI Engineer · Madbox↗
Migrated all data models from BigQuery-orchestrated Airflow pipelines to dbt. Optimized the Pub/Sub event streaming pipeline, halving infrastructure costs. Designed the iOS attribution pipeline post-ATT and defined the Conversion Value schema using unsupervised machine learning.
Programmatic Data Analyst · Smadex↗
- Automated ad processing, reporting, and API integrations in Python, significantly reducing manual operational time.
- Performed KPI analysis and visualization using AWS Athena and Tableau, driving data-informed strategy improvements.
Research Data Scientist · TALP, UPC↗
Developed two configurable Python algorithms for mining temporal association rules from clinical data, identifying sequential relationships between drugs, conditions, and diseases across patient visits. Coordinated clinical validation with 3 physicians, measuring inter-rater agreement via Fleiss' Kappa.
Data Scientist Intern · Predictheon↗
Collected and synchronized high-resolution EEG and anesthesia data for signal analysis. Built a logistic regression model to detect burst-suppression proneness. Created a text parser to extract data from medical documents.
Projects
Banking Fraud Detection Pipeline ↗
End-to-end fraud detection on 13M credit card transactions with a LightGBM + focal-loss classifier served via FastAPI on Cloud Run. BigQuery/dbt feature layer, Terraform-managed GCP, and a LangChain agent that generates PDF reports.
BigQuery Air Quality Forecasting ↗
Multi-pollutant hourly forecasting across 25 Seoul stations using a LightGBM ensemble with conformalized quantile regression for calibrated prediction intervals. Supervised anomaly detection, dbt on BigQuery, FastAPI on Cloud Run.
Seoul Air Quality Dashboard ↗
Interactive Next.js 16 frontend for the BigQuery Air Quality Forecasting project above. Six views covering timeseries, geospatial, forecasts, and anomalies. Built on ECharts and MapLibre GL with URL-shareable state over a BigQuery backend.
Music Streaming Churn Prediction ↗
Churn prediction on the KKBox WSDM dataset (~970K users) with temporal holdout validation reaching 0.924 ROC-AUC. LightGBM tuned via Optuna, 36 dbt-engineered features over DuckDB, SHAP explainability.
Session Recommender with LambdaRank ↗
Two-stage session recommender for an Inditex hackathon: multi-signal candidate generation (co-visitation, Item2Vec, CV embeddings) plus a LightGBM LambdaRank reranker. NDCG@5 of 0.377 under 93% cold-start sessions.
Temporal Association Rules for Multimorbidity ↗
Two configurable algorithms for mining temporal association rules from electronic health records, uncovering sequential patterns between drugs, conditions, and diseases across patient visits. Published in BMC Medical Informatics and Decision Making.
Products
sessum_ai↗
Transforms Claude Code session transcripts into a structured knowledge base in Obsidian. Bronze/Silver/Gold processing pipeline with cross-project concept deduplication and zero additional API cost.
Kindle Highlights↗
Web app for organizing and searching Kindle highlights. Parses clippings files, deduplicates at the database layer, and renders with an amber highlighter effect.
ccloquells↗
Portfolio website for Catalan writer and translator Carme Cloquells Tudurí. Responsive, SEO-optimized, with dynamic routing for books and publications.
beacon-dqaWIP
Data pipelines test framework with integrated AI resolution.
