Publications

Automating Data Lineage and Pipeline Extraction • Proceedings of the VLDB Endowment • August 2024 • PDF

Abstract: Jupyter Notebooks are widely spread in modern data science environments. They allow data professionals to create models, analyze data, and build data pipelines. With an increasing focus on research areas such as explainability and fairness in machine learning, there is a need to understand the relationship between the data and the model in ad-hoc project setups. This doctoral research aims to automate the process of extracting pipelines from Jupyter Notebooks and deriving data lineage from those pipelines without executing the notebook. The goal is to develop a set of tools that identify all datasets, transformations, models, and columns that serve model training inside a notebook without the need for human intervention or execution of these pipelines.

Certificates

Professional Cloud Architect

Google Cloud • March 2025 • Credly

Professional Machine Learning Engineer

Google Cloud • December 2024 • Credly

Projects

APEX-DAG: Automating Pipeline EXtraction with Dataflow, Static Code Analysis, and Graph Attention Networks

January 2025 • GitHub

CLEAN-SSM: CLEANing data lakes in a Self-Supervised Manner

September 2023 • GitHub

EP Calendar: React Native Calendar Library

February 2021 • GitHub

Better Habits: React Native Habit Tracker

March 2020

Social Clouds: Self-Hosted Social Network Software

March 2017

Contact