Data scientists have been complaining about data preparation (data collection –> data understanding –> data cleaning –> data enrichment –> data integration –> feature engineering) for many years. Although some efforts have been devoted to solving this problem, a recent survey released by Anaconda in 2020 shows that it is still the case that “Data preparation and cleansing takes valuable time away from real data science work and has a negative impact on overall job satisfaction.” Most recently, Andrew Ng urged the AI community to shift from Model-Centric toward Data-Centric AI development.

In this talk, I will start by answering two fundamental questions: i) what makes data preparation hard? ii) why has this problem not been solved? Then, I will present DataPrep (http://dataprep.ai), a fast and easy-to-use python library to address these challenges. The DataPrep library currently contains three components: a data connector component to simplify and accelerate data collection, an exploratory data analysis (EDA) component to enable fast data understanding, and a data cleaning component to clean and standardize data. I will describe their novel design and demonstrate how they can significantly save data scientists’ time. In the end, I will share some lessons and experiences that I learned about open-source software development.

Speaker bio:

Jiannan Wang (https://www.cs.sfu.ca/~jnwang) is an Associate Professor and the Director of the Professional Master’s Program in the School of Computing Science at Simon Fraser University. Prior to that, he was a postdoc in the AMPLab at UC Berkeley. He obtained his Ph.D. from Tsinghua University. He has over ten years’ research experience in data preparation. His research contributions won him the VLDB Best Experiments, Analysis & Benchmark Paper Award (2021), a CS-Can/Info-Can Outstanding Early Career Researcher Award (2020), an IEEE TCDE Rising Star Award (2018), an ACM SIGMOD Best Demonstration Award (2016), a Distinguished Dissertation Award from the China Computer Federation (2013), and a Google Ph.D. Fellowship (2011). He is a General Co-chair for VLDB 2023, a Ph.D. Symposium Track Chair for ICDE 2022, an Associate Editor for VLDB 2021, and a core PC member for SIGMOD 2019.

Public video of talk: https://www.youtube.com/watch?v=M3xO0vVIKV0