I am writing a tabular data textbook for O'Reilly
- on building ML systems with a feature store.
I try to be opinionated about modelling - XGBoost is all you really need, but the challenges are more like you say - how to prevent data leakage (ASOF LEFT JOIN (or use a feature store)), separating model-independent data transformations from model-specific data transformations, APIs for things like time-series splits, logging, monitoring, building and operating the pipelines. All pretty standard software engineering in Python nowadays.
Personally I am very interested on building big data pipeline for machine learning with initially batch and then real-time data of ECG and seismic, for CVDs screening/early detection and earthquakes early detection/prediction respectively. Any idea when the completed book will be available?
Just wondering what is the main difference between your book and this book, Architecting Data and Machine Learning Platforms also from O'Reilly:
I try to be opinionated about modelling - XGBoost is all you really need, but the challenges are more like you say - how to prevent data leakage (ASOF LEFT JOIN (or use a feature store)), separating model-independent data transformations from model-specific data transformations, APIs for things like time-series splits, logging, monitoring, building and operating the pipelines. All pretty standard software engineering in Python nowadays.
Free chapters:
https://www.hopsworks.ai/lp/oreilly-book-building-ml-systems...