One of the ideas I wanted to explore in this project was simple: how much does data preprocessing really affect the performance of Machine Learning models? In clinical prediction problems, this question becomes especially relevant. A model may achieve good overall accuracy, but still fail to detect the most important cases: patients at risk. For that reason, I wanted to focus not only on accuracy, but also on metrics such as recall, F1-score and the behaviour of the model on minority classes. The datasets For this project, I worked with three public clinical datasets: Diabetes Dataset : used to predict diabetes from variables such as glucose, blood pressure, insulin, BMI and age. Healthcare Stroke Dataset : focused on predicting stroke risk using demographic, clinical and lifestyle-related variables. Thyroid Disease Dataset : related to thyroid disease detection using clinical, hormonal and categorical features. Each dataset presented different challenges.…