Diabetes Prediction Over Telephonic Health Survey
Abstract
In this project, the task is to predict whether a person is suffering from diabetes or not. The Behavioral Risk Factor Surveillance System (BRFSS)1 dataset is used in the study. It comprises of three class labels: 0 (no diabetes or only during pregnancy), 1 (prediabetes), and 2 (diabetes). The distribution of instances across these classes is imbalanced. No missing values in the training dataset eliminate the need for imputation. However, the high imbalance among different classes and the presence of categorical features present challenges. Experiments show that the Random Forest model outperforms other ML models. Nevertheless, its performance is deemed fair and not optimal. This limitation can be attributed to the substantial class imbalance within the dataset.