A Systematic Comparison of Horizontal Federated Learning Algorithm Based on Random Forests in a Medical Setting
-
Graphical Abstract
-
Abstract
The medical industry generates vast amounts of data suitable for machine learning during patient-clinician interaction in hospitals. However, as a result of data protection regulations like the general data protection regulation (GDPR), patient data cannot be shared freely across institutions. In these cases, federated learning (FL) is a viable option where a global model learns from multiple data sites without moving the data. In this paper, we focused on random forests (RFs) for its effectiveness in classification tasks and widespread use throughout the medical industry and compared two popular federated random forest aggregation algorithms on horizontally partitioned data. We first provided necessary background information on federated learning, the advantages of random forests in a medical context, and the two aggregation algorithms. A series of extensive experiments using four public binary medical datasets (an excerpt of MIMIC III, Pima Indian diabetes dataset from Kaggle, and diabetic retinopathy and heart failure dataset from UCI machine learning repository) were then performed to systematically compare the two on equal-sized, unequal-sized, and class-imbalanced clients. A follow-up investigation on the effects of more clients was also conducted. We finally empirically analyzed the advantages of federated learning and concluded that the weighted merge algorithm produces models with, on average, 1.903% higher F1 score and 1.406% higher AUCROC value.
-
-