Ischemic heart disease (IHD) has been identified as a leading cause of death globally (1). Compelling evidence showed that lifestyle changes could be effective strategies for the secondary prevention of IHD (2). Therefore, to reduce the burden of IHD mortality, an efficient tool for IHD screening and early diagnosis is warranted. A machine learning algorithm that is developed with serum metabolites, cardiometabolic biomarkers, and self-reported phenotypic data is promising in simplifying the process and reducing the cost of IHD screening/diagnosis. IHD status could be accurately detected with a simple blood draw and metabolomic profiling. In this project, I aim to develop such an algorithm using data from a European population.
I will use data from the MetaCardis consortium that recruited participants aged 18-75 years from Denmark, France, and Germany (3). The data was published early this year as the supplementary material of an article on Nature Medicine (3). The original study included 372 individuals with IHD. These IHD cases were further classified into acute coronary syndrome (n = 112), chronic ischemic heart disease (n = 158), and heart failure (n = 102). With a case-control design, the study also included 3 groups of controls matched on various factors. The raw data includes 1,882 observations, including repeated records with the same participant ID but different case-control statuses.
For this project, I will use records from the 372 IHD cases and 372 controls matched on type 2 diabetes (T2D) status and body mass index (BMI). I will also extract data for age, gender, fasting plasma triglycerides, adiponectin, and CRP, systolic and diastolic blood pressure, left ventricular ejection fraction, physical activity level, and 1,513 log-transformed values of serum metabolites.
After reading in the data, I first filtered the observations to keep
the IHD cases and their controls matched by T2D status and BMI. I then
merged metabolites data with cardiometabolic biomarkers and
self-reported phenotypic data to create a main dataset with
744 rows and 1522 columns. I noticed that several participants do not
have any metabolites data and, therefore, need to be removed.
Additionally, around 30% of participants had missing values for left
ventricular ejection fraction and physical activity level. Many machine
learning techniques could not be implemented with that many missing
values, and it would also not be appropriate to replace the missing
values with any arbitrarily selected value. So, I removed these two
potential predictors from my analyses. Finally, for variables with less
than 10% missing data, I replaced the missing values with the median of
the non-missing data. The cleaned main dataset had 603 rows
and 1522 columns. The first 6 rows of the cleaned main
dataset were printed in the below.
## # A tibble: 6 × 1,522
## ID case age tag adipon…¹ crp sbp dbp Gender acetate acetone
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 x14MCx1158 0 48 1.00 5.01 0.897 104 60.5 0 -3.91 -3.91
## 2 x14MCx2932 0 49 1.00 4.03 1.11 111 70 0 -3.91 -3.22
## 3 x14MCx2498 0 54 1.48 6.26 2.05 106. 68.5 1 -3.51 -3.51
## 4 x14MCx2237 0 47 0.787 3.44 0.67 138 78 1 -3.91 -4.95
## 5 x30MCx1828 0 66 0.59 11.0 0.427 110. 65.5 0 -3.91 -3.91
## 6 x30MCx1314 0 54 1.41 2.6 1.4 128. 75.5 1 -2.81 -3.91
## # … with 1,511 more variables: artemisin <dbl>, `beta-sitosterol` <dbl>,
## # betaine <dbl>, `betaine-aldehyde` <dbl>, butyrylcarnitine <dbl>,
## # catechol <dbl>, cellotetraose <dbl>, choline <dbl>, `D-trehalose` <dbl>,
## # `D-lyxose` <dbl>, `D-malate` <dbl>, `D-sorbitol` <dbl>, `D-threitol` <dbl>,
## # decanoylcarnitine <dbl>, glyceraldehyde <dbl>, ethanol <dbl>,
## # ethanolamine <dbl>, formate <dbl>, glucoheptonate <dbl>, glycolate <dbl>,
## # halostachine <dbl>, hydroquinone <dbl>, isovalerylcarnitine <dbl>, …
I then preprocessed the data to remove non-informative predictors with near-zero variances. Given that I planned to train at least one of my algorithms with regression, it would be better to have more predictors normally distributed so that model efficiencies could be improved. I tested the normality of each predictor with the Shapiro-Wilks Test and summarized the p-values. I found that only 101 predictors are normally distributed. It is also noteworthy that the metabolite values from the raw data were all log-transformed. Obviously, log transformation did not normalize the distributions successfully. So, I transformed all metabolite values back to the original scale and used rank-based inverse normal transformation (INT) to normalize the distributions instead. As examples, histograms showing the distributions of oleoylcarnitine (C18:1) and S-methylcysteine sulfoxide before and after the transformation were shown (Figure 1-2). I ended up having 840 predictors normalized successfully.




The outcome that my algorithm aimed to predict is the binary IHD status (non-case = 0, case = 1). Considering that I had 1,422 predictors, I would use principal component analysis (PCA) to reduce dimensions. I would keep principal components that account for at least 70% of variability as new predictors and train an algorithm with logistic regression and an algorithm with K-nearest neighbor (KNN). Given that the principal components are hard to interpret, and algorithms developed based on PCA could be difficult to implement, I would train another KNN-based algorithm with all 1,422 predictors instead. Random forest would be the 4th training method I would use. Finally, I would use an ensemble to combine the results of all four algorithms. For all algorithms, I would evaluate the overall accuracy, sensitivity, specificity, \(F_1\) score, and ROC curve. I would use a \(\beta\) of 2 to calculate the \(F_1\) score because higher sensitivity is more important than high specificity when predicting disease. In other words, a false positive will be less costly than a false negative in this scenario. I would also use cross-validation and bootstrapping to tune the model parameters.
For all the model training and fitting, I partitioned the
main dataset, which includes IHD case status and all
predictors, into a training (train_set) and a testing
(test_set) dataset. Matrices for predictors and cases were
also created. I then trained and assessed the models with the following
4 approaches: 1) Logistic regression with principal components as
predictors (PCA + Logistic); 2) K-nearest neighbors with principal
components as predictors (PCA + KNN); 3) K-nearest neighbors with serum
metabolites, cardiometabolic biomarkers, and self-reported phenotypic
data as predictors (KNN); and 4) Random forest with serum metabolites,
cardiometabolic biomarkers, and self-reported phenotypic data as
predictors (RF).
The PCA in the training set generated 483 principal components (PCs) from 1,422 predictors, including age, gender, fasting plasma triglycerides, adiponectin, and CRP, systolic and diastolic blood pressure, and 1,415 inverse normal transformed serum metabolites. After evaluating the proportion of variance explained by each PC, I selected the first 69 PCs that accounted for 70% of the total variance as new predictors. The proportion of variance explained by each of the first 69 PCs was printed below. I fitted a logistic regression with IHD cases as the dependent variable and the 69 PCs as the independent variables. For the logistic regression, there was no model parameter to tune. To make predictions in the testing set, I used the PC rotations to transform all 1,422 predictors in the testing set into 483 PCs and kept the first 69 PCs. The logistic regression estimates were then used to predict the probability of having IHD cases in the testing set. Participants with a predicted probability of having an IHD over 0.5 were defined as predicted IHD cases.
The overall accuracy of my predicted IHD cases from the logistic regression was 0.875 with a 95% confidence interval of (0.802, 0.928). This algorithm had a sensitivity of 0.892, a specificity of 0.854, and an \(F_1\) score of 0.890. I also plotted the ROC and observed an area under the curve (AUC) of 0.946, which was very high (Figure 3).
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
## 0.09528 0.14187 0.17858 0.21450 0.24215 0.26855 0.29403 0.31393 0.33268 0.35060
## PC11 PC12 PC13 PC14 PC15 PC16 PC17 PC18 PC19 PC20
## 0.36595 0.37976 0.39264 0.40522 0.41732 0.42849 0.43910 0.44940 0.45886 0.46815
## PC21 PC22 PC23 PC24 PC25 PC26 PC27 PC28 PC29 PC30
## 0.47727 0.48587 0.49381 0.50149 0.50892 0.51619 0.52337 0.53013 0.53645 0.54261
## PC31 PC32 PC33 PC34 PC35 PC36 PC37 PC38 PC39 PC40
## 0.54863 0.55446 0.56017 0.56585 0.57135 0.57666 0.58171 0.58672 0.59164 0.59644
## PC41 PC42 PC43 PC44 PC45 PC46 PC47 PC48 PC49 PC50
## 0.60116 0.60573 0.61013 0.61444 0.61872 0.62298 0.62701 0.63098 0.63494 0.63877
## PC51 PC52 PC53 PC54 PC55 PC56 PC57 PC58 PC59 PC60
## 0.64255 0.64624 0.64986 0.65345 0.65694 0.66041 0.66380 0.66717 0.67048 0.67372
## PC61 PC62 PC63 PC64 PC65 PC66 PC67 PC68 PC69
## 0.67691 0.68005 0.68315 0.68620 0.68922 0.69224 0.69519 0.69808 0.70096
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 58 8
## 1 7 47
##
## Accuracy : 0.875
## 95% CI : (0.8022, 0.9283)
## No Information Rate : 0.5417
## P-Value [Acc > NIR] : 5.161e-15
##
## Kappa : 0.7479
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.8923
## Specificity : 0.8545
## Pos Pred Value : 0.8788
## Neg Pred Value : 0.8704
## Prevalence : 0.5417
## Detection Rate : 0.4833
## Detection Prevalence : 0.5500
## Balanced Accuracy : 0.8734
##
## 'Positive' Class : 0
##
## [1] 0.8895706

I then used KNN to train the algorithm with the 69 PCs as predictors. To select the parameter K that maximizes the accuracy, I used 10-fold cross-validation with bootstrapping as the resampling scheme. Given that I have already reduced the dimension to 69 and we only have 603 observations, I did not worry much about the computation time of using 10-fold cross-validation. I fitted the model with K values from 2 to 100 with 20 as the increment. After plotting the accuracy under different K values, I was not able to identify a clearly optimized K, given that the curve of accuracy did not go down within the specified K range (Figure 4a). Therefore, I fitted the model with K values from 5 to 150 with 10 as the increment instead. I identified 75 as the K for the maximum accuracy and fitted the model again with this value (Figure 4b). The fitted KNN model was then used to predict the IHD cases in the testing set.
Using the combination of PCA and KNN, the overall accuracy of my predicted IHD cases was 0.842 with a 95% confidence interval of (0.764, 0.902). Compared to the algorithm developed with PCA and logistic regression, this algorithm had a higher sensitivity of 0.923, a lower specificity of 0.746, and a higher \(F_1\) score of 0.898. I plotted the ROC and observed an AUC of 0.889 (Figure 5).


## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 60 15
## 1 5 40
##
## Accuracy : 0.8333
## 95% CI : (0.7544, 0.8951)
## No Information Rate : 0.5417
## P-Value [Acc > NIR] : 1.512e-11
##
## Kappa : 0.6596
##
## Mcnemar's Test P-Value : 0.04417
##
## Sensitivity : 0.9231
## Specificity : 0.7273
## Pos Pred Value : 0.8000
## Neg Pred Value : 0.8889
## Prevalence : 0.5417
## Detection Rate : 0.5000
## Detection Prevalence : 0.6250
## Balanced Accuracy : 0.8252
##
## 'Positive' Class : 0
##
## [1] 0.8955224

The previous two algorithms developed based on selected PCs already performed well in predicting IHD cases. However, people who want to implement these two algorithms have to use the PCA rotations to transform their data first. That could increase the burden of using these algorithms, particularly in clinical settings. Also, the PCs no longer have biological meaning and, therefore, could be difficult to interpret. With these concerns, I developed another KNN-based algorithm with 1,422 predictors, including 1,415 serum metabolites.
Given that the sample size of my study is not large, I used 10-fold cross-validation with bootstrapping as the resampling scheme to select the parameter K again. I found 65 as the K that maximized the model accuracy after fitting the model with K values from 5 to 150 with 10 as the increment (Figure 6). I then fitted the model in the training set and predicted the IHD cases in the testing set. The overall accuracy of my predicted IHD cases was 0.800 with a 95% confidence interval of (0.717, 0.868). Compared to the algorithm developed with PCA and KNN, this KNN algorithm had a slightly higher sensitivity of 0.939. But the specificity dropped to 0.636. The \(F_1\) score was 0.894. I plotted the ROC and observed an AUC of 0.897 (Figure 7).

## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 61 20
## 1 4 35
##
## Accuracy : 0.8
## 95% CI : (0.7172, 0.8675)
## No Information Rate : 0.5417
## P-Value [Acc > NIR] : 3.087e-09
##
## Kappa : 0.588
##
## Mcnemar's Test P-Value : 0.0022
##
## Sensitivity : 0.9385
## Specificity : 0.6364
## Pos Pred Value : 0.7531
## Neg Pred Value : 0.8974
## Prevalence : 0.5417
## Detection Rate : 0.5083
## Detection Prevalence : 0.6750
## Balanced Accuracy : 0.7874
##
## 'Positive' Class : 0
##
## [1] 0.8944282

The last approach I used to train my model was random forest. It was more computationally intensive because predictors had to be randomly selected using bootstrapping to predict a single tree. To stabilize accuracy, hundreds of trees might need to be predicted. Also, I had to change the number of predictors being sampled at each bootstrap iteration to find the one that maximized the accuracy. Therefore, I started training the model with 15 trees and tuning the number of predictors to be sampled between 10 and 1000 with 100 as the increment. I implemented a 5-fold cross-validation. The plot of error against the number of trees showed that the accuracy improved as I added more trees and stabilized at around 100 trees (Figure 8a & 9a). In my second attempt, I changed the number of trees to be predicted to be 100. The plot of accuracy by the number of randomly sampled predictors did show a maximum point. However, it seems that the range of 10 to 1000 predictors was too large (Figure 8b & 9b). So, I further tuned the number of predictors to be sampled from 10 to 500 with 20 as the increment. It turned out that randomly sampling 150 predictors and predicting 100 trees maximized and stabilized the accuracy of model prediction (Figure 8c & 9c).
The overall accuracy of my predicted IHD cases from the random forest model was 0.900 with a 95% confidence interval of (0.832, 0.947). This algorithm had a high sensitivity of 0.939, a high specificity of 0.855, a high \(F_1\) score of 0.927, and a high AUC of 0.958 (Figure 10).






## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 60 6
## 1 5 49
##
## Accuracy : 0.9083
## 95% CI : (0.8419, 0.9533)
## No Information Rate : 0.5417
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8151
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9231
## Specificity : 0.8909
## Pos Pred Value : 0.9091
## Neg Pred Value : 0.9074
## Prevalence : 0.5417
## Detection Rate : 0.5000
## Detection Prevalence : 0.5500
## Balanced Accuracy : 0.9070
##
## 'Positive' Class : 0
##
## [1] 0.9202454

In this project, I aimed to develop an algorithm that uses serum metabolites, cardiometabolic biomarkers, and self-reported phenotypic data to predict ischemic heart disease (IHD) status in a European population. I obtained my data from a paper published early this year on Nature Medicine (3). For data preprocessing, I removed observations with missing metabolite measures and predictors with at least around 30% of missing data. For predictors with a small amount of missing data, I replaced the missing values with median values. Additionally, predictors with near-zero variance were also excluded. I used 4 approaches to train my model. The first two approaches used PCA to reduce dimension from 1,422 predictors. A logistic regression and a KNN-based algorithm were trained and fitted with the selected 69 PCs. The 3rd approach was also based on KNN but fitted the model with the 1,422 predictors. The last approach used random forest to develop the algorithm. I summarized the sensitivity, specificity, overall accuracy, \(F_1\) score, and AUC of all models in a table (Table 1). I also conducted an ensemble to combine results from the KNN model and random forest model and showed the performance at the end of the table. ROC curves were plotted on the same figure for comparison (Figure 11 & 12).
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 60 9
## 1 5 46
##
## Accuracy : 0.8833
## 95% CI : (0.812, 0.9347)
## No Information Rate : 0.5417
## P-Value [Acc > NIR] : 8.505e-16
##
## Kappa : 0.7637
##
## Mcnemar's Test P-Value : 0.4227
##
## Sensitivity : 0.9231
## Specificity : 0.8364
## Pos Pred Value : 0.8696
## Neg Pred Value : 0.9020
## Prevalence : 0.5417
## Detection Rate : 0.5000
## Detection Prevalence : 0.5750
## Balanced Accuracy : 0.8797
##
## 'Positive' Class : 0
##
## [1] 0.9118541


According to the table, the two algorithms developed with PCs had lower sensitivity than those trained with the original predictors. The KNN and random forest algorithms both had a very high sensitivity of 0.938. The two algorithms developed with KNN had lower specificity than the others. The random forest model also had a relatively high specificity of 0.855. The KNN-based algorithm with biologically meaningful predictors had the lowest overall accuracy, while the random forest model had the highest overall accuracy. The algorithms with PCs as predictors and used logistic regression for fitting had an overall accuracy of 0.875, while the one using PCs and KNN had an accuracy of 0.842. When evaluating with \(F_1\) score, the random forest model performed the best while the rest three models performed similarly. Finally, the random forest model had the highest AUC, followed by the PCA + KNN model. It is interesting that the ensemble of KNN and random forest did not further improve the model performance. In conclusion, the algorithm developed with random forest performed the best in all measures (Table 1).
| Sensitivity | Specificity | Overall accuracy | F_1 score | AUC | |
|---|---|---|---|---|---|
| PCA + Logistic | 0.892 | 0.855 | 0.875 | 0.890 | 0.946 |
| PCA + KNN | 0.923 | 0.727 | 0.833 | 0.896 | 0.889 |
| KNN | 0.938 | 0.636 | 0.800 | 0.894 | 0.897 |
| Random forest | 0.923 | 0.891 | 0.908 | 0.920 | 0.953 |
| Ensemble of KNN & RF | 0.923 | 0.836 | 0.883 | 0.912 | 0.942 |
My analysis that used 4 different approaches to train the algorithm is successful. I identified the random forest algorithm as the best among the 4 according to all 5 measures. The sensitivity of predicting IHD is particularly high. It is particularly important because we don’t want to miss any IHD cases if the patient really has IHD. Early detection could improve the prognosis and lower mortality. Moreover, the high sensitivity of my algorithm is not at the cost of low specificity. The specificity is also reasonably high. That could lower the probability of identifying healthy people as IHD cases and avoid overtreatment. It is also great that a random forest-based algorithm does not require extensive data transformation. Future implementation in clinical settings could be more efficient. If I had more time to spend on this project, I would look for metabolomics data in other populations and develop an algorithm that has higher generalizability.