Postdoctoral researcher Universidade Federal de Minas Gerais Belo Horizonte, Minas Gerais, Brazil
Abstract: The development of new techniques for cancer diagnostic is of great importance. In this work we present a new approach for diagnostic of mammary carcinoma from canine mammary gland samples using machine learning spectroscopy. Mammary gland samples of female dogs were measured by variable angle spectroscopic ellipsometry (VASE). Measurements were carried out at 6 different incidence angles, in a spectral range from 245 nm to 1700 nm. Samples were obtained from 12 different dogs and measured in duplicate. In this way, a large amount of data with approximately 0.6 million measured data points was obtained for each sample. The traditional approach to ellipsometry data analysis, which considered the construction and adjustment of an optical model for the sample, was not considered in this case due to the great complexity of the samples. Instead, a machine learning approach was used for binary classification of the samples. Five different machine learning models were considered: K-nearest neighbors (KNN) [1], Logistic Regression (LR) [2], Support Vector Machine (SVM) [3], eXtreme Gradient Boosting (XGBoost) [4], and Multilayer Perceptron (MLP) [5]. Binary classification strategies were implemented, coupled with a robust parameter and hyperparameter optimization scheme utilizing the Optuna library to enhance result accuracy. Throughout the cross-validation procedure, meticulous treatments were applied to ensure result reliability and efficacy, inclusive of outlier identification using the K-Nearest Neighbors (KNN) algorithm, data rebalancing through SMOTE, and feature curation employing the Select From Model technique. Following the optimization of machine learning models, a sequence of 30x 5-fold cross-validation trials ensued, yielding metrics scores across varied validation parameters. The optimal classification model returned a good level of performance with AUC of 0.89.
We would like to acknowledge the Brazilian agencies CNPq (grant 409327/2022-0, grant302632/2022-0) and FAPEMIG (grant RED-00135-22) for the financial support of this work.
References 1. Fiz, E. and J. L. Hodges, “Discriminatory Analysis. Nonparametric Discrimination: Consistency Properties,” California Univ., Berkeley. Electronics Research Laboratory, 1951. 2. Cox, D. R., “The regression analysis of binary sequences.” Journal of the Royal Statistical Society: Series B (Methodological), Vol. 20, No. 2, 215-232, 1958. 3. Cortes, C. and V. Vapnik, “Support-Vector Networks,” Mach. Learn., Vol. 20, No. 3, 273–297, 1995. 4. Chen, T. and C. Guestrin, “Xgboost: A scalable tree boosting system,” in Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, 785–794, 2016. 5. Haykin, S., “Neural networks: a comprehensive foundation,” Prentice Hall PTR, 1998.