Publication:
Métodos bayesianos para comparar el funcionamiento de algoritmos sobre un conjunto de datos médicos

Loading...
Thumbnail Image
Official URL
Full text at PDC
Publication Date
2020
Advisors (or tutors)
Editors
Journal Title
Journal ISSN
Volume Title
Publisher
Citations
Google Scholar
Research Projects
Organizational Units
Journal Issue
Abstract
One of the greatest challenge is electing appropriate hyperparameters for unsupervised clustering algorithms in an optimal way depending on the issue under study, which we face while adapting clustering algorithms to immune disorder diagnoses. In this essay we approach this challenge by proposing a model of statistical assessment, that allows the empirical comparison of algorithms, an essential step in heuristic optimization. The statistical assessments are based on the adaptation of the proposed bayesian procedure in [7] to compare the performance of the algorithms in several tests problems. Hitherto, in the field of statistical assessment researchers have relied on the use of null hypothesis statistical test. Nonetheless, lately, concerns about their treatment[5, 6] has emerged and, in many fields, other (Bayesian) alternatives are being considered. In this project, we propose a Bayesian analysis based on the Plackett-Luce model over rankings, that allows several algorithms to be considered at the same time. The major edge of the proposed method is that it allows queries such as - which is the marginal probability that a given clustering algorithm is the best one? - to be directly answered. Furthermore, thanks to the nature of the Bayesian analysis, it instinctively serves us with knowledge about the uncertainty remaining after the data have been introduced. In order to test the proposed approach, we will carry out two different experiments. In the first one, we will use controlled scenarios to show, as a sanity check, that indeed the model provides the information we are looking for. In order to do that, instead of using actual rankings of algorithms, we will simulate them by sampling a probabilistic model deined over permutations. In particular, we consider a Mallows model with Kendall's distance. In that way, we can set the number of algorithms and instances simulated and, more importantly, we can get the true marginal probabilities associated to the first position. The second one will be used to show how the procedure can be applied in an actual comparison of algorithms, in a real-life environment. We adapt clustering algorithms to immune disorder diagnoses. We compare the performance of unsupervised clustering algorithms to detect ares and remission periods in lupus patients' records with different hyperparameter choices. Specically, the clustering algorithms that we apply are: K-Means, Hierarchical Clustering and DBSCAN. To answer the query - which is the marginal probability that a given clustering algorithm is the best one? - we resort to a Bayesian analysis based on the Plackett-Luce model applied to rankings, that allow us to determine the best combination of hyperparameters and clustering technique to detect outbreaks on immune disorder diagnoses. The document is organized as follows: In the chapter 2, we motivate the bayesian analysis approach. In the chapter 3, we present the Bayesian model and, besides that we deine and detail the mathematical concepts needed to grasp the project. At the end of the chapter 3, we run a synthetic test to show that the model works as expected. In the chapter 4, we apply the Plackett-Luce model in a real-life problem, specically, the detection of ares and remission periods in lupus patients' records. Finally, in the section 5, we draw the main conclusions of the project.
Description
Keywords
Citation
[1] J.H. Ward, \Hierarchical Grouping to Optimize an Objective Function". Journal of the American Statistical Association 1963. [2] L. Kaufman and P.J. Rousseeuw, \Finding groups in data : An introduction to cluster analysis". Hoboken, NJ: Wiley-Interscience. pp. 87. 1990, doi:10.1002/9780470316801. ISBN 9780471878766. [3] J. MacQueen, \Some methods for classication and analysis of multivariate observations".Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics, 281{297, University of California Press, Berkeley, Calif., 1967. https://projecteuclid.org/euclid.bsmsp/1200512992 [4] B. Hopkins and J.G. SKELLAM, \A new method for determining the type of distribution of plant individuals". Annals Botany Co. 18 (2): pp. 213-227, 1954 [5] Monya Baker. \Is there a reproducibility crisis?"Nature 533, pp. 452-454, 2016. [6] R.L. Wasserstein and A.N. Lazard, \The ASAs statement on pvalues: context, process, and purpose."The American Statistician 70, 2, pp. 129-133, 2016. [7] B. Calvo, J. Ceberio, J.A. Lozano, \Bayesian Inference for Algorithm Ranking Analysis", In GECCO Companion: Genetic and Evolutionary Computation Conference Companion, July 15-9 2018, Kyoto, Japan. ACM, New York, NY, USA. [8] PerMallows: E. Irurozki, B. Calvo, J.A. Lozano, \An R Package for Mallows and Generalized Mallows Models", Journal of Statistical Software August 2016, Volume 71, Issue 12. [9] https://github.com/b0rxa/scmamp [10] https://theclevermachine.wordpress.com/2012/09/24/a-brief-introduction-to-markovchains/ [11] https://theclevermachine.wordpress.com/2012/11/19/a-gentle-introduction-to-markovchain- monte-carlo-mcmc/ [12] https://theclevermachine.wordpress.com/2012/11/05/mcmc-the-gibbs-sampler/ [13] https://theclevermachine.wordpress.com/2012/11/18/mcmc-hamiltonian-monte-carloa- k-a-hybrid-monte-carlo/ [14] C. M. Bishop, \Pattern Recognition and Machine Learning". New York: Springer-Verlag. pp. 548-554, 2006. [15] https://www.datanovia.com/en/lessons/agglomerative-hierarchical-clustering/ [16] E. Martin, and K. Hans-Peter, S. J�org and X. Xiaowei, \A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise", Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, AAAI Press, pp. 226- 231, Portland, Oregon, 1996 https://www.aaai.org/Papers/KDD/1996/KDD96-037.pdf [17] A. Carpio, A. Sim�on y L.F. Villa, \Clustering methods and Bayesian inference for the analysis of the time evolution of immune disorders", Preprint 2020.