¡Nos trasladamos! E-Prints cerrará el 7 de junio.

En las próximas semanas vamos a migrar nuestro repositorio a una nueva plataforma con muchas funcionalidades nuevas. En esta migración las fechas clave del proceso son las siguientes:

Es muy importante que cualquier depósito se realice en E-Prints Complutense antes del 7 de junio. En caso de urgencia para realizar un depósito, se puede comunicar a docta@ucm.es.

Automatic analysis of high dimensional categorical variables in medical databases for the prediction of hospital bacteremia
Análisis automático de variables categóricas de alta dimensionalidad en bases de datos médicas para la predicción de bacteriemias hospitalarias



Downloads per month over past year

Rey García, Jaime del (2021) Automatic analysis of high dimensional categorical variables in medical databases for the prediction of hospital bacteremia. [Trabajo Fin de Grado]

[thumbnail of REY GARCÍA 82332_JAIME_DEL_REY_GARCIA_Analisis_automatico_de_variables_categoricas_de_alta_dimensionalidad_en_bases_de_datos_medicas_para_la_prediccion_de_1000412445.pdf]
Creative Commons Attribution Non-commercial.



This project aims to continue and consolidate the study for the bacteriemia detection process and its diagnosis carried out by some faculty companions last year. A first glance through the analysis of numerical variables allowed a deeper understanding and the trace of an approach for a quick detection model. Now, categorical variables take relevance too in order to successfully achieve higher results in the classifier models.
The addition of categorical variables in classifier models has been around for at least five years due to the increase in computational capacity, and the benefits in the classifiers as direct consequence is clear. Yet, it is proven that, as complex and abstract as language is, classifiers do struggle when data with slang or abbreviations comes up for prediction, even if its linguistic register is heavily bounded, i.e. when strictly related to medical issues data is treated.
Throughout the study we will apply text cleaning and text processing methods to prepare the variables for use, since their format is heterogeneous and unsuitable to be processed by Machine Learning tools.
We will also apply the string similarity method to identify all those classes that can help in the algorithm classification process and we will assess the most suitable types of encoding for working with these variables.
Finally, we will apply the Random Forest Machine Learning algorithm on the set with techniques that allow us to avoid data learning bias and we will assess the results in terms of the success rates and the relevance of the variables in the decision-making process of the algorithm.

Item Type:Trabajo Fin de Grado
Additional Information:

Trabajo de Fin de Grado en Ingeniería Informática, Facultad de Informática UCM, Departamento de Arquitectura de Computadores y Automática, Curso 2020/2021.

Garnica Alcázar, Óscar
Ruiz Giardín, José Manuel
Uncontrolled Keywords:Bacteremia, Comorbidity, Predictive medicine, Pathogenesis, Dataframe, Dirty category, String similarity, One hot encoding, Adjacency matrix, Adjacency list, Binary encoding, K-Nearest Neighbors (KNN), Bias and Variance, K-Fold Cross Validation, Random forest, ROC, SHAP
Subjects:Sciences > Computer science
Título de Grado:Grado en Ingeniería Informática
ID Code:74572
Deposited On:19 Sep 2022 14:08
Last Modified:19 Sep 2022 14:28

Origin of downloads

Repository Staff Only: item control page