Imputación de datos mediante Random Forest

Barreñada Taleb, Lasai Alai

Publication:
Imputación de datos mediante Random Forest

Files

TFM.pdf (4.52 MB)

Publication Date

2021-07-14

Authors

Barreñada Taleb, Lasai Alai

Advisors (or tutors)

Salgado Fernández, David

Rosa Pérez, Elena

Alonso Sanz, Rosa

Citations

Exportar

Abstract

La información disponible es cada vez mayor y los institutos de estadística oficiales deben hacer uso de esta información para crear procesos innovadores y eficaces. El statistical learning es el conjunto de técnicas usadas para la mejor comprensión de los datos. Los random forests, basados en un ensemble de árboles de decisión, son una de las técnicas mas utilizadas de aprendizaje supervisado. En este trabajo se han usado random forests para la imputación de datos en encuestas económicas coyunturales y mas concretamente en los Índices de Cifras de Negocios de la Industria. La imputación se trata del proceso mediante el cual se asigna un valor a un ítem para el que previamente no se tenia información. En este estudio se elabora la metodología para la imputación después de analizar los criterios de calidad necesarios para la producción de una estadística oficial. En primer lugar se realiza la selección de variables o feature selection más interesante para el cálculo de las cifras de negocios. Posteriormente, se aborda el proceso de selección de parámetros para la obtención del modelo óptimo de bosques aleatorios para el conjunto de datos seleccionado. Finalmente se realiza una aplicación práctica del bosque aleatorio para las imputaciones y se evalúan obteniendo un resultado satisfactorio.
The amount of available information in National Statistical lnstitutes is increasing rapidly and they shall make use of it to develop innovative and effective processes. Statistical learning is the set of techniques used for better understanding of data. Random Forests, based on decision tree ensembles, are one of the most used techniques of supervised learning. In this thesis Random Forest have been used to impute data in short term business statistics. Imputation is defined as the method to give value to an item that previously was missing. In this study a new methodology is developed after analysing the quality requirements for official statistics. Firstly, the feature selection is carried out in order to get the set of variables that will be included in the model. After this, the tuning of the forests is carried out to get the optimum forest. Finally, this model is used to impute the missing values and the assessment of the accuracy of the estimation is carried out having satisfactory results.

Description

Calificación: 10

UCM subjects

Estadística matemática (Matemáticas)

Unesco subjects

1209 Estadística

Citation

Acock, Alan C. (nov. de 2005). «Working with missing values». En: Journal of Marriage and Family 67.4, págs. 1012-1028. ISSN:00222445. DOI:10.1111/j.1741-3737. 2005.00191.x.URL:http://doi.wiley.com/10.1111/j.1741-3737.2005.00191.x. Alsagheer, Radhwan HA, Abbas FH Alharan y Ali SA Al-Haboobi (2017). «Popular decision tree algorithms of data mining techniques: a review». En: International Journal of Computer Science and Mobile Computing 6.6, págs. 133-142. Andridge, Rebecca R. y Roderick J. A. Llttle (abr. de 2010). «A Review of Hot Deck lmputation for Survey Non-response». En: International Statistical Review 78.1, págs. 40-64. ISSN:03067734. DOI:10.1111/j .1751-5823.2010.00103.x.URL: http://doi.wiley.com/10.1111/j.1751-5823.2010.00103.x. Batista, Gustavo EAPA, Maria Carolina Monard y col. (2002). «A study of K-nearest neighbour as an imputation method.» En: His 87.251-260, pág. 48. Beck, Martin, Florian Dumpert y Joerg Feuerhake (dic. de 2018). «Machine Learning in Official Statistics». En: arXiv: 1812.10422. URL:http: //arxiv.org/abs/1812.10422. Bennett, Derrick A. (oct. de 2001). «How can 1deal with missing data in my study?» En: Australian and New Zealand fournal of Public Health 25.5, págs. 464-469. ISSN:1326-0200. DOI:10.1111/j.1467-842X.2001.tb00294.x.URL:http://doi. wiley.com/10.1111/j.1467-842X.2001.tb00294.x. BOE (2001). Real Decreto 508/2001, de 11 de mayo, por el que se aprueba el Estatuto del Instituto Nacional de Estadística. https://www.boe.es/eli/es/rd/2001/05/11/508/con. BOE (2020). Boletín oficial del estado: Ley 11/1.020, de 30 de diciembre, de Presupuestos Generales del Estado para el año 2021. https://boe.es/boe/dias/2020/12/31/pdfs/BOE-A-2020-17339.pdf. Bou-Hamad, Imad, Denis Larocque, Hatem Ben-Ameur y col. (2011). «A review of survival trees». En: Statistics surveys 5, págs. 44-71. Brackstone, Gordon J (2002). How important is accuracy?Citeseer. Breiman, Leo (1996). «Bagging predictors». En: Machine learning 24.2, págs. 123-140. Breiman, Leo (2001). «Random forests». En: Machine learning 45.1, págs. 5-32. Breiman, Leo (2015). «Random forests leobreiman and adele cutler». En: Random Forests-Classification Description. URL:https ://www.stat.berkeley.edu/-breiman/Random.Forests/ cc_home.htm. Breiman, Leo y col. (1984). Classification and regression trees. CRC press. Comisión Europea (1998). Reglamento (CE) NO 1165/98 del Consejo. https://eur-lex.europa.eu/legal-content/ES/TXT/PDF/?uri=CELEX: 31998R1165&from=ES. Comisión Europea (2006). Reglamento (CE) n o 1893/2006 del Parlamento Europeo y del Consejo. https://eur-lex.europa.eu/legal-content/ES/TXT/PDF/?uri=CELEX:02006R1893-20190726&from=ES. Costa, Alex, Jaume Garciá y Josep Lluis Raymond (sep. de 2014). «Are All Qua lity Dirnensions of Equal Importance when Measuring the Perceived Quality of Official Statistics? Evidence from Spain». En: Journal of Official Statistics 30.3, págs. 547-562. ISSN: 2001-7367. DOI: 10.2478/jos- 2014-0034. URL: https://www.sciendo.com/article/10.2478/jos-2014-0034. Cutler, Adele, D. Richard Cutler y John R. Stevens (2012). «Random Forests». En: Ensemble Machine Leaming. Boston, MA: Springer US, págs. 157-175. DOI: 10. 1007/978-1-4419-9326-7_5. URL: http://link.springer.com/10.1007/978-1-4419-9326-7_5. De Waal, Ton, Jorden Pannekoek y Sander Scholtus (2007). Statistical data editing and imputation. Vol. 29. 29, pág. 51. ISBN: 9780470542804. Donders, A. Rogier T. y col. (2006). «Review: A gentle introduction to imputation of missing values». En: Journal of Clinical Epidemiology 59.10, págs. 1087-1091. ISSN: 08954356. DOI: 10.1016/j.jclinepi.2006.01.014. Dowle, Matt y col. (2019). «Package 'data. table'». En: Extension of'data.frame. Drucker, Harris y Corinna Cortes (1996). «Boosting decision trees». En: Advances in neural information processing systems, págs. 479-485. Elvers, Eva y Hakan Lindén (sep. de 2015). «Quality Concept for Official Statistics». En: Wiley StatsRef: Statistics Reference Online. Chichester, UK: John Witley y Sons, Ltd, págs. 1-13. DOI: 10.1002/9781118445112.stat03101.pub2.URL:http://doi.wiley.com/10.1002/9781118445112.stat03101.pub2. Eurostat (2017). Eurapean Statistics Code of Practice. Eurostat, Luxembourg. DOI: 10.2785/798269.URL:https://ec.europa.eu/eurostat/documents/4031688/ 8971242/KS-02-18-142-EN-N.pdf/e7f85f07-91db-4312-8118-f729c75878c7. Friedman, Jerome, Trevor Hastie, Robert Tibshirani y col. (2001). The elements of statistical leaming. Vol. 1. 10. Springer series in statistics New York. Grazzini, J (2021). «Statistics Coded -Storytelling through literate programming and runnable computing». En: Groves, R. M. y L. Lyberg (ene. de 2010). «Total Survey Error: Past, Present, and Future». En: Public Opinion Quarterly 74.5, págs. 849-879. ISSN: 0033-362X. DOI: 10.1093/poq/nfq065.URL:https ://academic.oup.com/poq/article-lookup/ doi/10.1093/poq/nfq065. Grudkowska, Sylwia y col. (2013). «Advanced Tools for Tune Series Analysis and Seasonal Adjustment in the New JDemetra+». En: JSM Proceedings Paper. Hssina, Badr y col. (2014). «A comparative study of decision tree ID3 and C4. 5». En: International Journal of Advanced Computer Science and Applications 4.2,págs. 13-19. INE (mar.de 2015). Política de revisión del Instituto Nacional de Estadística . INE (mar.de 2020). Encuesta de satisfacción de los usuarios de estadísticas del INE. James, Gareth y col. (2013). An introduction to statistical learning. Vol. 112. Springer. Janitza, Silke, Harald Binder y Anne-Laure Boulesteix (2016). «Pitfalls of hypothesis tests and model selection on bootstrap samples: causes and consequences in biometrical applications». En: Biometrical Journal 58.3, págs. 447-473. Kim, Jae Kwang (2001). «Variance estimation after imputation». En: 27.1, pág. 173. URL: http://projecteuclid.org/euclid.aos/1083178946. Kowarik, Alexander y Matthias Templ (2016). «Imputation with the R Package VIM». En: Journal of Statistical Software 74.7, págs. 1-16. Kuhn, Max, Kjell Johnson y col (2013). Applied predictive modeling. Vol 26. Springer. Lewis, Roger J (2000). «An introduction to classification and regression tree (CART) analysis». En: Annual meeting of the society for academic emergency medicine in San Francisco, California. Vol 14. Liaw, Andy, Matthew Wiener y col. (2002). «Classification and regression by random Forest». En: R news 2.3, págs. 18-22. Little, Roderick J. A. y Donald B. Rubin (ago. de 2002). Statistical Analysis with Missing Data. Hoboken, NJ, USA: John Witley y Sons, Inc. ISBN:9781119013563. DOI: 10.1002/9781119013563. URL: http://doi.wiley.com/10.1002/9781119013563. MacFeely, Steve (dic. de 2016). «The Continuing Evolution of Official Statistics: Some Challenges and Opportunities». En: Journal of Official Statistics 32.4, págs. 789-810. ISSN:2001-7367. DOI:10.1515/jos-2016-0041. URL: https: //www.sciendo. com/article/10.1515/jos-2016-0041. Manski, Charles (mayo de 2014). Communicating Uncertainty in Official Economic Statistics. Inf. téc. Cambridge, MA: National Bureau of Economic Research. DOI:10.3386/w20098. URL:http://www .nber .org/papers/w20098.pdf. Martínez-Muñoz, Gonzalo y Alberto Suárez (2010). «Out-of-bag estimation of the optimal sample size in bagging». En: Pattern Recognition 43.11 págs. 143-152. McKenzie, Richard y Michela Gamba (2008). «lnterpreting the results of Revision Analyses: Recommended Summary Statistics». En: Contribution to OECD/Euros tat Task Force on "Performing Revisions Analysis far Sub-Annual Economic Statistics. URL:https ://www .oecd.org/sdd/40315546.pdf. Merkle, Edgar C. y Victoria A. Shaffer (2011). «Binary recursive partitioning: Background, methods, and application to psychology». En: British Journal of Mathematical and Statistical Psychology 64.1, págs. 161-181. DOI: 10.1348/000711010X503129. eprint: https://bpspsychub.onlinelibrary.wiley.com/doi/pdf/10.1348/000711010X503129. URL:https://bpspsychub. onlinelibrary.wiley.com/doi/abs/10.1348/000711010X503129. Mitchell, Tom M (1997). Machine learning. McGraw-hill New York. Nembrini, Stefano, Inke R Konig y Marvin N Wright (mayo de 2018). «The revival of the Gini importance?» En: Bioinformatics 34.21, págs. 3711-3718. ISSN:1367-4803. DOI:10.1093/bioinformatics/bty373. eprint: https://academic.oup.com/ bioinformatics/article-pdf/34/21/3711/26146979/bty373\_supplement\_nembrini.pdf. URL: https://doi.org/10.1093/bioinformatics/bty373. OECD (2011). Quality dimensions, core values for oecd statistics and procedures for planning and evaluating statistical activities. Probst, Philipp y Anne-Laure Boulesteix (2017). «To Tune or Not to Tune the Number of Trees in Random Forest.» En: J. Mach. Learn. Res. 18.1,págs. 6673-6690. Probst, Philipp, Marvin N Wright y Anne-Laure Boulesteix (2019). «Hyperparame ters and tuning strategies for random forest>>. En: Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 9.3, e1301. Samdal, Carl-Erik y Sixten Lundstrom (ene. de 2005). Estimatian in Surveys with Non response. Chichester, UK: John Wtley y Sons, Ltd. ISBN:9780470011355. DOI: 10.1002/0470011351. URL:http://doi.wiley.com/10.1002/0470011351. Scholtus, Sander, Rob van de Laar y Lean Willenborg (2014). The memobust handbook on methodology for modern business statistics (MEMOBU ST Handbook). Scomet, Erwan (2017). «Tuning parameters in random forests». En: ESAIM : Procee dings and Suroeys 60, págs. 144-162. Steinberg, Dan (2009). «CART: classification and regression trees». En: The top ten algorithms in data mining. Chapman y Hall/CRC,págs. 193-216. Team R, Core (2000). «R language definition». En: Vienna, Austria: R foundation for statistical computing. Unión Europea (2020). Reglamento de ejecución (UE) 2020/1197 de la Comisión. https://eur-lex.europa.eu/legal content/ES/TXT/PDF/?uri=CELEX: 32020R1197&from=EN. Van Der Loo, Mark, Edwin De Jonge y Sander Scholtus (2011). Correctian of rounding, typing, and sign errors with the deducorrect package .Citeseer. West, Brady T. (ago. de 2011). «Paradata in Survey Research». En: Suroey Practice 4.4, págs. 1-8. ISSN:2168-0094. DOI: 10.29115/SP-2011-0018. URL: https:// surveypractice.scholasticahq.com/article/3036-paradata-in-survey research. Wickham, Hadley (2007). Theggplot package. Wright, Marvin N. y Andreas Ziegler (2017). <<ranger : A Fast Implementation of Random Forests for High Dimensional Data in C++ and R». En: Journal of Statistical Software 77.l. ISSN:1548-7660. DOI: 10.18637/jss.v077.i01. URL: http : //www.jstatsoft.org/v77/i01/. Zhang, Li-Chun (2012). «Topics of statistical theory for register-based statistics and data integration». En: Statistica Neerlandica 66.l, págs. 41-63.

URI

https://hdl.handle.net/20.500.14352/5138

Collections

Trabajos Fin de Master (TFM)

Full item page

Publication:
Imputación de datos mediante Random Forest

Files

Official URL

Full text at PDC

Publication Date

Authors

Advisors (or tutors)

Editors

Journal Title

Journal ISSN

Volume Title

Publisher

Citations

Exportar

Research Projects

Organizational Units

Journal Issue

Abstract

Description

UCM subjects

Unesco subjects

Keywords

Citation

URI

Collections

Publication: Imputación de datos mediante Random Forest

Files

Official URL

Full text at PDC

Publication Date

Authors

Advisors (or tutors)

Editors

Journal Title

Journal ISSN

Volume Title

Publisher

Citations

Exportar

Research Projects

Organizational Units

Journal Issue

Abstract

Description

UCM subjects

Unesco subjects

Keywords

Citation

URI

Collections

Publication:
Imputación de datos mediante Random Forest