Publication: Selective Data Editing of Continuous Variables with Random Forests in Official Statistics
Loading...
Official URL
Full text at PDC
Publication Date
2020
Authors
Advisors (or tutors)
Editors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Technological advances and new demands due to economic and socio-cultural changes regularly challenge the National Statistical Institutes to adapt to their evolving environment. The application of machine learning methods as important and promising tools for official statistics are discussed in the context of these changes, in the context of opportunities arising from new digital data sources, and considering the difficult task of having to balance a variety of quality requirements at national and international level. Selective statistical data editing is an approach to detect influential units and select them for manual follow up in order to make the process more efficient. In this thesis, a simple and a two-step approach are developed to apply random forests to selective editing of continuous variables in the context of short-term business survey data. We present a score function based on decision forest models which allows for an efficient selection of units relevant for the estimation of the final estimates. The approach is found to be applicable also at the disaggregated levels of the autonomous communities and economic branches.
El avance tecnológico y nuevas demandas debidas a cambios económicos y socioculturales desafían regularmente a los Institutos Nacionales de Estadística a adaptarse a su entorno en constante evolución. La aplicación de métodos de aprendizaje automático como instrumentos importantes y prometedores para las estadísticas oficiales se analizan en el contexto de esos cambios, en el contexto de las oportunidades que surgen de nuevas fuentes de datos digitales, y teniendo en cuenta la difícil tarea de tener que equilibrar una variedad de requisitos de calidad a nivel nacional e internacional. La depuración selectiva es un conjunto de técnicas para detectar unidades influyentes y seleccionarlas para el seguimiento manual a fin de hacer el proceso más eficiente. En este trabajo se desarrolla un enfoque simple y uno en dos etapas para aplicar los bosques aleatorios a la depuración selectiva de variables continuas en el contexto de datos de encuestas económicas coyunturales. Se presenta una función de puntuación basada en modelos de bosques aleatorios que permite una selección eficiente de unidades relevantes para la estimación de los agregados finales. El enfoque desarrollado también es aplicable a los niveles desagregados de las comunidades autónomas y ramas de negocio para los datos usados.
El avance tecnológico y nuevas demandas debidas a cambios económicos y socioculturales desafían regularmente a los Institutos Nacionales de Estadística a adaptarse a su entorno en constante evolución. La aplicación de métodos de aprendizaje automático como instrumentos importantes y prometedores para las estadísticas oficiales se analizan en el contexto de esos cambios, en el contexto de las oportunidades que surgen de nuevas fuentes de datos digitales, y teniendo en cuenta la difícil tarea de tener que equilibrar una variedad de requisitos de calidad a nivel nacional e internacional. La depuración selectiva es un conjunto de técnicas para detectar unidades influyentes y seleccionarlas para el seguimiento manual a fin de hacer el proceso más eficiente. En este trabajo se desarrolla un enfoque simple y uno en dos etapas para aplicar los bosques aleatorios a la depuración selectiva de variables continuas en el contexto de datos de encuestas económicas coyunturales. Se presenta una función de puntuación basada en modelos de bosques aleatorios que permite una selección eficiente de unidades relevantes para la estimación de los agregados finales. El enfoque desarrollado también es aplicable a los niveles desagregados de las comunidades autónomas y ramas de negocio para los datos usados.
Description
UCM subjects
Unesco subjects
Keywords
Citation
Arbues, Ignacio, Pedro Revilla, and David Salgado (2013). “An optimization approach
to selective editing”. In: Journal of Official Statistics 29.4, pp. 489–510.
Barber, David (2012). Bayesian reasoning and machine learning. Cambridge University
Press.
Beck, Martin, Florian Dumpert, and Joerg Feuerhake (2018). “Machine Learning in
Official Statistics”. In: arXiv preprint arXiv:1812.10422.
Biamonte, Jacob et al. (2017). “Quantum machine learning”. In: Nature 549.7671,
pp. 195–202.
Biemer, Paul P. (2010). “Total survey error: Design, implementation, and evaluation”.
In: Public Opinion Quarterly 74.5, pp. 817–848.
Boehmke, Brad and Brandon M. Greenwell (2019). Hands-On Machine Learning with
R. Available at: https : / / bradleyboehmke . github . io / HOML / process . html,
(accessed August 2020). CRC Press.
Breiman, Leo (2001). “Random forests”. In: Machine learning 45.1, pp. 5–32.
Coccia, Mario (2009). “Research performance and bureaucracy within public research
labs”. In: Scientometrics 79.1, pp. 93–107.
Crow, Michael M. and Barry L. Bozeman (1989). “Bureaucratization in the laboratory”.
In: Research Technology Management 32.5, p. 30.
Cutler, Adele, David Cutler, and John Stevens (Jan. 2011). “Random Forests”. In:
vol. 45, pp. 157–176.
De Waal, Ton (Dec. 2013). “Selective Editing: A Quest for Efficiency and Data Quality”.
In: Journal of official statistics 29, pp. 473–488.
De Waal, Ton, Jeroen Pannekoek, and Sander Scholtus (2011). Handbook of statistical
data editing and imputation. Vol. 563. John Wiley & Sons.
Di Zio, Marco and Ugo Guarnera (2013). “A contamination model for selective editing”.
In: Journal of Official Statistics 29.4, pp. 539–555.
European Statistical System Committee (2019). Quality Assurance Framework of the
European Statistical System (ESS QAF). Available at: https://ec.europa.eu/
eurostat/documents/64157/4392716/ESS- QAF- V1- 2final.pdf/bbf5970c-
1adf-46c8-afc3-58ce177a0646.
European Union (2009). “Regulation (EC) No. 223/2009 of the European Earliament
and of the Council on European Statistics”. In: Official Journal of the European
Union 284. amended by Regulation (EU) 2015/759, available at: https:
//eur-lex.europa.eu/legal-content/en/TXT/PDF/?uri=CELEX:02009R0223-
20150608&from=EN, p. 1.
Eurostat (2017). “European Statistics Code of Practice”. In: Adopted by the European
Statistical System Committee. available at: https://ec.europa.eu/eurostat/
documents/4031688/8971242/KS-02-18-142-EN-N.pdf/e7f85f07-91db-4312-
8118-f729c75878c7.
Fawcett, Tom (2006). “An introduction to ROC analysis”. In: Pattern recognition letters
27.8, pp. 861–874.
Granquist, Leopold (1997). “The new view on editing”. In: International Statistical
Review 65.3, pp. 381–387.
Bibliography 49
Groves, Robert M. and Lars Lyberg (2010). “Total survey error: Past, present, and
future”. In: Public opinion quarterly 74.5, pp. 849–879.
Hastie, Trevor, Robert Tibshirani, and Jerome Friedman (2009). The elements of statistical
learning: data mining, inference, and prediction. Springer Science & Business
Media.
Hedlin, Dan (2003). “Score functions to reduce business survey editing at the UK
office for national statistics”. In: Journal of Official Statistics 19.2, pp. 177–200.
— (2008). “Local and global score functions in selective editing”. In: Proceedings of
UN/ECE Work Session on Statistical Data Editing 21-23 April, Vienna. Available at:
https://www.unece.org/fileadmin/DAM/stats/documents/ece/ces/2008/04/
sde/wp.31.e.pdf.
Ho, Tin Kam (1995). “Random decision forests”. In: Proceedings of 3rd international
conference on document analysis and recognition. Vol. 1. IEEE, pp. 278–282.
James, Gareth et al. (2013). An introduction to statistical learning. Vol. 112. Springer.
Julien, Claude (2019). “Progress Report. Background document on the HLG-MOS
Machine Learning Project”. In: Available at: https://statswiki.unece.org/
display/ML/Machine+Learning+for+Official+Statistics+Home, (accessed
August 2020).
Kim, Seoyong, Wanki Paik, and Cheouljoo Lee (2014). “Does bureaucracy facilitate
the effect of information technology (IT)?” In: International Review of Public Administration
19.3, pp. 219–237.
Kuhn, Max and Kjell Johnson (2013). Applied predictive modeling. Vol. 26. Springer.
Lange, Kerstin (2020). “Automation of E&I processes. Working Paper. Workshop on
Statistical Data Editing 2020”. In: Available at: https://statswiki.unece.org/
download/attachments/282329136/SDE2020_T4_Germany_Lange_Paper.pdf?
version=1&modificationDate=1596798047993&api=v2, (accessed August 2020).
LFEP (1989). Law 12/1989 of 9 May 1989 on the Public Statistical Services. BOE n. 112,
11 May 1989.
Liaw, Andy and MatthewWiener (2002). “Classification and Regression by random-
Forest”. In: R News 2 3, pp. 18–22.
Ljones, Olav (2011). “Independence and ethical issues for modern use of administrative
data in official statistics”. In: Statistical Journal of the IAOS 27.1, 2, pp. 25–
29.
López-Ureña, R. et al. (2014). “Application of the optimization approach to selective
editing in the Spanish Industrial Turnover Index and Industrial New Orders
Received Index survey”. In: INE Statistics Spain, Working Papers 4.
Louppe, Gilles (2014). “Understanding random forests”. In: Cornell University Library.
Luzi, O. et al. (2007). Recommended Practices for Editing and Imputation in Cross-Sectional
Business Surveys (EDIMBUS), ISTAT, CBS, SFSO, Eurostat. Available at: https:
/ / ec . europa . eu / eurostat / documents / 64157 / 4374310 / 30 - Recommended +
Practices-for-editing-and-imputation-in-cross-sectional-business-
surveys-2008.pdf.
MacFeely, Steve (2016). “The continuing evolution of official statistics: Some challenges
and opportunities”. In: Journal of Official Statistics 32.4, pp. 789–810.
Measure, Alexander (2017). “Deep neural networks for worker injury autocoding”.
In: Available at: https : / / www . bls . gov / iif / deep - neural - networks . pdf,
(accessed August 2020).
Moisen, GG (2008). “Classification and regression trees”. In: In: Jørgensen, Sven Erik;
Fath, Brian D.(Editor-in-Chief). Encyclopedia of Ecology, volume 1. Oxford, UK: Elsevier.
p. 582-588., pp. 582–588.
Molnar, Christoph (2020). Interpretable Machine Learning. Lulu.
Murphy, Kevin P (2012). Machine learning: a probabilistic perspective. MIT press.
Olsen, Johan P. (2008). “The ups and downs of bureaucratic organization”. In: Annu.
Rev. Polit. Sci. 11, pp. 13–37.
Pannekoek, Jeroen, Sander Scholtus, and Mark Van der Loo (2013). “Automated and
manual data editing: a view on process design and methodology”. In: Journal of
Official Statistics 29.4, pp. 511–537.
Probst, Philipp, MarvinNWright, and Anne-Laure Boulesteix (2019). “Hyperparameters
and tuning strategies for random forest”. In: Wiley Interdisciplinary Reviews:
Data Mining and Knowledge Discovery 9.3, e1301.
Rama, Silvia and David Salgado (2014). “Standardising the editing phase at Statistics
Spain: a little step beyond EDIMBUS”. In: INE Statistics Spain, Working Papers 5.
Revilla, Pedro and Asunción Piñán (2012). “Implementing a Quality Assurance Framework
based on the Code of Practice at the National Statistical Institute of Spain”.
In: INE Statistics Spain, Working Papers 4.
Sæbø, Hans Viggo and Anders Holmberg (2019). “Beyond code of practice: New
quality challenges in official statistics”. In: Statistical Journal of the IAOS 35.2,
pp. 171–178.
Scholtus, S., R. van de Laar, and L. Willenborg (2014). The memobust handbook on
methodology for modern business statistics (MEMOBUST Handbook).
Sonak, Apurva and R.A. Patankar (2015). “A survey on methods to handle imbalance
dataset”. In: Int. J. Comput. Sci. Mobile Comput 4.11, pp. 338–343.
Spain (1978). Spanish Constitution. BOE n. 311, 29 December 1978.
Spies, Lydia and Kerstin Lange (2018). “Implementation of artificial intelligence and
machine learning methods within the Federal Statistical Office of Germany.Working
Paper. Workshop on Statistical Data Editing 2018”. In: Available at: https:
//www.unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.44/2018/T4_
Germany_LANGE_Paper.pdf, (accessed August 2020).
Statistics Spain (2019). Standardised Methodological Report. Services Sector Activity Indicators
(SSAI). Base 2015. Available at: https://www.ine.es/dynt3/metadatos/
en/RespuestaDatos.html?oe=30183, (accessed August 2020).
— (2020). Services Sector Activity Indicators (SSAI). Base 2015. Available at: https:
//www.ine.es/dyngs/INEbase/en/operacion.htm?c=Estadistica_C&cid=
1254736176863&menu=ultiDatos&idp=1254735576778, (accessed August 2020).
Stats NZ (2019). Data sources, editing, and imputation for the 2018 Census. Available
at: https://www.stats.govt.nz/assets/Uploads/Methods/Data- sources-
editing-and-imputation-in-the-2018-Census/Data-sources-editing-and-
imputation-in-the-2018-census.pdf, (accessed August 2020).
United Nations Economic Commission for Europe (2019a). Generic Statistical Business
Process Model (GSBPM). Version 5.1.
— (2019b). Generic Statistical Data Editing Model (GSDEM). Version 2.0.
Vale, Steven (2014). “The Common Statistical Production Architecture: An Important
New Tool for Standardisation”. In: Weber, Max (1978). Economy and society: An outline of interpretive sociology. Vol. 1. University of California Press.
Wright, Marvin N and Andreas Ziegler (2015). “ranger: A fast implementation of
random forests for high dimensional data in C++ and R”. In: arXiv preprint.