Publication:
Entity Resolution y Deduplication con Blocking paralelo en Spark

Loading...
Thumbnail Image
Official URL
Full text at PDC
Publication Date
2020-06
Advisors (or tutors)
Editors
Journal Title
Journal ISSN
Volume Title
Publisher
Citations
Google Scholar
Research Projects
Organizational Units
Journal Issue
Abstract
En este trabajo planteamos un algoritmo que permite identificar qué registros de un dataset, aún no siendo idénticos, se corresponden con la misma entidad real (Entity Resolution). El algoritmo clásico para este proceso consiste en la comparación directa de todos los registros dos a dos y, por tanto, tiene por lo menos complejidad cuadrática. Nuestra solución mejora el algoritmo clásico utilizando paralelización y, por consiguiente, garantizando la escalabilidad del mismo. Además, el diseño del algoritmo es genérico. Permite la definición de unos parámetros de configuración para adaptarlo al dataset concreto que se desee estudiar. Las ejecuciones realizadas para analizar el comportamiento de este algoritmo han resultado muy satisfactorias, obteniendo resultados muy similares al caso clásico en unos tiempos de ejecución significativamente menores. Esta diferencia temporal es aún mayor conforme aumentemos el tamaño de los datasets sobre la que se trabajen.
In this work we present and algorithm that allows the user to identify which registers from a dataset, while not being identical, represent the same real-world entity (Entity Resolution). The classical algorithm for this process consists of direct comparisons between all registers and, as a result, has at least quadratic complexity. Our solution improves upon this classical algorithm by using parallelization, granting its scalability. In addition, its design is generic. It allows for some configuration parameters to be defined depending on the concrete dataset that wants to be studied. The executions performed to analyse its behaviour have been very successful, obtaining very similar results to the classical algorithm using significantly less execution time. This time difference is even bigger as the dataset’s size increases.
Description
Keywords
Citation
[1] Vassilis Christophides y col. “End-to-end entity resolution for big data: A survey”. En: arXiv preprint arXiv:1905.06397 (2019). [2] Dimas Cassimiro do Nascimento, Carlos Eduardo Santos Pires y Demetrio Gomes Mestre. “Exploiting block co-occurrence to control block sizes for entity resolution”. En: Knowl. Inf. Syst. 62.1 (2020), págs. 359-400. doi: 10.1007/s10115-019-01347-0. url: https://doi.org/10.1007/s10115-019-01347-0. [3] Luciano Barbosa. “Learning representations of Web entities for entity resolution”. En: IJWIS 15.3 (2019), págs. 346-358. doi: 10.1108/IJWIS-07-2018-0059. url: https://doi.org/10.1108/IJWIS-07-2018-0059. [4] Chenchen Sun y col. “A genetic algorithm based entity resolution approach with active learning”. En: Frontiers Comput. Sci. 11.1 (2017), págs. 147-159. doi:10.1007/s11704- 015-5276-6. url: https://doi.org/10.1007/s11704-015-5276-6. [5] Muhammad Sadiq y col. “A Vertex Matcher for Entity Resolution on Graphs”. En: 14th International Conference on Ubiquitous Information Management and Communication, IM- COM 2020, Taichung, Taiwan, January 3-5, 2020. IEEE, 2020, págs. 1-4. doi: 10.1109/IMCOM48794.2020.9001799. url: https://doi.org/10.1109/IMCOM48794.2020.9001799. [6] Omar Benjelloun y col. “Swoosh: a generic approach to entity resolution”. En: VLDB J. 18.1 (2009), págs. 255-276. doi: 10.1007/s00778-008-0098-x. url: https://doi.org/10.1007/s00778-008-0098-x. [7] Peter Christen. “A Comparison of Personal Name Matching: Techniques and Practical Issues”. En: Workshops Proceedings of the 6th IEEE International Conference on Data Mining (ICDM 2006), 18-22 December 2006, Hong Kong, China. IEEE Computer Society, 2006, págs. 290-294. doi: 10.1109/ICDMW.2006.2. url: https://doi.org/10.1109/ICDMW.2006.2.