Optimización de Apache Spark

Ortiz Hernández, Janira

Publication:
Optimización de Apache Spark

Files

Trabajo fin de Grado_final.pdf (3.31 MB)

Publication Date

2021-07

Authors

Ortiz Hernández, Janira

Advisors (or tutors)

Gregorio Rodríguez, Carlos

Llana Díaz, Luis

Citations

Exportar

Abstract

Trabajando con Spark a menudo se presenta que el rendimiento obtenido es peor de lo esperado y son muchas las variables a tener en cuenta para abordarlo: una inadecuada configuración del clúster, tener un número correcto de particiones de los datos a procesar, la presencia de sesgo de datos, el formato de datos con los que se trabaja, el trabajo en memoria, la cantidad de datos que se transfieren entre las máquinas, el orden de procesamiento de datos o el tipo serialización de datos. Todas estas consideraciones son importantes a la hora de estudiar el rendimiento de nuestras aplicaciones. Profundizar en todas ellas sería un objetivo demasiado amplio para el tamaño y propósito de un Trabajo Fin de Grado, por lo que nos centraremos en los puntos que han resultado más interesantes y adecuados al tipo de aplicaciones con las que se suele trabajar. En la primera parte del trabajo, se ha desarrollado un estudio teórico de la herramienta Spark: su arquitectura y componentes, la aplicación de sus transformaciones y operaciones definidas sobre RDDs y Dataframes, las herramientas de monitorización de las ejecuciones, la instalación y el uso de PYSPARK. Además, hay una introducción a las ejecuciones en modo clúster. La segunda parte recoge el desarrollo del proyecto propiamente dicho. Se ha basado en la creación y ajuste de distintas aplicaciones, su ejecución en el clúster y la recogida de los datos resultantes sobre ciertos puntos de interés: configuración del clúster (asignación de los recursos), particionado de datos, persistencia de los datos, sesgo de datos, control del shuffle y el trabajo con más de un dataset. Cada punto de estudio tiene sus propios scripts de ejecución pero comparten de un flujo común. Este flujo nace con un primer script en el que se seleccionan los datasets y las aplicaciones que, junto al resto de características de la ejecución pasadas como argumento (punto de interés, número de repeticiones por prueba...), generan la llamada a un segundo script encargado de generar los archivos donde se guardarán los resultados. Además, este script realiza las llamadas a la ejecución de las aplicaciones y la recogida la información útil contenida en los archivos de log que produce Spark. Las ejecuciones se han desarrollado bajo las condiciones de un clúster de 6 nodos, de 4 núcleos y 8GB de RAM cada uno. Esto implica que los resultados obtenidos no son directamente trasladables a clústers con recursos diferentes. Hay que tener en cuenta los límites que se presentan según las características del entorno de trabajo del que dispongamos. A pesar de todo, las conclusiones generales nos aportarán un mayor conocimiento de cómo abordar estos límites y aplicar los conocimientos adquiridos.
This paper aims to analyze some of the performance optimization aspects of Apache Spark applications. When working with Spark, it often happens that the performance obtained is worse than expected and there are many variables to take into account to address it: an inadequate cluster configuration, having the correct number of partitions of the data to be processed, the presence of data bias, the data format with which we work, the work in memory, the amount of data transferred between machines, the order of data processing or the type of data serialization. All these considerations are important when studying the performance of our applications. To go into all of them in depth would be too broad an objective for the size and purpose of a Final Degree Project, so we will focus on the points that have been most interesting and appropriate to the type of applications with which we usually work. In the first part of the work, a theoretical study of the Spark tool has been developed: its architecture and components, the application of its transformations and operations defined on RDDs and Dataframes, the tools for monitoring executions, the installation and use of PYSPARK, and an introduction to cluster mode executions. The second part covers the development of the project itself and is based on the creation and adjustment of different applications, their execution on the cluster and the collection of the resulting data on six points of interest: cluster configuration (resource allocation), data partitioning, data persistence, data bias, shuffle control and working with more than one dataset. Each study point has its own execution scripts but they share a common flow. This flow starts with a first script in which the datasets and applications are selected which, together with the rest of the characteristics of the execution passed as arguments (point of interest, number of repetitions per test...), generate the call to a second script in charge of generating the files where the results will be saved and, in this order, make the calls to the execution of the applications and the collection of the useful information contained in the log files produced by Spark. The executions have been developed under the conditions of a cluster of 6 nodes, 4 cores and 8GB of RAM each. This implies that the results obtained are not directly transferable to clusters with different resources. It is necessary to take into account the limits that are presented according to the characteristics of the working environment we have, although the general conclusions will provide us with a better knowledge of how to deal with these limits and apply the knowledge acquired.

UCM subjects

Informática (Informática)

Unesco subjects

1203.17 Informática

Citation

White, T. (2015). Hadoop. The Definitive Guide. O'Reilly Media. Konwinski, A. & Karau, H. & Matei Z. & Wendell P. (2015). Data Analysis Learning Spark Lightning. Fast Big. O'Reilly Media. Drabas, T. & Lee, D. (2017). Learning PySpark. Packt. Chambers, B. & Zaharia, M. (2018). Spark. The Definitive Guide. Big Data Processing Made Simple. O'Reilly Media. The Apache Software Foundation (mayo de 2020) Apache Spark https://spark.apache.org/docs/2.3.0/ Databrick (enero de 2020) Apache Spark https://databricks.com/spark/about Grover, M. & Malaska, T. (16 junio 2016) Top 5 Mistakes When Writing Spark Applications https://databricks.com/session/top-5-mistakes-when-writing-spark-applications Kozlowski N. (2 de noviembre de 2017) Partitioning in Apache Spark https://medium.com/parrot-prediction/partitioning-in-apache-spark-8134ad840b0 Zvara, Z. (2016). Handling data skew adaptively in Spark using Dynamic Repartitioning. MTA SZTAKI. Karau, H. & Warren, R. (2016). High Performance Spark: Best practices for scaling and optimizing Apache Spark. O'Reilly Media. The Apache Software Foundation (abril de 2020) Tuning Spark https://spark.apache.org/docs/2.3.0/tuning

URI

https://hdl.handle.net/20.500.14352/5348

Collections

Trabajos Fin de Grado (TFG) y Diplomas de Estudios Avanzados (DEA)

Full item page

Publication:
Optimización de Apache Spark

Files

Official URL

Full text at PDC

Publication Date

Authors

Advisors (or tutors)

Editors

Journal Title

Journal ISSN

Volume Title

Publisher

Citations

Exportar

Research Projects

Organizational Units

Journal Issue

Abstract

Description

UCM subjects

Unesco subjects

Keywords

Citation

URI

Collections

Publication: Optimización de Apache Spark

Files

Official URL

Full text at PDC

Publication Date

Authors

Advisors (or tutors)

Editors

Journal Title

Journal ISSN

Volume Title

Publisher

Citations

Exportar

Research Projects

Organizational Units

Journal Issue

Abstract

Description

UCM subjects

Unesco subjects

Keywords

Citation

URI

Collections

Publication:
Optimización de Apache Spark