Livy is a REST API support open interface for interacting with Spark Cluster, or a REST API that gives remote access to Apache Spark cluster and helps in job submission to the Spark Cluster. In more common words, Livy provides access for remote users to their Spark Cluster. It gives easy interaction and management of SparkContext and SparkSession. Livy creates this entry point for spark application so that user doesn't have to create it.
Livy is an Apache certified project, which means it fulfills every requirement of the spark cluster and has many extra features. ● Apache Livy can have long-running Spark Contexts that can be used for multiple Spark jobs, by multiple clients ● Supports sharing cached RDDs or Dataframes across multiple jobs and clients ● Multiple Spark Contexts can be managed simultaneously, and the Spark Contexts run on the cluster (YARN/Mesos) instead of the Livy Server, for good fault tolerance and concurrency ● Jobs can be submitted as precompiled jars, snippets of code or via java/scala client API ● Ensure security via secure authenticated communication ● Apache-licensed, 100% open source
Apache Livy provides two ways of interface with spark cluster. ● Interactive Mode: provides spark-shell, pyspark, and sparkR kind of environment. ●Batch Mode: provides spark-submit type environment, to submit a Spark application to cluster without interaction during run-time. A user can select whichever mode he wants to use in Apache Livy. Launching spark jobs with Livy can be done with Curl or Python Request module. And if users want, Jupyter notebook can also be set as a working environment, which will give the ability to users to launch jobs with SQL, Spark, pyspark and sparkR. In this write up I will be launching job with curl commands. Interactive Mode: In interactive mode, the user creates a context once and later uses it to perform jobs or tasks. This mode is similar to spark-shell or pyspark, where we get a development environment to write statements for different jobs. Before creating a session, the Livy server must be up and running. To start livy server use the command given below. $LIVY_HOME/bin/livy-server Now launch a spark Interactive mode with the curl command: curl -X POST --data '{"kind": "pyspark"}' -H "Content-Type: application/json" localhost:8998/sessions Here, data can hold many parameters to specify context type and properties. User impersonation can be made by passing the proxyuser parameter and executor memory. The user can specify the number of cores and session name. Request Body for Livy interactive Cell Mode:
Name | Description | Type |
kind | The session kind | Session kind |
proxyUser | User to impersonate when starting the session | string |
jars | Jars to be used in this session | List of string |
pyFiles | Python files to be used in this session | List of string |
files | Files to be used in this session | List of string |
driverMemory | Amount of memory to use for the driver process | string |
driverCores | Number of core for the process | int |
numExecutors | Number of executors for process | Int |
archive | Archive to use in session | List of string |
queue | Name of yarn queue in which job to be submitted | string |
Name | Name of the session | string |
conf | Spark configuration properties to be used | Key and values |
heartbeatTimeoutInSecond | Time out in seconds | int |
● To check running sessions use the below curl command: curl localhost:8998/sessions | python -m json.tool ● To perform a task with a session use: curl localhost:8998/sessions/0/statements -X POST -H 'Content-Type: application/json' -d '{"code":"1 + 1"}' ● To check your code statement result, you can use the below curl command: curl localhost:8998/sessions/0/statements/0 | python -m json.tool Output: { "id": 0, "code": "1 + 1", "state": "available", "output": { "status": "ok", "execution_count": 0, "data": { "text/plain": "2" } }, "progress": 1.0 } Statement result can be seen in the json Output>data. The user can delete a session after completing his requirement. curl localhost:8998/sessions/0 -X DELETE Interactive mode works like a normal shell where all the defined variables will be available to you, until your session is alive. Batch Mode: Batch mode works as a spark-submit where we can submit our application with configuration parameter and application files. In batch mode user can submit a jar, or a .py file to spark cluster with Livy-server. ● To submit a jar file: curl -X POST --data '{"file": "pathToJar/spark-examples.jar", "className": "org.apache.spark.examples.SparkPi"}' -H "Content-Type: application/json" localhost:8998/batches ● To submit a py file: curl -X POST --data '{"file": "path/codeFile.py"}' -H "Content-Type: application/json" localhost:8998/batches ● To check batch result: curl localhost:8998/batches/0/log | python -m json.tool The data parameter can hold many key-value pair in json which specify the job, like number of cores, dependent jars, configuration properties and many more which are mentioned below. Request Body for Livy batch mode:
Name | Description | Type |
file | File containing the application to execute | Path (required) |
proxyUser | User to impersonate when running the job | string |
className | Application Java/Spark main class | string |
args | Command line arguments for the application | List of strings |
jars | jars to be used in this session | List of strings |
pyFiles | Python files to be used in this session | List of strings |
files | files to be used in this session | List of strings |
driverMemory | Amount of memory to use for the driver process | string |
driverCores | Number of cores to use for the driver process | int |
executorMemory | Amount of memory to use per executor process | string |
executorCores | Number of cores to use for each executor | int |
numExecutors | Number of executors to launch for this session | int |
archives | Archives to be used in this session | List of string |
queue | The name of the YARN queue in which job to be submitted | string |
name | name of this session | string |
conf | Spark configuration properties | Map of key=val |
Apache Livy comes with many extraordinary features which make Livy unique from other REST APIs. ● Livy supports Interactive Scala, Python, and R shells ● Job can be submitted in Batch mode using coding languages like Scala, Java, and Python. ● Multiple users can share the same server and everyone can submit jobs and monitor it themselves (impersonation support). ● Can be used for submitting jobs from anywhere with REST ● No need for code modification. ● Livy supports every version of spark (supports spark 1.x and 2.x versions). Hence no version mismatch problem like in other REST APIs. ● Jupyter notebook or any notebook can be used as IDE with Livy and that can support pyspark, spark, sparkR languages. Now we know Livy comes with great features too. Livy supports user impersonation and is compatible with Apache ranger which can secure your cluster from anonymous user hacking and stealing your data. There will be always many REST API servers in Big Data system, but there will be only few which give you what you want and what you need. Livy is one of them!!!