Why Livy?

Monday, August 13, 2018

What is Livy?

Livy is a REST API support open interface for interacting with Spark Cluster, or a REST API that gives remote access to Apache Spark cluster and helps in job submission to the Spark Cluster. In more common words, Livy provides access for remote users to their Spark Cluster. It gives easy interaction and management of SparkContext and SparkSession. Livy creates this entry point for spark application so that user doesn't have to create it.

What Livy Offers

Livy is an Apache certified project, which means it fulfills every requirement of the spark cluster and has many extra features. ● Apache Livy can have long-running Spark Contexts that can be used for multiple Spark jobs, by multiple clients ● Supports sharing cached RDDs or Dataframes across multiple jobs and clients ● Multiple Spark Contexts can be managed simultaneously, and the Spark Contexts run on the cluster (YARN/Mesos) instead of the Livy Server, for good fault tolerance and concurrency ● Jobs can be submitted as precompiled jars, snippets of code or via java/scala client API ● Ensure security via secure authenticated communication ● Apache-licensed, 100% open source

Operation mode of Apache Livy:

Apache Livy provides two ways of interface with spark cluster. ● Interactive Mode: provides spark-shell, pyspark, and sparkR kind of environment. ●Batch Mode: provides spark-submit type environment, to submit a Spark application to cluster without interaction during run-time. A user can select whichever mode he wants to use in Apache Livy. Launching spark jobs with Livy can be done with Curl or Python Request module. And if users want, Jupyter notebook can also be set as a working environment, which will give the ability to users to launch jobs with SQL, Spark, pyspark and sparkR. In this write up I will be launching job with curl commands.   Interactive Mode: In interactive mode, the user creates a context once and later uses it to perform jobs or tasks. This mode is similar to spark-shell or pyspark, where we get a development environment to write statements for different jobs. Before creating a session, the Livy server must be up and running. To start livy server use the command given below. $LIVY_HOME/bin/livy-server Now launch a spark Interactive mode with the curl command: curl -X POST --data '{"kind": "pyspark"}' -H "Content-Type: application/json" localhost:8998/sessions Here, data can hold many parameters to specify context type and properties. User impersonation can be made by passing the proxyuser parameter and executor memory. The user can specify the number of cores and session name. Request Body for Livy interactive Cell Mode:

NameDescriptionType
kindThe session kindSession kind
proxyUserUser to impersonate when starting the sessionstring
jarsJars to be used in this sessionList of string
pyFilesPython files to be used in this sessionList of string
filesFiles to be used in this sessionList of string
driverMemoryAmount of memory to use for the driver processstring
driverCoresNumber of core for the processint
numExecutorsNumber of executors for processInt  
archiveArchive to use in sessionList of string
queueName of yarn queue in which job to be submittedstring
NameName of the sessionstring
confSpark configuration properties to be usedKey and values
heartbeatTimeoutInSecondTime out in secondsint

● To check running sessions use the below curl command: curl localhost:8998/sessions | python -m json.tool ● To perform a task with a session use: curl localhost:8998/sessions/0/statements -X POST -H 'Content-Type: application/json' -d '{"code":"1 + 1"}' ● To check your code statement result, you can use the below curl command: curl localhost:8998/sessions/0/statements/0 | python -m json.tool Output:  { "id": 0, "code": "1 + 1", "state": "available", "output": { "status": "ok", "execution_count": 0, "data": { "text/plain": "2" } }, "progress": 1.0 }   Statement result can be seen in the json Output>data. The user can delete a session after completing his requirement. curl localhost:8998/sessions/0 -X DELETE Interactive mode works like a normal shell where all the defined variables will be available to you, until your session is alive.   Batch Mode: Batch mode works as a spark-submit where we can submit our application with configuration parameter and application files. In batch mode user can submit a jar, or a .py file to spark cluster with Livy-server. ● To submit a jar file: curl -X POST --data '{"file": "pathToJar/spark-examples.jar", "className": "org.apache.spark.examples.SparkPi"}' -H "Content-Type: application/json" localhost:8998/batches ● To submit a py file: curl -X POST --data '{"file": "path/codeFile.py"}' -H "Content-Type: application/json" localhost:8998/batches ● To check batch result: curl localhost:8998/batches/0/log | python -m json.tool The data parameter can hold many key-value pair in json which specify the job, like number of cores, dependent jars, configuration properties and many more which are mentioned below. Request Body for Livy batch mode:

NameDescriptionType
fileFile containing the application to executePath (required)
proxyUserUser to impersonate when running the jobstring
classNameApplication Java/Spark main classstring  
argsCommand line arguments for the applicationList of strings
jarsjars to be used in this sessionList of strings
pyFilesPython files to be used in this sessionList of strings
files files to be used in this sessionList of strings
driverMemoryAmount of memory to use for the driver processstring
driverCoresNumber of cores to use for the driver processint  
executorMemoryAmount of memory to use per executor processstring
executorCoresNumber of cores to use for each executorint  
numExecutorsNumber of executors to launch for this sessionint  
archivesArchives to be used in this sessionList of string
queueThe name of the YARN queue in which job to be submittedstring  
namename of this sessionstring  
confSpark configuration propertiesMap of key=val

 

What makes Apache Livy unique?

Apache Livy comes with many extraordinary features which make Livy unique from other REST APIs. ● Livy supports Interactive Scala, Python, and R shells ● Job can be submitted in Batch mode using coding languages like Scala, Java, and Python. ● Multiple users can share the same server and everyone can submit jobs and monitor it themselves (impersonation support). ● Can be used for submitting jobs from anywhere with REST ● No need for code modification. ● Livy supports every version of spark (supports spark 1.x and 2.x versions). Hence no version mismatch problem like in other REST APIs. ● Jupyter notebook or any notebook can be used as IDE with Livy and that can support pyspark, spark, sparkR languages. Now we know Livy comes with great features too. Livy supports user impersonation and is compatible with Apache ranger which can secure your cluster from anonymous user hacking and stealing your data. There will be always many REST API servers in Big Data system, but there will be only few which give you what you want and what you need. Livy is one of them!!!

No items found.