######################### Installation & Deployment ######################### ======================== Installing IReS-Platform ======================== This section serves as an installation and execution manual for IReS -------- Overview -------- To have the IRes platform up and running, 4 steps are required: 1. Clone IReS code to the server 2. Run install.sh 3. Validate installation 4. Start the IReS server ---------------------------------- Clone IReS code to the server ---------------------------------- For a quick reference of how to use git, click `here `_. Open a terminal (Linux) and navigate to a desired directory where IReS-Platform files will be cloned e.g. asap. Then, clone the project by entering the following command .. code:: bash git clone git@github.com:project-asap/IReS-Platform.git --------------- Run install.sh --------------- After successful cloning of the IReS platform, various folders and files can be found inside $IRES HOME. Among them there exists install.sh. Assuming that the current working directory is $IRES HOME, executing .. code:: bash ./install.sh will start building IReS. Upon successful build you will be prompted to provide the path where Hadoop YARN is located in your computer. By doing this, IReS gets con- nected to Hadoop YARN. Alternatively, executing .. code:: bash ./install.sh -c $YARN_HOME,$IRES_HOME will make the connection of IReS and YARN, where $YARN_HOME and $IRES_HOME correspond to the absolute paths of YARN's and IReS's home folder. Assuming that the connections have been established, update the file $YARN_HOME/etc/hadoop/yarn-site.xml with the following property values, .. code:: bash yarn.nodemanager.services-running.per-node yarn.nodemanager.services-running.check-availability yarn.nodemanager.services-running.check-status These properties enable IReS to run workflows over YARN and monitor cluster resources and services. ----------------------- Validate installation ----------------------- If anything goes wrong during the build process of IReS, error messages will be printedout and a log file will be provided. ---------------------- Start the IReS server ---------------------- Run IReS server by running the command .. code:: bash ./install.sh -r start No exception should be raised. Also, the jps command should print a "Main" process running that corresponds to ASAP server. Run ASAP server web user interface at http://your_hostname:1323/web/main. IReS home page should be displayed. Run a workflow, for example run "hello_world" from "Abstrack Workflows" tab and see what happens not only in IReS web interface but also in YARN and HDFS web interfaces. Make sure that YARN has been started before running any workflow. Click on "Cockpit" tab to verify that the services are running. -------- Monitor -------- The Monitor is responsible for the profiling of the operators in every workflow execution. It keeps the execution metrics (eg execution time, number of cores etc) in a dictionary format and stores them into a MongoDB server. To install the Monitor: 1. Install MongoDB. You can follow `this tutorial `_. 2. Copy the 'asap-tools' `subproject `_ to every IReS node and also install the required dependencies. 3. Set the full path of the `asap (asap-tools/bin/asap)` script in the `asap.path parameters of the `asap.properties(asap-server/target/conf/asap.properties)` file. 4. In each node create a file `/etc/reporter_config.json` with the following content: .. code:: bash { "backend": "mongo", "host":"the_mongo_db_host" } ================================ Running a sample workflow ================================ The HelloWorld is a simple workflow constists of just a single operator, designed for demonstration purposes. To run the HelloWolrd follow the next steps: 1. Go to IReS UI: http://ires_host:1323/web/main .. figure:: ireshome.png IReS Home Page 2. Go to the **Abstract Workflows** tab and select the **HelloWorld** workflow .. figure:: abstractworkflows.png Abstract Workflows Tab 3. Then click on **Materialize Workflow** button .. figure:: abstracthello.png Abstract HelloWorld Workflow 4. Click on the **Execute Workflow** button to start the execution .. figure:: materializedhello.png The materialized HelloWorld workflow In the figures below we can see the execution process .. figure:: exec1.png :width: 150% The execution has been started .. figure:: yarn.png :width: 150% The submitted YARN application .. figure:: exec2.png :width: 150% The execution has been finished =============================================== Create a workflow from scratch =============================================== In this section the process of designing a new workflow from scratch is described. We will create a workflow that consists of a single operator and takes as input a text file and produces as output the number of lines. ------------------- Dataset definition ------------------- In order to create the workflow input dataset you need to add the dataset definition into IReS library. Create a file named 'asapServerLog' into the asapLibrary/datasets/ folder and add the following content: .. code:: Optimization.documents=1 Execution.path=hdfs\:///user/root/asap-server.log Constraints.Engine.FS=HDFS This step assumes that a file named 'asap-server.log' exists in the HDFS. You can download the log file used in this example `through this link <./files/asap-server.log>`_. -------------------------------------------- Materialized Operator Definition (Server-Side): To add a materialized operator a folder with the least required files is needed. i. From the bash shell, go to the asapLibrary/operators folder in the IReS installation directory. .. code:: javascript cd $ASAP HOME/target/asapLibrary/operators ii. Then, create a new folder named with the new materialized operators name. .. code:: javascript mkdir LineCount iii. Create the description file and enter the information below. A description file should meet the standards of the template provided in this `this link <./files/description_template>`_.. This template contains all the required parameters for an operator to run as long as all the optional parameters which can be used. .. code:: javascript Constraints.Engine=Spark Constraints.Output.number=1 Constraints.Input.number=1 Constraints.OpSpecification.Algorithm.name=LineCount Optimization.model.execTime=gr.ntua.ece.cslab.panic.core.models.UserFunction Optimization.model.cost=gr.ntua.ece.cslab.panic.core.models.UserFunction Optimization.outputSpace.execTime=Double Optimization.outputSpace.cost=Double Optimization.cost=1.0 Optimization.execTime=1.0 Execution.Arguments.number=2 Execution.Argument0=In0.path.local Execution.Argument1=lines.out Execution.Output0.path=$HDFS_OP_DIR/lines.out Execution.copyFromLocal=lines.out Execution.copyToLocal=In0.path iv. Create the .lua file with the execution instructions .. code:: javascript operator = yarn { name = "LineCount", timeout = 10000, memory = 1024, cores = 1, container = { instances = 1, --env = base_env, resources = { ["count_lines.sh"] = { file = "asapLibrary/operators/LineCount/count_lines.sh", type = "file", -- other value: ’archive’ visibility = "application" -- other values: ’private’, ’public’ } }, command = { base = "./.sh" } } } v. Create the executable named count lines.sh with the following content .. code:: javascript: #!/bin/bash wc -l $1 >> $2 chmod +x count_lines.sh vi. Restart the IReS server ..code:: javascript $ IRES_HOME/asap-server/src/main/scripts/asap-server restart --------------------------------------------- Materialized Operator Definition (via REST) --------------------------------------------- In this example we describe an alternative way to create a materialized operator with the REST API. To do so, create a folder locally and add the required description file as well as all other files needed for the execution. In this case, an extra parameter should be added to the description file which defines the execution command (Execution.command). i. description file: Create inside the folder a file named `description` with the following content: .. code:: javascript Constraints.Engine=Spark Constraints.Output.number=1 Constraints.Input.number=1 Constraints.OpSpecification.Algorithm.name=LineCount Optimization.model.execTime=gr.ntua.ece.cslab.panic.core.models.UserFunction Optimization.model.cost=gr.ntua.ece.cslab.panic.core.models.UserFunction Optimization.outputSpace.execTime=Double Optimization.outputSpace.cost=Double Optimization.cost=1.0 Optimization.execTime=1.0 Execution.Arguments.number=2 Execution.Argument0=In0.path.local Execution.Argument1=lines.out Execution.Output0.path=$HDFS_OP_DIR/lines.out Execution.copyFromLocal=lines.out Execution.copyToLocal=In0.path Execution.command=./count_lines.sh ii. executable file: Create the executable named 'count_lines.sh' with the following content: .. code:: bash #!/bin/bash wc -l $1 >> $2 and make it executable .. code:: bash chmod +x count_lines.sh iii. Send the operator via the 'send_operator.sh' script: .. code:: bash ./send_operator.sh LOCAL_OP_FOLDER IRES_HOST LineCount The script is available at $IRES_HOME/asap-server/src/main/scripts. You can also `download it directly `_. ------------------------------ Abstract operator definition ------------------------------ Create the `LineCount` abstract operator by creating a file named 'LineCount' in the asapLibrary/abstractOperators folder with the following content: .. code:: javascript Constraints.Output.number=1 Constraints.Input.number=1 Constraints.OpSpecification.Algorithm.name=LineCount ------------------------------------------- Abstract workflow definition (Server-Side) ------------------------------------------- Create the `LineCountWorkflow` workflow by creating a folder named 'LineCountWorkflow' in the asapLibrary/abstractWorkflows. The abstract workflow folder should consist of three required components: the `datasets` folder , the `operators` folder and a file named `graph`. i. Create a folder named 'datasets' and copy the `asapServerLog` file from the `asapLibrary/datasets/` folder into it. Then, create an empty file named 'd1' (touch d1). ii. Create a file named 'graph' and add the following content: .. code:: javascript asapServerLog,LineCount,0 LineCount,d1,0 d1,$$target This `graph` file defines the workflow graph as follows: `asapServerLog` dataset is being given as input to the `LineCount` abstract operator and `LineCount` operator outputs the result into `d1`. Finally, `d1` node maps to the final result ($$target). iii. operators: Create a folder named 'operators' which will contain the operators involved in the worflow. In the 'operators' folder create a file named 'LineCount' and add the following content: .. code:: javascript Constraints.Engine=Spark Constraints.Output.number=1 Constraints.Input.number=1 Constraints.OpSpecification.Algorithm.name=LineCount iv. Restart the server for changes to take effect. .. code:: bash $IRES_HOME/asap-platform/asap-server/src/main/scripts/asap-server restart ------------------------------------ Abstract Workflow Definition (GUI): ------------------------------------ Alternatively, the abstract workflow can be defined through the Web UI as follows. i. Go to the `Abstract Workflows` tab. Enter the name ”LineCountWorkflow” in the Name textbox and click the `New Workflow` button. ii. Then we add the workflow parts one-by-one. First we add the asapServer-Log dataset from the dataset library. Select the `Materialized Dataset` radio button and enter the dataset name in the Comma seperated list text box. Then click the `Add nodes` button to add the dataset node to the workflow graph. Repeat this step to add an output node with name d1. Just enter the name `d1` to the text box and click the `Add nodes` button. iii. Add the LineCount abstract operator to the workflow. Select the `Abstract Operator` radio button, enter the operators name (LineCount) in the text box and click again the `Add nodes` button. iv. Describe the workflow by connecting the graph nodes defined in the previous steps, by entering the following text in the large text box: .. code:: javascript asapServerLog,LineCount LineCount,d1 d1,$$target Click the `Change graph` button ------------------------- Workflow Materialization ------------------------- To materialize the workflow navigate to the `Abstract Workflows` tab and click on the LineCountWorkflow created in the previous steps. .. image:: ./images/lineCount/abstractLineCount.png :width: 150% Click on the `Materialize Workflow` button .. image:: ./images/lineCount/lineCountMaterialized.png :width: 150% Now you can see the materialized LineCount workflow. Click on `Execute Workflow` button to trigger the execution .. image:: ./images/lineCount/lineCountExecution.png :width: 150% When the execution finish, navigate to the HDFS file browser to see the output located at appN folder. .. image:: ./images/lineCount/lineCountHDFS.png :width: 150% All resources and examples files described in this section are available `here <./files/LineCountExample.tar>`_. ==================================== Creating a text clustering workflow ==================================== This example describes how to define a text clustering workflow consisting of two operators. This workflow takes as input a dataset with raw text files. In the first operator the files are transformed into tf-idf vectors. Then the vectors are given as input to the next operator which performs the clustering using a k-means algorithm. We will use two Cilk-based implementations for this example, and we will create all the required files and directories using the server-side method. ------------------- Dataset definition ------------------- We will use this text file for our example. The following file should exists in the HDFS cluster with name 'textData'. Create the data definition as follows: 1. Create a file named 'textData' in the asapLibrary/datasets folder 2. Add the following content: .. code:: javascript Constraints.Engine.FS = HDFS Constraints.type = text Execution.path = hdfs:///user/asap/input/textData Optimization.size = 932E06 ------------------------------------- TF-IDF abstract operator definition ------------------------------------- Next, we'll define the abstract definition for a TF-IDF operator. 1. Create a file named 'tf-idf' in the asapLibrary/abstractOperators folder 2. Add the following content: .. code:: javascript Constraints.Input.number = 1 Constraints.OpSpecification.Algorithm.name = TF_IDF Constraints.Output.number = 1 ------------------------------------- K-Means abstract operator definition ------------------------------------- Create the abstract definition of K-Means operator as follows: 1. Create a file named 'kmeans' in the asapLibrary/abstractOperators folder 2. Add the following content: .. code:: javascript Constraints.Input.number = 1 Constraints.OpSpecification.Algorithm.name = kmeans Constraints.Output.number = 1 ----------------------------- Abstract workflow definition ----------------------------- In this step we'll describe how to connect the two aforementioned operators in order to define the text clustering workflow. 1. Create a folder named 'TextClustering' in the asabLibrary/abstractWorkflows folder 2. Specify the workflow graph by creating a file named 'graph' with the following content: .. code:: javascript testdir,tfidf_cilk,0 tfidf_cilk,d1,0 d1,kmeans,0 kmeans,d2,0 d2,$$target Next, we will defined the materialized operators. We will use Cilk for our implementations. ----------------------------------------------- TF-IDF materialized operator definition (Cilk) ----------------------------------------------- 1. Create a folder named 'TF_IDF_cilk' in the asapLibrary/operators folder. 2. Create the description file named 'description' and add the following content: .. code:: javascript Constraints.Output0.Engine.FS=HDFS Constraints.OpSpecification.Algorithm.name=TF_IDF Constraints.Input0.type=text Constraints.Output0.type=arff Constraints.Engine=Cilk Constraints.Output.number=1 Constraints.Input.number=1 Execution.LuaScript=TF_IDF_cilk.lua Execution.Arguments.number=2 Execution.Argument0=In0.path.local Execution.Argument1=tfidf.out Execution.copyFromLocal=tfidf.out Execution.copyToLocal=In0.path Execution.Output0.path=$HDFS_OP_DIR/tfidf.out 3. Create the lua file named 'TF_IDF_cilk.lua' as follows: .. code:: javascript operator = yarn { name = "Execute cilk tfidf", timeout = 10000, memory = 1024, cores = 1, container = { instances = 1, --env = base_env, resources = { ["tfidf"] = { file = "asapLibrary/operators/TF_IDF_cilk/tfidf", type = "file", -- other value: 'archive' visibility = "application" -- other values: 'private', 'public' } }, command = { base = "export LD_LIBRARY_PATH=/0/asap/qub/gcc-5/lib64:$LD_LIBRARY_PATH ; ./tfidf" } } } 4. Add the 'tfidf' executable (can be found in the tarball provided in the end of this article). ----------------------------------------------- K-Means materialized operator definition (Cilk) ----------------------------------------------- 1. Create a folder named 'kmeans_cilk' in the asapLibrary/operators folder. 2. Create the description file named 'description' and add the following content: .. code:: javascript Constraints.Output0.Engine.FS=HDFS Constraints.OpSpecification.Algorithm.name=kmeans Constraints.Input0.Engine.FS=HDFS Constraints.Input0.type=arff Constraints.Engine=Spark Constraints.Output.number=1 Constraints.Input.number=1 Execution.LuaScript=kmeans_cilk.lua Execution.Arguments.number=2 Execution.Argument0=In0.path.local Execution.Argument1=kmeans.out Execution.copyFromLocal=kmeans.out Execution.copyToLocal=In0.path Execution.Output0.path=$HDFS_OP_DIR/kmeans.out 3. Create the lua file named 'kmeans_cilk.lua' as follows: .. code:: javascript operator = yarn { name = "Execute kmeans", timeout = 10000, memory = 1024, cores = 1, container = { instances = 1, --env = base_env, resources = { ["kmeans"] = { file = "asapLibrary/operators/kmeans_cilk/kmeans", type = "file", -- other value: 'archive' visibility = "application" -- other values: 'private', 'public' } }, command = { base = "export LD_LIBRARY_PATH=/0/asap/qub/gcc-5/lib64:$LD_LIBRARY_PATH ; ./kmeans" } } } 4. Add the 'kmeans' executable (can be also found in the tarball). --------------------- Execute the workflow --------------------- After finishing the previous steps restart the server for changes to take effect. Then: 1. Go to Abstract Workflows and click on TextClustering .. image:: ./images/TextClustering/abstract.png :width: 150% 2. Materialize the workflow by clicking 'Materialize' button .. image:: ./images/TextClustering/materialized.png :width: 150% 3. Start the workflow execution by clicking 'Execute' button .. image:: ./images/TextClustering/running.png :width: 150% The files used in this example can be downloaded `here <./files/TextClustering.tar>`_.