spark-scheduling across application [duplicate] - r

This question already has an answer here:
multiple spark application submission on standalone mode
(1 answer)
Closed 5 years ago.
i want to run spark wordcount application on four different file at same time.
i have standalone cluster with 4 worker nodes, each node having one core and 1gb memory.
spark works in standalone mode...
1.4worker nodes
2.1 core for each worker node
3.1gb memory for each node
4.core_max set to 1
./conf/spark-env.sh
**
export SPARK_MASTER_OPTS="-Dspark.deploy.defaultCores=1"
export SPARK_WORKER_OPTS="-Dspark.deploy.defaultCores=1"
export SPARK_WORKER_CORES=1
export SPARK_WORKER_MEMORY=1g
export SPARK_WORKER_INSTANCES=4
**
i have executed using .sh file
./bin/spark-submit --master spark://-Aspire-E5-001:7077 ./wordcount.R txt1 &
./bin/spark-submit --master spark://-Aspire-E5-001:7077 ./wordcount.R txt2 &
./bin/spark-submit --master spark://-Aspire-E5-001:7077 ./wordcount.R txt3 &
./bin/spark-submit --master spark://-Aspire-E5-001:7077 ./wordcount.R txt4
is this a correct way to submit application parallelly ?
when one application running it takes 2sec like that(only using one core)
when 4 application given simultaneously then each application takes more than 4sec ...
how do i run spark application on different files parallely?

When you submit multiple jobs to a spark cluster, the Application master / resource-manager automatically schedules the jobs in parallel. (as spark is on top of yarn).
You dont need to do any extra scheduling for that.
And for the scenario you have shown, you could have read all different files in a single spark job.
And believe me, due to Spark's lazy evaluation / DAG optimizations and RDD transformations (logical/physical plans), reading of different files and word-count will go in parallel.
You can read all files in single job as:
sc.wholeTextFiles("<folder-path>")
The folder-path is the parent directory where all files reside.

Related

Install R on the nodes for Azure batch services

I can create the batch service resources using Power shell as described here: https://learn.microsoft.com/en-us/azure/batch/batch-powershell-cmdlets-get-started
I want to run a R script on the nodes and I need R installed on the nodes as none of the available VM's(windows or linux) come with R installed. I have currently installed R by manually logging into the VM. But I want to create the batch resources and then install R on the nodes preferably through a script before I run the R code. How can I go about this?
There are 4 main ways to load necessary software on to VMs:
Create a start task along with potentially resource files to prep the compute node per your requirements.
Create a custom image that already contains all of your software preconfigured.
Use containers instead of directly loading software on the compute node.
Utilize application packages.

AzerothCore : Import the update of database

Hello I wanted to ask if, to import the .sql update (after a git pull) I have to assemble and merge with the bash file (app/db_assembler) or if it's ok if I just launch the worldserver.exe and he will do it
Thanks
Short answer
No, the worldserver process will NOT update your database.
You need to use the DB-assembler bash script, as the instructions say.
More details
This is different than in TrinityCore, where it is a feature of the worldserver process to update the database.
In AzerothCore this task is a responsability of an external script, written in bash, the DB-assembler.
The advantage of having an external script to do this task instead of the worldserver is:
You don't need to compile and run the worldserver if you only need to create the database (useful when using or developing tools that only need the DBs)
The DB assembler is able to generate a unique SQL update file per each DB (by merging all the single SQL update files), which can be useful for debugging or development purposes
In general, it is better to delegate different software components for different tasks, instead of having a monolith doing everything
You can also make your own merge script and apply manually. Or just merge with the db_assembler.sh then apply manually.
Else refer to Francesco's answer

How do I setup and run SparkR projects and scripts (like a jar file)?

We have successfully gone through all the SparkR tutorials about setting it up and running basic programs in RStudio on an EC2 instance.
What we can't figure out now is how to then create a project with SparkR as a dependency, compile/jar it, and run any of the various R programs within it.
We're coming from Scala and Java, so we may be coming at this with the wrong mindset. Is this even possible in R or is it done differently that Java's build files and jars or do you just have to run each R script individually without a packaged jar?
do you just have to run each R script individually without a packaged jar?
More or less. While you can create a R package(-s) to store reusable parts of your code (see for example devtools::create or R packages) and optionally distribute it over the cluster (since current public API is limited to high level interactions with JVM backend it shouldn't be required), what you pass to spark-submit is simply a single R script which:
creates a SparkContext - SparkR::sparkR.init
creates a SQLContext / HiveContexts - SparkR::sparkRSQL.init / SparkR::sparkRHive.init
executes the rest of your code
stops SparkContext - SparkR::sparkR.stop
Assuming that external dependencies are present on the workers, missing packages can installed on the runtime using if not require pattern, for example:
if(!require("some_package")) install.packages("some_package")
or
if(!require("some_package")) devtools::install_github("some_user/some_package")

Efficient parallel Hadoop load from external sources?

Lets assume that I've got a text file with 33000 lines, where each line is a URL pointing to a accessible 1 GB .gz file, downloadable over the HTTPS. Lets also assume that I've got a Hadoop 2.6.0 cluster consisting of 20 nodes. What is the fastest, yet still simple and elegant, parallel way how to load all of the files into the HDFS?
The best approach that I've been able to think so far is a bash script that will connect via the SSH to all of the other nodes running a series of the wget piped to the HDFS put commands. But in this scenario I am afraid of the concurrency.
You can use Java multithreading executor service . The sample example here
You can read text file with URL . Read say 10 lines at at time and start downloading them in parallel using java multithreading. You can define number of thread to any number instead 10.
you can multithread download file and then put it in HDFS using java HDFA APIs

Unix .pid file created automatically? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Reference for proper handling of PID file on Unix
Is the .pid file in unix created automatically by the operating system whenever a process runs OR should the process create a .pid file programmatically like "echo $$ > myprogram.pid" ?
The latter holds true -- a process must create the .pid file (the OS won't do it). pid files are often used by daemon processes for various purposes. For example, they can be used to prevent a process from running more than once. They can also be used so that control processes (e.g. apache2ctl) know what process to send signals to.

Resources