Efficient parallel Hadoop load from external sources?

Efficient parallel Hadoop load from external sources? - bigdata

Lets assume that I've got a text file with 33000 lines, where each line is a URL pointing to a accessible 1 GB .gz file, downloadable over the HTTPS. Lets also assume that I've got a Hadoop 2.6.0 cluster consisting of 20 nodes. What is the fastest, yet still simple and elegant, parallel way how to load all of the files into the HDFS?
The best approach that I've been able to think so far is a bash script that will connect via the SSH to all of the other nodes running a series of the wget piped to the HDFS put commands. But in this scenario I am afraid of the concurrency.

You can use Java multithreading executor service . The sample example here
You can read text file with URL . Read say 10 lines at at time and start downloading them in parallel using java multithreading. You can define number of thread to any number instead 10.
you can multithread download file and then put it in HDFS using java HDFA APIs

Related

Amazon Web Services - how to run a script daily

I have an R script that I run every day that scrapes data from a couple of different websites, and then writes the data scraped to a couple of different CSV files. Each day, at a specific time (that changes daily) I open RStudio, open the file, and run the script. I check that it runs correctly each time, and then I save the output to a CSV file. It is often a pain to have to do this everyday (takes ~10-15 minutes a day). I would love it if someway I could have this script run automatically at a pre-defined specific time, and a buddy of mine said AWS is capable of doing this?
Is this true? If so, what is the specific feature / aspect of AWS that is able to do this, this way I can look more into it?
Thanks!

Two options come to mind thinking about this:
Host a EC2 Instance with R on it and configure a CRON-Job to execute your R-Script regularly.
One easy way to get started: Use this AMI.
To execute the script R offers a CLI rscript. See e.g. here on how to set this up
Go Serverless: AWS Lambda is a hosted microservice. Currently R is not natively supported but on the official AWS Blog here they offer a step by step guid on how to run R. Basically you execute R from Python using the rpy2-Package.
Once you have this setup schedule the function via CloudWatch Events (~hosted cron-job). Here you can find a step by step guide on how to do that.
One more thing: You say that your function outputs CSV files: To save them properly you will need to put them to a file-storage like AWS-S3. You can do this i R via the aws.s3-package. Another option would be to use the AWS SDK for python which is preinstalled in the lambda-function. You could e.g. write a csv file to the /tmp/-dir and after the R script is done move the file to S3 via boto3's S3 upload_file function.
IMHO the first option is easier to setup but the second-one is more robust.

It's a bit counterintuitive but you'd use Cloudwatch with an event rule to run periodically. It can run a Lambda or send a message to an SNS topic or SQS queue. The challenge you'll have is that a Lambda doesn't support R so you'd either have to have a Lambda kick off something else or have something waiting on the SNS topic or SQS queue to run the script for you. It isn't a perfect solution as there are, potentially, quite a few moving parts.

#stdunbar is right about using CloudWatch Events to trigger a lambda function. You can set a frequency of the trigger or use a Cron. But as he mentioned, Lambda does not natively support R.
This may help you to use R with Lambda: R Statistics ready to run in AWS Lambda and x86_64 Linux VMs

If you are running windows, one of the easier solution is to write a .BAT script to run your R-script and then use Window's task scheduler to run as desired.
To call your R-script from your batch file use the following syntax:
C:\Program Files\R\R-3.2.4\bin\Rscript.exe" C:\rscripts\hello.R
Just verify the path to the "RScript" application and your R code is correct.

Dockerize your script (write a Dockerfile, build an image)
Push the image to AWS ECR
Create an AWS ECS cluster and AWS ECS task definition within the cluster that will run the image from AWS ECR every time it's spun-up
Use EventBridge to create a time-based trigger that will run the AWS ECS task definition
I recently gave a seminar walking through this at the Why R? 2022 conference.
You can check out the video here: https://www.youtube.com/watch?v=dgkm0QkWXag
And the GitHub repo here: https://github.com/mrismailt/why-r-2022-serverless-r-in-the-cloud

How to easily execute R commands on remote server?

I use Excel + R on Windows on a rather slow desktop. I have a full admin access to very fast Ubuntu-based server. I am wondering: how to remotely execute commands on the server?
What I can do is to save the needed variables with saveRDS, and load them on server with loadRDS, execute the commands on server, and then save the results and load them on Windows.
But it is all very interactive and manual, and can hardly be done on regular basis.
Is there any way to do the stuff directly from R, like
Connect with the server via e.g. ssh,
Transfer the needed objects (which can be specified manually)
Execute given code on the server and wait for the result
Get the result.
I could run the whole R remotely, but then it would spawn a network-related problems. Most R commands I do from within Excel are very fast and data-hungry. I just need to remotely execute some specific commands, not all of them.

Here is my setup.
Copy your code and data over using scp. (I used github, so I clone my code from github. This has the benefit of making sure that my work is reproducible)
(optional) Use sshfs to mount the remote folder on your local machine. This allows you to edit the remote files using your local text editor instead of ssh command line.
Put all things you want to run in an R script (on the remote server), then run it via ssh in R batch mode.

There are a few options, the simplest is to exchange secure keys to avoid entering SSH/SCP passwords manually all the time. After this, you can write a simple R script that will:
Save necessary variables into a data file,
Use scp to upload the data file to ubuntu server
Use ssh to run remote script that will process the data (which you have just uploaded) and store the result in another data file
Again, use scp command to transfer the results back to your workstation.
You can use R's system command to run scp and ssh with necessary options.
Another option is to set up cluster worker at the remote machine, then you can export the data using clusterExport and evaluate expressions using clusterEvalQ and clusterApply.

There are a few more options:
1) You can do the stuff directly from R by using Rserve. See: https://rforge.net/
Keep in mind that Rserve can accept connections from R clients, see for example how to connect to Rserve with an R client.
2) You can set up cluster on your linux machine and then use these cluster facilities from your windows client. The simplest is to use Snow, https://cran.r-project.org/package=snow, also see foreach and many other cluster libraries.

MPI parallel write to a TIFF file

I'm trying to write a TIFF file in MPI code. Different processors have different parts of the image, and I want to write the image to the file in parallel.
The write fails, only the 1st processor can write to it.
How do I do this?
There is no error in my implementation, just it does not work.
I used h=TIFFOpen(file, "a+") on each processor to open the same file (I am not sure whether this is a right way or not), then each processor who is responsible for a directory will write the header at its own place using TIFFSetDirectory(h, directorynumber), then the content of each directory will be written. I will finalize with TIFFWriteDirectory(h). The result would be the first directory which is written on the file.
I thought that I need to open the file using MPI_IO but doing this way it is not TIFFOpen?

Different MPI tasks are independent programs, running on independent hosts from the OS point of view. In your case the TIFF library is not designed to handle parallel operations, so opening the file will lead the first process to succeed, all the rest to fail because they found the file already opened (on a shared filesystem).
Except in case you are dealing with huge images (eg: astronomical images) where it's important for performance to perform parallel I/O (you need a filesystem supporting it however... I am aware of IBM GPFS), I would avoid to write a custom TIFF driver with MPI_IO.
Instead the typical solution is to gather (MPI_Gather()) the image parts on the process with rank==0 and let it only save the tiff file.

Amazon EMR: Using R code in Amazon EMR

I have a very beginner question. I've just been reading through some of the documentation regarding Amazon's EMR. Before I sign up etc. I just wanted to ask about using R in it.
I have one R module that calls several other modules, and then, just before it finishes running, saves several variables as .txt files.
My rather basic question is, can I do this in Amazon's EMR? And will I be able to access the .txt output files? Finally, my R script reads in some data from Excel spreadsheets. Will it still be able to do this from the EMR if I upload the Excel files into the system?
Thanks
Mike

#Mike, Answers to your 3 questions below
Running R on EMR: Yes you can.
You can run R programs on EMR once you have installed R on the EMR instance. I assume that you would write MapReduce moules if you plan to use multi-instance cluster. If you program is just about a "plain" R program then you may have to just use one sizable instance. I would rather use an EC2 instance with R AMI (look for Louis Aslett).
Moving output files:
Yes you can. It is possible to transfer your program output from EMR to S3 storage bucket of your choice. You will have to add a step calling S3DistCp command to move the files. An example from my project -
--jar /home/hadoop/lib/emr-s3distcp-1.0.jar --args '--src,hdfs:///contents,--dest,s3://<bucket-name>/'
Reading spreadsheets: AFAIK, If you are able to do this on local installation of R, then you should also be able to do it on EMR. You have to ensure that the necessary packages/libraries are installed during the bootstrap process.
I am able to install squeezy-cran and rmr2 on an EMR instance with all their dependencies (RCpp, reshap2, digest, RJSONIO, functional etc.). I am still unable to call the R program as a step. I am having to use SSH session and run R CMD commands on the shell prompt. Being on Windows, putty.exe works for me.

How do I scp a file to a Unix host so that a file polling service won't see it before the copy is complete?

I am trying to transfer a file to a remote Unix server using scp. On that server, there is a service which polls the target directory to detect incoming files for processing. I would like to ensure that the polling service does not pick up new files before the copy is complete. Is there a way of doing that?
My file transfer process is a simple scp command embedded in a larger Java program. Ideally, a solution which did not involve changing the Jana would be best (for reasons involving change control processes).

You can scp the file to a different (/tmp) directory and move the
file via ssh after transfer is complete. The different directory needs to be on the same partition as the final destination directory otherwise there will be a copy operation and you'll face a similar problem. Another service on the destination machine can do this move operation.
You can copy the file as hidden (prefix the filename with .) and copy, then move
If you can modify the polling service, you can check active scp processes and ignore files matching scp arguments.
You can check for open files with lsof +d $directory and ignore them in the polling server

I suggest copying the file using rsync instead of scp. rsync already copies new files to temporary filenames, and has many other useful features for file synchronization as well.
$ rsync -a source/path/ remotehost:/target/path/
Of course, you can also copy file-by-file if that's your preference.
If rsync's temporary filenames are sufficient to avoid being picked up by your polling service, then you could simply replace your scp command with a shell script that acts as a wrapper for rsync, eliminating the need to change your Java program.
You would need to know the precise format that your Java program uses to call the scp command, to make sure that the options you feed to rsync do what you expect.
You would also need to figure out how your Java program calls scp. If it does so by full pathname (i.e. /usr/bin/scp), then this solution might put other things at risk on your system that depend on scp (like you, for example, expecting scp to behave as it usually does instead of as a wrapper). Changing a package-installed binary like /usr/bin/scp may also "break" your package registration, making it difficult to install future security updates because a binary has changed to a shell script. And of course, there might be security implications to any change you make.
All in all, I suspect you're better off changing your Java program to make it do precisely what you want, even if that is to launch a shell script to handle aspects of automation that you want to be able to change in the future without modifying your Java.
Good luck!