I am sending a sparkR Job to run on a Yarn cluster in cluster mode with ./bin/spark-submit script. I need to upload a file (external dataset) by the --file option. This action upload files to HDFS tempory directory. But I need to access to the path where the file was downloaded to include it directly in my SparkR code.
For java and PySpark, files distributed using --files can be accessed via SparkFiles.get(filename) method which return the absolute path of filename. Is there an equivalent in SparkR ?
I know we can work around the problem by different ways :
Put files manualy to HDFS
Deploy files on worker nodes
But I want to use this option for convinient reasons.
Related
TLDR; Convert the bash line to download sftp files get Inbox/* to c++ or python. We do not have execute permissions on Inbox directory.
I am trying to read the files present in a directory in a remote server through SFTP. The catch is that I only had read and write permissions on the directory and not execute. This means any method that requires opening (cding) into the folder would fail. I need to read the file names since they are variable. From what I understand ls does not require execute privs. If I can get a list of files in the directory then reading then would be fine. Here is the directory structure:
Inbox
--file-a.txt
--file_b.txt
...
I have tried libssh but sftp_readdir required a handle of the open directory. I also looked at paramiko for python but that too requires to open the directory to read the file names.
I am able to do this in bash using send "get Inbox/* ${destination_dir}". Is there anyway I can use a similar pattern match but on c++ or python?
Also, I cannot execute bash commands through my binary. Does anyone know of any library in python or c++ (preferred) that would support this?
I have not posted here in a while so please excuse me if I am not following the formatting. I will learn from your suggestions. Thank you!
I'm using a JFROG cli to download content from an Artifactory. It seems that even though a destination contains same files, cli is trying to download it. If I re-run the command without cleaning the destination folder, I takes the same time.
Is there any option to speedup the process? If destination folder has the same SHA1 file, skip?
Our command (download all folders a* in the repo):
jfrog rt dl --threads=`nproc` repo_name/a*/ $TMP_FOLDER/
JFrog CLI is already skipping download in case of a file existence which is validated using a checksum.
You can see this by setting the environment variable "JFROG_CLI_LOG_LEVEL=DEBUG" and then running same download command again. In the debug log you will see on some files the following line: "File already exists locally" - this means the download was skipped due to a file existence.
The relevant code can be found in GitHub - see the method "downloadFileIfNeeded".
Keep in mind that the CLI still has to get the file info from Artifactory and calculate the local file checksum, so in case of a lot of small files, this won't have a strong effect as on big files download.
I'm having issues on windows with R failing when changing the working directory to a mapped network drive (e.g. \Share\Folder mapped to Z:) in batch mode. If I run the same script in an interactive console I don't have any issues. I am accomplishing this by running R.exe with the script specified inside of a windows batch (.bat) file. The .bat file contains the following.
"C:\RRO\R-3.2.1\bin\R.exe" CMD BATCH "C:/Scripts/Rscript.R"
The error is simply...
> setwd( 'Z:/' )
Error in setwd("Z:/") : cannot change working directory
I'd be open to a different approach entirely for scheduling these scripts via the windows task scheduler if that helps avoid the issue. The reason for mapping the drive is that I need to supply some credentials in order to access it, which is done automatically when it is mapped, but can test to see if that's not the case in R if anyone knows how.
I hope this can help with your question.
I duplicated the problem with no errors by using Rscript command instead of a CMD BATCH
my R code which I saved as a script (test1.R)
library(openxlsx)
setwd("P:/Records/Indexing Operations/Indexing Data Analysis/Daily Reports")
my.data = read.xlsx("FSI Daily Project Status Report - 18 Mar 2016.xlsx", sheet = 1)
setwd("C:/Users/golieth/Documents/")
png(filename = "test.png", width = 500, height = 350 )
plot(my.data$Total.Images, my.data$Completed.Images.A,
main = Sys.time())
dev.off()
Note I change the directory 2 times in this file. Once to access data on a mapped network drive and a 2nd to save the image to the computer. I put a timestamp of the current time as the main plot title so you can run the batch file repeatedly and verify it works
my batch file
cd C:\Program Files\R\R-3.2.3\bin\i386
Rscript C:\Users\golieth\Documents\test1.R
Note: On the batch file if your code relies on 32 bit you need to change the directory of your R program (cd) to the R 32bit program. Same with R64. Next the Rscript should reference where you have saved your .R file
Finally, and this might be stating the obvious but make sure you are connected to your VPN before running the batch file.
Imagine a batch file with
cd Z:\<Destination>
Z:
RScript "C:/Scripts/Rscript.R"
This will enable Windows to change to the directory with all credentials and then start R within that directory. So the working dir. is the location from where R is started. Doing so requires that "C:\RRO\R-3.2.1\bin\" is part of your PATH variable.
Good luck!
When writing a .bat file, remember that cd is not used to change drive letters. To change drive letters you simply enter the name of the drive letter, which should be done prior to issuing the final cd to the working directory.
Like this:
sample.bat
z:
cd z:\your\working\directory\
C:\RRO\R-3.2.1\bin\Rscript.exe C:/Scripts/Rscript.R
You can save the files locally in your code, and use file.copy in your code to copy the files over to your network drive. Also try replacing the path in file.copy the network drive letter by the full network address name eg. \\....\.....\
I am currently trying out on the docker link between my app and db containers. I've checked on my app container and environment variables are automatically set when I link the containers together.
What I want to do is for my config file, which is packaged into a jar file, to receive the environment variables and set the required values to it. Any advice or help?
And this is how I create a config file in my jar file to connect to MySQL
database { url="jdbc:mysql://${MYSQL_PORT_3306_TCP_ADDR}:${MYSQL_PORT_3306_TCP_PORT}/mydb" driver="com.mysql.jdbc.Driver"}
Updating the config file inside the jar could be quite overkill.
It think you have several choices
read the config environment variable directly in you program
use variable either directly or generate the config file there
create launch script (details of this depends of you guest os in docker how to do it; sh/bash for linux etc..)
that script can generate new config file from environment and put it on classpath before jar so you program sees it.
EDIT: added example
You can save this kind of launcher script on docker image which dynamically creates configuration before launching actual program.
#!/bin/bash
# some default values for testing even without links to other container
MYSQL_PORT_3306_TCP_ADDR=${MYSQL_PORT_3306_TCP_ADDR:-127.0.0.1}
MYSQL_PORT_3306_TCP_PORT=${MYSQL_PORT_3306_TCP_PORT:-3306}
cat << EOF > /opt/yourprogram/dbconfig.conf
database { url="jdbc:mysql://${MYSQL_PORT_3306_TCP_ADDR}:${MYSQL_PORT_3306_TCP_PORT}/mydb" driver="com.mysql.jdbc.Driver"
}
EOF
scala -classpath /opt/yourprogram YourProgram
What I did is that I wrote the sh file in my directory /tmp/restcore-1.0-SNAPSHOT/bin like this:
#!/bin/bash echo "database{url="jdbc:mysql://"${MYSQL_PORT_3306_TCP_ADDR}":"${MYSQL_PORT_3306_TCP_PORT}"/mydb" driver="com.mysql.jdbc.Driver" }" > myconf.conf
jar uf /tmp/restcore-SNAPSHOT/lib/com.organization.restcore-1.0-SNAPSHOT.jar /tmp/restcore-1.0-SNAPSHOT/bin/myconf.conf
After building the Dockerfile and running the sh file in CMD, I use cat myconf.conf to check the config file and I'll be able to see the environment set.
I have bundled a jar from my eclipse project. I would like to pass arguments to the jar. Basically an input file to the jar. I would like to know how to give an input file that is not in Hdfs. I know that's not now hadoop works but this is for testing purposes. Eclipse has the feature for local files. Is there a way to do this via command line?
You can run hadoop in 'local' mode by overriding the job tracker and file system properties from the command line:
hadoop jar <jar-file> <main-class> -fs local -jt local <other-args..>
You need to be using the GenricOptionsParser (which is the norm if you're using ToolRunner to launch your jobs.