pyspark: how to show current directory? - directory

Hi I'm using pyspark interactively. I think I'm failing loading a LOCAL file correctly.
how do I check current directory, so that I can go to browser to take a look at that actual file?
Or is the default directory where pyspark is? Thanks

You can't load local file unless you have same file in all workers under same path. For example if you want to read data.csv file in spark, copy this file to all workers under same path(say /tmp/data.csv). Now you can use sc.textFile("file:///tmp/data.csv") to create RDD.
Current working directory is the folder from where you have started pyspark. You can start pyspark using ipython and run pwd command to check working directory.
[Set PYSPARK_DRIVER_PYTHON=/path/to/ipython in spark-env.sh to use ipython]

import os
cwd = os.getcwd()
print(cwd)

Related

How to put an R script into a dockerfile?

I am trying to add my R script into a dockerfile. The beginning of the file (loading base image, installing necessary packages) works fine when I run it in my terminal, but I keep getting this error when it gets to the line that contains the R script I want to run:
Step 15/17 : COPY /Users/emma/Documents/folder1/examples/question-1/model-1.r .
COPY failed: stat /var/lib/docker/tmp/docker-builder376572603/Users/emma/Documents/folder1/examples/question-1/model-1.r: no such file or directory
I am already running the terminal out of the "question-1" directory (my shell command looks like this) :
Emmas-MacBook-Air-2:question-1 emma$
and the R script file "model-1.r" is in that folder. What am I doing wrong in detailing the path to the R script? Do I have to somehow transform the script before adding it to the dockerfile?
Thank you
I believe, that you have to specify relative (to your build folder) path to copy from. Source:
Multiple resources may be specified but the paths of files and directories will be interpreted as relative to the source of the context of the build.
And file to copy should be inside context of the build. So if your Dockerfile is located in folder A, then the objects you would like to copy should be placed in the folder A or it's subfolders.

Unable to load data file in Julia

There is a CSV file called orders_data stored in my system, but when I try to load this file in Julia using readdlm command in Jupyter Notebook(running in my browser), it says "NO SUCH FILE DIRECTORY FOUND"
I'm not sure why does this happen? is there a specific location where the files need to be stored to be accessed using Julia command? is it that I need to install some packages first to load the file using browser version of jupyter?
//Error information
SystemError: opening file orders_data.csv: No such file or directory
Your working directory is set to your current location when you start a Julia session. You can see what it is by calling the pwd() function. You can change it by calling the cd() function. Unless you specify otherwise, or provide a more complete pathname, Julia looks for files in your current working directory (although it's different for modules).

Creating temp folder in Linux

I was using R installed on a Linux server using SSH. Everything was fine, but now I have been denied access to temp folder and if I am loading R it is giving error cannot create 'R_TempDir', as it can't create the temp folder.
Can you please tell me how to create own local temp folder so that R can create temporary directory there ?
You can try to set one of these environment variable :
TMPDIR, TMP, TEMP:
Consulted (in that order) when setting the temporary directory for the session: see tempdir. TMPDIR is also used by some of the utilities see the help for build
by doing for instance :
export TMPDIR=/tmp
source
Hope this answers.
From what I understand,
I just thought that you could use .bashrc files in your /home/username/ directory
~# nano /home/username/.bashrc
You can put the command to create the folder inside this .bashrc file by just adding this line mkdir /your/dir/path/yourDir
This file is just like an autorun file which run everytime you upstart your linux server
But this is just working per user setting

Set working directory to mapped network drive in BATCH mode

I'm having issues on windows with R failing when changing the working directory to a mapped network drive (e.g. \Share\Folder mapped to Z:) in batch mode. If I run the same script in an interactive console I don't have any issues. I am accomplishing this by running R.exe with the script specified inside of a windows batch (.bat) file. The .bat file contains the following.
"C:\RRO\R-3.2.1\bin\R.exe" CMD BATCH "C:/Scripts/Rscript.R"
The error is simply...
> setwd( 'Z:/' )
Error in setwd("Z:/") : cannot change working directory
I'd be open to a different approach entirely for scheduling these scripts via the windows task scheduler if that helps avoid the issue. The reason for mapping the drive is that I need to supply some credentials in order to access it, which is done automatically when it is mapped, but can test to see if that's not the case in R if anyone knows how.
I hope this can help with your question.
I duplicated the problem with no errors by using Rscript command instead of a CMD BATCH
my R code which I saved as a script (test1.R)
library(openxlsx)
setwd("P:/Records/Indexing Operations/Indexing Data Analysis/Daily Reports")
my.data = read.xlsx("FSI Daily Project Status Report - 18 Mar 2016.xlsx", sheet = 1)
setwd("C:/Users/golieth/Documents/")
png(filename = "test.png", width = 500, height = 350 )
plot(my.data$Total.Images, my.data$Completed.Images.A,
main = Sys.time())
dev.off()
Note I change the directory 2 times in this file. Once to access data on a mapped network drive and a 2nd to save the image to the computer. I put a timestamp of the current time as the main plot title so you can run the batch file repeatedly and verify it works
my batch file
cd C:\Program Files\R\R-3.2.3\bin\i386
Rscript C:\Users\golieth\Documents\test1.R
Note: On the batch file if your code relies on 32 bit you need to change the directory of your R program (cd) to the R 32bit program. Same with R64. Next the Rscript should reference where you have saved your .R file
Finally, and this might be stating the obvious but make sure you are connected to your VPN before running the batch file.
Imagine a batch file with
cd Z:\<Destination>
Z:
RScript "C:/Scripts/Rscript.R"
This will enable Windows to change to the directory with all credentials and then start R within that directory. So the working dir. is the location from where R is started. Doing so requires that "C:\RRO\R-3.2.1\bin\" is part of your PATH variable.
Good luck!
When writing a .bat file, remember that cd is not used to change drive letters. To change drive letters you simply enter the name of the drive letter, which should be done prior to issuing the final cd to the working directory.
Like this:
sample.bat
z:
cd z:\your\working\directory\
C:\RRO\R-3.2.1\bin\Rscript.exe C:/Scripts/Rscript.R
You can save the files locally in your code, and use file.copy in your code to copy the files over to your network drive. Also try replacing the path in file.copy the network drive letter by the full network address name eg. \\....\.....\

Setting IPython Notebook save directory when using through django_extensions

I am using IPython Notebook through django_extensions:
python manage.py shell_plus --notebook
This saves the Notebook files to the current folder (Django project folder). How can I change the save location for .ipynb files?
You can change the directories where files are stored and read from using these parameters in ~/.ipython/profile_projectname/ipython_notebook_config.py, where projectname is your Django project.
c.NotebookManager.notebook_dir = u'/path/to/files'
c.FileNotebookManager.notebook_dir = u'/path/to/files'
This seems to ruin the import path to django however. I've been trying to play around with syspath to get this right via adding startup scripts in the startup directory, but have not found a solution that works yet. If you find a solution, let me know, because I'd like to have my notebook files outside of my project root directory as well.
A little late to the game, but I managed to save the notebooks in a different location + auto importing Django settings.
I start my notebooks with:
PYTHONPATH=/path/to/project/root DJANGO_SETTINGS_MODULE=settings python manage.py --notebook --no-browser
The PYTHONPATH enables finding the correct modules of my project and all shell_plus imports work automatically like a charm.
P.S.
As I am running this command executed from my host in my vagrant box --no-browser prevents opening w3m ^^

Resources