Running jobs in background in R - r

I am working with a 250 by 250 matrix. However, it takes loads and loads of time to compute this. It takes like an hour at least.
Is it possible that I can store this matrix in memory in R, such that everytime I open up R, it is already there.
Ideally, I would like to know if it is possible to run a job on background in R , so that I dont have to wait an hour to get the matrix out and be able to play around with it.

1) You can save the workspace of R when closing R. Usually R asks "Save workspace image?" when you are closing it. If you will answer "Yes" it will save the workspace in a file named ".Rdata" and will load it when staring a new R instance.
2) The better option (more safe) is to save the matrix explicitly. There are several options how it can be done. One of the options is to save it as Rdata file:
save(m, file = "matrix.Rdata")
where m is your matrix.
You can load the matrix at any time with
load("matrix.Rdata")
if you are on the same working directory.
3) There is not such option as background computing for R. But you can open several R instances. Do computation in one instance, and do something else on other instance.

What would help is to output it to a file when you have computed it and then parse that file everytime you open R. Write yourself a computeMatrix() function or script to produce a file with the matrix stored in a sensible format. Also write yourself a loadMatrix() function or script to read in that file and load the matrix into memory for use, then call or run loadMatrix everytime you start R and want to use the matrix.
In terms of running an R job in the background, you can run an R script from the command line with the syntax "R CMD BATCH scriptName" with scriptName replaced by the name of your script.

It might be better to use the ff package and save the matrix as an ff object. This means that the actual matrix will be saved on the disk in an efficient manner, then when you start a new R session you can point to that same file without loading the entire matrix into memory. When you need part of the matrix, only the part you need will be loaded so it will be much quicker. Even if you need the entire matrix loaded into memory it should load faster than reading a text file.

Related

Open many xlsx files and run a package that calculates a set of non parametric variables for each file

I need some help for my master thesis
I have a very large set of xlsx files and must calculate a series of indices for each file. I have the code for doing it one excel file at the time, but it would take many days to do it one by one. So does anyone nows how to open several excel files at the same time and do a loop for the calculation and putting all the indices in a matrix?
This is the code for one file at the time:
install.packages("nparACT")
library(nparACT)
(Import the data set manually of one file [I am new to R])
Nuevo <- data.frame(as.factor(P1_a_completo_Tmov$Datetime), P1_a_completo_Tmov$Dist)
(P1_a_completo_Tmov is the name of the file, example)
nparACT_base("Nuevo", SR=1/30)
(This last command gives me many options, what I need is the data.frame, so what I do now is to copy nparACT_base("Nuevo", SR=1/30) in the console and then I get the data frame)
Now I am stuck with a very inefficient time consuming way of working, but hope that one of you R experts can give me some light on how to speed the process. Thank you

Save or Retrieve the data in Rstudio

Sir,i am a student, learning R,I have a question about how to store data in R, or how to retrieve data that has been erased.
Sir,
Using RStudio is not much different than using, say, Word or Notepad, but with some differences.
First the similarities:
If you do not save your Rscript or data, it might not be available after you restart RStudio or if you overwrite/erase your data.
The advantage of using R and Rstudio is that you can script how you load and manipulate your data, hence recreate the data. If you use a script and do not rely only on the console (interactive) part.
For the differences, Rstudio can be set to save your current workspace. This is were all data and variables loaded reside. To change the settings, go to "Tools" --> "Global options" and you should see the options as depicted below.
However, if you erase your data by overwriting with other values or using the command unset, the data is lost. Your only recourse is to retrace how it was loaded/modified, using either your script or going through the "history".
For saving data, see e.g. http://www.sthda.com/english/wiki/saving-data-into-r-data-format-rds-and-rdata. Note the difference between save and saveRDS where the former saves data with their variable names, whereas saveRDS saves the data without and must be loaded into a variable.

Lock file when writing to it from parallel processes in R

I use parSapply() from parallel package in R. I need to perform calculations on huge amount of data. Even in parallel it takes hours to execute, so I decided to regularly write results to a file from clusters using write.table(), because the process crashes from time to time when running out of memory or for some other random reason and I want to continue calculations from the place it stopped. I noticed that some lines of csv files that I get are just cut in the middle, probably as a result of several processes writing to the file at the same time. Is there a way to place a lock on the file for the time while write.table() executes, so other clusters can't access it or the only way out is to write to separate file from each cluster and then merge the results?
It is now possible to create file locks using filelock (GitHub)
In order to facilitate this with parSapply() you would need to edit your loop so that if the file is locked the process will not simply quit, but either try again or Sys.sleep() for a short amount of time. However, I am not certain how this will affect your performance.
Instead I recommend you create cluster-specific files that can hold your data, eliminating the need for a lock file and not reducing your performance. Afterwards you should be able to weave these files and create your final results file.
If size is an issue then you can use disk.frame to work with files that are larger than your system RAM.
The old unix technique looks like this:
`#make sure other processes are not writing to the files by trying to create a directory:
if the directory exists it sends an error and then tries again. Exit the repeat when it successfully creates the lock directory
repeat{
if(system2(command="mkdir", args= "lockdir",stderr=NULL)==0){break}
}
write.table(MyTable,file=filename,append=T)
#get rid of the locking directory
system2(command = "rmdir", args = "lockdir")
`

source() taking long time and often crashed

I used dump() command to dump some dataframes in R.
The particular dump files are about 200 MB, and one is about1.5 GB. Later I tried to retrieve them using source() and it is taking a lot of time and says windows stopped working after 3-4 hours. I am using 64 bit R 3.0.0 ( I tried in R 2.15.3 too) in windows 7 with memory of 48 GB. For one of the file, It threw some memory error, (I don't have log now) but loaded 4-5 datasets out of about 15 datasets.
Is there any way I can load a particular dataset if I know name?
or is there any other way?
I have learned my lesson and probably save command to create data and the original data. or one data in one dump file (or R image file)
Thank you
Use save() and load() rather than dump() and source().
save() writes out an binary representation of the data to an .Rdata file, which can then be loaded back in using load().
dump() converts everything to a text representation, which source() then has to reconvert back to binary. Both ends of that process are very inefficient.

Open large files with R

I want to process a file (1.9GB) that contains 100.000.000 datasets in R.
Actually I only want to have every 1000th dataset.
Each dataset contains 3 Columns, separated by a tab.
I tried: data <- read.delim("file.txt"), but R Was not able to manage all datasets at once.
Can I tell R directly to load only every 1000th dataset from the file?
After reading the file I want to bin the data of column 2.
Is it possible to directly bin the number written in column 2?
Is it possible the read the file line by line, without loading the whole file into the memory?
Thanks for your help.
Sven
You should pre-process the file using another tool before reading into R.
To write every 1000th line to a new file, you can use sed, like this:
sed -n '0~1000p' infile > outfile
Then read the new file into R:
datasets <- read.table("outfile", sep = "\t", header = F)
You may want to look at the manual devoted to R Data Import/Export.
Naive approaches always load all the data. You don't want that. You may want another script which reads line-by-line (written in awk, perl, python, C, ...) and emits only every N-th line. You can then read the output from that program directly in R via a pipe -- see the help on Connections.
In general, very large memory setups require some understanding of R. Be patient, you will get this to work but once again, a naive approach requires lots of RAM and a 64-bit operating system.
Maybe package colbycol could be usefull to you.

Resources