R: how to write a raster to disk without auxiliary file? - r

I'm writing a dataset to file in ERMapper format (.ers) using the Raster package in R, but I'm having issues with the resulting .aux.xml auxiliary file (which I'm actually not interested in).
Simple example:
rst <- raster(ncols=15000,nrows=10000)
rst[] <- 1.234
writeRaster(rst, filename='_test.ers', overwrite=TRUE)
The writeRaster() line takes some time to execute, the data file is quite large, about 1.2GB on disk.
When checking what's happening while writeRaster() is executed, I find that the .ers file (header file + associated data file) is typically generated in about 20 sec. Then, it takes writeRaster() another 20 - 25 sec to generate the .aux.xml file, which only contains statistics such as min, max, mean, and st. dev. (which likely explains why it takes so long to compute).
Since I don't care about the .aux.xml file, I would like writeRaster() to not bother with it at all, and save me 20 - 25 sec of exec time (I'm writing lots of these datasets to disk so a 50% speedup in my code is quite substantial).
Anyone has any idea how to tell writeRaster() to not create a .aux.xml file? I suspect it's a GDAL-related issue, but haven't been able to find an answer yet after much research...
Any help most welcome!

Options related to the GDAL file format drivers can be set using the (not so easy to find) rgdal::setCPLConfigOption function.
In your case,
rgdal::setCPLConfigOption("GDAL_PAM_ENABLED", "FALSE")
should disable the xml file creation.
HTH

Related

Open many xlsx files and run a package that calculates a set of non parametric variables for each file

I need some help for my master thesis
I have a very large set of xlsx files and must calculate a series of indices for each file. I have the code for doing it one excel file at the time, but it would take many days to do it one by one. So does anyone nows how to open several excel files at the same time and do a loop for the calculation and putting all the indices in a matrix?
This is the code for one file at the time:
install.packages("nparACT")
library(nparACT)
(Import the data set manually of one file [I am new to R])
Nuevo <- data.frame(as.factor(P1_a_completo_Tmov$Datetime), P1_a_completo_Tmov$Dist)
(P1_a_completo_Tmov is the name of the file, example)
nparACT_base("Nuevo", SR=1/30)
(This last command gives me many options, what I need is the data.frame, so what I do now is to copy nparACT_base("Nuevo", SR=1/30) in the console and then I get the data frame)
Now I am stuck with a very inefficient time consuming way of working, but hope that one of you R experts can give me some light on how to speed the process. Thank you

How to get data into h2o fast

What my question isnt:
Efficient way to maintain a h2o data frame
H2O running slower than data.table R
Loading data bigger than the memory size in h2o
Hardware/Space:
32 Xeon threads w/ ~256 GB Ram
~65 GB of data to upload. (about 5.6 billion cells)
Problem:
It is taking hours to upload my data into h2o. This isn't any special processing, only "as.h2o(...)".
It takes less than a minute using "fread" to get the text into the space and then I make a few row/col transformations (diff's, lags) and try to import.
The total R memory is ~56GB before trying any sort of "as.h2o" so the 128 allocated shouldn't be too crazy, should it?
Question:
What can I do to make this take less than an hour to load into h2o? It should take from a minute to a few minutes, no longer.
What I have tried:
bumping ram up to 128 GB in 'h2o.init'
using slam, data.table, and options( ...
convert to "as.data.frame" before "as.h2o"
write to csv file (r write.csv chokes and takes forever. It is writing a lot of GB though, so I understand).
write to sqlite3, too many columns for a table, which is weird.
Checked drive cache/swap to make sure there are enough GB there. Perhaps java is using cache. (still working)
Update:
So it looks like my only option is to make a giant text file and then use "h2o.importFile(...)" for it. I'm up to 15GB written.
Update2:
It is a hideous csv file, at ~22GB (~2.4Mrows, ~2300 cols). For what it was worth, it took from 12:53pm until 2:44PM to write the csv file. Importing it was substantially faster, after it was written.
Think of as.h2o() as a convenience function, that does these steps:
converts your R data to a data.frame, if not already one.
saves that data.frame to a temp file on local disk (it will use data.table::fwrite() if available (*), otherwise write.csv())
call h2o.uploadFile() on that temp file
delete the temp file
As your updates say, writing huge data files to disk can take a while. But the other pain point here is using h2o.uploadFile() instead of the quicker h2o.importFile(). The decision of which to use is visibility:
With h2o.uploadFile() your client has to be able to see the file.
With h2o.importFile() your cluster has to be able to see the file.
When your client is running on the same machine as one of your cluster nodes, your data file is visible to both client and cluster, so always prefer h2o.importFile(). (It does a multi-threaded import.)
Another couple of tips: only bring data into the R session that you actually need there. And remember both R and H2O are column-oriented, so cbind can be quick. If you just need to process 100 of your 2300 columns in R, have them in one csv file, and keep the other 2200 columns in another csv file. Then h2o.cbind() them after loading each into H2O.
*: Use h2o:::as.h2o.data.frame (without parentheses) to see the actual code. For data.table writing you need to first do options(h2o.use.data.table = TRUE); you can also optionally switch it on/off with the h2o.fwrite option.

Multiple procedures in IDL program

I've written a procedure in IDL which performs some calculations on data and outputs an array of values. The calculations take about 2 minutes to run.
I need to then perform analysis on these results, and ideally I would like not to have to perform the initial calculations each time I want to perform some different analysis.
Is the best way to achieve this to save the output from the calculation to a data file and then read this in from a different program? Or is there a less cumbersome way to go about this?
Thanks in advance for any help
Yes, saving to a file is the easiest way to save the results from your first program for later use in the second (assuming you quit IDL between the two). There are may ways to save the data, depending on it's type and your preferences.
Easiest Way:
An IDL .sav file created by the SAVE command can store any kind of data, IDL variables, and even the whole state of your IDL session. Unfortunately, it only works for IDL (no other languages), and it can need to be re-generated if you upgrade IDL version. You read these files with RESTORE, which even remembers the names of the variables.
my_variable = 'Some data here.'
SAVE, my_variable, FILENAME='myfile.sav' ; save variable(s)
... IDL opened and closed here ...
RESTORE, 'myfile.sav' ; read variable(s) from file
print, my_variable
Some data here.
Most Portable Way:
For simple tabular data, CSV has the advantage of being highly portable and human readable. However, it's also slow, since numbers are stored in ASCII. Use WRITE_CSV to write, and READ_CSV to read.
Most Portable Binary Formats:
For complex data that needs to be read by multiple languages, consider the HDF5 or NetCDF libraries. Both of these are binary formats that can store most types of IDL-supported data. Note that NetCDF is actually an easier-to-use subset of HDF5.
Simplest Binary Format:
Another option for tabular data is a simple binary dump. Use WRITEU to write to a normal file opened for writing. Use READU to read from a normal file open for reading.
Assuming that your data calculations will only change very rarely, then, yes, your best solution is to just save the calculations to an output file, and then read them back into your analysis program. You don't say what kind of data this is, so it's hard to give a more specific answer. Assuming that you have a two-dimensional array of data, you could just write the results as a "flat" binary file:
pro perform_calculations
...
; assume mydata is a float array of dimensions [m,n]
openw, 1, 'results.dat'
writeu, 1, mydata
close, 1
end
Then, in either the same file or preferably a different .pro file:
pro perform_analysis
mydata = fltarr(m, n)
openr, 1, 'results.dat'
readu, 1, mydata
close, 1
...
end
Hope this helps.
Saving is a good way to do it, but if you run in the same session and your second program won't mess up the data from the first one, you can just call one and then pass the result to the second one.
pro do_calculations,result1,result2,result3
result1=1
result2=1.
result3=result1/result2
return
end
pro use_calculations,result1,result2,result3,result4
result4=result1-result2+result3
return
end
Then
IDL> do_calculations,result1,result2,result3
IDL> use_calculations,result1,result2,result3,result4
If you edit use_calculations, you can go again by:
IDL> use_calculations,result1,result2,result3,result4
Because the earlier results will stay in memory unless use_calculations does something bad to them.
You could also set up the second procedure to check to see if it has valid results from the first one and call it as needed.

Running jobs in background in R

I am working with a 250 by 250 matrix. However, it takes loads and loads of time to compute this. It takes like an hour at least.
Is it possible that I can store this matrix in memory in R, such that everytime I open up R, it is already there.
Ideally, I would like to know if it is possible to run a job on background in R , so that I dont have to wait an hour to get the matrix out and be able to play around with it.
1) You can save the workspace of R when closing R. Usually R asks "Save workspace image?" when you are closing it. If you will answer "Yes" it will save the workspace in a file named ".Rdata" and will load it when staring a new R instance.
2) The better option (more safe) is to save the matrix explicitly. There are several options how it can be done. One of the options is to save it as Rdata file:
save(m, file = "matrix.Rdata")
where m is your matrix.
You can load the matrix at any time with
load("matrix.Rdata")
if you are on the same working directory.
3) There is not such option as background computing for R. But you can open several R instances. Do computation in one instance, and do something else on other instance.
What would help is to output it to a file when you have computed it and then parse that file everytime you open R. Write yourself a computeMatrix() function or script to produce a file with the matrix stored in a sensible format. Also write yourself a loadMatrix() function or script to read in that file and load the matrix into memory for use, then call or run loadMatrix everytime you start R and want to use the matrix.
In terms of running an R job in the background, you can run an R script from the command line with the syntax "R CMD BATCH scriptName" with scriptName replaced by the name of your script.
It might be better to use the ff package and save the matrix as an ff object. This means that the actual matrix will be saved on the disk in an efficient manner, then when you start a new R session you can point to that same file without loading the entire matrix into memory. When you need part of the matrix, only the part you need will be loaded so it will be much quicker. Even if you need the entire matrix loaded into memory it should load faster than reading a text file.

Reading in only part of a Stata .DTA file in R

I apologize in advance if this has a simple answer somewhere. It seems like the kind of thing that would, but I can't seem to locate it in the help files, by searching SO, or by Googling.
I'm working with some datasets that are several GB right now. It's enough to fit in memory on one of the cluster nodes I have access to, but takes quite a bit of time to load. For many debugging/programming activities with this data, I don't need the entire file loaded, just the first few thousand observations to have a dataset on which to test code. I can of course just read the whole file in and subset, but I was wondering if there's a way to tell read.dta() to only read in the first N rows? This would of course be much faster.
I could also use a proper format like .csv and then use read.csv()'s nrows argument, but then I'd lose the factor labels in the Stata dataset (and have to recreate quite a few GB of data from someone else's code that's feeding in to this project. So a direct solution on .dta files is preferred.
Stata's binary files are written row-by-row, so you could change the R_LoadStataData function in stataread.c to limit the number of rows read in. However, this will only work if you do not need the value labels because they are written at the end of the file and would require you to read the entire file--which wouldn't save any time.
That's going to be a difficult one, as the do_readStata function under the hood is compiled code, only capable of taking in the whole file. I believe that in general binary files are hard to read line by line, and .dta is a binary format. Also the native binary format of R doesn't allow to select a number of lines from the dataset while reading in.
In my humble opinion, you can better just create a set of test files from within Stata ( eg the Stata code sample 1000, count will give you a sample of 1000 observations from the loaded dataset), and work with them. And if you have no access to Stata, someone else in the project should be able to do that for you.
To follow up on Joris Meys: For this kind of thing, I use a "test" data set and the "real" data set, each in separate folders. I keep a macro at the top of the .do file (with if/then statements below) to (1) take a sample of the data and (2) point input/output to the right folder containing one or the other. I probably do it different for every project, but something like this:
data creation .do file
blah blah blah
save using data/myfile.dta
save if uniform()<.05 using test_data/myfile.dta // or bsample, then save for panel data
analysis .do file
local test = "test_"
// when you're ready to run the file with all the data, use the following
// local test = ""
use `test'data/myfile.dta
blah blah blah
outreg2 ... using `test'output/mytable.txt

Resources