use readOGR to load in a large spatial file in R - r

For my processes in R I want to read in a 20 gigabyte file. I got it in a XML file type.
In R I cannot load it in with readOGR since it is to big. It gives me the error cannot allocate vector 99.8 mb.
Since my file is to big the logical next step in my mind would be to split the file. But since I can not open it in R and any other GIS package at hand, I can not split the file before I load it in. I am already using the best PC to my availability.
Is there a solution?
UPDATE BECAUSE OF COMMENT
If I use head() my line looks like underneath. It does not work unfortunately.
headfive <- head(readOGR('file.xml', layer = 'layername'),5)

Related

Import .rds file to h2o frame directly

I have a large .rds file saved and I trying to directly import .rds file to h2o frame using some functionality, because it is not feasible for me to read that file in R enviornment and then use as.h2o function to convert.
I am looking for some fast and efficient way to deal with it.
My attempts:
I have tried to read that file and then convert it into h2o frame. But, it is way much time consuming process.
I tried saving file in .csv format and using h2o.import() with parse=T.
Due to memory constraint I was not able to save complete dataframe.
Please suggest me any efficient way to do it.
Any suggestions would be highly appreciated.
The native read/write functionality in R is not very efficient, so I'd recommend using data.table for that. Both options below make use of data.table in some way.
First, I'd recommend trying the following: Once you install the data.table package, and load the h2o library, set options("h2o.use.data.table"=TRUE). What that will do is make sure that as.h2o() uses data.table underneath for the conversion from an R data.frame to an H2O Frame. Something to note about how as.h2o() works -- it writes the file from R to disk and then reads it back again into H2O using h2o.importFile(), H2O's parallel file-reader.
There is another option, which is effectively the same thing, though your RAM doesn't need to store two copies of the data at once (one in R and one in H2O), so it might be more efficient if you are really strapped for resources.
Save the file as a CSV or a zipped CSV. If you are having issues saving the data frame to disk as a CSV, then you should make sure you're using an efficient file writer like data.table::fwrite(). Once you have the file on disk, read it directly into H2O using h2o.importFile().

Read a sample from sas7bdat file in R

I have a sas7bdat file of size around 80 GB. Since my pc has a memory of 4 GB the only way I can see is reading some of its rows. I tried using the sas7bdat package in R which gives the error "big endian files are not supported"
The read_sas() function in haven seems to work but the function supports reading specific columns only while I need to read any subset of rows with all columns. For example, it will be fine if I can read 1% of the data to understand it.
Is there any way to do this? Any package which can work?
Later on I plan to read parts of the file and divide it into 100 or so sections
If you have Windows you can use the SAS Universal Viewer, which is free, and export the dataset to CSV. Then you can import the CSV into R in more readable chunks using this method.

How to save raster data in R object format?

I don't know how to deal with save.image()and saveRDS()with raster data in R. I have understood that raster package open a connexion with the image file using raster() function, so it doesn't really open the file into R workspace.
I want to save my workspace (data.frame, list, raster, etc) with save.image() function (or similar) and open it in a different computer. If I try to plot or process a raster object saved in a different computer, always have the same issue:
Error in .local(.Object, ...) :
`C:\path\to\file.tif' does not exist in the file system,
and is not recognised as a supported dataset name.
Is there a way to save a raster object (opened as external file) in R format? I don't mean raster format as tiff nor grid and others.
At your own risk, you can use the readAll function to load the raster into memory before saving. e.g.
r <- raster(system.file("external/test.grd", package="raster"))
r <- readAll(r) # force data into memory
save(r, file = 'r.RData')
It can be loaded on a different machine as mentioned
load('r.Rdata`)
Beware, this will be problematic for very large rasters on memory limited systems
You can save rasters, like other R objects, using the save command.
save(r,file="r.Rdata")
On a different computer, you can load that file using
load("r.Rdata")
which will bring back the raster r in your workspace.
I have tried this across Windows and Linux and it never gives problems

Error while parsing a very large (10 GB) XML file in R, using the XML package

Context
I'm currently working on a project involving osm data (Open Street Map). In order to manipulate geographic objects, I have to convert the data (an osm xml file) into an object. The osmar package lets me do this, but it fails to parse the raw xml data.
The error
Error in paste(file, collapse = "\n") : result would exceed 2^31-1 bytes
The code
require(osmar)
osmar_obj <- get_osm("anything", source = osmsource_file("my filename"))
Inside the get_osm function, the code calls ret <- xmlParse(raw), which triggers the error after a few seconds.
The question
How am I supposed to read a large XML file (here 10GB), knowing that I have 64G of memory ?
Thanks a lot !
This is the solution I came up with, even though it is not 100% satisfying.
Transform the .osm file by removing every newline (but the last) in your shell
Run the exact same code as before, skipping the paste that is not needed anymore (since you just did the equivalent in shell)
Profit :)
Obviously, I'm not very happy with it because modifying the data file in shell is more a trick that an actual solution :(

Running jobs in background in R

I am working with a 250 by 250 matrix. However, it takes loads and loads of time to compute this. It takes like an hour at least.
Is it possible that I can store this matrix in memory in R, such that everytime I open up R, it is already there.
Ideally, I would like to know if it is possible to run a job on background in R , so that I dont have to wait an hour to get the matrix out and be able to play around with it.
1) You can save the workspace of R when closing R. Usually R asks "Save workspace image?" when you are closing it. If you will answer "Yes" it will save the workspace in a file named ".Rdata" and will load it when staring a new R instance.
2) The better option (more safe) is to save the matrix explicitly. There are several options how it can be done. One of the options is to save it as Rdata file:
save(m, file = "matrix.Rdata")
where m is your matrix.
You can load the matrix at any time with
load("matrix.Rdata")
if you are on the same working directory.
3) There is not such option as background computing for R. But you can open several R instances. Do computation in one instance, and do something else on other instance.
What would help is to output it to a file when you have computed it and then parse that file everytime you open R. Write yourself a computeMatrix() function or script to produce a file with the matrix stored in a sensible format. Also write yourself a loadMatrix() function or script to read in that file and load the matrix into memory for use, then call or run loadMatrix everytime you start R and want to use the matrix.
In terms of running an R job in the background, you can run an R script from the command line with the syntax "R CMD BATCH scriptName" with scriptName replaced by the name of your script.
It might be better to use the ff package and save the matrix as an ff object. This means that the actual matrix will be saved on the disk in an efficient manner, then when you start a new R session you can point to that same file without loading the entire matrix into memory. When you need part of the matrix, only the part you need will be loaded so it will be much quicker. Even if you need the entire matrix loaded into memory it should load faster than reading a text file.

Resources