Error while parsing a very large (10 GB) XML file in R, using the XML package - r

Context
I'm currently working on a project involving osm data (Open Street Map). In order to manipulate geographic objects, I have to convert the data (an osm xml file) into an object. The osmar package lets me do this, but it fails to parse the raw xml data.
The error
Error in paste(file, collapse = "\n") : result would exceed 2^31-1 bytes
The code
require(osmar)
osmar_obj <- get_osm("anything", source = osmsource_file("my filename"))
Inside the get_osm function, the code calls ret <- xmlParse(raw), which triggers the error after a few seconds.
The question
How am I supposed to read a large XML file (here 10GB), knowing that I have 64G of memory ?
Thanks a lot !

This is the solution I came up with, even though it is not 100% satisfying.
Transform the .osm file by removing every newline (but the last) in your shell
Run the exact same code as before, skipping the paste that is not needed anymore (since you just did the equivalent in shell)
Profit :)
Obviously, I'm not very happy with it because modifying the data file in shell is more a trick that an actual solution :(

Related

How can I fix the 'line x did not have y elements' error when trying to use read.csv.sql?

I am a relative beginner to R trying to load and explore a large (7GB) CSV file.
It's from the Open Food Facts database and the file is downloadable here: https://world.openfoodfacts.org/data (the raw csv link).
It's too large to read straight into R and my searching has made me think the sqldf package could be useful. But when I try and read the file in with this code ...
library(sqldf)
library(here)
read.csv.sql(here("02. Data", "en.openfoodfacts.org.products.csv"), sep = "\t")
I get this error:
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
line 10 did not have 196 elements
Searching around made me think it's because there are missing values in the data. With read.csv, it looks like you can set fill = TRUE and get around this. But I can't work out how to do this with the read.csv.sql function. I also can't actually open the csv in Excel to inspect it because it's too large.
Does anyone know how to solve this or if there is a better method for reading in this large file? Please keep in mind I don't really know how to use SQL or other database tools, mostly just R (but can try and learn the basics if helpful).
Based on the error message, it seems unlikely that you can read the CSV file en toto into memory, even once. I suggest for analyzing the data within it, you may need to change your data-access to something else, such as:
DBMS, whether monolithic (duckdb or RSQLite, lower cost-of-entry) or full DBMS (e.g., PostgreSQL, MariaDB, SQL Server). With this method, you would connect (using DBI) to the database (monolithic or otherwise), query for the subset of data you want/need, and work on that data. It is feasible to do in-database aggregation as well, which might be a necessary step in your analysis.
Arrow parquet file. These are directly supported by dplyr functions and in a lazy fashion, meaning that when you call open_dataset("path/to/my.parquet"), it immediately returns an object but does not load data; you call your dplyr mutate/filter/select/summarize pipe (some limitations), and then you finally call ... %>% collect(), only then it loads the resulting data into memory. Similar to SQL above in that you work on subsets at a time, but if you're already familiar with dplyr, it is much much closer than learning SQL from scratch.
There are ways to get a large CSV file into each of this.
Arrow/Parquet: How to convert a csv file to parquet (python,
arrow/drill), a quick search in your favorite search-engine should provide other possibilities; regardless of the language you want to do your analysis in ("R"), don't constrain yourself to solutions using that language.
SQL: DuckDB (https://duckdb.org/docs/data/csv.html), SQLite (https://www.sqlitetutorial.net/sqlite-import-csv/), and other DBMSes tend to have a "bulk" command for importing raw CSV.

Does R have an equivalent to python's io for saving file like objects to memory?

In python we can import io and then make make a file like object with some_variable=io.BytesIO() and then download any type of file to that and interact with it like it were a locally saved file except that it's in memory. Does R have something like that? To be clear I'm not asking about what any particular OS does when you save some R object to a temp file.
This is kind of a duplicate of Can I write to and access a file in memory in R? but that is about 9 years old so maybe the functionality exists now either in base or with a package.
Yes, readBin.
readBin("/path", raw(), file.info("/path")$size)
This is a working example:
tfile <- tempfile()
writeBin(serialize(iris, NULL), tfile)
x <- readBin(tfile, raw(), file.info(tfile)$size)
unserialize(x)
…and you get back your iris data.
This is just an example, but for R objects, it is way more convenient to use readRDS/saveRDS().
However, if the object is an image you want to analyse, readBin gives a raw memory representation.
For text files, you should then use:
rawToChar(x)
but again there are readLines(), read.table(), etc., for these tasks.

R: how to write a raster to disk without auxiliary file?

I'm writing a dataset to file in ERMapper format (.ers) using the Raster package in R, but I'm having issues with the resulting .aux.xml auxiliary file (which I'm actually not interested in).
Simple example:
rst <- raster(ncols=15000,nrows=10000)
rst[] <- 1.234
writeRaster(rst, filename='_test.ers', overwrite=TRUE)
The writeRaster() line takes some time to execute, the data file is quite large, about 1.2GB on disk.
When checking what's happening while writeRaster() is executed, I find that the .ers file (header file + associated data file) is typically generated in about 20 sec. Then, it takes writeRaster() another 20 - 25 sec to generate the .aux.xml file, which only contains statistics such as min, max, mean, and st. dev. (which likely explains why it takes so long to compute).
Since I don't care about the .aux.xml file, I would like writeRaster() to not bother with it at all, and save me 20 - 25 sec of exec time (I'm writing lots of these datasets to disk so a 50% speedup in my code is quite substantial).
Anyone has any idea how to tell writeRaster() to not create a .aux.xml file? I suspect it's a GDAL-related issue, but haven't been able to find an answer yet after much research...
Any help most welcome!
Options related to the GDAL file format drivers can be set using the (not so easy to find) rgdal::setCPLConfigOption function.
In your case,
rgdal::setCPLConfigOption("GDAL_PAM_ENABLED", "FALSE")
should disable the xml file creation.
HTH

Split big data in R

I have a big data file (~1GB) and I want to split it into smaller ones. I have R in hand and plan to use it.
Loading the whole into memory cannot be done as I would get the "cannot allocate memory for vector of xxx" error message.
Then I want to use the read.table() function with the parameters skip and nrows to read only parts of the file in. Then save out to individual files.
To do this, I'd like to know the number of lines in the big file first so I can workout how many rows should I set to individual files and how many files should I split into.
My question is: how can I get the number of lines from the big data file without fully loading it into R?
Suppose I can only use R. So cannot use any other programming languages.
Thank you.
Counting the lines should be pretty easy -- check this tutorial http://www.exegetic.biz/blog/2013/11/iterators-in-r/ (the "iterating through lines part).
The gist is to use ireadLines to open an iterator over the file
For Windows, something like this should work
fname <- "blah.R" # example file
res <- system(paste("find /v /c \"\"", fname), intern=T)[[2]]
regmatches(res, gregexpr("[0-9]+$", res))[[1]]
# [1] "39"

use readOGR to load in a large spatial file in R

For my processes in R I want to read in a 20 gigabyte file. I got it in a XML file type.
In R I cannot load it in with readOGR since it is to big. It gives me the error cannot allocate vector 99.8 mb.
Since my file is to big the logical next step in my mind would be to split the file. But since I can not open it in R and any other GIS package at hand, I can not split the file before I load it in. I am already using the best PC to my availability.
Is there a solution?
UPDATE BECAUSE OF COMMENT
If I use head() my line looks like underneath. It does not work unfortunately.
headfive <- head(readOGR('file.xml', layer = 'layername'),5)

Resources