Does R have an equivalent to python's io for saving file like objects to memory? - r

In python we can import io and then make make a file like object with some_variable=io.BytesIO() and then download any type of file to that and interact with it like it were a locally saved file except that it's in memory. Does R have something like that? To be clear I'm not asking about what any particular OS does when you save some R object to a temp file.
This is kind of a duplicate of Can I write to and access a file in memory in R? but that is about 9 years old so maybe the functionality exists now either in base or with a package.

Yes, readBin.
readBin("/path", raw(), file.info("/path")$size)
This is a working example:
tfile <- tempfile()
writeBin(serialize(iris, NULL), tfile)
x <- readBin(tfile, raw(), file.info(tfile)$size)
unserialize(x)
…and you get back your iris data.
This is just an example, but for R objects, it is way more convenient to use readRDS/saveRDS().
However, if the object is an image you want to analyse, readBin gives a raw memory representation.
For text files, you should then use:
rawToChar(x)
but again there are readLines(), read.table(), etc., for these tasks.

Related

Convert raw bytes into a NetCDF object

I am pulling in NetCDF data from a remote server using data <- httr:GET(my_url) in an R session. I can writeBin(content(data, "raw"), "my_file.nc") and then nc_open("my_file.nc") but that is rather cumbersome (I am processing hundreds of NetCDF files).
Is there a way to convert the raw data straight into a ncdf4 object without going through the file system? For instance, would it be possible to pipe the raw data into nc_open()? I looked at the source code and the function prototype expects a named file, so I suppose a named pipe might work but how do I make a named pipe from a raw blob of bytes in R?
Any other suggestions welcome.

Create parquet file directory from CSV file in R

I'm running into more and more situations where I need out-of-memory (OOM) approaches to data analytics in R. I am familiar with other OOM approaches, like sparklyr and DBI but I recently came across arrow and would like to explore it more.
The problem is that the flat files I typically work with are sufficiently large that they cannot be read into R without help. So, I would ideally prefer a way to make the conversion without actually need to read the dataset into R in the first place.
Any help you can provide would be much appreciated!
arrow::open_dataset() can work on a directory of files and query them without reading everything into memory. If you do want to rewrite the data into multiple files, potentially partitioned by one or more columns in the data, you can pass the Dataset object to write_dataset().
One (temporary) caveat: as of {arrow} 3.0.0, open_dataset() only accepts a directory, not a single file path. We plan to accept a single file path or list of discrete file paths in the next release (see issue), but for now if you need to read only a single file that is in a directory with other non-data files, you'll need to move/symlink it into a new directory and open that.
You can do it in this way:
library(arrow)
library(dplyr)
csv_file <- "obs.csv"
dest <- "obs_parquet/"
sch = arrow::schema(checklist_id = float32(),
species_code = string())
csv_stream <- open_dataset(csv_file, format = "csv",
schema = sch, skip_rows = 1)
write_dataset(csv_stream, dest, format = "parquet",
max_rows_per_file=1000000L,
hive_style = TRUE,
existing_data_behavior = "overwrite")
In my case (56GB csv file), I had a really weird situation with the resulting parquet tables, so double check your parquet tables to spot any funky new rows that didn't exist in the original csv. I filed a bug report about it:
https://issues.apache.org/jira/browse/ARROW-17432
If you also experience the same issue, use the Python Arrow library to convert the csv into parquet and then load it into R. The code is also in the Jira ticket.

Is there a way to set character encoding when reading sas files to spark or when pulling the data to the r session?

So, I have sas7bdat files that are huge and I would like to read them to spark, process them and then collect the results to a r session. I'm reading them to spark using the package spark.sas7bdat and function spark_read_sas. So far so good. The problem is that the character encoding of the sas7bdat files is iso-8859-1 but to show the content correctly in R it would need to be UTF-8. When I pull the results to R, my data looks like this. (Let's first create an example that has the same raw bytes that my results have)
mydf <- data.frame(myvar = rawToChar(as.raw(c(0xef, 0xbf, 0xbd, 0x62, 0x63))))
head(mydf$myvar,1) # should get äbc if it was originally read correctly
> �bc
Changing the encoding afterwards doesn't work for some reason.
iconv(head(mydf$myvar,1), from = 'iso-8859-1', to = 'UTF-8')
> �bc
If I use haven package and read_sas('myfile.sas7bdat', encoding = 'iso-8859-1') to read the file directly to my r-session, everything works as expected.
head(mydf$myvar,1)
> äbc
I would be very grateful for a solution that enables me to do the processing in spark and then collect only the results to R-session because the files are so big. I guess this could potentially either be solved when a) reading the file to spark (but I did not find and option that would work) or b) correcting the encoding in R (could not get it to work but do not understand why, maybe it could have something to do with special character encoding in the sas7bdat file).

Import .rds file to h2o frame directly

I have a large .rds file saved and I trying to directly import .rds file to h2o frame using some functionality, because it is not feasible for me to read that file in R enviornment and then use as.h2o function to convert.
I am looking for some fast and efficient way to deal with it.
My attempts:
I have tried to read that file and then convert it into h2o frame. But, it is way much time consuming process.
I tried saving file in .csv format and using h2o.import() with parse=T.
Due to memory constraint I was not able to save complete dataframe.
Please suggest me any efficient way to do it.
Any suggestions would be highly appreciated.
The native read/write functionality in R is not very efficient, so I'd recommend using data.table for that. Both options below make use of data.table in some way.
First, I'd recommend trying the following: Once you install the data.table package, and load the h2o library, set options("h2o.use.data.table"=TRUE). What that will do is make sure that as.h2o() uses data.table underneath for the conversion from an R data.frame to an H2O Frame. Something to note about how as.h2o() works -- it writes the file from R to disk and then reads it back again into H2O using h2o.importFile(), H2O's parallel file-reader.
There is another option, which is effectively the same thing, though your RAM doesn't need to store two copies of the data at once (one in R and one in H2O), so it might be more efficient if you are really strapped for resources.
Save the file as a CSV or a zipped CSV. If you are having issues saving the data frame to disk as a CSV, then you should make sure you're using an efficient file writer like data.table::fwrite(). Once you have the file on disk, read it directly into H2O using h2o.importFile().

read.sas7bdat unable to read compressed file

I am trying to read a .sas7bdat file in R. When I use the command
library(sas7bdat)
read.sas7bdat("filename")
I get the following error:
Error in read.sas7bdat("county2.sas7bdat") : file contains compressed data
I do not have experience with SAS, so any help will be highly appreciated.
Thanks!
According to the sas7bdat vignette [vignette('sas7bdat')], COMPRESS=BINARY (or COMPRESS=YES) is not currently supported as of 2013 (and this was the vignette active on 6/16/2014 when I wrote this). COMPRESS=CHAR is supported.
These are basically internal compression routines, intended to make filesizes smaller. They're not as good as gz or similar (not nearly as good), but they're supported by SAS transparently while writing SAS programs. Obviously they change the file format significantly, hence the lack of implementation yet.
If you have SAS, you need to write these to an uncompressed dataset.
options compress=no;
libname lib '//drive/path/to/files';
data lib.want;
set lib.have;
run;
That's the simplest way (of many), assuming you have a libname defined as lib as above and change have and want to names that are correct (have should be the filename without extension of the file, in most cases; want can be changed to anything logical with A-Z or underscore only, and 32 or fewer characters).
If you don't have SAS, you'll have to ask your data provided to make the data available uncompressed, or as a different format. If you're getting this from a PUDS somewhere on the web, you might post where you're getting it from and there might be a way to help you identify an uncompressed source.
This admittedly is not a pure R solution, but in many situations (e.g. if you aren't on a pc and don't have the ability to write the SAS file yourself) the other solutions posted are not workable.
Fortunately, Python has a module (https://pypi.python.org/pypi/sas7bdat) which supports reading compressed SAS data sets - it's certainly better using this than needing to acquire SAS if you don't already have it. Once you extract the file and save it to text via Python, you can then access it in R.
from sas7bdat import SAS7BDAT
import pandas as pd
InFileName = "myfile.sas7bdat"
OutFileName = "myfile.txt"
with SAS7BDAT(InFileName) as f:
df = f.to_data_frame()
df.to_csv(path_or_buf = OutFileName, sep = "\t", encoding = 'utf-8', index = False)
The haven package can read compressed SAS-files:
library(haven)
df <- read_sas("sasfile.sas7bdat")
But only SAS-files which are compressed using compress=char, but not compress=binary.
So haven will be able to read this SAS-file:
data output.compressed_data_char (compress=char);
set inputdata;
run;
But not this SAS-file:
data output.compressed_data_binary (compress=binary);
set inputdata;
run;
https://cran.r-project.org/package=haven
http://support.sas.com/documentation/cdl/en/lrcon/62955/HTML/default/viewer.htm#a001002773.htm
"RevoScaleR" is a good package to read SAS data sets (compressed or uncompressed).You can use rxImport function of this package. Below is the example
Importing library
library(RevoScaleR)
Reading data
R_df_name <- rxImport("fake_path/file_name.sas7bdat")
The speed of this function is far better than haven/sas7bdat/sas7bdat.parso. I hope this helps anyone who struggles to read SAS data sets in R.
Cheers!
I found R to be the easiest for this kind of challenge, especially with compressed sas7dbat files, three simple lines:
library(haven)
data <- read_sas("yourfile.sas7dbat")
and then transform it to csv
write.csv(data,"data.csv")

Resources