In memory data processing in R?: save -> readBin ->? - r

How can I access the R data originally saved with the SAVE command and later read with readBin?
Let me try to explain:
I have saved some data (mostly matrices and lists) to a file using SAVE command.
Later I have transformed this file (encrypted) and saved it using writeBin.
Since the file is transformed I cannot get the data directly using LOAD but need to do it with readBin and perform opposite transformation in memory.
The problem is, after reading with readBin and transforming, the data are in memory, but I cannot access them as R objects (such as matrices or lists), since they are not recognized as such (there is just singular binary object).
The easiest way would be to use this binary object as connection for LOAD.
Unfortunately, LOAD does not accept in-memory binary connections.
I guess .Internal(loadFromConn2(...)) may be a key to this, but I do not have details of it internal workings.
Is there any way to make R recognize the binary data stored in-memory as binary object as R original objects (matrices, lists, etc.)?
The encryption code I am using is available at: http://pastebin.com/eVfVQYwn
Thanks in advance.

(If you aren't interested in learning how to research this type of
problem in the future, skip to "Results", far below.)
Long Story ...
Knowing some things about how the R objects are stored with save
will inform you on how to retrieve it with load. From help(save):
save(..., list = character(),
file = stop("'file' must be specified"),
ascii = FALSE, version = NULL, envir = parent.frame(),
compress = !ascii, compression_level,
eval.promises = TRUE, precheck = TRUE)
The default for compress will be !ascii which means compress will
be TRUE, so:
compress: logical or character string specifying whether saving to a
named file is to use compression. 'TRUE' corresponds to
'gzip' compression, ...
The key here is that it defaults to 'gzip' compression. From here,
let's look at help(load):
'load' ... can read a compressed file (see 'save') directly from a
file or from a suitable connection (including a call to
'url').
(Emphasis added by me.) This implies both that it will take a
connection (that is not an actual file), and that it "forces"
compressed-ness. My typical go-to function for faking file connections
is textConnection, though this does not work with binary files, and
its help page doesn't provide a reference for binary equivalence.
Continued from help(load):
A not-open connection will be opened in mode '"rb"' and closed after
use. Any connection other than a 'gzfile' or 'gzcon'
connection will be wrapped in 'gzcon' to allow compressed saves to
be handled ...
Diving a little tangentially (remember the previous mention of gzip
compression?), help(gzcon):
Compressed output will contain embedded NUL bytes, and so 'con'
is not permitted to be a 'textConnection' opened with 'open =
"w"'. Use a writable 'rawConnection' to compress data into
a variable.
Aha! Now we see that there is a function rawConnection which one
would (correctly) infer is the binary-mode equivalent of
textConnection.
Results (aka "long story short, too late")
Your pastebin code is interesting but unfortunately moot.
Reproducible examples
make things easier for people considering answering your question.
Your problem statement, restated:
set.seed(1234)
fn <- 'test-mjaniec.Rdata'
(myvar1 <- rnorm(5))
## [1] -1.2070657 0.2774292 1.0844412 -2.3456977 0.4291247
(myvar2 <- sample(letters, 5))
## [1] "s" "n" "g" "v" "x"
save(myvar1, myvar2, file=fn)
rm(myvar1, myvar2) ## ls() shows they are no longer available
x.raw <- readBin(fn, what=raw(), n=file.info(fn)$size)
head(x.raw)
## [1] 1f 8b 08 00 00 00
## how to access the data stored in `x.raw`?
The answer:
load(rawConnection(x.raw, open='rb'))
(Confirmation:)
myvar1
## [1] -1.2070657 0.2774292 1.0844412 -2.3456977 0.4291247
myvar2
## [1] "s" "n" "g" "v" "x"
(It works with your encryption code, too, by the way.)

Related

RStudio read_delim(): intermittently receive error std::bad_alloc upon opening files with unusual delimeter

I received a series of 100+ files from a client. This client received the files as part of litigation, so they didn't have to be transmitted in a convenient fashion, they just all had to be present. In a single .zip file, all the files are all tracked with names like Folder1.001, Folder1.002, Folder3.001, etc. When unpackaged these files using the 7-Zip program, they don't show up with a .txt, .csv, or any other file extension. Windows incorrectly interprets the unzipped files as a ".001 File" or ".002 File." This is not the issue, because I know that the files are delimited by a ~ and are 118 columns wide. Each file has between 2.5M and 4.9M rows, and each is about 1 GB in size when unzipped.
This is my first ever post here, so please excuse any breach of etiquette.
I am working in a .Rmd file on a virtual machine running Windows. I have R4.2.2 (64-bit), and RStudio 2022.12.0+353. All work is being done within a drive on the virtual machine that has 9+ GB free out of 300 GB total. The size of this virtual drive could be increased, if necessary.
My goal here is examine one variable in each file, to see if cases fall within a given range for that variable, and save those rows that do. I have been saving them as .rds files using write_rds().
I have been bringing in the files using a read_delim() statement specifying 'delim = "~"'. I created a vector of 120 column names which I use because the columns are not labeled. These commands on their own are not an issue. A successful import looks like the below.
work1 <- read_delim("Data\\Folder1\\File1.001"), delim = "~", col_names = vNames1)
Rows: 2577668 Columns: 120── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────── Delimiter: "~" chr (16): Press_ZIP, Person1ID, Specialty, PCode, Retailer, ProdType, ProdGroupNo, Unk1, Skip2, Skip3, Skip4, Skip5, Skip6, Skip7... dbl (102): Person2No, ReportNo, DateStr, BucketNo, Bu1, Bu2, Bu3, Bu4, Bu5, Bu6, Bu7, Bu8, Bu9, Bu10, Bu11, Bu12, Bu13, Bu14, Bu15, B... lgl (2): Skip1, Skip9 ℹ Use spec()to retrieve the full column specification for this data. ℹ Specify the column types or setshow_col_types = FALSE to quiet this message.
It mishandles the columns named Skip1 and Skip9 as logical values, but those aren't a necessary part of my analysis.
I then filter and write the file using
work1 <- work1 %>% filter(as.numeric(Press_ZIP) > 78900, as.numeric(Press_ZIP) < 99900)
write_rds(work1, "Data\\Working\\Folder1_001.rds")
I have also done this with the read_delim() and filter() piped into a single command. This is not the issue. NOTE: Before I read in the next file (File1.002), I now have a work1 file that is at most, 4000 cases, down from millions when it was imported.
Since I have over 100 of these files, I have written multiple code chunks to do a few of these at a time. After one to three read_delim() statements in a row, I get the below error.
work2 <- read_delim("Data\\Folder1\\File1.002"), delim = "~", col_names = vNames1)
Error std::bad_alloc
Which I understand has to memory allocation. I can close out RStudio and restart and that will allow me to do one or two more imports, filterings, then writings. Doing that for over 100 files is far too inefficient.
I condensed my code a step further by writing the read_delim() step within the write_rds() step, which looks like the below.
write_rds((read_delim("Data\\Folder1\\File003",
delim = "~", col_names = vNames1) %>%
filter(as.numeric(Press_ZIP) > 78900, as.numeric(Press_ZIP) < 99900)),
"Data\\Working\\Folder1_003.rds")
Rows: 2577668 Columns: 120── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────── Delimiter: "~" chr (16): Press_ZIP, Person1ID, Specialty, PCode, Retailer, ProdType, ProdGroupNo, Unk1, Skip2, Skip3, Skip4, Skip5, Skip6, Skip7... dbl (102): Person2No, ReportNo, DateStr, BucketNo, Bu1, Bu2, Bu3, Bu4, Bu5, Bu6, Bu7, Bu8, Bu9, Bu10, Bu11, Bu12, Bu13, Bu14, Bu15, B... lgl (2): Skip1, Skip9 ℹ Use spec()to retrieve the full column specification for this data. ℹ Specify the column types or setshow_col_types = FALSE to quiet this message.
Yet after 1 or 2 successful runs, I get the same
Error std::bad_alloc message.
Using traceback(), it seems like it is related to vroom::vroom(), but I'm not sure how to check any further.

How do I download images from a server and then upload it to a website using R?

Okay, so I have approximately 2 GB worth of files (images and what not) stored on a server (I'm using Cygwin right now since I'm on Windows) and I was wondering if I was able to get all of this data into R and then eventually translate it onto a website where people can view/download those images?
I currently have installed the ssh package and have logged into my server using:
ssh::ssh_connect("name_and_server_ip_here")
I've been able to successfully connect, however, I am not particular sure how to locate the files on the server through R. I assume I would use something like scp_download to download the files from the server, but as mentioned before, I am not particularly sure how to locate the files from the server, so I wouldn't be able to download them anyways (yet)!
Any sort of feedback and help would be appreciated! Thanks :)
You can use ssh::ssh_exec_internal and some shell commands to "find" commands.
sess <- ssh::ssh_connect("r2#myth", passwd="...")
out <- ssh::ssh_exec_internal(sess, command = "find /home/r2/* -maxdepth 3 -type f -iname '*.log'")
str(out)
# List of 3
# $ status: int 0
# $ stdout: raw [1:70] 2f 68 6f 6d ...
# $ stderr: raw(0)
The stdout/stderr are raw (it's feasible that the remote command did not produce ascii data), so we can use rawToChar to convert. (This may not be console-safe if you have non-ascii data, but it is here, so I'll go with it.)
rawToChar(out$stdout)
# [1] "/home/r2/logs/dns.log\n/home/r2/logs/ping.log\n/home/r2/logs/status.log\n"
remote_files <- strsplit(rawToChar(out$stdout), "\n")[[1]]
remote_files
# [1] "/home/r2/logs/dns.log" "/home/r2/logs/ping.log" "/home/r2/logs/status.log"
For downloading, scp_download is not vectorized, so we can only upload one file at a time.
for (rf in remote_files) ssh::scp_download(sess, files = rf, to = ".")
# 4339331 C:\Users\r2\.../dns.log
# 36741490 C:\Users\r2\.../ping.log
# 17619010 C:\Users\r2\.../status.log
For uploading, scp_upload is vectorized, so we can send all in one shot. I'll create a new directory (just for this example, and to not completely clutter my remote server :-), and then upload them.
ssh::ssh_exec_wait(sess, "mkdir '/home/r2/newlogs'")
# [1] 0
ssh::scp_upload(sess, files = basename(remote_files), to = "/home/r2/newlogs/")
# [100%] C:\Users\r2\...\dns.log
# [100%] C:\Users\r2\...\ping.log
# [100%] C:\Users\r2\...\status.log
# [1] "/home/r2/newlogs/"
(I find it odd that scp_upload is vectorized while scp_download is not. If this were on a shell/terminal, then each call to scp would need to connect, authenticate, copy, then disconnect, a bit inefficient; since we're using a saved session, I believe (unverified) that there is little efficiency lost due to not vectorizing the R function ... though it is still really easy to vectorize it.)

Can you convert an R raw vector representing an RDS file back into an R object without a round trip to disk?

I have an RDS file that is uploaded and then download via curl::curl_fetch_memory() (via httr) - this gives me a raw vector in R.
Is there a way to read that raw vector representing the RDS file to return the original R object? Or does it always have to be written to disk first?
I have a setup similar to below:
saveRDS(mtcars, file = "obj.rds")
# upload the obj.rds file
...
# download it again via httr::write_memory()
...
obj
# [1] 1f 8b 08 00 00 00 00 00 00 03 ad 56 4f 4c 1c 55 18 1f ca 02 bb ec b2 5d
# ...
is.raw(obj)
#[1] TRUE
It seems readRDS() should be used to uncompress it, but it takes a connection object and I don't know how to make a connection object from an R raw vector - rawConnection() looked promising but gave:
rawConnection(obj)
#A connection with
#description "obj"
#class "rawConnection"
#mode "r"
#text "binary"
#opened "opened"
#can read "yes"
#can write "no"
readRDS(rawConnection(obj))
#Error in readRDS(rawConnection(obj)) : unknown input format
Looking through readRDS it looks like it uses gzlib() underneath but couldn't get that to work with the raw vector object.
If its download via httr::write_disk() -> curl::curl_fetch_disk() -> readRDS() then its all good but this is a round trip to disk and I wondered if it could be optimised for big files.
By default, RDS file streams are gzipped. To read a raw connection you need to manually wrap it into a gzcon:
con = rawConnection(obj)
result = readRDS(gzcon(con))
This works even when the stream isn’t gzipped. But unfortunately it fails if a different supported compression method (e.g. 'bzip2') was used to create the RDS file. Unfortunately R doesn’t seem to have a gzcon equivalent for bzip2 or xz. For those formats, the only recourse seems to be to write the data to disk.
I had exactly the same problem, and for me, the above answer with gzcon did not work, however, I could directly load the raw object into R's memory using the rawConnection:
load(rawConnection(obj))

R, GET and GZ compression

I am building clients onto RESTful APIs. Some links let me download attachments (files) from the server, and in the best case these are .txt. I only mention the RESTful part since it means that I have to send some headers and potentially body with each post - the standard R 'filename'=URL logic won't work.
Sometimes people bundle many txts into a zip. These are awkward since I don't know what they contain until I download many of them.
For the moment, I am unpackaging these, gzipping the files (adds the .gz extension) and re-uploading them. They can then be indexed and downloaded.
I'm using Hadley's cute httr package, but I can't see an elegant way to decompress the gz files.
When using read.csv or similar any files with a gz ending are automatically decompressed (convenient!). What's the equivalent when using httr or curl?
content(GET("http://glimmer.rstudio.com/alexbbrown/gz/sample.txt.gz"))
[1] 1f 8b 08 08 4e 9e 9b 51 00 03 73 ...
That looks nice, a compressed byte stream with the correct header (1f 8b). Now I need the text contents, so I tried using memDecompress, which says it should do this:
memDecompress(content(GET("http://glimmer.rstudio.com/alexbbrown/gz/sample.txt.gz")),type="gzip")
Error in memDecompress(content(GET("http://glimmer.rstudio.com/alexbbrown/gz/sample.txt.gz")), :
internal error -3 in memDecompress(2)
What's the proper solution here?
Also, is there a way to get R to pull the INDEX of a remote .zip file without downloading all of it?
The following works, but seems a little convoluted:
> scan(gzcon(rawConnection(content(GET("http://glimmer.rstudio.com/alexbbrown/gz/sample.txt.gz")))),"",,,"\n")
Read 1 item
[1] "These are not the droids you are looking for"
You can add a parser to handle the mime type. Look at ?content and the line You can add new parsers by adding appropriately functions to httr:::parser
ls(httr:::parsers)
#[1] "application/json" "application/x-www-form-urlencoded" #"image/jpeg"
#[4] "image/png" "text/html" #"text/plain"
#[7] "text/xml"
we can add one to handle gz content. I dont have a better answer at this point then you gave so you can incorporate your function.
assign("application/octet-stream", function(x, ...) {scan(gzcon(rawConnection(x)),"",,,"\n")},envir = httr:::parsers)
content(GET("http://glimmer.rstudio.com/alexbbrown/gz/sample.txt.gz"), as = "parsed")
Read 1 item
[1] "These are not the droids you are looking for"
>
EDIT:
I hacked together an alternative:
assign("application/octet-stream", function(x, ...) {f <- tempfile(); writeBin(x,f);untar(f);readLines(f, warn = FALSE)},envir = httr:::parsers)
content(GET("http://glimmer.rstudio.com/alexbbrown/gz/sample.txt.gz"), as = "parsed")
#[1] "These are not the droids you are looking for"
With regards to listing the files in the archive maybe you can adjust the function somewhat.
If we try to get the httr source files. They have a mime type "application/x-gzip"
assign("application/x-gzip", function(x, ...) {
f <- tempfile();
writeBin(x,f);
if(!is.null(list(...)$list)){
if(list(...)$list){
return(untar(f, list = TRUE))
}else{
untar(f, ...);
readLines(f)
}
}else{
untar(f, ...);
readLines(f)
}
}, envir = httr:::parsers)
content(GET("http://cran.r-project.org/src/contrib/httr_0.2.tar.gz"), as = "parsed", list = TRUE)
# > head(content(GET("http://cran.r-project.org/src/contrib/httr_0.2.tar.gz"), as = "parsed", list = TRUE))
#[1] "httr/" "httr/MD5" "httr/tests/"
#[4] "httr/tests/test-all.R" "httr/README.md" "httr/R/"

Logfile analysis in R?

I know there are other tools around like awstats or splunk, but I wonder whether there is some serious (web)server logfile analysis going on in R. I might not be the first thought to do it in R, but still R has nice visualization capabilities and also nice spatial packages. Do you know of any? Or is there a R package / code that handles the most common log file formats that one could build on? Or is it simply a very bad idea?
In connection with a project to build an analytics toolbox for our Network Ops guys,
i built one of these about two months ago. My employer has no problem if i open source it, so if anyone is interested i can put it up on my github repo. I assume it's most useful to this group if i build an R Package. I won't be able to do that straight away though
because i need to research the docs on package building with non-R code (it might be as simple as tossing the python bytecode files in /exec along with a suitable python runtime, but i have no idea).
I was actually suprised that i needed to undertake a project of this sort. There are at least several excellent open source and free log file parsers/viewers (including the excellent Webalyzer and AWStats) but neither parse server error logs (parsing server access logs is the primary use case for both).
If you are not familiar with error logs or with the difference between them and access
logs, in sum, Apache servers (likewsie, nginx and IIS) record two distinct logs and store them to disk by default next to each other in the same directory. On Mac OS X,
that directory in /var, just below root:
$> pwd
/var/log/apache2
$> ls
access_log error_log
For network diagnostics, error logs are often far more useful than the access logs.
They also happen to be significantly more difficult to process because of the unstructured nature of the data in many of the fields and more significantly, because the data file
you are left with after parsing is an irregular time series--you might have multiple entries keyed to a single timestamp, then the next entry is three seconds later, and so forth.
i wanted an app that i could toss in raw error logs (of any size, but usually several hundred MB at a time) have something useful come out the other end--which in this case, had to be some pre-packaged analytics and also a data cube available inside R for command-line analytics. Given this, i coded the raw-log parser in python, while the processor (e.g., gridding the parser output to create a regular time series) and all analytics and data visualization, i coded in R.
I have been building analytics tools for a long time, but only in the past
four years have i been using R. So my first impression--immediately upon parsing a raw log file and loading the data frame in R is what a pleasure R is to work with and how it is so well suited for tasks of this sort. A few welcome suprises:
Serialization. To persist working data in R is a single command
(save). I knew this, but i didn't know how efficient is this binary
format. Thee actual data: for every 50 MB of raw logfiles parsed, the
.RData representation was about 500 KB--100 : 1 compression. (Note: i
pushed this down further to about 300 : 1 by using the data.table
library and manually setting compression level argument to the save
function);
IO. My Data Warehouse relies heavily on a lightweight datastructure
server that resides entirely in RAM and writes to disk
asynchronously, called redis. The proect itself is only about two
years old, yet there's already a redis client for R in CRAN (by B.W.
Lewis, version 1.6.1 as of this post);
Primary Data Analysis. The purpose of this Project was to build a
Library for our Network Ops guys to use. My goal was a "one command =
one data view" type interface. So for instance, i used the excellent
googleVis Package to create a professional-looking
scrollable/paginated HTML tables with sortable columns, in which i
loaded a data frame of aggregated data (>5,000 lines). Just those few
interactive elments--e.g., sorting a column--delivered useful
descriptive analytics. Another example, i wrote a lot of thin
wrappers over some basic data juggling and table-like functions; each
of these functions i would for instance, bind to a clickable button
on a tabbed web page. Again, this was a pleasure to do in R, in part
becasue quite often the function required no wrapper, the single
command with the arguments supplied was enough to generate a useful
view of the data.
A couple of examples of the last bullet:
# what are the most common issues that cause an error to be logged?
err_order = function(df){
t0 = xtabs(~Issue_Descr, df)
m = cbind( names(t0), t0)
rownames(m) = NULL
colnames(m) = c("Cause", "Count")
x = m[,2]
x = as.numeric(x)
ndx = order(x, decreasing=T)
m = m[ndx,]
m1 = data.frame(Cause=m[,1], Count=as.numeric(m[,2]),
CountAsProp=100*as.numeric(m[,2])/dim(df)[1])
subset(m1, CountAsProp >= 1.)
}
# calling this function, passing in a data frame, returns something like:
Cause Count CountAsProp
1 'connect to unix://var/ failed' 200 40.0
2 'object buffered to temp file' 185 37.0
3 'connection refused' 94 18.8
The Primary Data Cube Displayed for Interactive Analysis Using googleVis:
A contingency table (from an xtab function call) displayed using googleVis)
It is in fact an excellent idea. R also has very good date/time capabilities, can do cluster analysis or use any variety of machine learning alogorithms, has three different regexp engines to parse etc pp.
And it may not be a novel idea. A few years ago I was in brief email contact with someone using R for proactive (rather than reactive) logfile analysis: Read the logs, (in their case) build time-series models, predict hot spots. That is so obviously a good idea. It was one of the Department of Energy labs but I no longer have a URL. Even outside of temporal patterns there is a lot one could do here.
I have used R to load and parse IIS Log files with some success here is my code.
Load IIS Log files
require(data.table)
setwd("Log File Directory")
# get a list of all the log files
log_files <- Sys.glob("*.log")
# This line
# 1) reads each log file
# 2) concatenates them
IIS <- do.call( "rbind", lapply( log_files, read.csv, sep = " ", header = FALSE, comment.char = "#", na.strings = "-" ) )
# Add field names - Copy the "Fields" line from one of the log files :header line
colnames(IIS) <- c("date", "time", "s_ip", "cs_method", "cs_uri_stem", "cs_uri_query", "s_port", "cs_username", "c_ip", "cs_User_Agent", "sc_status", "sc_substatus", "sc_win32_status", "sc_bytes", "cs_bytes", "time-taken")
#Change it to a data.table
IIS <- data.table( IIS )
#Query at will
IIS[, .N, by = list(sc_status,cs_username, cs_uri_stem,sc_win32_status) ]
I did a logfile-analysis recently using R. It was no real komplex thing, mostly descriptive tables. R's build-in functions were sufficient for this job.
The problem was the data storage as my logfiles were about 10 GB. Revolutions R does offer new methods to handle such big data, but I at last decided to use a MySQL-database as a backend (which in fact reduced the size to 2 GB though normalization).
That could also solve your problem in reading logfiles in R.
#!python
import argparse
import csv
import cStringIO as StringIO
class OurDialect:
escapechar = ','
delimiter = ' '
quoting = csv.QUOTE_NONE
parser = argparse.ArgumentParser()
parser.add_argument('-f', '--source', type=str, dest='line', default=[['''54.67.81.141 - - [01/Apr/2015:13:39:22 +0000] "GET / HTTP/1.1" 502 173 "-" "curl/7.41.0" "-"'''], ['''54.67.81.141 - - [01/Apr/2015:13:39:22 +0000] "GET / HTTP/1.1" 502 173 "-" "curl/7.41.0" "-"''']])
arguments = parser.parse_args()
try:
with open(arguments.line, 'wb') as fin:
line = fin.readlines()
except:
pass
finally:
line = arguments.line
header = ['IP', 'Ident', 'User', 'Timestamp', 'Offset', 'HTTP Verb', 'HTTP Endpoint', 'HTTP Version', 'HTTP Return code', 'Size in bytes', 'User-Agent']
lines = [[l[:-1].replace('[', '"').replace(']', '"').replace('"', '') for l in l1] for l1 in line]
out = StringIO.StringIO()
writer = csv.writer(out)
writer.writerow(header)
writer = csv.writer(out,dialect=OurDialect)
writer.writerows([[l1 for l1 in l] for l in lines])
print(out.getvalue())
Demo output:
IP,Ident,User,Timestamp,Offset,HTTP Verb,HTTP Endpoint,HTTP Version,HTTP Return code,Size in bytes,User-Agent
54.67.81.141, -, -, 01/Apr/2015:13:39:22, +0000, GET, /, HTTP/1.1, 502, 173, -, curl/7.41.0, -
54.67.81.141, -, -, 01/Apr/2015:13:39:22, +0000, GET, /, HTTP/1.1, 502, 173, -, curl/7.41.0, -
This format can easily be read into R using read.csv. And, it doesn't require any 3rd party libraries.

Resources