Having saved a dataframe to hdfs I have an error when I try to unserialize it when reading it back in using rhdfs - r

I have written a dataframe into hdfs using the rhdfs library and when I try to read it back in I have errors.
The code to write the dataframe is as follows,
df.file <- hdfs.file("/mydir/df.Rdata", "w")
hdfs.write(df, df.file)
hdfs.close(df.file)
And to read it back in I use
df.file <- hdfs.file("/mydir/df.Rdata", "r")
m <- hdfs.read(df.file)
df <- unserialize(m)
hdfs.close(df.file)
But I get an error at the unserialize stage,
Error in unserialize(m) : read error
Does anyone have any idea what the cause of this error is and what I can do to prevent it. Any help would be much appreciated.

This happens when the object you unserialize is bigger than 65536 bytes
If you look at the RStudio Environment, you will see that df object is raw[1:65536] and you missed a part of the file
you should read it by pieces like this code:
http://chingchuan-chen.github.io/posts/2015/04/08/installations-of-rhdfs-rmr2-plyrmr-and-hbase

Related

Reading .xls-file in R

I am trying to read a .xls-file into a R dataframe. I've tried:
library(readxl)
dfTest <- readxl::read_excel("file_path/file.xls")
Which gives me:
Error:
filepath: file_path/file.xls
libxls error: Unable to open file
Next I tried:
library(xlsx)
dfTest <- xlsx::read.xlsx("file_path/file.xls",1)
Which results in:
Error in .jcall("RJavaTools", "Ljava/lang/Object;", "invokeMethod", cl, :
java.io.IOException: block[ 1462 ] already removed - does your POIFS have circular or duplicate block references?
I tried:
library(openxlsx)
dfTest <- openxlsx::read.xlsx("file_path/file.xls")
Which results in:
Error in read.xlsx.default("file_path/file.xls") :
openxlsx can not read .xls or .xlm files!
Last thing that I tried was:
library(RODBC)
conn <- odbcConnectExcel("file_path/file.xls")
Which gives me:
Error in odbcConnectExcel("file_path/file.xls") :
odbcConnectExcel is only usable with 32-bit Windows
Would anyone have an idea how I can read the Excel file? Saving the file as .csv-file and loading it into R works perfectly fine. However, I have a large amount of files that I ultimately want to read and process in a loop. Saving all by hand as .csv is teadious to say the least.
I'm restricted in changing the software installations on the computer I'm working on.
I believe for .xls files read_delim from the readr package should work.
For example:
readr::read_delim("file_path/file.xls",as.is=TRUE)

Difficulty opening a package data file of unknown type

I am trying to load the state map from the maps package into an R object. I am hoping it is a SpatialPolygonsDataFrame or something I can turn into one after I have inspected it. However I am failing at the first step – getting it into an R object. I do not know the file type.
I first tried to assign the map() output to an R object directly:
st_m <- maps::map(database = "state")
draws the map, but str(st_m) appears to do nothing, unless it is redrawing the same map.
Then I tried loading it as a dataset: st_m <- data("stateMapEnv", package="maps") but this just returns a string:
> str(stateMapEnv)
chr "R_MAP_DATA_DIR"
I opened the maps directory win-library/3.4/maps/mapdata/ and found what I think is the map file, “state.L”.
I tried reading it with scan and got an error message I do not understand:
scan(file = "D:/Documents/R/win-library/3.4/maps/mapdata/state.L")
Error in scan(file = "D:/Documents/R/win-library/3.4/maps/mapdata/state.L") :
scan() expected 'a real', got '#'
I then opened the file with Notepad++. It appears to be a binary or compressed file.
So I thought it might be an R data file with an unusual extension. But my attempt to load it returned a “bad magic number” error:
st_m <- load("D:/Documents/R/win-library/3.4/maps/mapdata/state.L")
Error in load("D:/Documents/R/win-library/3.4/maps/mapdata/state.L") :
bad restore file magic number (file may be corrupted) -- no data loaded
Observing that these responses have progressed from the unhelpful through the incomprehensible to the occult, I thought it best to seek assistance from the wizards of stackoverflow.
This should be able to export the 'state' or any other maps dataset for you:
library(ggplot2)
state_dataset <- map_data("state")

Loading .RData file into Data Science Experience

I am trying to load a .RData file into my R Notebook in DSX. I have followed the instructions in this notebook (https://apsportal.ibm.com/exchange/public/entry/view/90a34943032a7fde0ced0530d976ca82) but am still unable to load my data. So far, I have been successful in the following steps:
I have loaded my dataset into object storage.
I inserted my credentials using the Insert to code -> Insert Credentials button. This seemed to work as expected.
In the next cell, I chose the Insert to code -> Insert textConnection object option. This seemed to work as expected also.
The output of step # 3 was as follows:
Your data file was loaded into a textConnection object and you can process the data with your package of choice.
data.1 <- getObjectStorageFileWithCredentials_xxxxxxxxxx("projectname", "file.RData")
After this, since my file is a .RData file, I typed the following command:
data <- load("file.RDA")
When I ran this cell, I got the following output:
Warning message in readChar(con, 5L, useBytes = TRUE):
“cannot open compressed file 'file.RDA', probable reason 'No such file or directory'”
Error in readChar(con, 5L, useBytes = TRUE): cannot open the connection
Traceback:
load("file.RDA")
readChar(con, 5L, useBytes = TRUE)
When I type in the following command to print the dataset:
data
I get the following output:
X.html..h1.Forbidden..h1..p.Access.was.denied.to.this.resource...p...html.
Please can someone help?
Thanks,
Venky
Here is a workaround given that load can't read from a response object since to read objects from Object storage, only way is the REST api.
I tried to use rawConnection instead of textConnection but it seems to be not helping.
So instead of passing the read object from OS directly to load or readRDS function.You can write it to GPFS of spark service attached and read it from there same as reading from local.
Change this lines from generated code:-
rawdata <- content(httr::GET(url = access_url, add_headers ("Content-Type" = "application/json", "X-Auth-Token" = x_subject_token)), as="raw")
rawdata
Basically instead of returning text , return raw object and then write that as binary object to local GPFS.
data.3 <- getObjectStorageFileWithCredentials_216c032f3f574763ae975c6a83a0d523("testObjectStorage", "sample.rdata")
writeBin(data.3,"sample.rdata")
Now read it back using readRDS or load.
load("sample.rdata")
To see loaded dataframe.
ls()
I hope it helps.
Thanks,
Charles.

Error in ls(envir = envir, all.names = private)?

The below error keeps coming up inconsistently when I try to read excel files into R using the 'XLConnect' package.
Error in ls(envir = envir, all.names = private) :
invalid 'envir' argument
I have actually run into this error while even using other packages that read excel files like package 'xlsx' and 'xlsReadWrite'. Many times restarting the R session solves this problem, which leads me to think that something else I am doing in my R session is changing the environment and not allowing me to load excel files anymore. Below is the latest example of code that is causing this error. In this case I know that the following coding sequence is causing the error to appear - but why is that happening? And how can I get past this error if I need the chron package.
library("XLConnect")
wb2 <- loadWorkbook("excel_file", create = FALSE)
library(chron)
wb2 <- loadWorkbook("excel_file", create = FALSE)
Anyone else run into this issue before? Any help on this issue is greatly appreciated!
Before reopening the workbook try removing the reference to previously opened one, so:
rm(wb2)
wb2 <- loadWorkbook("excel_file", create = FALSE)
Also, make sure that "excel_file" is not open by excel or any other program while you run the R test.
I've seen the same error come up when using XLConnect and the above seemed to help.
Had this problem a couple of times and the call stack looks like this message is generated when a "OutOfMemory" Exception is thrown.
To solve this problem I used:
options( java.parameters = "-Xmx4g" )
to increase the heap size rJava is able to use.
Debugging with options(error=utils::recover) helped a lot, because the R error messages are not very specific.

getting the name of a dataframe from loading a .rda file in R

I am trying to load an .rda file in r which was a saved dataframe. I do not remember the name of it though.
I have tried
a<-load("al.rda")
which then does not let me do anything with a. I get the error
Error:object 'a' not found
I have also tried to use the = sign.
How do I load this .rda file so I can use it?
I restared R with load("al.rda) and I know get the following error
Error: C stack usage is too close to the limit
Use 'attach' and then 'ls' with a name argument. Something like:
attach("al.rda")
ls("file:al.rda")
The data file is now on your search path in position 2, most likely. Do:
search()
ls(pos=2)
for enlightenment. Typing the name of any object saved in al.rda will now get it, unless you have something in search path position 1, but R will probably warn you with some message about a thing masking another thing if there is.
However I now suspect you've saved nothing in your RData file. Two reasons:
You say you don't get an error message
load says there's nothing loaded
I can duplicate this situation. If you do save(file="foo.RData") then you'll get an empty RData file - what you probably meant to do was save.image(file="foo.RData") which saves all your objects.
How big is this .rda file of yours? If its under 100 bytes (my empty RData files are 42 bytes long) then I suspect that's what's happened.
I had to reinstall R...somehow it was corrupt. The simple command which I expected of
load("al.rda")
finally worked.
I had a similar issue, and it was solved without reinstall R. for example doing
load("al.rda) works fine, however if you do
a <- load("al.rda") will not work.
The load function does return the list of variables that it loaded. I suspect you actually get an error when you load "al.rda". What exactly does R output when you load?
Example of how it should work:
d <- data.frame(a=11:13, b=letters[1:3])
save(d, file='foo.rda')
a <- load('foo.rda')
a # prints "d"
Just to be sure, check that the load function you actually call is the original one:
find("load") # should print "package:base"
EDIT Since you now get an error when you load the file, it is probably corrupt in some way. Try this and say what it prints:
file.info("a1.rda") # Prints the file size etc...
readBin("a1.rda", "raw", 50) # reads first 50 bytes from the file
Without having access to the file, it's hard to investigate more... Maybe you could share the file somehow (http://www.filedropper.com or similar)?
I usually use save to save only a single object, and I then use the following utility method to retrieve that object into a given variable name using load, but into a temporary namespace to avoid overwriting existing objects. Maybe it will be helpful for others as well:
load_first_object <- function(fname){
e <- new.env(parent = parent.frame())
load(fname, e)
return(e[[ls(e)[1]]])
}
The method can of course be extended to also return named objects and lists of objects, but this simple version is for me the most useful.

Resources