SparkR reading and writing dataframe issue - r

I have a Spark DataFrame which I want to write to my disc, I used the following code-
write.df(data_frame,"dataframe_temp.csv",source="csv",mode="overwrite",schema="true",header="true")
It got completed and I can see a new folder created with a _SUCCESS file in it.
Now when I am trying to read from the same file, using following code-
dataframe2<-read.df("dataframe_temp.csv",inferSchema="true",header="true")
I am getting following error:
ERROR RBackendHandler: loadDF on org.apache.spark.sql.api.r.SQLUtils
failed Error in invokeJava(isStatic = TRUE, className, methodName,
...) : org.apache.spark.sql.AnalysisException: Unable to infer
schema for ParquetFormat at dataframe.csv. It must be specified
manually;
I have even tried using repartition
data_frame<-repartition(data_frame,1)
Any help?

You also have to specify the source as "csv":
dataframe2<-read.df("dataframe_temp.csv", source="csv")
Regarding the header argument:
Currently there is also a bug in SparkR for Spark 2.0, where the variable arguments of the write.df function aren't passed to the options parameter (see https://issues.apache.org/jira/browse/SPARK-17442). That's why the header is not written to the csv even if you specify header="true" on write.df.

Got it solved using parquet file format, parquet file format stores the schema by default.

Related

Difficulty opening a package data file of unknown type

I am trying to load the state map from the maps package into an R object. I am hoping it is a SpatialPolygonsDataFrame or something I can turn into one after I have inspected it. However I am failing at the first step – getting it into an R object. I do not know the file type.
I first tried to assign the map() output to an R object directly:
st_m <- maps::map(database = "state")
draws the map, but str(st_m) appears to do nothing, unless it is redrawing the same map.
Then I tried loading it as a dataset: st_m <- data("stateMapEnv", package="maps") but this just returns a string:
> str(stateMapEnv)
chr "R_MAP_DATA_DIR"
I opened the maps directory win-library/3.4/maps/mapdata/ and found what I think is the map file, “state.L”.
I tried reading it with scan and got an error message I do not understand:
scan(file = "D:/Documents/R/win-library/3.4/maps/mapdata/state.L")
Error in scan(file = "D:/Documents/R/win-library/3.4/maps/mapdata/state.L") :
scan() expected 'a real', got '#'
I then opened the file with Notepad++. It appears to be a binary or compressed file.
So I thought it might be an R data file with an unusual extension. But my attempt to load it returned a “bad magic number” error:
st_m <- load("D:/Documents/R/win-library/3.4/maps/mapdata/state.L")
Error in load("D:/Documents/R/win-library/3.4/maps/mapdata/state.L") :
bad restore file magic number (file may be corrupted) -- no data loaded
Observing that these responses have progressed from the unhelpful through the incomprehensible to the occult, I thought it best to seek assistance from the wizards of stackoverflow.
This should be able to export the 'state' or any other maps dataset for you:
library(ggplot2)
state_dataset <- map_data("state")

Loading .RData file into Data Science Experience

I am trying to load a .RData file into my R Notebook in DSX. I have followed the instructions in this notebook (https://apsportal.ibm.com/exchange/public/entry/view/90a34943032a7fde0ced0530d976ca82) but am still unable to load my data. So far, I have been successful in the following steps:
I have loaded my dataset into object storage.
I inserted my credentials using the Insert to code -> Insert Credentials button. This seemed to work as expected.
In the next cell, I chose the Insert to code -> Insert textConnection object option. This seemed to work as expected also.
The output of step # 3 was as follows:
Your data file was loaded into a textConnection object and you can process the data with your package of choice.
data.1 <- getObjectStorageFileWithCredentials_xxxxxxxxxx("projectname", "file.RData")
After this, since my file is a .RData file, I typed the following command:
data <- load("file.RDA")
When I ran this cell, I got the following output:
Warning message in readChar(con, 5L, useBytes = TRUE):
“cannot open compressed file 'file.RDA', probable reason 'No such file or directory'”
Error in readChar(con, 5L, useBytes = TRUE): cannot open the connection
Traceback:
load("file.RDA")
readChar(con, 5L, useBytes = TRUE)
When I type in the following command to print the dataset:
data
I get the following output:
X.html..h1.Forbidden..h1..p.Access.was.denied.to.this.resource...p...html.
Please can someone help?
Thanks,
Venky
Here is a workaround given that load can't read from a response object since to read objects from Object storage, only way is the REST api.
I tried to use rawConnection instead of textConnection but it seems to be not helping.
So instead of passing the read object from OS directly to load or readRDS function.You can write it to GPFS of spark service attached and read it from there same as reading from local.
Change this lines from generated code:-
rawdata <- content(httr::GET(url = access_url, add_headers ("Content-Type" = "application/json", "X-Auth-Token" = x_subject_token)), as="raw")
rawdata
Basically instead of returning text , return raw object and then write that as binary object to local GPFS.
data.3 <- getObjectStorageFileWithCredentials_216c032f3f574763ae975c6a83a0d523("testObjectStorage", "sample.rdata")
writeBin(data.3,"sample.rdata")
Now read it back using readRDS or load.
load("sample.rdata")
To see loaded dataframe.
ls()
I hope it helps.
Thanks,
Charles.

Spotfire TERR text mining error: "name must be a single string"

I am trying to create a script that does text mining (tm) combining property and action controls with TERR.
I have run my script successfully in open-source R but keep getting an error in TERR. I have narrowed down the function causing the error to VCorpus, part of the tm package. Here is the portion of the script causing errors:
myinput <- do.call(paste, c(as.list(col1), sep=" "))
Col1 is a document property (string) based on selection from property
control drop down list.
b <- VCorpus(VectorSource(myinput), readerControl = list(language = 'eng'))
... and the error message I get in TERR is:
TIBCO Enterprise Runtime for R returned an error: 'Error in
getS3method("pGetElem", class(x), TRUE) : 'name' must be a single
string'.
I am at this point too.
I can do well using open R engine but in TERR I am trying to solve this error.
I am suspecting about the data format expected by TERR.
Got a solution from Tibco developers community
Answer:
You will not face this error if you use TERR 4.1.
There was a bug which got fixed in version 4.1
Reference :
https://docs.tibco.com/pub/enterprise-runtime-for-R/4.1.0/TIB_terr_4.1.0_relnotes.pdf
See below fix on page 16
TERR-6049 The getS3method function now works when the class argument is of
length greater than 1.

Unable to read an SBML file in SBMLR

I'm trying to read a SBML file (Test.xml) using the R package SBMLR. Below is the code I executed.
library(SBMLR)
file <- system.file("models","Test.xml",package = "SBMLR")
doc <- readSBML(file)
When I execute the 3rd line I get an error message saying:
Error in xmlEventParse(filename, handlers = sbmlHandler(),
ignoreBlanks = TRUE) : File does not exist
I have tried to read the file using rsbml library as well.. But still I'm getting an error saying
Error: File unreadable.
I'm following this guide at the moment. Any help regarding the issue is highly appreciated!

Cannot read data from an xlsx file in RStudio

I have installed the required packages - gdata and ggplot2 and I have installed perl.
library(gdata)
library(ggplot2)
# Read the data from the excel spreadsheet
df = data.frame(read.xls ("AssignmentData.xlsx", sheet = "Data", header = TRUE, perl = "C:\\Strawberry\\perl\\bin\\perl.exe"))
However when I run this I get the following error:
Error in xls2sep(xls, sheet, verbose = verbose, ..., method = method, :
Intermediate file 'C:\Users\CLAIRE~1\AppData\Local\Temp\RtmpE3UYWA\file8983d8e1efc.csv' missing!
In addition: Warning message:
running command '"C:\STRAWB~1\perl\bin\perl.exe" "C:/Users/Claire1992/Documents/R/win-library/3.1/gdata/perl/xls2csv.pl" "AssignmentData.xlsx" "C:\Users\CLAIRE~1\AppData\Local\Temp\RtmpE3UYWA\file8983d8e1efc.csv" "Data"' had status 2
Error in file.exists(tfn) : invalid 'file' argument
Thanks to #Stibu I realised I had to set my work directory. This is the command you use to run in Rstudio; setwd("C/Documents..."). The file path is where the excel file is located.
I had the issue but I solved it differently.
My problem was because my file was saved as Excel (extension .xls) but it was a txt file.
I corrected the file and I did not meet any other error with the R function.

Resources