Read a csv file in sparkR where columns have spaces - r

Normally, when we read a csv file in R, the spaces are automatically converted to '.'
> df <- read.csv("report.csv")
> str(df)
'data.frame': 598 obs. of 61 variables:
$ LR.Number
$ Vehicle.Number
However, when we read the same csv file in sparkR, the space remains intact and is not handled implicitly by spark
#To read a csv file
df <- read.df(sqlContext, path = "report.csv", source = "com.databricks.spark.csv", inferSchema = "true", header="true")
printSchema(df)
root
|-- LR Number: string (nullable = true)
|-- Vehicle Number: string (nullable = true)
Because of this, performing any activity with the column causes a lot of trouble and need to be call like this
head(select(df, df$`LR Number`))
How can I explicitly handle this? How can sparkR implicitly handle this.
I am using sparkR 1.5.0 version

As a work around you could use the following piece of psuedo code
colnames_df<-colnames(df)
colnames_df<-gsub(" ","_",colnames_df)
colnames(df)<-colnames_df
Another solution is to save file somewhere and read using read.df()

Following worked for me
df = collect(df)
colnames_df<-colnames(df)
colnames_df<-gsub(" ","_",colnames_df)
colnames(df)<-colnames_df
df <- createDataFrame(sqlContext, df)
printSchema(df)
Here we need to locally collect the data first, which will convert spark data frame to normal R data frame. I am sceptical whether this is a good solution as I don't want to call collect. However I investigated and found that even to use ggplot libraries we need to convert this into a local data frame

Related

rjson reads 35-column 40-row file as one long row

I am trying to test unicode-heavy imports of various R packages. I'm through everything but JSON because of a persisetant error: The file is read in as one long, single-row file. The file is available here.
I think I am following the instructions in the help. I have tried two approaches:
read the data into an object, then convert to a data frame.
raw_json_data <- read_file("World Class.json")
test_json <- fromJSON(raw_json_data)
as.data.frame(test_json)
Read the file using fromJSON() then convert to a data frame. I happen to be using R's new pipe here, but that doesn't seem to matter.
rjson_json <- fromJSON(file = "World Class.json") |>
as.data.frame()
In every attempt, I get the same result: a data frame of 1 column and 1400 variables. Is there a step I am missing in this conversion?
EDIT: I am not looking for the answer "Use package X instead". The rjson package seems to read in the JSON data, which has a quite simple structure. The problem is that the as.data.frame() call results in one-row, 1400-character data frame, and I'm asking wht that is.
Try the jsonlite package instead.
library(jsonlite)
## next line gives warning: JSON string contains (illegal) UTF8 byte-order-mark!
json_data <- fromJSON("World Class.json") # from file
dim(json_data)
[1] 40 35

Using unz() to read in SAS data set into R

I am trying to read in a data set from SAS using the unz() function in R. I do not want to unzip the file. I have successfully used the following to read one of them in:
dir <- "C:/Users/michael/data/"
setwd(dir)
dir_files <- as.character(unzip("example_data.zip", list = TRUE)$Name)
ds <- read_sas(unz("example_data.zip", dir_files))
That works great. I'm able to read the data set in and conduct the analysis. When I try to read in another data set, though, I encounter an error:
dir2_files <- as.character(unzip("data.zip", list = TRUE)$Name)
ds2 <- read_sas(unz("data.zip", dir2_files))
Error in read_connection_(con, tempfile()) :
Evaluation error: error reading from the connection.
I have read other questions on here saying that the file path may be incorrectly specified. Some answers mentioned submitting list.files() to the console to see what is listed.
list.files()
[1] "example_data.zip" "data.zip"
As you can see, I can see the folders, and I was successfully able to read the data set in from "example_data.zip", but I cannot access the data.zip folder.
What am I missing? Thanks in advance.
Your "dir2_files" is String vector of the names of different files in "data.zip". So for example if the files that you want to read have them names at the positions "k" in "dir_files" and "j" in "dir2_files" then let update your script like that:
dir <- "C:/Users/michael/data/"
setwd(dir)
dir_files <- as.character(unzip("example_data.zip", list = TRUE)$Name)
ds <- read_sas(unz("example_data.zip", dir_files[k]))
dir2_files <- as.character(unzip("data.zip", list = TRUE)$Name)
ds2 <- read_sas(unz("data.zip", dir2_files[j]))

How to get a vector of the file names contained in a tempfile in R?

I am trying to automatically download a bunch of zipfiles using R. These files contain a wide variety of files, I only need to load one as a data.frame to post-process it. It has a unique name so I could catch it with str_detect(). However, using tempfile(), I cannot get a list of all files within it using list.files().
This is what I've tried so far:
temp <- tempfile()
download.file("https://url/file.zip", destfile = temp)
files <- list.files(temp) # this is where I only get "character(0)"
# After, I'd like to use something along the lines of:
data <- read.table(unz(temp, str_detect(files, "^file123.txt"), header = TRUE, sep = ";")
unlink(temp)
I know that the read.table() command probably won't work, but I think I'll be able to figure that out once I get a vector with the list of the files within temp.
I am on a Windows 7 machine and I am using R 3.6.0.
Following what was said before, this structure should allow you to check the correct download with a temporary file structure :
temp <- tempfile("test.zip")
download.file("https://url/file.zip", destfile = temp)
files <- list.files(temp)

reading a subset of files within a folder using R

I'm quite new to R and am looking to build an R script that takes a csv file containing 3 elements:
Id
Type
Filename
The contents of the DataFrame look something like:
14261336 5 Test1.xml
16767594 8 Test2.xml
13601470 7 Test3.xml
12963658 5 Test4.xml
17771952 6 Test5.xml
I've tried to use the following code to get the filenames, and then use these to be able to parse the XML, but I seem to be hitting a bit of a wall (down to my inexperience with R):
headerNames <- c('Id','Type','Filename')
GetNames <- read.csv(file= 'c:/temp/XML/myXMLFiles.csv', header = FALSE, col.names = headerNames) #
list(c(GetNames[3])) %>%
map(read_xml)
The outcome is that I get the message:
Error in UseMethod("read_xml") :
no applicable method for 'read_xml' applied to an object of class "list"
Can one of you experts point me in the right direction please?
Many Thanks
You can normally only read data from a string. Furthermore, to read xml, you will need the xmlParse() from xml-library:
library(XML) # install it with install.packages("XML") if needed
files_inp <- as.character(GetNames[,3]) # you will need the filenames as character
for (f in files_inp) {
assign(paste0("file",f), xmlParse(file = f)) # I never read XML files, but that should work! :-)
}
Your output data should be variables named file1, file2, ...
Hope that helps!

Error while trying to read .data file in R

I am trying to read car.data file at this location - https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data using read.table as below. Tried various solutions listed earlier, but did not work. I am using Windows 8, R version 3.2.3. I can save this file as txt file and then read, but not able to read the .data file directly from URL or even after saving using read.table
t <- read.table(
"https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data",
fileEncoding="UTF-16",
sep = ",",
header=F
)
Here is the error I am getting and is resulting in an empty dataframe with single cell with "?" in it:
Warning messages:
1: In read.table("https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data", : invalid input found on input connection 'https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data'
2: In read.table("https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data", :
incomplete final line found by readTableHeader on 'https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data'
Please help!
Don't use read.table when the data is not stored in a table. Data at that link is clearly presented in comma-separated format. Use the RCurl package instead and read the data as CSV:
library(RCurl)
x <- getURL("https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data")
y <- read.csv(text = x)
Now y contains your data.
Thanks to cory, here is the solution - just use read.csv directly:
x <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data")

Resources