different behavior while using createDataFrame and read.df in SparkR - r

I am using Spark 1.5.1
When I do this
df <- createDataFrame(sqlContext, iris)
#creating a new column for category "Setosa"
df$Species1<-ifelse((df)[[5]]=="setosa",1,0)
head(df)
output: new column created
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
but when I saved the iris dataset as a CSV file and try to read it and convert it to sparkR dataframe
df <- read.df(sqlContext,"/Users/devesh/Github/deveshgit2/bdaml/data/iris/",
source = "com.databricks.spark.csv",header = "true",inferSchema = "true")
now when I try to create new column
df$Species1<-ifelse((df)[[5]]=="setosa",1,0)
I get the below error:
16/02/05 12:11:01 ERROR RBackendHandler: col on 922 failed Error in select(x, x$"*", alias(col, colName)) :
error in evaluating the argument 'col' in selecting a method for function 'select': Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
org.apache.spark.sql.AnalysisException: Cannot resolve column name "Sepal.Length" among (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species);
at org.apache.spark.s

SparkSQL doesn't support names with embedded dots. When you use createDataFrame names are automatically adjusted for you, for other methods you have to provide schema explicitly:
schema <- structType(
structField("Sepal_Length", "double"),
structField("Sepal_Width", "double"),
structField("Petal_Length", "double"),
structField("Petal_Width", "double"),
structField("Species", "string"))
df <- read.df(sqlContext, path, source = "com.databricks.spark.csv",
header="true", schema=schema)

Related

Passing vector of names to verify to assertr's verify in R

I am importing a dataset from a third party and would would like to be able to validate that all of the columns in the incoming dataset are named as agreed to and expected. To do this, I intended to use the verify statement in assertr's package in R with has_all_names. I can accomplish this with no problem if I manually enter the column names to be verified, but I can't seem to accomplish this by passing in a vector that contains the names of the columns to be verified. So for example, using the build-in iris dataset, I can verify that existence of the all the column names if I manually enter the names as an argument to the has_all_names function, but if I have the names stored in a vector and attempt to use it for verification, it does not work:
#Create a sample list of column names to be verified
#In my real work, I obtain this list from a database
(names(iris)->expected_variable_names)
Which outputs:
[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
But then I run the following and:
#This works:
iris %>% verify(has_all_names("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Species"))
#But this does not:
iris %>% verify(has_all_names(expected_variable_names))
When I attempt to run the line that does not work, this generates:
verification [has_all_names(expected_variable_names)] failed! (1 failure)
verb redux_fn predicate column index value
1 verify NA has_all_names(expected_variable_names) NA 1 NA
Error: assertr stopped execution
Obviously, the failed attempt is indicating that not all of the column names are found in the dataframe, but since I'm passing in all the variable names that are indeed on the dataset, it should succeed. How can I pass into verify a vector or possibly even a list of column names to validate? I've tried a number of different variations of this last attempt with no success.
Thanks.
We may use invoke
library(purrr)
library(dplyr)
library(assertr)
iris %>%
verify(invoke(has_all_names, expected_variable_names))
-output
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
...
Or with exec from rlang
library(rlang)
iris %>%
verify(exec(has_all_names, !!!expected_variable_names))
Or with do.call from base R
iris %>%
verify(do.call(has_all_names,
as.list(expected_variable_names)))

Writing an R loop to create new standardized columns

I'm using the Ionosphere dataset in R and am trying to write a loop that will create new columns that are standardized iterations of existing columns and name them accordingly.
I've got the "cname" as the new column name and c as the original. The code is:
install.packages("mlbench")
library(mlbench)
data('Ionosphere')
library(robustHD)
col <- colnames(Ionosphere)
for (c in col[1:length(col)-1]){
cname <- paste(c,"Std")
Ionosphere$cname <- standardize(Ionosphere$c)
}
But get the following error:
"Error in `$<-.data.frame`(`*tmp*`, "cname", value = numeric(0)) :
replacement has 0 rows, data has 351
In addition: Warning message:
In mean.default(x) : argument is not numeric or logical: returning NA"
I feel like there's something super-simple I'm missing but I just can't see it.
Any help gratefully received.
We can use lapply, a custom-made standardization function, setNames, and cbind.
I do not have access to your dataset, so I am using the iris dataset as an example:
df<-iris
cbind(df, set_names(lapply(df[1:4],
\(x) (x - mean(x))/sd(x)),
paste0(names(df[1:4]), '_Std')))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Length_Std Sepal.Width_Std Petal.Length_Std Petal.Width_Std
1 5.1 3.5 1.4 0.2 setosa -0.89767388 1.01560199 -1.33575163 -1.3110521482
2 4.9 3.0 1.4 0.2 setosa -1.13920048 -0.13153881 -1.33575163 -1.3110521482
3 4.7 3.2 1.3 0.2 setosa -1.38072709 0.32731751 -1.39239929 -1.3110521482
4 4.6 3.1 1.5 0.2 setosa -1.50149039 0.09788935 -1.27910398 -1.3110521482
5 5.0 3.6 1.4 0.2 setosa -1.01843718 1.24503015 -1.33575163 -1.3110521482
...
I feel these transformations get easier with dplyr:
library(dplyr)
iris %>% mutate(across(where(is.numeric),
~ (.x - mean(.x))/sd(.x),
.names = "{col}_Std"))

How can I use dplyr's "Select helpers" in paste()?

This works well, but troublesome.
> library(dplyr)
> mutate(iris, a = paste( Petal.Width, Petal.Length) ) %>>% head
Sepal.Length Sepal.Width Petal.Length Petal.Width Species a
1 5.1 3.5 1.4 0.2 setosa 0.2 1.4
2 4.9 3.0 1.4 0.2 setosa 0.2 1.4
3 4.7 3.2 1.3 0.2 setosa 0.2 1.3
4 4.6 3.1 1.5 0.2 setosa 0.2 1.5
5 5.0 3.6 1.4 0.2 setosa 0.2 1.4
6 5.4 3.9 1.7 0.4 setosa 0.4 1.7
How can I use dplyr's "Select helpers" in paste()?
> mutate(iris, a = paste( starts_with("Petal") ))
Error in mutate_impl(.data, dots) :
wrong result size (0), expected 150 or 1
> mutate_(iris, a = paste( starts_with("Petal") ))
Error in parse(text = x)[[1]] : subscript out of bounds
> mutate_(iris, a = paste( starts_with(Petal) ))
Error in is.string(match) : object 'Petal' not found
> mutate(iris, a = paste( grep("Petal", names(iris), value=T) ))
Error in mutate_impl(.data, dots) :
wrong result size (2), expected 150 or 1
And this did not work.
> mutate(iris, a = paste( names(iris)[base::startsWith(names(iris),"Petal")] ))
Error in mutate_impl(.data, dots) :
wrong result size (2), expected 150 or 1
I made very troublesome function. But it works. Maybe I use this or search more simple good one.
> paste.colprefix <- function(DFNAME, PREFIX){
+ TMP <- eval(parse(text= paste0("grep(\"", PREFIX, "\",names(", DFNAME, "), v=T)")))
+ TMP <- paste0(DFNAME, "$",TMP)
+ TMP <- paste0(TMP, collapse = ",")
+ eval(parse(text= paste0( "paste(", TMP, ")")))
+ }
>
> iris$PetalPaste <- paste.colprefix("iris", "Petal")
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species PetalPaste
1 5.1 3.5 1.4 0.2 setosa 1.4 0.2
2 4.9 3.0 1.4 0.2 setosa 1.4 0.2
3 4.7 3.2 1.3 0.2 setosa 1.3 0.2
4 4.6 3.1 1.5 0.2 setosa 1.5 0.2
5 5.0 3.6 1.4 0.2 setosa 1.4 0.2
6 5.4 3.9 1.7 0.4 setosa 1.7 0.4
>
You can not use select's helper functions in paste function.
Following is the trick with which you can get expected output.
You can filter out column names of the data frame and use them as parameter to your paste function.
To filter out those column names you can use any one of the following technique.
base::startsWith(character vector, Starts with string)
cn <- names(iris)[base::startsWith(names(iris),"Petal")]
stringr::str_detect(character vector, regex to find)
cn <- names(iris)[stringr::str_detect(names(iris), "Petal.*")]
In each of this method, it will return vector of column names which start with "Petal".
Then You can use this as following to get your expected result.
iris$a <- do.call(paste,iris[cn])

Unable to modify an object returned by an R function

I wrote a function in R to return the name of the first column of a data frame:
my_func <- function(table, firstColumnOnly = TRUE){
if(firstColumnOnly)
return(colnames(table)[1])
else
return(colnames(table))
}
If I call the function like this:
my_func(fertility)<-"foo"
I get the following error:
Error in my_func(fertility, FALSE)[1] <- "foo" :
could not find function "my_func<-"
Why am I getting this error? I can do this without an error:
colnames(fertility)[1]<-"Country"
It seems like you are expecting that this:
my_func(fertility)<-"foo"
will be understood by R as:
colnames(table)[1] <- "foo" # if firstColumnOnly
or
colnames(table) <- "foo" # if !firstColumnOnly
It will not. One reason for this is that colnames() and colnames()<- are two distinct functions. The first one returns the column names, the second one assigns new names. Your function can only return the names, not assign them.
One workaround would be to write your function using colnames()<-:
my_func <- function(table, rep, firstColumnOnly = TRUE){
if(firstColumnOnly) colnames(table)[1] <- rep
else colnames(table) <- rep
return(table)
}
Test
head(my_func(iris,"foo"))
foo Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa

accessing variables in data frame in R

I am try to open all the csv files in my working directory and read all the tables into a large list of data frame. I find a similar solution on stackoverflow and the solution works. The code is:
load_data <- function(path)
{
files <- dir(path, pattern = '\\.csv', full.names = TRUE)
tables <- lapply(files, read.csv)
do.call(rbind, tables)
}
pollutantmean <- load_data("specdata")
However, I am confused to some steps. If I delete or omit do.call(rbind,tables), I am not able to access the column variables by calling tables[index]$variable. It returns NULL in the console. Then I try to print an output by calling tables[index] and I do not see any column variables' name appearing the the first row in the table. Can someone explain to me what cause the column variables' name missing and return NULL value?
To see why you are getting NULL let's create a reproducible example:
df1 <- head(mtcars)
df2 <- head(iris)
my_list <- list(df1, df2)
Test the subsetting with one bracket and two:
my_list[2]$Species
NULL
my_list[[2]]$Species
[1] setosa setosa setosa setosa setosa setosa
Levels: setosa versicolor virginica
Subsetting with two brackets produces the desired output.
Further Explanation
Why doesn't one bracket work?
> my_list[2]
# [[1]]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
# 4 4.6 3.1 1.5 0.2 setosa
# 5 5.0 3.6 1.4 0.2 setosa
# 6 5.4 3.9 1.7 0.4 setosa
> my_list[[2]]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1 5.1 3.5 1.4 0.2 setosa
# 2 4.9 3.0 1.4 0.2 setosa
# 3 4.7 3.2 1.3 0.2 setosa
# 4 4.6 3.1 1.5 0.2 setosa
# 5 5.0 3.6 1.4 0.2 setosa
# 6 5.4 3.9 1.7 0.4 setosa
If someone couldn't tell the difference between the two outputs I wouldn't blame them, they look alike. There's one small important difference between using one bracket and two. The first returns a list, the second returns a data frame. To check, notice the [[1]] in the first line of the output of my_list[2]. That indicates that the output is a list. As a list we cannot analyze it as we would a data frame. We must use the two brackets to get back a data frame.

Resources