The jist of this question is that I have some R code which works fine on a local data frame, but fails on a Spark data frame, even if otherwise the two tables are identical.
In R, given a dataframe of all character columns, one can dynamically type cast all the columns to numeric that can be safely converted to numeric with the following code:
require(dplyr)
require(varhandle)
require(sparklyr)
checkNumeric <- function(column)
{
column %>% as.data.frame %>% .[,1] %>% varhandle::check.numeric(.) %>% all
}
typeCast <- function(df)
{
columns <- colnames(df)
numericIdx <- df %>% mutate(across(columns, checkNumeric)) %>% .[1,]
doThese <- columns[which(numericIdx==T)]
df <- df %>% mutate_at(all_of(vars(doThese)), as.numeric)
return(df)
}
For a trivial example, one could run:
df <- iris
df$Sepal.Length <- as.character(df$Sepal.Length)
newDF <- df %>% typeCast
class(df$Sepal.Length)
class(newDF$Sepal.Length)
Now, this code will not work on a dataset like starwars, which has composite columns. But for other dataframes, I would expect this code to work just fine on a Spark data frame. It doesn't. That is:
sc <- spark_connect('yarn', config=config) # define your Spark configuration somewhere, that's outside the scope of this question
df <- copy_to(sc, iris, "iris")
newDF <- df %>% typeCast
Will fail with the following error.
Error in .[1, ] : incorrect number of dimensions
When debugging, if we try to run this code:
columns <- colnames(df)
df %>% mutate(across(columns, checkNumeric))
This error is returned:
Error in UseMethod("escape") :
no applicable method for 'escape' applied to an object of class "function"
What gives? Why would the code work fine on a local data frame, but not a Spark data frame?
I didn't find an exact solution per se, but I did find a workaround.
typeCheckPartition <- function(df)
{
require(dplyr)
require(varhandle)
checkNumeric <- function(column)
{
column %>% as.data.frame %>% .[,1] %>% varhandle::check.numeric(.) %>% all
}
# this works on non-spark data frames
columns <- colnames(df)
numericIdx <- df %>% mutate(across(all_of(columns), checkNumeric)) %>% .[1,]
return(numericIdx)
}
typeCastSpark <- function(df, max_partitions = 1000, undo_coalesce = T)
{
# numericIdxDf will have these dimensions: num_partition rows x num_columns
# so long as num_columns is not absurd, this coalesce should make collect a safe operation
num_partitions <- sdf_num_partitions(df)
if (num_partitions > max_partitions)
{
undo_coalesce <- T && undo_coalesce
df <- df %>% sdf_coalesce(max_partitions)
} else
{
undo_coalesce <- F
}
columns <- colnames(df)
numericIdxDf <- df %>% spark_apply(typeCheckPartition, packages=T) %>% collect
numericIdx <- numericIdxDf %>% as.data.frame %>% apply(2, all)
doThese <- columns[which(numericIdx==T)]
df <- df %>% mutate_at(all_of(vars(doThese)), as.numeric)
if (undo_coalesce)
df <- df %>% sdf_repartition(num_partitions)
return(df)
}
Just run the typeCastSpark function against your dataframe and it will type cast all of the columns to numeric (that can be).
I'm trying to create a function that essentially gets me the MODE...or MODE-X (2nd-Xth most common value & and the associated counts for each column in a data frame.
I can't figure out what I may be missing and I'm looking for some assistance? I believe it has to do with the passing in of a variable into dplyr function.
library(tidyverse)
myfunct_get_mode = function(x, rank=1){
mytable = dplyr::count(rlang::sym(x), sort = TRUE)
names(mytable)= c('variable','counts')
# return just the rank specified...such as mode or mode -1, etc
result = table %>% dplyr::slice(rlang::sym(rank))
return(result)
}
mtcars %>% lapply(. %>% (function(x) myfunct_get_mode(x, rank=2)))
There are some problems with your function:
You function-call is not doing what you think. Check with mtcars %>% lapply(. %>% (function(x) print(x))) that actually your x is the whole column of mtcars. To get the names of the column apply the function to names(mtcars). But then you also have to specify the dataframe you're working on.
To evaluate a symbol you get sym from you need to use !! in front of the rlang::sym(x).
rank is not a variable name, thus no need for rlang::sym here.
table should be mytable in second to last line of your function.
So how could it work (although there are probably better ways):
myfunct_get_mode = function(df, x, rank=1){
mytable = count(df, !!rlang::sym(x), sort = TRUE)
names(mytable)= c('variable','counts')
# return just the rank specified...such as mode or mode -1, etc
result = mytable %>% slice(rank)
return(result)
}
names(mtcars) %>% lapply(function(x) myfunct_get_mode(mtcars, x, rank=2))
If we need this in a list, we can use map
f1 <- function(dat, rank = 1) {
purrr::imap(dat, ~
dat %>%
count(!! rlang::sym(.y)) %>%
rename_all(~ c('variable', 'counts')) %>%
arrange(desc(counts)) %>%
slice(seq_len(rank))) #%>%
#bind_cols - convert to a data.frame
}
f1(mtcars, 2)
Given
base <- data.frame( a = 1)
f <- function() c(2,3,4)
I am looking for a solution that would result in a function f being applied to each row of base data frame and the result would be appended to each row. Neither of the following works:
result <- base %>% rowwise() %>% mutate( c(b,c,d) = f() )
result <- base %>% rowwise() %>% mutate( (b,c,d) = f() )
result <- base %>% rowwise() %>% mutate( b,c,d = f() )
What is the correct syntax for this task?
This appears to be a similar problem (Assign multiple new variables on LHS in a single line in R) but I am specifically interested in solving this with functions from tidyverse.
I think the best you are going to do is a do() to modify the data.frame. Perhaps
base %>% do(cbind(., setNames(as.list(f()), c("b","c","d"))))
would probably be best if f() returned a list in the first place for the different columns.
In case you're willing to do this without dplyr:
# starting data frame
base_frame <- data.frame(col_a = 1:10, col_b = 10:19)
# the function you want applied to a given column
add_to <- function(x) { x + 100 }
# run this function on your base data frame, specifying the column you want to apply the function to:
add_computed_col <- function(frame, funct, col_choice) {
frame[paste(floor(runif(1, min=0, max=10000)))] = lapply(frame[col_choice], funct)
return(frame)
}
Usage:
df <- add_computed_col(base_frame, add_to, 'col_a')
head(df)
And add as many columns as needed:
df_b <- add_computed_col(df, add_to, 'col_b')
head(df_b)
Rename your columns.
Using a user-defined function I have to join the lower and higher bound of confidence intervals (named as CIlow and CIhigh) of a selected number of columns from a data frame. The data frame has CIlow and CIhigh for a number of groups (named as a, b and c) and for a number row (in this example just two). See below how the data frame looks like.
dataframe<-data.frame(CIlow_a=c(1.1,1.2),CIlow_b=c(2.1,2.2),CIlow_c=c(3.1,3.2),
CIhigh_a=c(1.3,1.4),CIhigh_b=c(2.3,2.4),CIhigh_c=c(3.3,3.4))
I would like to have a joined column for each group in a selected number of groups (e.g. a, b) among the existing ones (a, b and c).
Thus, the expected output should be the following:
output<-data.frame(CI_a=c("(1.1,1.3)","(1.2,1.4)"),
CI_b=c("(2.1,2.3)","(2.2,2.4)"))
To built my own user-defined function I tried the following code:
f<-function(df,gr){
enquo_gr<-enquo(gr)
r<-df%>%
dplyr::mutate(UQ(paste("CI",quo_name(gr),sep="_")):=
sprintf("(%s,%s)",
paste("CIlow",UQ(enquo_gr),sep="_"),
paste("CIhigh",UQ(enquo_gr),sep="_")))%>%
dplyr::select(paste("CI",UQ(enquo_gr),sep="_"))
return(r)
}
However when using the above mentioned function in this way
library(dplyr)
group<-c("a","b")
dataframe<-data.frame(CIlow_a=c(1.1,1.2),CIlow_b=c(2.1,2.2),CIlow_c=c(3.1,3.2),CIhigh_a=c(1.3,1.4),CIhigh_b=c(2.3,2.4),CIhigh_c=c(3.3,3.4))
f(df=dataframe,gr=group)
I get the following error message:
Error: expr must quote a symbol, scalar, or call
How could I solve this issue?
PS1: This question is similar to a previous one. However, this question goes one step further because it requires selecting the columns to be merged.
PS2: I would appreciate code suggestions following the approach of this question.
If we are passing quoted strings, then use sym (for more than one element - syms which return a list)
f <- function(df, gr){
sl <- rlang::syms(paste("CIlow", gr, sep="_"))
sh <- rlang::syms(paste("CIhigh", gr, sep="_"))
nmN <- paste("CI", gr, sep= "_")
df %>%
dplyr::mutate(!!(nmN[1]) := sprintf("(%s,%s)",
!!(sl[[1]]), !!(sh[[1]])),
!!(nmN[2]) := sprintf("(%s,%s)",
!!(sl[[2]]), !!(sh[[2]]))) %>%
dplyr::select(paste("CI", gr, sep="_"))
}
group <- c("a","b")
f(dataframe, group)
# CI_a CI_b
#1 (1.1,1.3) (2.1,2.3)
#2 (1.2,1.4) (2.2,2.4)
I would have probably answered differently basing on the question, but after examining you answer I prepared below code. It uses trick with lapply from here dplyr::unite across column patterns. I am not sure if usage of dplyr/tidyr is the best option here, maybe simple for would be simpler.
output <- data.frame(CI_a=c("(1.1,1.3)","(1.2,1.4)"),
CI_b=c("(2.1,2.3)","(2.2,2.4)"),
stringsAsFactors = F)
dataframe <- data.frame(CIlow_a=c(1.1,1.2),CIlow_b=c(2.1,2.2),CIlow_c=c(3.1,3.2),
CIhigh_a=c(1.3,1.4),CIhigh_b=c(2.3,2.4),CIhigh_c=c(3.3,3.4))
tricky <- function(input_data, group_ids){
# convert columns to character
input_data <- input_data %>%
mutate_each(funs(as.character(.)))
# unite selected groups
output <- group_ids %>%
lapply(function(group_id) {unite_(input_data,
paste0("CI_", group_id),
paste0(c("CIlow_", "CIhigh_"), group_id),
sep = ',') %>% select_(paste0("CI_", group_id))}) %>%
bind_cols() %>%
mutate_each(funs(paste0("(", ., ")")))
return(output)
}
identical(tricky(dataframe, list("a", "b")), output)
I have found by myself an solution for my issue. The code below works:
output<-data.frame(CI_a=c("(1.1,1.3)","(1.2,1.4)"), CI_b=c("(2.1,2.3)","(2.2,2.4)"))
dataframe<-data.frame(CIlow_a=c(1.1,1.2),CIlow_b=c(2.1,2.2),CIlow_c=c(3.1,3.2),
CIhigh_a=c(1.3,1.4),CIhigh_b=c(2.3,2.4),CIhigh_c=c(3.3,3.4))
f <- function(df, gr){
sl <<- rlang::syms(paste("CIlow", gr, sep="_"))
sh <<- rlang::syms(paste("CIhigh", gr, sep="_"))
nmN <<- paste("CI", gr, sep= "_")
r<-df
for(i in 1:length(gr)){
r<-dplyr::mutate(r,UQ(nmN[i]) := sprintf("(%s;%s)", UQ(sl[[i]]),UQ(sh[[i]])))
}
r<- dplyr::select(r,nmN)
return(r)
}
group <- c("a","b")
x<-f(df=dataframe, gr=group)
The code works for an undefined number of elements in group. Thus, it works for c("a","b"), for c("a") or c("a","b","c").
I know loops are not recommended. Any better solution is appreciated.
My data table df has a subject column (e.g. "SubjectA", "SubjectB", ...). Each subject answers many questions, and the table is in long format, so there are many rows for each subject. The subject column is a factor. I want to create a new column - call it subject.id - that is simply a numeric version of subject. So for all rows with "SubjectA", it would be 1; for all rows with "SubjectB", it would be 2; etc.
I know that an easy way to do this with dplyr would be to call df %>% mutate(subject.id = as.numeric(subject)). But I was trying to do it this way:
subj.list <- unique(as.character(df$subject))
df %>% mutate(subject.id = which(as.character(subject) == subj.list))
And I get this error:
Error: wrong result size (12), expected 72 or 1
Why does this happen? I'm not interested in other ways to solve this particular problem. Rather, I worry that my inability to understand this error reflects a deep misunderstanding of dplyr or mutate. My understanding is that this call should be conceptually equivalent to:
df$subject.id <- NULL
for (i in 1:nrow(df)) {
df$subject.id[i] <- which(as.character(df$subject[i]) == subj.list))
}
But the latter works and the former doesn't. Why?
Reproducible example:
df <- InsectSprays %>% rename(subject = spray)
subj.list <- unique(as.character(df$subject))
# this works
df$subject.id <- NULL
for (i in 1:nrow(df)) {
df$subject.id[i] <- which(as.character(df$subject[i]) == subj.list)
}
# but this doesn't
df %>% mutate(subject.id = which(as.character(subject) == subj.list))
The issue is that operators and functions are applied in a vectorized way by mutate. Thus, which is applied to the vector produced by as.character(df$subject) == subj.list, not to each row (as in your loop).
Using rowwise as described here would solve the issue: https://stackoverflow.com/a/24728107/3772587
So, this will work:
df %>%
rowwise() %>%
mutate(subject.id = which(as.character(subject) == subj.list))
Since your df$subject is a factor, you could simply do:
df %>% mutate(subj.id=as.numeric(subject))
Or use a left join approach:
subj.df <- df$subject %>%
unique() %>%
as_tibble() %>%
rownames_to_column(var = 'subj.id')
df %>% left_join(subj.df,by = c("subject"="value"))