I am trying to create a function that uses ddply to summarize data about a particular column that I pass in. I am able to reference the column I want outside of ddply, but I'm not sure how to do it within ddply:
exp_group = c('test','test','control','control')
value = c(1,3,2,3)
df <- data.frame(exp_group, value)
compare_means <- function(df,cols_detail, col_to_eval){
df_int <- df[, c(cols_detail, col_to_eval)] # this part works fine
summary <- ddply(df_int
, .(exp_group)
, summarize
, mean = t.test(col_to_eval)$estimate #these ones don't
, lo_bound = t.test(col_to_eval)$conf.int[1]
, hi_bound = t.test(col_to_eval)$conf.int[2]
)
return(summary)
}
test <- compare_means(df, 'exp_group','value')
When I do this, it returns col_to_eval not found. I've also tried it with df_int[,col_to_eval], as well as df_int[,2] (col reference value) and it says df_int not found.
Where I want to find the means of the test and control groups.
How do I reference the column I want in the t.test functions?
Ok, went through a few iterations and finally got it to work by doing this:
exp_group = c('test','test','control','control')
value = c(1,3,2,3)
df <- data.frame(exp_group, value)
compare_means <- function(df,cols_detail, col_to_eval){
df_int <- df[, c(cols_detail, col_to_eval)]
summary <- ddply(df_int
, .(exp_group)
, function(x){
mean = t.test(x[,col_to_eval])$estimate
lo_bound = t.test(x[,col_to_eval])$conf.int[1]
hi_bound = t.test(x[,col_to_eval])$conf.int[2]
data.frame(mean, lo_bound, hi_bound)
}
)
return(summary)
}
test <- compare_means(df, 'exp_group','value')
Related
I need to calculate aggregate using a native R function IQR.
df1 <- SparkR::createDataFrame(iris)
df2 <- SparkR::agg(SparkR::groupBy(df1, "Species"),
IQR_Sepal_Length=IQR(df1$Sepal_Length, na.rm = TRUE)
)
returns
Error in as.numeric(x): cannot coerce type 'S4' to vector of type 'double'
How can I do it?
This is exactly what gapply, dapply, gapplyCollect is created for! Essentially you can use a user defined function in Spark, which will not run as optimally as native Spark functions, but at least you will get what you want.
I would suggest you to start using gapplyCollect initially, then move on to gapply.
df1 <- SparkR::createDataFrame(iris)
# gapplyCollect does not require you to specify output schema
# but, it will collect all the distributed workload back to driver node
# hence, it is not efficient if you expect huge sized output
df2 <- SparkR::gapplyCollect(
df1,
c("Species"),
function(key, x){
df_agg <- data.frame(
Species = key[[1]],
IQR_Sepal_Length = IQR(x$Sepal_Length, na.rm = TRUE)
)
}
)
# This is how you do the same thing using gapply - specify schema
df3 <- SparkR::gapply(
df1,
c("Species"),
function(key, x){
df_agg <- data.frame(
Species = key[[1]],
IQR_Sepal_Length = IQR(x$Sepal_Length, na.rm = TRUE)
)
},
schema = "Species STRING, IQR_Sepal_Length DOUBLE"
)
SparkR::head(df3)
I have the following situation: I have different dataframes, I would like to be able, for each dataframe, to create 2 dataframes according to the value of one of the columns (log2FoldChange>1 and logFoldChange<-1).
For this I use the following code:
DJ29_T0_Overexpr = DJ29_T0[which(DJ29_T0$log2FoldChange > 1),]
DJ29_T0_Underexpr = DJ29_T0[which(DJ21_T0$log2FoldChange < -1),]
DJ229_T0 being one of my dataframe.
First problem: the sign for the dataframe where log2FoldChange < -1 is not taken into account.
But the main problem is at the time of making the function, I wrote the following:
spliteOverUnder <- function(res){
nm <-deparse(substitute(res))
assign(paste(nm,"_Overexpr", sep=""), res[which(as.numeric(as.character(res$log2FoldChange)) > 1),])
assign(paste(nm,"_Underexpr", sep=""), res[which(as.numeric(as.character(res$log2FoldChange)) < -1),])
}
Which I then ran with :
spliteOverUnder(DJ29_T0)
No error message, but my objects are not exported in my global environment. I tried with return(paste(nm,"_Overexpr", sep="") but it only returns the object name but not the associated dataframe.
Using paste() forces the use of assign(), so I can't do :
spliteOverUnder <- function(res){
nm <-deparse(substitute(res))
paste(nm,"_Overexpr", sep="") <<- res[which(as.numeric(as.character(res$log2FoldChange)) > 1),]
paste(nm,"_Underexpr", sep="") <<- res[which(as.numeric(as.character(res$log2FoldChange)) < -1),]
}
spliteOverUnder(DJ24_T0)
I encounter the following error:
Error in paste(nm, "_Overexpr", sep = "") <<- res[which(as.numeric(as.character(res$log2FoldChange)) > :
could not find function "paste<-"
If you've encountered this difficulty before, I'd appreciate a little help.
And if you knew, once the function works, how to use a For loop going through a list containing all my dataframes to apply this function to each of them, I'm also a taker.
Thanks
When assigning, use the pos argument to hoist the new objects out of the function.
function(){
assign(x = ..., value = ...,
pos = 1 ## see below
)
}
... where 0 = the function's local environment, 1 = the environment next up (in which the function is defined) etc.
edit
A general function to create the split dataframes in your global environment follows. However, you might rather want to save the new dataframes (from within the function) or just forward them to downstream functions than cram your workspace with intermediary objects.
splitOverUnder <- function(the_name_of_the_frame){
df <- get(the_name_of_the_frame)
df$cat <- cut(df$log2FoldChange,
breaks = c(-Inf, -1, 1, Inf),
labels = c('underexpr', 'normal', 'overexpr')
)
split_data <- split(df, df$cat)
sapply(c('underexpr', 'overexpr'),
function(n){
new_df_name <- paste(the_name_of_the_frame, n, sep = '_')
assign(x = new_df_name,
value = split_data$n,
envir = .GlobalEnv
)
}
)
}
## say, df1 and df2 are your initial dataframes to split:
sapply(c('df1', 'df2'), function(n) splitOverUnder(n))
When I try to add a column within a function using inputs from the function, a column is added with the wrong name. Here is a sample of the data:
AllGlut1 <- data.frame(Date = c("11/1/2021", "11/2/2021", "11/3/2021"), Row = c(3, 6, 8), d.15N.14N = c(-4.593, -4.427, -4.436))
known <- "d15N_known"
RefMaterials <- data.frame(d15N_known = c(6.485, 2.632, 9.235), d13C_known = c(-21.523, -23.344, -24.892))
colm <- "d.15N.14N"
driftcorr <- function(colm, known, df){
AllGlut1 <- AllGlut1 %>% mutate(res_drift = RefMaterials[1,known] - AllGlut1[colm])
return(AllGlut1)
}
results <- driftcorr(colm, known, AllGlut1)
When I just do:
res_drift <- RefMaterials[1,known] - AllGlut1[colm]
in the console, it works perfectly fine.
Anybody know what is happening here?
Use [, colm] instead of [colm] to reference the column of AllGlut1:
driftcorr <- function(colm, known, df){
AllGlut1 <- AllGlut1 %>%
mutate(res_drift = RefMaterials[1,known] - AllGlut1[, colm])
return(AllGlut1)
}
or, as #Martin Gal says, use RefMaterials[1,known] - !!sym(colm) (I checked, it does work ...)
AllGlut1[colm] returns a one-column data frame
AllGlut1[, colm] returns a vector if AllGlut1 is a data frame, or a one-column tibble if AllGlut1 is a tibble
AllGlut1[[colm]] always returns a vector (as does pull(AllGlut1, colm) or AllGlut1[,colm, drop=TRUE])
It looks like you're using a mixture of base-R and tidyverse approaches, which can potentially get confusing ...
I have a Spark DataFrame with an ID column called "userid" that I am manipulating using sparklyr. Each userid can have anywhere from one row of data up to hundreds of rows of data. I am applying a function to each userid group which condenses the number of rows it contains based on certain event criteria. Something like
sdf %>%
group_by(userid) %>%
... %>% # using dplyr::filter and dplyr::mutate
ungroup()
I would like to wrap this function in an error handler such as purrr::possibly so that computation will not be interrupted if an error occurs in a single group.
So far, I have had the most success using the replyr package. Specifically, replyr::gapply "partitions from by values in grouping column, applies a generic transform to each group and then binds the groups back together." There are two methods for partitioning the data: "group_by" and "extract". The authors only recommend using "extract" in the case that the number of groups is 100 or less, but the "group_by" method does not work as I'd expect:
library(sparklyr)
library(dplyr)
library(replyr) # replyr::gapply
library(purrr) # purrr::possibly
sc <- spark_connect(master = "local")
# Create a test data frame to use gapply on.
test_spark <- tibble(
userid = c(1, 1, 2, 2, 3, 3),
occurred_at = seq(1, 6)
) %>%
sdf_copy_to(sc, ., "test_spark")
# Create a data frame that purrr::possibly should return in case of error.
default_spark <- tibble(userid = -1, max = -1, min = -1) %>%
sdf_copy_to(sc, ., "default_spark")
#####################################################
# Method 1: gapply with partitionMethod = "group_by".
#####################################################
# Create a function which may throw an error. The group column, userid, is not
# included since gapply( , partitionMethod = "group_by") creates it.
# - A print statement is included to show that when gapply uses "group_by", the
# function is only called once.
fun_for_groups <- function(sdf) {
temp <- sample(c(1,2), 1)
print(temp)
if (temp == 2) {
log("a")
} else {
sdf %>%
summarise(max = max(occurred_at),
min = min(occurred_at))
}
}
# Wrap the risk function to try and handle the error gracefully.
safe_for_groups <- purrr::possibly(fun_for_groups, otherwise = default_spark)
# Apply the safe function to each userid using gapply and "group_by".
# - The result is either a) only the default_spark data frame.
# b) the result expected if no error occurs in fun_for_groups.
# I would expect the answer to have a mixture of default_spark rows and correct rows.
replyr::gapply(
test_spark,
gcolumn = "userid",
f = safe_for_groups,
partitionMethod = "group_by"
)
#####################################################
# Method 2: gapply with partitionMethod = "extract".
#####################################################
# Create a function which may throw an error. The group column, userid, is
# included since gapply( , partiionMethod = "extract") doesn't create it.
# - Include a print statement to show that when gapply uses partitionMethod
# "split", the function is called for each userid.
fun_for_extract <- function(df) {
temp <- sample(c(1,2), 1)
print(temp)
if (temp == 2) {
log("a")
} else {
df %>%
summarise(max = max(occurred_at),
min = min(occurred_at),
userid = min(userid))
}
}
safe_for_extract <- purrr::possibly(fun_for_extract, otherwise = default_spark)
# Apply that function to each userid using gapply and "split".
# - The result dataframe has a mixture of "otherwise" rows and correct rows.
replyr::gapply(
test_spark,
gcolumn = "userid",
f = safe_for_extract,
partitionMethod = "extract"
)
How bad of an idea is it to use gapply when the grouping column has millions of values? Is there an alternative to the error handling strategies presented above?
replyr::gapply() is just a thin wrapper on top of dplyr (and in this case sparklyr).
For the grouped mode- the result can only be correct if no group errors out, as the calculation is issued all at once. This is the most efficient mode, but can't really achieve any sort of error handling.
For the extract mode- it might be possible to add error handling, but the current code does not have it.
As the replyr author I would actually suggest looking into sparklyr's spark_apply() method. replyr's gapply was designed when spark_apply() was not available in sparklyr (and also when binding lists of data was also not available in sparklyr).
Also replyr is mostly in "maintenance mode" (patching issues for clients who used it in larger projects), and probably not a good choice for new projects.
I'm having an issue in R where I am running a cor.test on a data frame where there are multiple groups.
I am trying to obtain the correlation coefficient for one dependent variable and multiple independent variables contained in a data frame. The data frame has 2 grouping columns for subsetting the data. Here is an example:
DF <- data.frame(group1=rep(1:4,3),group2=rep(1:2,6),x=rnorm(12),v1=rnorm(12),v2=rnorm(12),v3=rnorm(12))
I created the following script that uses plyr to calculate the correlation coefficient for each of the groups and then loop through for each of the variables.
library(plyr)
group_cor <- function(DF,x,y)
{
return(data.frame(cor = cor.test(DF[,x], DF[,y])$estimate))
}
resultDF <- ddply(DF, .(group1,group2), group_cor,3,4)
for(i in 5:6){
resultDF2 <- ddply(DF, .(group1,group2), group_cor,3,i)
resultDF <- merge(resultDF,resultDF2,by=c("group1","group2"))
rm(resultDF2)
}
This works fine. The problem I'm running into is when there aren't enough values in a group to calculate the correlation coefficient. For example: when I change the data frame created above to now include a few key NA values and then try to run the same loop:
DF[c(2,6,10),5]=NA
for(i in 5:6){
resultDF2 <- ddply(DF, .(group1,group2), group_cor,3,i)
resultDF <- merge(resultDF,resultDF2,by=c("group1","group2"))
rm(resultDF2)
}
I get the following error "Error: not enough finite observations"
I understand why I get this error and am not expecting to get a correlation coefficient for these cases. But what I would like to do is to pass out a null value and move on the the next group instead of stopping my code at an error.
I've tried using a wrapper with try() but can't seem to pass that variable into my result data frame.
Any help on how to get around this would be much appreciated.
I invariably forget to use try if I haven't use it in, oh, a day or something. This link helped me remember the basics.
For your function, you could add it in like this:
group_cor = function(DF,x,y) {
check = try(cor.test(DF[,x], DF[,y])$estimate, silent = TRUE)
if(class(check) != "try-error")
return(data.frame(cor = cor.test(DF[,x], DF[,y])$estimate))
}
However, the won't return anything for the group with the error. That's actually OK if you use the all argument when you merge. Here's another way to merge, saving everything into a list with lapply and then merging with Reduce.
allcor = lapply(4:6, function(i) ddply(DF, .(group1,group2), group_cor, 3, i))
Reduce(function(...) merge(..., by = c("group1", "group2"), all = TRUE), allcor)
If you want to fill in with NA inside the function rather than waiting to fill in using merge, you could change your function to:
group_cor2 = function(DF,x,y) {
check = try(cor.test(DF[,x], DF[,y])$estimate, silent = TRUE)
if(class(check) == "try-error")
return(data.frame(cor = NA))
return(data.frame(cor = cor.test(DF[,x], DF[,y])$estimate))
}
Finally (and outside the scope of the question), depending on what you are doing with your output, you might consider naming your columns uniquely based on which columns you are doing the cor.test for so merge doesn't name them all with suffixes. There is likely a better way to do this, maybe with merge and the suffixes argument.
group_cor3 = function(DF,x,y) {
check = try(cor.test(DF[,x], DF[,y])$estimate, silent = TRUE)
if(class(check) != "try-error") {
dat = data.frame(cor = cor.test(DF[,x], DF[,y])$estimate)
names(dat) = paste("cor", x, "vs", y, sep = ".")
dat
}
}