ddply cor.test with error handling - r

I'm having an issue in R where I am running a cor.test on a data frame where there are multiple groups.
I am trying to obtain the correlation coefficient for one dependent variable and multiple independent variables contained in a data frame. The data frame has 2 grouping columns for subsetting the data. Here is an example:
DF <- data.frame(group1=rep(1:4,3),group2=rep(1:2,6),x=rnorm(12),v1=rnorm(12),v2=rnorm(12),v3=rnorm(12))
I created the following script that uses plyr to calculate the correlation coefficient for each of the groups and then loop through for each of the variables.
library(plyr)
group_cor <- function(DF,x,y)
{
return(data.frame(cor = cor.test(DF[,x], DF[,y])$estimate))
}
resultDF <- ddply(DF, .(group1,group2), group_cor,3,4)
for(i in 5:6){
resultDF2 <- ddply(DF, .(group1,group2), group_cor,3,i)
resultDF <- merge(resultDF,resultDF2,by=c("group1","group2"))
rm(resultDF2)
}
This works fine. The problem I'm running into is when there aren't enough values in a group to calculate the correlation coefficient. For example: when I change the data frame created above to now include a few key NA values and then try to run the same loop:
DF[c(2,6,10),5]=NA
for(i in 5:6){
resultDF2 <- ddply(DF, .(group1,group2), group_cor,3,i)
resultDF <- merge(resultDF,resultDF2,by=c("group1","group2"))
rm(resultDF2)
}
I get the following error "Error: not enough finite observations"
I understand why I get this error and am not expecting to get a correlation coefficient for these cases. But what I would like to do is to pass out a null value and move on the the next group instead of stopping my code at an error.
I've tried using a wrapper with try() but can't seem to pass that variable into my result data frame.
Any help on how to get around this would be much appreciated.

I invariably forget to use try if I haven't use it in, oh, a day or something. This link helped me remember the basics.
For your function, you could add it in like this:
group_cor = function(DF,x,y) {
check = try(cor.test(DF[,x], DF[,y])$estimate, silent = TRUE)
if(class(check) != "try-error")
return(data.frame(cor = cor.test(DF[,x], DF[,y])$estimate))
}
However, the won't return anything for the group with the error. That's actually OK if you use the all argument when you merge. Here's another way to merge, saving everything into a list with lapply and then merging with Reduce.
allcor = lapply(4:6, function(i) ddply(DF, .(group1,group2), group_cor, 3, i))
Reduce(function(...) merge(..., by = c("group1", "group2"), all = TRUE), allcor)
If you want to fill in with NA inside the function rather than waiting to fill in using merge, you could change your function to:
group_cor2 = function(DF,x,y) {
check = try(cor.test(DF[,x], DF[,y])$estimate, silent = TRUE)
if(class(check) == "try-error")
return(data.frame(cor = NA))
return(data.frame(cor = cor.test(DF[,x], DF[,y])$estimate))
}
Finally (and outside the scope of the question), depending on what you are doing with your output, you might consider naming your columns uniquely based on which columns you are doing the cor.test for so merge doesn't name them all with suffixes. There is likely a better way to do this, maybe with merge and the suffixes argument.
group_cor3 = function(DF,x,y) {
check = try(cor.test(DF[,x], DF[,y])$estimate, silent = TRUE)
if(class(check) != "try-error") {
dat = data.frame(cor = cor.test(DF[,x], DF[,y])$estimate)
names(dat) = paste("cor", x, "vs", y, sep = ".")
dat
}
}

Related

R : How to create objects with a function which name and value depend on an argument, and that these objects are found in the global environment?

I have the following situation: I have different dataframes, I would like to be able, for each dataframe, to create 2 dataframes according to the value of one of the columns (log2FoldChange>1 and logFoldChange<-1).
For this I use the following code:
DJ29_T0_Overexpr = DJ29_T0[which(DJ29_T0$log2FoldChange > 1),]
DJ29_T0_Underexpr = DJ29_T0[which(DJ21_T0$log2FoldChange < -1),]
DJ229_T0 being one of my dataframe.
First problem: the sign for the dataframe where log2FoldChange < -1 is not taken into account.
But the main problem is at the time of making the function, I wrote the following:
spliteOverUnder <- function(res){
nm <-deparse(substitute(res))
assign(paste(nm,"_Overexpr", sep=""), res[which(as.numeric(as.character(res$log2FoldChange)) > 1),])
assign(paste(nm,"_Underexpr", sep=""), res[which(as.numeric(as.character(res$log2FoldChange)) < -1),])
}
Which I then ran with :
spliteOverUnder(DJ29_T0)
No error message, but my objects are not exported in my global environment. I tried with return(paste(nm,"_Overexpr", sep="") but it only returns the object name but not the associated dataframe.
Using paste() forces the use of assign(), so I can't do :
spliteOverUnder <- function(res){
nm <-deparse(substitute(res))
paste(nm,"_Overexpr", sep="") <<- res[which(as.numeric(as.character(res$log2FoldChange)) > 1),]
paste(nm,"_Underexpr", sep="") <<- res[which(as.numeric(as.character(res$log2FoldChange)) < -1),]
}
spliteOverUnder(DJ24_T0)
I encounter the following error:
Error in paste(nm, "_Overexpr", sep = "") <<- res[which(as.numeric(as.character(res$log2FoldChange)) > :
could not find function "paste<-"
If you've encountered this difficulty before, I'd appreciate a little help.
And if you knew, once the function works, how to use a For loop going through a list containing all my dataframes to apply this function to each of them, I'm also a taker.
Thanks
When assigning, use the pos argument to hoist the new objects out of the function.
function(){
assign(x = ..., value = ...,
pos = 1 ## see below
)
}
... where 0 = the function's local environment, 1 = the environment next up (in which the function is defined) etc.
edit
A general function to create the split dataframes in your global environment follows. However, you might rather want to save the new dataframes (from within the function) or just forward them to downstream functions than cram your workspace with intermediary objects.
splitOverUnder <- function(the_name_of_the_frame){
df <- get(the_name_of_the_frame)
df$cat <- cut(df$log2FoldChange,
breaks = c(-Inf, -1, 1, Inf),
labels = c('underexpr', 'normal', 'overexpr')
)
split_data <- split(df, df$cat)
sapply(c('underexpr', 'overexpr'),
function(n){
new_df_name <- paste(the_name_of_the_frame, n, sep = '_')
assign(x = new_df_name,
value = split_data$n,
envir = .GlobalEnv
)
}
)
}
## say, df1 and df2 are your initial dataframes to split:
sapply(c('df1', 'df2'), function(n) splitOverUnder(n))

Return names() of list in lapply() UDF for T-Test

I am trying to pass a list of dataframes through a UDF function for a t-test with lapply(). I'd like to print/preserve the argument name (df name) so that I know which test belongs to which df.
list <- list(arousSelf = arousSelf, arousOther = arousOther)
ttest <- function(data) {
d <- t.test(data$pre, data$post, paired = T)
cohend <- cohen.d(data$pre, data$post, paired = T)
print(substitute(data))
print(d)
print(cohend)
}
ttest(arousSelf)
lapply(names(list, ttest)
When I run the function for one df (i.e., arousSelf), print(substitute(data)) prints the actual argument's name. When I run the list through lapply(), numbers are printed instead. I have tried suggestions from different posts but can't seem to figure it out.
Any help would be much appreciated.
I don't see the point in printing out the objects in the inner function. You'll be printing them anyways when you call it through lapply. I would do this...
ttest <- function(data) {
d <- t.test(data$pre, data$post, paired = T)
cohend <- cohen.d(data$pre, data$post, paired = T)
list(d, cohend)
}
lapply(lt, ttest)

Delete data.frame columns and loop through data.frame assignment function

I found the following piece of code here at stackoverflow:
library(svDialogs)
columnFunction <- function (x) {
column.D <- dlgList(names(x), multiple = T, title = "Spalten auswaehlen")$res
if (!length((column.D))) {
cat("No column selected\n")
} else {
cat("The following columns are choosen:\n")
print(column.D)
x <- x[,!names(x) %in% column.D]
}
return(x)
}
df <- columnFunction(df)
So i wanted to use it for my own proposes, but it did not work out as planned.
What i try to archive is to use it in a for loop or with lapply to use it with multiple data.frames. Amongst others I tried:
d.frame1 <- iris
d.frame2 <- cars
l.frames <- c("d.frame1","d.frame2")
for (b in l.frames){
columnFunction(b)
}
but it yields the following error message:
Error in dlgList(names(x), multiple = T, title = "Spalten auswaehlen")$res :
$ operator is invalid for atomic vectors
Well, what i need additionally is that I can loop though that function so that i can iterate through different data.frames.
Last but not least I would need something like:
for (xyz in l.frames){
xyz <- columnFunction(xyz)
}
to automate the saving step.
Does anyone have any idea how i could loop though that function or how i could change the function so that it performs all those steps and is loopable.
I`m quite new to R so perhaps Im missing something obvious.
lapply was designed for this task:
l.frames <- list(d.frame1, d.frame2)
l.frames <- lapply(l.frames, columnFunction)
If you insist on using a for loop:
for (i in seq_along(l.frames)) l.frames[[i]] <- columnFunction(l.frames[[i]])

Referencing a column name within function and ddply

I am trying to create a function that uses ddply to summarize data about a particular column that I pass in. I am able to reference the column I want outside of ddply, but I'm not sure how to do it within ddply:
exp_group = c('test','test','control','control')
value = c(1,3,2,3)
df <- data.frame(exp_group, value)
compare_means <- function(df,cols_detail, col_to_eval){
df_int <- df[, c(cols_detail, col_to_eval)] # this part works fine
summary <- ddply(df_int
, .(exp_group)
, summarize
, mean = t.test(col_to_eval)$estimate #these ones don't
, lo_bound = t.test(col_to_eval)$conf.int[1]
, hi_bound = t.test(col_to_eval)$conf.int[2]
)
return(summary)
}
test <- compare_means(df, 'exp_group','value')
When I do this, it returns col_to_eval not found. I've also tried it with df_int[,col_to_eval], as well as df_int[,2] (col reference value) and it says df_int not found.
Where I want to find the means of the test and control groups.
How do I reference the column I want in the t.test functions?
Ok, went through a few iterations and finally got it to work by doing this:
exp_group = c('test','test','control','control')
value = c(1,3,2,3)
df <- data.frame(exp_group, value)
compare_means <- function(df,cols_detail, col_to_eval){
df_int <- df[, c(cols_detail, col_to_eval)]
summary <- ddply(df_int
, .(exp_group)
, function(x){
mean = t.test(x[,col_to_eval])$estimate
lo_bound = t.test(x[,col_to_eval])$conf.int[1]
hi_bound = t.test(x[,col_to_eval])$conf.int[2]
data.frame(mean, lo_bound, hi_bound)
}
)
return(summary)
}
test <- compare_means(df, 'exp_group','value')

How do I convert this for loop into something cooler like by in R

uniq <- unique(file[,12])
pdf("SKAT.pdf")
for(i in 1:length(uniq)) {
dat <- subset(file, file[,12] == uniq[i])
names <- paste("Sample_filtered_on_", uniq[i], sep="")
qq.chisq(-2*log(as.numeric(dat[,10])), df = 2, main = names, pvals = T,
sub=subtitle)
}
dev.off()
file[,12] is an integer so I convert it to a factor when I'm trying to run it with by instead of a for loop as follows:
pdf("SKAT.pdf")
by(file, as.factor(file[,12]), function(x) { qq.chisq(-2*log(as.numeric(x[,10])), df = 2, main = paste("Sample_filtered_on_", file[1,12], sep=""), pvals = T, sub=subtitle) } )
dev.off()
It works fine to sort the data frame by this (now a factor) column. My problem is that for the plot title, I want to label it with the correct index from that column. This is easy to do in the for loop by uniq[i]. How do I do this in a by function?
Hope this makes sense.
A more vectorized (== cooler?) version would pull the common operations out of the loop and let R do the book-keeping about unique factor levels.
dat <- split(-2 * log(as.numeric(file[,10])), file[,12])
names(dat) <- paste0("IoOPanos_filtered_on_pc_", names(dat))
(paste0 is a convenience function for the common use case where normally one would use paste with the argument sep=""). The for loop is entirely appropriate when you're running it for its side effects (plotting pretty pictures) rather than trying to capture values for further computation; it's definitely un-cool to use T instead of TRUE, while seq_along(dat) means that your code won't produce unexpected results when length(dat) == 0.
pdf("SKAT.pdf")
for(i in seq_along(dat)) {
vals <- dat[[i]]
nm <- names(dat)[[i]]
qq.chisq(val, main = nm, df = 2, pvals = TRUE, sub=subtitle)
}
dev.off()
If you did want to capture values, the basic observation is that your function takes 2 arguments that vary. So by or tapply or sapply or ... are not appropriate; each of these assume that just a single argument is varying. Instead, use mapply or the comparable Map
Map(qq.chisq, dat, main=names(dat),
MoreArgs=list(df=2, pvals=TRUE, sub=subtitle))

Resources