I'm having difficulty using the unpaired t-test and the aggregate function.
Example
dd<-data.frame(names=c("1st","1st","1st","1st","2nd","2nd","2nd","2nd"),a=c(11,12,13,14,2.1,2.2,2.3,2.4),b=c(3.1,3.2,3.3,3.4,3.1,3.2,3.3,3.4))
dd
# Compare all the values in the "a" column that match with "1st" against the values in the "b" column that match "1st".
# Then, do the same thing with those matching "2nd"
t.test(c(11,12,13,14),c(3.1,3.2,3.3,3.4))$p.value
t.test(c(3.1,3.2,3.3,3.4),c(3.1,3.2,3.3,3.4))$p.value
# Also need to replace any errors from t.test that have too low variance with NA
# An example of the type of error I might run into would be if the "b" column was replaced with c(3,3,3,3,3,3,3,3).
For paired data, I found a work around.
# Create Paired data.
data_paired<-dd[,3]-dd[,2]
# Create new t-test so that it doesn't crash upon the first instance of an error.
my_t.test<-function(x){
A<-try(t.test(x), silent=TRUE)
if (is(A, "try-error")) return(NA) else return(A$p.value)
}
# Use aggregate with new t-test.
aggregate(data_paired, by=list(dd$name),FUN=my_t.test)
This aggregate works with a single column of input. However, I can't get it to function when I must have several columns go into the function.
Example:
my_t.test2<-function(x,y){
A<-try(t.test(x,y,paired=FALSE), silent=TRUE)
if (is(A, "try-error")) return(NA) else return(A$p.value)
}
aggregate(dd[,c(2,3)],by=list(dd$name),function(x,y) my_t.test2(dd[,3],dd[,2]))
I had thought that the aggregate function would only send the rows matching the value in the list to the function my_t.test2 and then move onto the next list element. However, the results produced indicate that it is performing a t-test on all values in the column like below. And then placing each of those values in the results.
t.test(dd[,3],dd[,2])$p.value
What am I missing? Is this an issues with the original my_test.2, an issue with how to structure the aggregate function, or something else. The way I applied it doesn't seem to aggregate.
These are the results I want.
t.test(c(11,12,13,14),c(3.1,3.2,3.3,3.4))$p.value
t.test(c(3.1,3.2,3.3,3.4),c(3.1,3.2,3.3,3.4))$p.value
To Note, this is a toy example and the actual data set will have well over 100,000 entries that need to be grouped by the value in the names column. Hence why I need the aggregate function.
Thanks for the help.
aggregate isn't the right function to use here because the summary function only works on one column at a time. It's not possible to get both the a and b values simultaneously with this method.
Another way you could approach the problem is to split the data, then apply the t-test to each of the subset. Here's one implementation
sapply(
split(dd[-1], dd$names),
function(x) t.test(x[["a"]], x[["b"]])$p.value
)
Here I split dd into a list of subset for each value of names. I use dd[-1] to drop the "names" column from the subsets to I just have a data.frame with two columns. One for a and one for b.
Then, to each subset in the list, I perform a t.test using the a and b columns. Then I extract the p-value. The sapply wrapper with calculate this p-value for each subset and rill returned a named vector of p-values where the names of the entries correspond to the levels of dd$names
1st 2nd
6.727462e-04 3.436403e-05
If you wanted to do paired t-test this way, you could do
sapply(
split(dd[-1], dd$names),
function(x) t.test(x[["a"]] - x[["b"]])$p.value
)
As #MrFlick said, agregate is not the right function to do this. Here are some alternatives to the sapply approach, using the dplyr or data.table packages.
require(dplyr)
summarize(group_by(dd, names), t.test(a,b)$p.value)
require(data.table)
data.table(dd)[, t.test(a,b)$p.value, by=names]
Related
I'm trying to get the correlation coefficient for corresponding columns of two csv files. I simply use the followings but get errors. consider each csv file has 50 columns
first values <- read.csv("")
second values <- read.csv("")
correlation.csv <- cor(x= first values , y=second values, method="spearman)
But i get x' must be numeric error!
subset of one csv file
Thanks for your help
The read.table function and all of it's derivatives return a data.frame which is an R list object. The mapply function processes lists in "parallel". If the matching columns are in the same order in the two datasets and have the same number of rows and do not have spaces in their names, it would be as simple as:
mapply(cor, first_values , second_values)
If it's more complicated tahn that, then you need to fill in the missing details with example data by editing the question (not by responding in comments.)
There must be some categorical variable in X.So you can first separate that categorical variable from X and then use X in cor() function.
Dear Friends I would appreciate if someone can help me in some question in R.
I have a data frame with 8 variables, lets say (v1,v2,...,v8).I would like to produce groups of datasets based on all possible combinations of these variables. that is, with a set of 8 variables I am able to produce 2^8-1=63 subsets of variables like {v1},{v2},...,{v8}, {v1,v2},....,{v1,v2,v3},....,{v1,v2,...,v8}
my goal is to produce specific statistic based on these groupings and then compare which subset produces a better statistic. my problem is how can I produce these combinations.
thanks in advance
You need the function combn. It creates all the combinations of a vector that you provide it. For instance, in your example:
names(yourdataframe) <- c("V1","V2","V3","V4","V5","V6","V7","V8")
varnames <- names(yourdataframe)
combn(x = varnames,m = 3)
This gives you all permutations of V1-V8 taken 3 at a time.
I'll use data.table instead of data.frame;
I'll include an extraneous variable for robustness.
This will get you your subsetted data frames:
nn<-8L
dt<-setnames(as.data.table(cbind(1:100,matrix(rnorm(100*nn),ncol=nn))),
c("id",paste0("V",1:nn)))
#should be a smarter (read: more easily generalized) way to produce this,
# but it's eluding me for now...
#basically, this generates the indices to include when subsetting
x<-cbind(rep(c(0,1),each=128),
rep(rep(c(0,1),each=64),2),
rep(rep(c(0,1),each=32),4),
rep(rep(c(0,1),each=16),8),
rep(rep(c(0,1),each=8),16),
rep(rep(c(0,1),each=4),32),
rep(rep(c(0,1),each=2),64),
rep(c(0,1),128)) *
t(matrix(rep(1:nn),2^nn,nrow=nn))
#now get the correct column names for each subset
# by subscripting the nonzero elements
incl<-lapply(1:(2^nn),function(y){paste0("V",1:nn)[x[y,][x[y,]!=0]]})
#now subset the data.table for each subset
ans<-lapply(1:(2^nn),function(y){dt[,incl[[y]],with=F]})
You said you wanted some statistics from each subset, in which case it may be more useful to instead specify the last line as:
ans2<-lapply(1:(2^nn),function(y){unlist(dt[,incl[[y]],with=F])})
#exclude the first row, which is null
means<-lapply(2:(2^nn),function(y){mean(ans2[[y]])})
I have two columns of paired values in a data frame, I want to bin the data in one column using the cut2 function from the Hmisc package so that there are at least say 25 data points in each bin. I however need the corresponding values from the other column. Is there a convenient way for that using R? I have to bin the column B.
A B
-10.834510 1.680173
11.012966 1.866603
-16.491415 1.868667
-14.485036 1.900002
2.629104 1.960929
-3.597291 2.005348
.........
It's not clear what you mean by wanting the "corresponding values of the other column". The first part is easy to accomplish using the g (# of groups) argument:
dfrm$Agrp <- cut2(dfrm$A, g=trunc(length(dfrm$A)/25) )
You can aggregate means or medians of B within Agrp's using tapply or ave or one of the Hmisc summary functions. There are several worked examples in one of today's questions: How to get Summary statistics by group as well as many other examples of using those functions or aggregate or the pkg:plyr functions.
Given that the number of B values will not necessarily be constant across groups the only way I can think to deliver the individual values by A-grouped-value would be with split. I added an extra row to illustrate that a non-even split might need to return a list rather than a more "rectangular" object :
dat <- read.table(text="A B
-10.834510 1.680173
11.012966 1.866603
-16.491415 1.868667
-14.485036 1.900002
2.629104 1.960929
-3.597291 2.005348\n 3.5943 3.796", header=TRUE)
dat$Agrp <- cut2(dat$A, g=trunc(length(dat$A)/3) )
split(dat$B, dat$Agrp)
#-----
$`[-16.49, 2.63)`
[1] 1.680173 1.868667 1.900002 2.005348
$`[ 2.63,11.01]`
[1] 1.866603 1.960929 3.796000
If you want the vector of values on which the splits were done then that can be accomplished by using regex on levels(dat$Agrp).
I want to split a large dataframe into a list of dataframes according to the values in two columns. I then want to apply a common data transformation on all dataframes (lag transformation) in the resulting list. I'm aware of the split command but can only get it to work on one column of data at a time.
You need to put all the factors you want to split by in a list, eg:
split(mtcars,list(mtcars$cyl,mtcars$gear))
Then you can use lapply on this to do what else you want to do.
If you want to avoid having zero row dataframes in the results, there is a drop parameter whose default is the opposite of the drop parameter in the "[" function.
split(mtcars,list(mtcars$cyl,mtcars$gear), drop=TRUE)
how about this one:
library(plyr)
ddply(df, .(category1, category2), summarize, value1 = lag(value1), value2=lag(value2))
seems like an excelent job for plyr package and ddply() function. If there are still open questions please provide some sample data. Splitting should work on several columns as well:
df<- data.frame(value=rnorm(100), class1=factor(rep(c('a','b'), each=50)), class2=factor(rep(c('1','2'), 50)))
g <- c(factor(df$class1), factor(df$class2))
split(df$value, g)
You can also do the following:
split(x = df, f = ~ var1 + var2...)
This way, you can also achieve the same split dataframe by many variables without using a list in the f parameter.
I've got a data frame with far too many rows to be able to do a spatial correlogram. Instead, I want to grab 40 rows for each species and run my correlogram on that subset.
I wrote a function to subset a data frame as follows:
samp <- function(dataf)
{
dataf[sample(1:dim(dataf)[1], size=40, replace=FALSE),]
}
Now I want to apply this function to each species in a larger data frame.
When I try something like
culled_data = ddply (larger_data, .(species), subset, samp)
I get this error:
Error in subset.data.frame(piece, ...) :
'subset' must evaluate to logical
Anyone got ideas on how to do this?
It looks like it should work once you remove , subset from your call.
Dirk answer is of course correct, but to add additional explanation I post my own.
Why your call don't work?
First of all your syntax is a shorthand. It's equivalent of
ddply(larger_data, .(species), function(dfrm) subset(dfrm, samp))
so you can clearly see that you provide function (see class(samp)) as second argument of subset. You could use samp(dfrm), but it won't work too cause samp return data.frame and subset need logical vector. So you could use samp(dfrm) when it returns logical indexing.
How to use subset in this case?
Make subset work by feed him with logical vector:
ddply (larger_data, .(species), subset, sample(seq_along(species)<=40))
I create logical vector with 40 TRUE (btw it works when for some spieces is less then 40 cases, then it return all) and random it.