I am trying to run a t.test() on multiple columns of data within 'j' in a data.table. I've found a way that works, but isn't very elegant, and I feel like there's probably a more concise way using .SDcols, but haven't had any luck looking through here, or the data.table vignette. If this has been asked previously I apologize and please point me in the right direction.
My data.table has essentially the following format
DT <- data.table(name = c('a', 'b', 'c', 'a', 'b', 'c'),
y = c(rnorm(6, mean = 100, sd = 30)),
z = c(rnorm(6, mean = 10, sd = 3)),
group = rep(c('One', 'Two'), 3))
When I want to run a t.test comparing the values of y in group One and group Two, it's very straightforward:
DT[,t.test(y~group)]
If I want to get an output for both y and z the following works, but is clunky and inelegant. And with my actual data, I'm trying to do this over many columns so it would be more time consuming to type out each iteration I would like to run.
DT[,.(t.test(y~group), t.test(z~group))]
In the data.table vignette, using a function over a specific subset of columns is achieved by
DT[,lapply(.SD, mean), .SDcols = c('y', 'z')]
However replacing mean with t.test yields a one sample t.test, while I'm trying to get a two sample t.test. I've tried:
DT[,lapply(.SD, t.test, formula = .SDcols ~ group, data = DT), .SDcols = c('y', 'z')]
But this gives me a comparison between y and z, not both the comparisons of y~group and z~group.
I've tried several versions of lapply with a custom function to get the output I want, but I won't make anyone read through my walls of unsuccessful code. Needless to say I have been unable to get that to work.
Question:
Is there a way via lapply() or function() or a way currently unknown to me, to get t.test to run over multiple columns of data within 'j' in a data.table?
Thanks in advance for your help,
Chris
To pull together the parts of the answer and to rearrange to put the name in the first column (if desired for nicer printing):
library(data.table)
DT <- data.table(name = c('a', 'b', 'c', 'a', 'b', 'c'),
y = c(rnorm(6, mean = 100, sd = 30)),
z = c(rnorm(6, mean = 10, sd = 3)),
group = rep(c('One', 'Two'), 3))
result <-
DT[,lapply(.SD, function(x) t.test(x ~ group)), .SDcols = y:z][
,ttname:=names(t.test(1:5))][ # add names
,.(ttname,y,z)] # put names in first column
result
# ttname y z
# 1: statistic 0.1391646 0.1295093
# 2: parameter 3.468876 3.559917
# 3: p.value 0.8970165 0.9039359
# 4: conf.int -99.61786,109.47358 -8.209637, 8.972439
# 5: estimate 110.7286,105.8008 11.15414,10.77274
# 6: null.value 0 0
# 7: stderr 35.41031 2.94497
# 8: alternative two.sided two.sided
# 9: method Welch Two Sample t-test Welch Two Sample t-test
# 10: data.name x by group x by group
Here is my solution, wrapped as a function. In the accepted answer, , I didn't like that the test output was rows and the variables columns, i like it the other way around, makes it easier to read I think.
I also a added an argument for rounding, and one that default only prints the most important info, the pvalue and test statistic. the function requires purrr. the input for the group variable and the variables to test are character, so dt_test(dtx, 'varname', c('z','y'))
dt_ttest <- function(dtx, grp, thecols, decimals=3, small=TRUE, ...) {
x1 <- a2[, map(.SD, ~ t.test(.x ~ get(grp))), .SDcols = thecols]
x2 <- t(x1) %>% data.table()
setnames(x2, names(t.test(1:2)))
x2 <- x2[, var := thecols] [, !'data.name']
tcols <- c('p.value', 'statistic', 'stderr', 'null.value', 'parameter', 'method', 'alternative')
x2[, (tcols) := map(.SD, unlist), .SDcols=tcols ]
x2
thecols2 <- keep(x2, is.numeric) %>% names()
x2[, (thecols2) := map(.SD, ~ round(.x, decimals)), .SDcols=thecols2 ]
# go one level deeper to round the two list cols
thecols3 <- c('conf.int', 'estimate')
x2[, (thecols3) := modify_depth(.SD, 2, ~ round(.x, decimals)), .SDcols=thecols3 ]
# set order
setcolorder(x2, c('var', 'p.value', 'statistic', 'stderr', 'conf.int', 'estimate', 'parameter', 'method', 'alternative') )
if( small) x2[, .(var, p.value, statistic)] else x2[]
}
Related
I've been using a code to run means for specific variable values (demographic breaks), however I now have data that has a weight variable and need to calculate weighted means. I've already been using a code to calculate sample means, and was wondering if it's possible to change change or adjust the function to calculate the weighted mean. Here is some code to generate sample data
df <- data.frame(gender=c(2,2,1,1,2,2,1,1,1,1,1,1,2,2,2,2,1,2,2,1),
agegroup=c(2,2,7,5,5,5,2,7,2,2,4,4,4,3,4,5,3,3,6,6),
attitude_1=c(4,3,4,4,4,4,4,4,5,2,5,5,5,4,3,2,3,4,2,4),
attitude_2=c(4,4,1,3,4,2,4,5,5,5,5,4,5,4,3,3,4,4,4,4),
attitude_3=c(2,2,1,1,3,2,5,1,4,2,2,2,3,3,4,1,4,1,3,1),
income=c(40794,74579,62809,47280,72056,57908,70784,96742,66629,117530,79547,54110,39569,111217,109146,56421,106206,28385,85830,71110),
weight=c(1.77,1.89,2.29,6.14,2.07,5.03,0.73,1.60,1.95,2.56,5.41,2.02,6.87,3.23,3.01,4.68,3.42,2.75,2.31,4.04))
So far I've been using this code to get sample means
assign("Gender_Profile_1",
data.frame(sapply(subset(df, gender==1), FUN = function(x) mean(x, na.rm = TRUE))))
> Gender_Profile_1
sapply.subset.df..gender....1...FUN...function.x..mean.x..na.rm...TRUE..
gender 1.000
agegroup 4.200
attitude_1 4.000
attitude_2 4.000
attitude_3 2.300
income 77274.700
weight 3.016
As you can see it generates Gender_Profile_1 with the means for all variables.
In my attempt to calculate the weighted mean, I've tried to change the "FUN=" part to this
assign("Gender_Profile_1",
data.frame(sapply(subset(df, gender==1), FUN = function(x) weighted.mean(x, w=weight,na.rm = TRUE))))
I get the following error message
Error in weighted.mean.default(x, w = weight, na.rm = TRUE) :
'x' and 'w' must have the same length
I've been trying all kinds of permutations of df$weight and df$x, but nothing seems to work.
Any help or ideas would be great. Many thanks
Base R
If you want to stick to base R, you can do the following:
# define func to return all weighted means
all_wmeans <- function(data_subset) {
# which cols to summarise? all but gender and weight
summ_cols <- setdiff(names(data_subset), c('gender', 'weight'))
# for each col, calc weighted mean with weights from the 'weight' column
result <- lapply(data_subset[, summ_cols],
weighted.mean, w=data_subset$weight)
# squeeze the resuling list back to a data.frame and return
return(data.frame(result))
}
# now, split the df on gender, and apply the func to each chunk
lapply(split(df, df$gender), all_wmeans)
The result is a list of two data frames, for each value of gender:
$`1`
agegroup attitude_1 attitude_2 attitude_3 income
1 4.397546 4.027851 3.950597 1.962202 74985.25
$`2`
agegroup attitude_1 attitude_2 attitude_3 income
1 4.092234 3.642666 3.676287 2.388872 64075.23
The fabulous data.table
If you don't mind using packages, dplyr and data.table are great packages that make this kind of stuff much simpler. Here's data.table:
# load library and create a data.table object
library(data.table)
my_dt <- data.table(df)
# now it's a one liner:
my_dt[, lapply(.SD, weighted.mean, w=.SD$weight), by=gender]
which returns:
gender agegroup attitude_1 attitude_2 attitude_3 income weight
1: 2 4.092234 3.642666 3.676287 2.388872 64075.23 4.099426
2: 1 4.397546 4.027851 3.950597 1.962202 74985.25 3.904483
The data.table code also groups the rows by gender, and uses lapply to apply a function and extra argument to each Subset of Data (that's what the .SD call is). Conceptually, it's the exact same as the base R code, just compact and fast.
You can do the whole lot at once like this:
sapply(1:2, function(y)
sapply(subset(df, df$gender == y), function(x)
weighted.mean(x, df$weight[df$gender == y])))
#> [,1] [,2]
#> gender 1.000000 2.000000
#> agegroup 4.397546 4.092234
#> attitude_1 4.027851 3.642666
#> attitude_2 3.950597 3.676287
#> attitude_3 1.962202 2.388872
#> income 74985.247679 64075.232966
#> weight 3.904483 4.099426
I think the main problem with your code is that you are calling the weights column inside the sapply loop, however, this column has not been subsetted (as df has). Thus, you could just subset the weights columns before the sapply and then loop using that subsetted weights.
Using the code you posted:
weight <- subset(df, gender==1)[,"weight"]
#Exactly the same code you posted
assign("Gender_Profile_2",
data.frame(sapply(subset(df, gender==1), FUN = function(x) weighted.mean(x, w=weight,na.rm = TRUE))))
Here is another solution using apply, that might be easier to implement:
#Apply the desired function by columns
apply(subset(df, gender==1), 2, FUN = function(x) mean(x, na.rm = TRUE))
#Get the weights of the rows that have gender == 1
weight <- subset(df, gender==1)[,7]
#Apply the wighted mean function
apply(subset(df[,-7], gender==1), 2, FUN = function(x) weighted.mean(x, w=weight,na.rm = TRUE))
I have the following dataframe:
df = data.frame(id=c("A","A","A","A","B","B","B","B","C","C","C","C","D","D","D","D"),
sub=rep(c(1:4),4),
acc1=runif(16,0,3),
acc2=runif(16,0,3),
acc3=runif(16,0,3),
acc4=runif(16,0,3))
What I want is to obtain the mean rows for each ID, which is to say I want to obtain the mean acc1, acc2, acc3 and acc4 for each level A, B, C and D by averaging the values for each sub (4 levels for each id), which would give something like this in the end (with the NAs replaced by the means I want of course):
dfavg = data.frame(id=c("A","B","C","D"),meanacc1=NA,meanacc2=NA,meanacc3=NA,meanacc4=NA)
Thanks in advance!
Try:
You can use any of the specialized packages dplyr or data.table or using base R. Because you have a lot of columns that starts with acc to get the mean of, I choose dplyr. Here, the idea is to first group the variable by id and then use summarise_each to get the mean of each column by id that starts_with acc
library(dplyr)
df1 <- df %>%
group_by(id) %>%
summarise_each(funs(mean=mean(., na.rm=TRUE)), starts_with("acc")) %>%
rename(meanacc1=acc1, meanacc2=acc2, meanacc3=acc3, meanacc4=acc4) #this works but it requires more typing.
I would rename using paste
# colnames(df1)[-1] <- paste0("mean", colnames(df1)[-1])
gives the result
# id meanacc1 meanacc2 meanacc3 meanacc4
#1 A 1.7061929 2.401601 2.057538 1.643627
#2 B 1.7172095 1.405389 2.132378 1.769410
#3 C 1.4424233 1.737187 1.998414 1.137112
#4 D 0.5468509 1.281781 1.790294 1.429353
Or using data.table
library(data.table)
nm1 <- paste0("acc", 1:4) #names of columns to do the `means`
dt1 <- setDT(df)[, lapply(.SD, mean, na.rm=TRUE), by=id, .SDcols=nm1]
Here.SD implies Subset of Data.table, .SDcols are the columns to which we apply the mean operation.
setnames(dt1, 2:5, paste0("mean", nm1)) #change the names of the concerned columns in the result
dt1
(This must have been asked at least 20 times.) The `aggregate function applies the same function (given as the third argument) to all the columns of its first argument within groups defined by its second argument:
aggregate(df[-(1:2)], df[1],mean)
If you want to append the letters "mean" to the column names:
names(df2) <- paste0("mean", names(df2)
If you had wanted to do the column selection automatically then grep or grepl would work:
aggregate(df[ grepl("acc", names(df) )], df[1], mean)
Here are a couple of other base R options:
split + vapply (since we know vapply would simplify to a matrix whenever possible)
t(vapply(split(df[-c(1, 2)], df[, 1]), colMeans, numeric(4L)))
by (with a do.call(rbind, ...) to get the final structure)
do.call(rbind, by(data = df[-c(1, 2)], INDICES = df[[1]], FUN = colMeans))
Both will give you something like this as your result:
# acc1 acc2 acc3 acc4
# A 1.337496 2.091926 1.978835 1.799669
# B 1.287303 1.447884 1.297933 1.312325
# C 1.870008 1.145385 1.768011 1.252027
# D 1.682446 1.413716 1.582506 1.274925
The sample data used here was (with set.seed, for reproducibility):
set.seed(1)
df = data.frame(id = rep(LETTERS[1:4], 4),
sub = rep(c(1:4), 4),
acc1 = runif(16, 0, 3),
acc2 = runif(16, 0, 3),
acc3 = runif(16, 0, 3),
acc4 = runif(16, 0, 3))
Scaling up to 1M rows, these both perform quite well (though obviously not as fast as "dplyr" or "data.table").
You can do this in base package itself using this:
a <- list();
for (i in 1:nlevels(df$id))
{
a[[i]] = colMeans(subset(df, id==levels(df$id)[i])[,c(3,4,5,6)]) ##select columns of df of which you want to compute the means. In your example, 3, 4, 5 and 6 are the columns
}
meanDF <- cbind(data.frame(levels(df$id)), data.frame(matrix(unlist(a), nrow=4, ncol=4, byrow=T)))
colnames(meanDF) = c("id", "meanacc1", "meanacc2", "meanacc3", "meanacc4")
meanDF
id meanacc1 meanacc2 meanacc3 meanacc4
A 1.464635 1.645898 1.7461862 1.026917
B 1.807555 1.097313 1.7135346 1.517892
C 1.350708 1.922609 0.8068907 1.607274
D 1.458911 0.726527 2.4643733 2.141865
I'm trying to aggregate a data frame using the function weighted.mean and continue to get an error. My data looks like this:
dat <- data.frame(date, nWords, v1, v2, v3, v4 ...)
I tried something like:
aggregate(dat, by = list(dat$date), weighted.mean, w = dat$nWords)
but got
Error in weighted.mean.default(X[[1L]], ...) :
'x' and 'w' must have the same length
There is another thread which answers this question using plyr but for only one variable, I want to aggregate all my variables that way.
You can do it with data.table:
library(data.table)
#set up your data
dat <- data.frame(date = c("2012-01-01","2012-01-01","2012-01-01","2013-01-01",
"2013-01-01","2013-01-01","2014-01-01","2014-01-01","2014-01-01"),
nwords = 1:9, v1 = rnorm(9), v2 = rnorm(9), v3 = rnorm(9))
#make it into a data.table
dat = data.table(dat, key = "date")
# grab the column names we want, generalized for V1:Vwhatever
c = colnames(dat)[-c(1,2)]
#get the weighted mean by date for each column
for(n in c){
dat[,
n := weighted.mean(get(n), nwords),
with = FALSE,
by = date]
}
#keep only the unique dates and weighted means
wms = unique(dat[,nwords:=NULL])
Try using by:
# your numeric data
x <- 111:120
# the weights
ww <- 10:1
mat <- cbind(x, ww)
# the group variable (in your case is 'date')
y <- c(rep("A", 7), rep("B", 3))
by(data=mat, y, weighted.mean)
If you want the results in a data frame, I suggest the plyr package:
plyr::ddply(data.frame(mat), "y", weighted.mean)
I'm new to plyr and want to take the weighted mean of values within a class to reshape a dataframe for multiple variables. Using the following code, I know how to do this for one variable, such as x2:
set.seed(123)
frame <- data.frame(class=sample(LETTERS[1:5], replace = TRUE),
x=rnorm(20), x2 = rnorm(20), weights=rnorm(20))
ddply(frame, .(class),function(x) data.frame(weighted.mean(x$x2, x$weights)))
However, I would like the code to create a new data frame for x and x2 (and any amount of variables in the frame). Does anybody know how to do this? Thanks
You might find what you want in the ?summarise function. I can replicate your code with summarise as follows:
library(plyr)
set.seed(123)
frame <- data.frame(class=sample(LETTERS[1:5], replace = TRUE), x=rnorm(20),
x2 = rnorm(20), weights=rnorm(20))
ddply(frame, .(class), summarise,
x2 = weighted.mean(x2, weights))
To do this for x as well, just add that line to be passed into the summarise function:
ddply(frame, .(class), summarise,
x = weighted.mean(x, weights),
x2 = weighted.mean(x2, weights))
Edit: If you want to do an operation over many columns, use colwise or numcolwise instead of summarise, or do summarise on a melted data frame with the reshape2 package, then cast back to original form. Here's an example.
That would give:
wmean.vars <- c("x", "x2")
ddply(frame, .(class), function(x)
colwise(weighted.mean, w = x$weights)(x[wmean.vars]))
Finally, if you don't like having to specify wmean.vars, you can also do:
ddply(frame, .(class), function(x)
numcolwise(weighted.mean, w = x$weights)(x[!colnames(x) %in% "weights"]))
which will compute a weighted-average for every numerical field, excluding the weights themselves.
A data.table answer for fun, which also doesn't require specifying all the variables individually.
library(data.table)
frame <- as.data.table(frame)
keynames <- setdiff(names(frame),c("class","weights"))
frame[, lapply(.SD,weighted.mean,w=weights), by=class, .SDcols=keynames]
Result:
class x x2
1: B 0.1390808 -1.7605032
2: D 1.3585759 -0.1493795
3: C -0.6502627 0.2530720
4: E 2.6657227 -3.7607866
I would like to aggregate a data.frame by an identifier variable called ensg. The data frame looks like this:
chromosome probeset ensg symbol XXA_00 XXA_36 XXB_00
1 X 4938842 ENSMUSG00000000003 Pbsn 4.796123 4.737717 5.326664
I want to compute the mean for each numeric column over rows with same ensg value. The problem here is that I would like to leave the other identity variables chromosome and symbol untouched as they are also the same for same ensg.
In the end I would like to have a data.frame with identity columns chromosome, ensg, symbol and mean of numeric columns over rows with same identifier. I implemented this in ddply, but it is very slow when compared to aggregate:
spec.mean <- function(eset.piece)
{
cbind(eset.piece[1,-numeric.columns],t(colMeans(eset.piece[,numeric.columns])))
}
t
mean.eset <- ddply(eset.consensus.grand,.(ensg),spec.mean,.progress="tk")
My first aggregate implementation looks like this,
mean.eset=aggregate(eset[,numeric.columns], by=list(eset$ensg), FUN=mean, na.rm=TRUE);
and is much faster. But the problem with aggregate is that I have to reattach the describing variables. I have not figured out how to use my custom function with aggregate since aggregate does not pass data frames but only vectors.
Is there an elegant way to do this with aggregate? Or is there some faster way to do it with ddply?
If speed is a primary concern, you should take a look at the data.table package. When the number of rows or grouping columns is large, data.table really seems to shine. The wiki for the package is here and has several links to other good introductory documents.
Here's how you'd do this aggregation with data.table()
library(data.table)
#Turn the data.frame above into a data.table
dt <- data.table(df)
#Aggregation
dt[, list(XXA_00 = .Internal(mean(XXA_00)),
XXA_36 = .Internal(mean(XXA_36)),
XXB_00 = .Internal(mean(XXB_00))),
by = c("ensg", "chromosome", "symbol")
]
Gives us
ensg chromosome symbol XXA_00 XXA_36 XXB_00
[1,] E1 A S1 0.18026869 0.13118997 0.6558433
[2,] E2 B S2 -0.48830539 0.24235537 0.5971377
[3,] E3 C S3 -0.04786984 -0.03139901 0.5618208
The aggregate solution provided above seems to fare pretty well when working with the 30 row data.frame by comparing the output from the rbenchmark package. However, when the data.frame contains 3e5 rows, data.table() pulls away as a clear winner. Here's the output:
benchmark(fag(), fdt(), replications = 10)
test replications elapsed relative user.self sys.self
1 fag() 10 12.71 23.98113 12.40 0.31
2 fdt() 10 0.53 1.00000 0.48 0.05
First let's define a toy example:
df <- data.frame(chromosome = gl(3, 10, labels = c('A', 'B', 'C')),
probeset = gl(3, 10, labels = c('X', 'Y', 'Z')),
ensg = gl(3, 10, labels = c('E1', 'E2', 'E3')),
symbol = gl(3, 10, labels = c('S1', 'S2', 'S3')),
XXA_00 = rnorm(30),
XXA_36 = rnorm(30),
XXB_00 = rnorm(30))
And then we use aggregate with the formula interface:
df1 <- aggregate(cbind(XXA_00, XXA_36, XXB_00) ~ ensg + chromosome + symbol,
data = df, FUN = mean)
> df1
ensg chromosome symbol XXA_00 XXA_36 XXB_00
1 E1 A S1 -0.02533499 -0.06150447 -0.01234508
2 E2 B S2 -0.25165987 0.02494902 -0.01116426
3 E3 C S3 0.09454154 -0.48468517 -0.25644569