I know that there are many answers provided in this forum on how to get summary statistics (e.g. mean, se, N) for multiple groups using options like aggregate , ddply or data.table. I'm not sure, however, how to apply these functions over multiple columns at once.
More specifically, I would like to know how to extend the following ddply command over multiple columns (dv1, dv2, dv3) without re-typing the code with different variable name each time.
library(reshape2)
library(plyr)
group1 <- c(rep(LETTERS[1:4], c(4,6,6,8)))
group2 <- c(rep(LETTERS[5:8], c(6,4,8,6)))
group3 <- c(rep(LETTERS[9:10], c(12,12)))
my.dat <- data.frame(group1, group2, group3, dv1=rnorm(24),dv2=rnorm(24),dv3=rnorm(24))
my.dat
data1 <- ddply(my.dat, c("group1", "group2","group3"), summarise,
N = length(dv1),
mean = mean(dv1,na.rm=T),
sd = sd(dv1,na.rm=T),
se = sd / sqrt(N)
)
data1
How can I apply this ddply function over multiple columns such that the outcome will be data1, data2, data3... for each outcome variable? I thought this could be the solution:
dfm <- melt(my.dat, id.vars = c("group1", "group2","group3"))
lapply(list(.(group1, variable), .(group2, variable),.(group3, variable)),
ddply, .data = dfm, .fun = summarize,
mean = mean(value),
sd = sd(value),
N=length(value),
se=sd/sqrt(N))
Looks like it's in the right direction but not exactly what I need. This solution provides the statistics by each group separately. What I need an outcome as in data1 (e.g. first aggregated group is people who are at A, E and I; the second is those who are at group B, E and I etc...)
Here's an illustration of reshaping your data first. I've written a custom function to improve readability:
mysummary <- function(x,na.rm=F){
res <- list(mean=mean(x, na.rm=na.rm),
sd=sd(x,na.rm=na.rm),
N=length(x))
res$se <- res$sd/sqrt(res$N)
res
}
library(data.table)
res <- melt(setDT(my.dat),id.vars=c("group1","group2","group3"))[,mysummary(value),
by=.(group1,group2,group3,variable)]
> head(res)
group1 group2 group3 variable mean sd N se
1: A E I dv1 9.75 6.994045 4 3.497023
2: B E I dv1 9.50 7.778175 2 5.500000
3: B F I dv1 16.00 4.082483 4 2.041241
4: C G I dv1 14.50 10.606602 2 7.500000
5: C G J dv1 10.75 10.372239 4 5.186119
6: D G J dv1 13.00 4.242641 2 3.000000
Or without the custom function, thanks to #Jaap
melt(setDT(my.dat),
id=c("group1","group2","group3"))[, .(mean = mean(value),
sd = sd(value),
n = .N,
se = sd(value)/sqrt(.N)),
.(group1, group2, group3, variable)]
If you don't want to melt into long format, you can also do:
library(data.table)
setDT(my.dat)[, as.list(unlist(lapply(.SD, function(x) list(mean = mean(x),
sd = sd(x),
n = .N,
se = sd(x)/sqrt(.N))))),
by = .(group1, group2, group3), .SDcols=c("dv1","dv2","dv3")]
which gives:
group1 group2 group3 dv1.mean dv1.sd dv1.n dv1.se dv2.mean dv2.sd dv2.n dv2.se dv3.mean dv3.sd dv3.n dv3.se
1: A E I 0.09959774 0.4704498 4 0.23522491 0.05020096 0.8098882 4 0.40494412 -0.134137210 0.7832841 4 0.3916420
2: B E I 0.72726477 0.3651544 2 0.25820315 0.73743314 1.4260172 2 1.00834641 -0.120188202 0.5532434 2 0.3912022
3: B F I -0.68661572 0.7212631 4 0.36063157 0.06670216 0.7678781 4 0.38393905 0.096275469 0.8993015 4 0.4496508
4: C G I -0.54577363 0.0798962 2 0.05649515 0.18293371 0.1022325 2 0.07228926 -0.947603264 2.3118016 2 1.6346906
5: C G J 0.17434075 0.8503874 4 0.42519369 -0.11485558 1.4184031 4 0.70920154 -0.005140781 0.6871591 4 0.3435796
6: D G J 0.17943465 0.4943486 2 0.34955725 -0.22223273 0.3679613 2 0.26018796 -0.373289114 1.0737512 2 0.7592568
7: D H J 0.38090937 0.7904832 6 0.32271340 0.02107597 1.0094695 6 0.41211422 0.118277330 0.9024006 6 0.3684035
Here is a solution using dplyr. This gives the result in a "wide" format (i.e. the stats for dv1, dv2, dv3 are on the same line).
se <- function(x) { sd(x)/sqrt(length(x)) }
my.dat %>%
group_by(group1, group2, group3) %>%
summarise_each(funs(mean, sd, length, se), dv1, dv2, dv3) %>%
ungroup
If having the stats for dv1, dv2, and dv3 on separate lines is desired, this can be modified using melt or gather (from tidyr).
Related
I have a data frame with numeric columns and a character column with labels. See example:
library(tidyverse)
a <- c(0.036210845, 0.005546561, 0.004394322 ,0.006635205, 2.269306824 ,0.013542101, 0.006580308 ,0.006854309,0.009076331 ,0.006577178 ,0.099406840 ,0.010962796, 0.011491922,0.007454443 ,0.004463684,0.005836916,0.011119906 ,0.009543205, 0.003990476, 0.007793532 ,0.020776231, 0.011713687, 0.010045341, 0.008411304, 0.032514994)
b <- c(0.030677829, 0.005210211, 0.004164294, 0.006279456 ,1.095908581 ,0.012029876, 0.006193405 ,0.006486812, 0.008589699, 0.006167356, 0.068956516 ,0.010140064 ,0.010602171 ,0.006898081 ,0.004193735, 0.005447855 ,0.009936211, 0.008743681, 0.003774822, 0.007375678, 0.019695336, 0.010827791, 0.009258572, 0.007960328,0.026956408)
c <- c(0.025855453, 0.004882746 ,0.003946182, 0.005929399 ,0.466284591 ,0.010704604 ,0.005815709, 0.006125196, 0.008110854, 0.005769223, 0.046847336, 0.009356712, 0.009803620 ,0.006366758, 0.003936953 ,0.005072295, 0.008885989 ,0.007989028, 0.003565631, 0.006964512, 0.018636187, 0.010009413, 0.008540876, 0.007516569,0.022227924)
label <- c("fa05","fa05" ,"fa05", "fa10", "fa10", "fa10", "fa20","fa20", "faflat", "faflat", "sa05", "sa05", "sa10" , "sa10" , "sa10" , "sa10", "sa10", "sa10", "sa20", "sa20", "sa20" ,"sa20", "saflat", "saflat", "saflat")
dataframe <- as.data.frame(cbind(a,b,c,label))
dataframe <- dataframe %>%
transform(a = as.numeric(a)) %>%
transform(b = as.numeric(b)) %>%
transform(c = as.numeric(c))
I have written a function that takes a sample of rows for each label (number of rows in sample = number of rows for the specific label) and as output gives the average of the samples. Example: in the source data (dataframe) there are 3 rows of the label "fa05". Lets call them fa05_1, fa05_2, fa05_3 (just for explaining it). The function takes a sample of these three rows that each consist of 3 columns (a,b and c). The number of fa05 in the sample equals the number fa05 in the source data, so 3 in this case. The function takes a sample with replacement so it could fx be fa05_3, fa05_1, fa05_1. Then it takes the average of those three samples for each of the three columns a,b and c and gives the output. It looks like this:
samp <- function(df, col1, var){
df %>%
group_by(!!col1) %>%
nest() %>%
ungroup() %>%
mutate(n = !!var) %>%
mutate(samp = map2(data, n, sample_n, replace=T)) %>%
select(-data) %>%
unnest(samp) %>%
group_by(!!col1) %>%
dplyr::summarise(across("a":"c", mean))
}
list <- c(3,3,2,2,2,6,4,3) # The number of times each label occur in the data
samp(dataframe, quo(label), quo(list))
label a b c
<chr> <dbl> <dbl> <dbl>
1 fa05 0.00439 0.00416 0.00395
2 fa10 0.00894 0.00820 0.00752
3 fa20 0.00672 0.00634 0.00597
4 faflat 0.00908 0.00859 0.00811
5 sa05 0.0552 0.0395 0.0281
6 sa10 0.00715 0.00657 0.00603
7 sa20 0.0101 0.00956 0.00903
8 saflat 0.0250 0.0211 0.0177
I would like to use this function on some data and repeat it 1000 times efficiently. At first it was not a function and I used rerun() but that was very inefficient. I read that I could write it as a function and the use lapply which should be more efficient, but it does not work when I do like this:
lapply(dataframe, samp, col1=quo(Pattern), var=quo(list))
Error in UseMethod("group_by_") :
no applicable method for 'group_by_' applied to an object of class "c('double', 'numeric')"
How do I make this work with lapply? And how to I tell lapply to rerun the function 1000 times? I hope you can help.
You can just do this
replicate(1000, samp(dataframe, quo(label), quo(list)), simplify = FALSE)
However, this is really slow.
> system.time(replicate(1000, samp(dataframe, quo(label), quo(list)), simplify = FALSE))
user system elapsed
33.83 0.03 33.87
To make it faster, we need to rewrite your samp function. Here is a tidyverse approach
group_sample_size <- c("fa05" = 3, "fa10" = 3, "fa20" = 2, "faflat" = 2, "sa05" = 2, "sa10" = 6, "sa20" = 4, "saflat" = 3)
prep <- function(df, grp_var, sample_size) {
df %>%
mutate(size = sample_size[.data[[grp_var]]]) %>%
group_by(across(!!grp_var))
}
rep_sample <- function(df, n) {
replicate(
n,
df %>%
slice(sample.int(n(), size[[1L]], replace = TRUE)) %>%
summarise(across(a:c, mean), .groups = "drop"),
simplify = FALSE
)
}
dataframe %>%
prep("label", group_sample_size) %>%
rep_sample(1000)
Performance has improved significantly but is still suboptimal IMO. It takes about 5-6 seconds to finish the simulation.
> system.time(dataframe %>% prep("label", group_sample_size) %>% rep_sample(1000))
user system elapsed
5.80 0.01 5.81
For efficiency, I think the following data.table approach would be better.
library(data.table)
fsamp <- function(df, grp_var, size, nsim) {
df <- as.data.table(df)
group_info <- table(df[[grp_var]], dnn = list(grp_var))
simu_pool <- df[, -grp_var, with = FALSE]
simu_vars <- names(simu_pool)
simu_pool <- split(simu_pool, df[[grp_var]])
out <- data.table(
simu = rep(seq_len(nsim), each = length(group_info)),
group_info
)
out[
, size := size[out[[grp_var]]]
][
, (simu_vars) := lapply(simu_pool[[.BY[[grp_var]]]][sample.int(N, size, replace = TRUE)], mean),
by = c("simu", grp_var)
][]
}
This one is about four times faster than the optimised tidyverse approach.
> system.time(fsamp(dataframe, "label", group_sample_size, 1000))
user system elapsed
1.47 0.04 1.50
All three approaches produce the same set of results
> set.seed(124)
> # rbindlist converts a list of tibbles into a single data.table
> dataframe %>% prep("label", group_sample_size) %>% rep_sample(1000) %>% rbindlist()
label a b c
1: fa05 0.015383909 0.013350778 0.011561460
2: fa10 0.763161377 0.371405971 0.160972865
3: fa20 0.006717308 0.006340109 0.005970452
4: faflat 0.009076331 0.008589699 0.008110854
5: sa05 0.055184818 0.039548290 0.028102024
---
7996: faflat 0.007826754 0.007378527 0.006940039
7997: sa05 0.099406840 0.068956516 0.046847336
7998: sa10 0.006648513 0.006118159 0.005626362
7999: sa20 0.020776231 0.019695336 0.018636187
8000: saflat 0.008411304 0.007960328 0.007516569
> set.seed(124)
> fsamp(df, "label", group_sample_size, 1000)
simu label N size a b c
1: 1 fa05 3 3 0.015383909 0.013350778 0.011561460
2: 1 fa10 3 3 0.763161377 0.371405971 0.160972865
3: 1 fa20 2 2 0.006717308 0.006340109 0.005970452
4: 1 faflat 2 2 0.009076331 0.008589699 0.008110854
5: 1 sa05 2 2 0.055184818 0.039548290 0.028102024
---
7996: 1000 faflat 2 2 0.007826754 0.007378527 0.006940039
7997: 1000 sa05 2 2 0.099406840 0.068956516 0.046847336
7998: 1000 sa10 6 6 0.006648513 0.006118159 0.005626362
7999: 1000 sa20 4 4 0.020776231 0.019695336 0.018636187
8000: 1000 saflat 3 3 0.008411304 0.007960328 0.007516569
> set.seed(124)
> replicate(1000, samp(dataframe, quo(label), quo(list)), simplify = FALSE) %>% rbindlist()
label a b c
1: fa05 0.015383909 0.013350778 0.011561460
2: fa10 0.763161377 0.371405971 0.160972865
3: fa20 0.006717308 0.006340109 0.005970452
4: faflat 0.009076331 0.008589699 0.008110854
5: sa05 0.055184818 0.039548290 0.028102024
---
7996: faflat 0.007826754 0.007378527 0.006940039
7997: sa05 0.099406840 0.068956516 0.046847336
7998: sa10 0.006648513 0.006118159 0.005626362
7999: sa20 0.020776231 0.019695336 0.018636187
8000: saflat 0.008411304 0.007960328 0.007516569
I have a list of the following structure,
myList <- replicate(5, data.frame(id = 1:10, mean = runif(10)), simplify =F)
and I want to reduce it with a merge
myList %>% reduce(function(x, y) merge(x, y, by = 'id'))
That, however, leads to the following colnames:
id mean.x mean.y mean.x mean.y mean
While I would like something like
id mean1 mean2 mean3 mean4 mean5
Where the numbers are based on the order of myList.
Obviously I could iterate over 1:length(myList), but I find this solution unelegant. Other option would be to introduce a check in the reducing function, but that would indue a new linear time search for each element of the list, so I don't believe it to be very efficient.
Is there another way to achieve this?
New answer:
Using rbindlist and dcast from the data.table-package:
library(data.table)
mydata <- rbindlist(myList, idcol = 'df')
dcast(mydata, id ~ paste0('mean',df), value.var = 'mean')
Or with the tidyverse packages:
library(dplyr)
library(tidyr)
myList %>%
bind_rows(., .id = 'df') %>%
spread(df, mean) %>%
rename_at(-1, funs(paste0('mean',.)))
which both give (data.table-output is shown):
id mean1 mean2 mean3 mean4 mean5
1: 1 0.6937674 0.005642891 0.4155868 0.74184186 0.54513885
2: 2 0.3602352 0.569412043 0.8018570 0.29177043 0.34521060
3: 3 0.6353133 0.512876032 0.8711914 0.44660086 0.16338451
4: 4 0.2106574 0.555638598 0.8240744 0.37495213 0.57443740
5: 5 0.9530160 0.059930577 0.0930678 0.39862717 0.91568414
6: 6 0.3723244 0.598526326 0.4970844 0.01978011 0.07832631
7: 7 0.2923137 0.712971846 0.3805590 0.25676592 0.11682605
8: 8 0.6208868 0.426853621 0.5533876 0.64054247 0.78949419
9: 9 0.9032609 0.274705843 0.3525957 0.46994429 0.32883110
10: 10 0.9707088 0.351394642 0.1089803 0.97969335 0.77791085
When there are duplicates in id in one or more of the dataframes in myList, you have to adapt the dcast-step to dcast(mydata, id + rowid(id,df) ~ paste0('mean',df), value.var = 'mean') to get the correct outcome. Check the following example to see the result:
myList <- replicate(5, data.frame(id = sample(1:10, 10, TRUE), mean = runif(10)), simplify = FALSE)
mydata <- rbindlist(myList, idcol = 'df')
dcast(mydata, id + rowid(id,df) ~ paste0('mean',df), value.var = 'mean')
This also works when there are no duplicates in id.
The tidyverse-code has then to be adapted to:
myList %>%
bind_rows(., .id = 'df') %>%
group_by(df, id) %>%
mutate(ri = row_number()) %>%
ungroup() %>%
spread(df, mean) %>%
rename_at(3:7, funs(paste0('mean',.)))
Old answer (still valid):
A possible solution:
# option 1
myList <- mapply(function(x,y) {names(x)[2] = paste0('mean',y); x}, myList, 1:length(myList), SIMPLIFY = FALSE)
Reduce(function(x, y) merge(x, y, by = 'id'), myList)
# option 2 (quite similar to #zx8754's solution)
mydata <- Reduce(function(x, y) merge(x, y, by = 'id'), myList)
setNames(mydata, c('id', paste0('mean', seq_along(myList))))
which gives:
id mean1 mean2 mean3 mean4 mean5
1 1 0.1119114 0.4193226 0.86619590 0.52543072 0.52879193
2 2 0.4630863 0.8786721 0.02012432 0.77274088 0.09227344
3 3 0.9832522 0.4687838 0.49074271 0.01611625 0.69919423
4 4 0.7017467 0.7845002 0.44692958 0.64485570 0.40808345
5 5 0.6204856 0.1687563 0.54407165 0.54236973 0.09947167
6 6 0.1480965 0.7654041 0.43591864 0.22468554 0.84557988
7 7 0.0179509 0.3610114 0.45420122 0.20612154 0.76899342
8 8 0.9862083 0.5579173 0.13540519 0.97311401 0.13947602
9 9 0.3140737 0.2213044 0.05187671 0.07870425 0.23880332
10 10 0.4515313 0.2367271 0.65728768 0.22149073 0.90578043
You can also try to modify the function in the Reduce (or reduce) call to make the adding of indices automatic :
Reduce(function(x, y){
# get indices of columns that are not the common one, in x and y
col_noby_x <- which(colnames(x) != "id")
col_noby_y <- which(colnames(y) != "id")
# maximum of indices in x (at the end of the column names)
ind_x <- max(as.numeric(sub(".+(\\d+)$", "\\1", colnames(x)[col_noby_x])))
# if there is no indice yet, put 1 and 2, else modify names only in y, taking the max value of indices in x plus one.
if(!is.na(ind_x)) colnames(y)[col_noby_y] <- paste0(colnames(y)[col_noby_y], ind_x +1) else {colnames(x)[col_noby_x] <- paste0(colnames(x)[col_noby_x], 1); colnames(y)[col_noby_y] <- paste0(colnames(y)[col_noby_y], 2)}
# finally merge
merge(x, y, by="id")}, myList)
# id mean1 mean2 mean3 mean4 mean5
#1 1 0.10698388 0.0277198 0.5109345 0.8885772 0.79983437
#2 2 0.29750846 0.7951743 0.9558739 0.9691619 0.31805857
#3 3 0.07115142 0.2401011 0.8106464 0.5101563 0.78697618
#4 4 0.39564336 0.7225532 0.7583893 0.4275574 0.77151883
#5 5 0.55860511 0.4111913 0.8403031 0.4284490 0.51489116
#6 6 0.92191777 0.9142926 0.4708712 0.2451099 0.84142501
#7 7 0.08218166 0.2741819 0.6772842 0.7939364 0.86930336
#8 8 0.35392512 0.2088531 0.0801731 0.2734870 0.62963218
#9 9 0.64068537 0.8427225 0.1904426 0.2389339 0.73145206
#10 10 0.31304719 0.9898133 0.8173664 0.2013031 0.04658273
Merge with Reduce, then update column names:
res <- Reduce(function(...) merge(..., all = TRUE, by = "id"), myList)
colnames(res)[2:ncol(res)] <- paste0("mean", 1:length(myList))
We can use set_names
library(tidyverse)
myList %>%
reduce(merge, by = 'id') %>%
set_names(c("id", paste0("mean", 1:5)))
# id mean1 mean2 mean3 mean4 mean5
#1 1 0.07122593 0.480300675 0.34944190 0.48718226 0.9118796
#2 2 0.18375430 0.850652470 0.24780063 0.45148232 0.2587470
#3 3 0.18617054 0.526188340 0.48716956 0.53354343 0.9057241
#4 4 0.87838756 0.811985522 0.49024819 0.10412944 0.7830501
#5 5 0.29287646 0.974811919 0.31413846 0.01508965 0.4587954
#6 6 0.62304018 0.004421152 0.81053625 0.80032467 0.7630185
#7 7 0.78445890 0.006362844 0.73643248 0.15952795 0.4386658
#8 8 0.71568076 0.081139996 0.36933728 0.31771823 0.2794372
#9 9 0.25523328 0.081603285 0.00298272 0.33698950 0.2413859
#10 10 0.86274552 0.432177738 0.26064580 0.75639537 0.3125151
Here are two one liners
Using purrr:reduce2 and dplyr::inner_join in place of merge:
library(dplyr)
library(purrr)
myList %>% reduce2(map(2:length(.),~c("",.x)), inner_join, by = 'id',copy=F)
# id mean mean2 mean3 mean4 mean5
# 1 1 0.44560715 0.4575765 0.6075921 0.06504922 0.90410342
# 2 2 0.60606716 0.5004711 0.7866959 0.89632285 0.09890028
# 3 3 0.59928281 0.4894146 0.4495071 0.66090212 0.56046997
# 4 4 0.55630819 0.4166869 0.1984523 0.08040737 0.18375885
# 5 5 0.97714203 0.1223497 0.7923596 0.53054508 0.93747149
# 6 6 0.07751312 0.6217220 0.3861749 0.30062805 0.03177210
# 7 7 0.22839323 0.3994350 0.6382234 0.98578452 0.27032222
# 8 8 0.73628572 0.8804618 0.8240999 0.44205508 0.73901477
# 9 9 0.81894510 0.2186181 0.9317510 0.60035660 0.65002083
# 10 10 0.26197059 0.5569660 0.9167330 0.58912675 0.81367176
Or using plyr::join_all and tibble::repair_names(same output):
myList %>% join_all('id','inner') %>% repair_names
I have a data table with 10 columns.
town
tc
one
two
three
four
five
six
seven
total
Need to generate mean for columns "one" to "total" for which I am using,
DTmean <- DT[,(lapply(.SD,mean)),by = .(town,tc),.SDcols=3:10]
This generates the mean, but then I want the column names to be suffixed with "_mean". How can we do this? Want the first two columns to remain the same as "town" and "tc". I tried the below but then it renames all "one" to "total" to just "_mean"
for (i in 3:10) {
setnames(DTmean,i,paste0(names(i),"_mean"))
}
If you want to do it the data.table way, you should use setnames as follows:
setnames(DTmean, 3:10, paste0(names(DT)[3:10], '_mean'))
or:
cols <- names(DT)[3:10]
setnames(DTmean, cols, paste0(cols, '_mean'))
Furthermore, you don't need the .SDcols statement as you are aggregating all the other columns. Using DT[, lapply(.SD,mean), by = .(town,tc)] should thus give you the same result as using DT[, (lapply(.SD,mean)), by = .(town,tc), .SDcols=3:10].
On the following example dataset:
set.seed(71)
DT <- data.table(town = rep(c('A','B'), each=10),
tc = rep(c('C','D'), 10),
one = rnorm(20,1,1),
two = rnorm(20,2,1),
three = rnorm(20,3,1),
four = rnorm(20,4,1),
five = rnorm(20,5,2),
six = rnorm(20,6,2),
seven = rnorm(20,7,2),
total = rnorm(20,28,3))
using:
DTmean <- DT[, lapply(.SD,mean), by = .(town,tc)]
setnames(DTmean, 3:10, paste0(names(DT)[3:10], '_mean'))
gives:
> DTmean
town tc one_mean two_mean three_mean four_mean five_mean six_mean seven_mean total_mean
1: A C 1.7368898 1.883586 3.358440 4.849896 4.742609 5.089877 6.792513 29.20286
2: A D 0.8906842 1.826135 3.267684 3.760931 6.210145 7.320693 5.571687 26.56142
3: B C 1.4037955 2.474836 2.587920 3.719658 3.446612 6.510183 8.309784 27.80012
4: B D 0.8103511 1.153000 3.360940 3.945082 5.555999 6.198380 8.652779 28.95180
In reply to your comment: If you want to calculate both the mean and the sd simultanuously, you could do (adapted from my answer here):
DT[, as.list(unlist(lapply(.SD, function(x) list(mean = mean(x), sd = sd(x))))), by = .(town,tc)]
which gives:
town tc one.mean one.sd two.mean two.sd three.mean three.sd four.mean four.sd five.mean five.sd six.mean six.sd seven.mean seven.sd total.mean total.sd
1: A C 0.2981842 0.3556520 1.578174 0.7788545 2.232366 0.9047046 4.896201 1.238877 4.625866 0.7436584 7.607439 1.7262628 7.949366 1.772771 28.94287 3.902602
2: A D 1.2099018 1.0205252 1.686068 1.5497989 2.671027 0.8323733 4.811279 1.404794 7.235969 0.7883873 6.765797 2.7719942 6.657298 1.107843 27.42563 3.380785
3: B C 0.9238309 0.6679821 2.525485 0.8054734 3.138298 1.0111270 3.876207 0.573342 3.843140 2.1991052 4.942155 0.7784024 6.783383 2.595116 28.95243 1.078307
4: B D 0.8843948 0.9384975 1.988908 1.0543981 3.673393 1.3505701 3.957534 1.097837 2.788119 1.9089660 6.463784 0.7642144 6.416487 2.041441 27.88205 3.807119
However, it is highly probable better to store this in long format. To get this you could use data.table's melt function as follows:
cols <- names(DT)[3:10]
DT2 <- melt(DT[, as.list(unlist(lapply(.SD, function(x) list(mn = mean(x), sdev = sd(x))))), by = .(town,tc)],
id.vars = c('town','tc'),
measure.vars = patterns('.mn','.sdev'),
value.name = c('mn','sdev'))[, variable := cols[variable]]
or in a much simpler operation:
DT2 <- melt(DT, id.vars = c('town','tc'))[, .(mn = mean(value), sdev = sd(value)), by = .(town,tc,variable)]
which results in:
> DT2
town tc variable mn sdev
1: A C one 0.2981842 0.3556520
2: A D one 1.2099018 1.0205252
3: B C one 0.9238309 0.6679821
4: B D one 0.8843948 0.9384975
5: A C two 1.5781743 0.7788545
6: A D two 1.6860675 1.5497989
7: B C two 2.5254855 0.8054734
8: B D two 1.9889082 1.0543981
9: A C three 2.2323655 0.9047046
10: A D three 2.6710267 0.8323733
11: B C three 3.1382982 1.0111270
12: B D three 3.6733929 1.3505701
.....
In response to your last comments, you can detect outliers as follows:
DT3 <- melt(DT, id.vars = c('town','tc'))
DT3[, `:=` (mn = mean(value), sdev = sd(value)), by = .(town,tc,variable)
][, outlier := +(value < mn - sdev | value > mn + sdev)]
which gives:
town tc variable value mn sdev outlier
1: A C one 0.5681578 0.2981842 0.355652 0
2: A D one 0.5528128 1.2099018 1.020525 0
3: A C one 0.5214274 0.2981842 0.355652 0
4: A D one 1.4171454 1.2099018 1.020525 0
5: A C one 0.5820994 0.2981842 0.355652 0
---
156: B D total 23.4462542 27.8820524 3.807119 1
157: B C total 30.5934956 28.9524305 1.078307 1
158: B D total 30.5618759 27.8820524 3.807119 0
159: B C total 27.5940307 28.9524305 1.078307 1
160: B D total 24.8378437 27.8820524 3.807119 0
I am trying to calculate the mean reagent vectors across the variables RBC, WBC, and hemoglobin. I am fairly new to R so my question is: Can you show me an easier way to do the following calculations in R? The data is from Table 6.19 of Rencher. I am trying to practice doing the computations in R as I follow the examples in Rencher.
reagent.dat <- read.table("https://dl.dropboxusercontent.com/u/28713619/reagent.dat")
colnames(reagent.dat) <- c("reagent", "subject", "RBC", "WBC", "hemoglobin")
reagent.dat$reagent <- factor(reagent.dat$reagent)
reagent.dat$subject <- factor(reagent.dat$subject)
library(plyr)
library(dplyr)
library(reshape2)
# Calculate the means per variable, across reagents
reagent.datm <- melt(reagent.dat)
group.means <- ddply(reagent.datm, c("variable","reagent"), summarise,mean=mean(value))
group.means <- tbl_df(group.means)
newdata <- group.means %>% select(reagent, mean)
# Store the group means into a matrix
y_bar <- matrix(c(rep(NA, times=12)), ncol=4)
for (i in 1:4)
y_bar[,i] <- as.matrix(filter(newdata, reagent == i)$mean, ncol=1)
y_bar
The dplyr package can actually simplify your code quite easily and is definitely worth learning because of how powerful it can be. As an example:
reagent.dat <- read.table("https://dl.dropboxusercontent.com/u/28713619/reagent.dat")
colnames(reagent.dat) <- c("reagent", "subject", "RBC", "WBC", "hemoglobin")
#Using dplyr
library(dplyr)
reagentmeans <- reagent.dat %>% select(reagent, RBC, WBC, hemoglobin) %>%
group_by(reagent) %>%
summarize(mean_RBC = mean(RBC), mean_WBC = mean(WBC),
mean_hemoglobin = mean(hemoglobin))
> reagentmeans
Source: local data frame [4 x 4]
reagent mean_RBC mean_WBC mean_hemoglobin
(fctr) (dbl) (dbl) (dbl)
1 1 7.290 4.9535 15.310
2 2 7.210 4.8985 15.725
3 3 7.055 4.8810 15.595
4 4 7.025 4.8915 15.765
You can use data.table,
library(data.table)
setDT(reagent.dat)[, lapply(.SD, mean), by = reagent, .SDcols = c('RBC', 'WBC', 'hemoglobin')]
# reagent RBC WBC hemoglobin
#1: 1 7.290 4.9535 15.310
#2: 2 7.210 4.8985 15.725
#3: 3 7.055 4.8810 15.595
#4: 4 7.025 4.8915 15.765
This is the first time that I ask a question on stack overflow. I have tried searching for the answer but I cannot find exactly what I am looking for. I hope someone can help.
I have a huge data set of 20416 observation. Basically, I have 83 subjects and for each subject I have several observations. However, the number of observations per subject is not the same (e.g. subject 1 has 256 observations, while subject 2 has only 64 observations).
I want to add an extra column containing the mean of the observations for each subject (the observations are reading times (RT)).
I tried with the aggregate function:
aggregate (RT ~ su, data, mean)
This formula returns the correct mean per subject. But then I cannot simply do the following:
data$mean <- aggregate (RT ~ su, data, mean)
as R returns this error:
Error in $<-.data.frame(tmp, "mean", value = list(su = 1:83, RT
= c(378.1328125, : replacement has 83 rows, data has 20416
I understand that the formula lacks a command specifying that the mean for each subject has to be repeated for all the subject's rows (e.g. if subject 1 has 256 rows, the mean for subject 1 has to be repeated for 256 rows, if subject 2 has 64 rows, the mean for subject 2 has to be repeated for 64 rows and so forth).
How can I achieve this in R?
The data.table syntax lends itself well to this kind of problem:
Dt[, Mean := mean(Value), by = "ID"][]
# ID Value Mean
# 1: a 0.05881156 0.004426491
# 2: a -0.04995858 0.004426491
# 3: b 0.64054432 0.038809830
# 4: b -0.56292466 0.038809830
# 5: c 0.44254622 0.099747707
# 6: c -0.10771992 0.099747707
# 7: c -0.03558318 0.099747707
# 8: d 0.56727423 0.532377247
# 9: d -0.60962095 0.532377247
# 10: d 1.13808538 0.532377247
# 11: d 1.03377033 0.532377247
# 12: e 1.38789640 0.568760936
# 13: e -0.57420308 0.568760936
# 14: e 0.89258949 0.568760936
As we are applying a grouped operation (by = "ID"), data.table will automatically replicate each group's mean(Value) the appropriate number of times (avoiding the error you ran into above).
Data:
Dt <- data.table::data.table(
ID = sample(letters[1:5], size = 14, replace = TRUE),
Value = rnorm(14))[order(ID)]
Staying in Base R, ave is intended for this use:
data$mean = with(data, ave(x = RT, su, FUN = mean))
Simply merge your aggregated means data with full dataframe joined by the subject:
aggdf <- aggregate (RT ~ su, data, mean)
names(aggdf)[2] <- "MeanOfRT"
df <- merge(df, aggdf, by="su")
Another compelling way of handling this without generating extra data objects is by using group_by of dplyr package:
# Generating some data
data <- data.table::data.table(
su = sample(letters[1:5], size = 14, replace = TRUE),
RT = rnorm(14))[order(su)]
# Performing
> data %>% group_by(su) %>%
+ mutate(Mean = mean(RT)) %>%
+ ungroup()
Source: local data table [14 x 3]
su RT Mean
1 a -1.62841746 0.2096967
2 a 0.07286149 0.2096967
3 a 0.02429030 0.2096967
4 a 0.98882343 0.2096967
5 a 0.95407214 0.2096967
6 a 1.18823435 0.2096967
7 a -0.13198711 0.2096967
8 b -0.34897914 0.1469982
9 b 0.64297557 0.1469982
10 c -0.58995261 -0.5899526
11 d -0.95995198 0.3067978
12 d 1.57354754 0.3067978
13 e 0.43071258 0.2462978
14 e 0.06188307 0.2462978