I have a dataframe similar to the one generated below. Some individuals have more than one observation for a particular variable and each variable has an associated standard error (SE) for the estimate. I would like to create a new dataframe that contains only a single row for each individual. For individuals with more than one observation, such as Kim or Bob, I need to calculate a precision weighted average based on the standard errors of the estimates along with a variance for the newly calculated weighted mean. For example, for Bob, for var1, this means that I would want his var1 value in the new dataframe to be:
weighted.mean(c(example$var1[2], example$var1[10]),
c(1/example$SE1[2], 1/example$SE1[10]))
and for Bob's new SE1, which would be the variance of the weighted mean, to be:
1/sum(1/example$SE1[2] + 1/example$SE1[10])
I have tried using the aggregate function and am able to calculate the arithmetic mean of the values, but the simple function I wrote does not use the standard errors nor can it deal with the NAs.
aggregate(example[,1:4], by = list(example[,5]), mean)
Would appreciate any help in developing some code to work through this problem. Here is the example dataset.
set.seed(1562)
example=data.frame(rnorm(10,8,2))
colnames(example)[1]=("var1")
example$SE1=rnorm(10,2,1)
example$var2=rnorm(10,8,2)
example$SE2=rnorm(10,2,1)
example$id=
c ("Kim","Bob","Joe","Sam","Kim","Kim","Joe","Sara","Jeff","Bob")
example$SE1[5]=NA
example$var1[5]=NA
example$SE2[10]=NA
example$var2[10]=NA
example
var1 SE1 var2 SE2 id
1 9.777769 2.451406 6.363250 2.2739566 Kim
2 8.753078 2.174308 6.219770 1.4978380 Bob
3 7.977356 2.107739 6.835998 2.1647437 Joe
4 11.113048 2.713242 11.091650 1.7018666 Sam
5 NA NA 11.769884 -0.1310218 Kim
6 5.271308 1.831475 6.818854 3.0294338 Kim
7 7.770062 2.094850 6.387607 0.2272348 Joe
8 9.837612 1.956486 8.517445 3.5126378 Sara
9 4.637518 2.516896 7.173460 2.0292454 Jeff
10 9.004425 1.592312 NA NA Bob
I like the plyr package for these sorts of problems. It should be functionally equivalent to aggregate, but I think it is nice and convenient to use. There are lots of examples and a great ~20 page intro to plyr on the website. For this problem, since the data starts as a data.frame and you want another data.frame on the other end, we use ddply()
library(plyr)
#f1()
ddply(example, "id", summarize,
newMean = weighted.mean(x=var1, 1/SE1, na.rm = TRUE),
newSE = 1/sum(1/SE1, na.rm = TRUE)
)
Which returns:
id newmean newSE
1 Bob 8.8982 0.91917
2 Jeff 4.6375 2.51690
3 Joe 7.8734 1.05064
4 Kim 7.1984 1.04829
5 Sam 11.1130 2.71324
6 Sara 9.8376 1.95649
Also check out ?summarize and ?transform for some other good background. You can also pass an anonymous function to the plyr functions if necessary for more complicated tasks.
Or use data.table package which can prove faster for some tasks:
library(data.table)
dt <- data.table(example, key="id")
#f2()
dt[, list(newMean = weighted.mean(var1, 1/SE1, na.rm = TRUE),
newSE = 1/sum(1/SE1, na.rm = TRUE)),
by = "id"]
A quick benchmark:
library(rbenchmark)
#f1 = plyr, #f2 = data.table
benchmark(f1(), f2(),
replications = 1000,
order = "elapsed",
columns = c("test", "elapsed", "relative"))
test elapsed relative
2 f2() 3.580 1.0000
1 f1() 6.398 1.7872
So data.table() is ~ 1.8x faster for this dataset on my simple laptop.
Related
My data frame has two columns that are used as a grouping key, 17 columns that need to be summed in each group, and one column that should be averaged instead. Let me illustrate this on a different data frame, diamonds from ggplot2.
I know I could do it like this:
ddply(diamonds, ~cut, summarise, x=sum(x), y=sum(y), z=sum(z), price=mean(price))
But while it is reasonable for 3 columns, it is unacceptable for 17 of them.
When researching this, I found the colwise function, but the best I came up with is this:
cbind(ddply(diamonds, ~cut, colwise(sum, 7:9)), price=ddply(diamonds, ~cut, summarise, mean(price))[,2])
Is there a possibility to improve this even further? I would like to do it in a more straightforward way, something like (imaginary commands):
ddply(diamonds, ~cut, colwise(sum, 7:9), price=mean(price))
or:
ddply(diamonds, ~cut, colwise(sum, 7:9), colwise(mean, ~price))
To sum up:
I don't want to have to type all 17 columns explicitly, like the first example does with x, y, and z.
Ideally, I would like to do it with a single call to ddply, without resorting to cbind (or similar functions), as in the second example.
For reference, the result I expect is 5 rows and 5 columns:
cut x y z price
1 Fair 10057.50 9954.07 6412.26 4358.758
2 Good 28645.08 28703.75 17855.42 3928.864
3 Very Good 69359.09 69713.45 43009.52 3981.760
4 Premium 82385.88 81985.82 50297.49 4584.258
5 Ideal 118691.07 118963.24 73304.61 3457.542
I would like to suggest data.table solutions for this. You can easily predefine the columns you want operate either by position or by names and then reuse the same code no matter how many column you want to operate on.
Predifine column names
Sums <- 7:9
Means <- "price"
Run the code
library(data.table)
data.table(diamonds)[, c(lapply(.SD[, Sums, with = FALSE], sum),
lapply(.SD[, Means, with = FALSE], mean))
, by = cut]
# cut x y z price
# 1: Ideal 118691.07 118963.24 73304.61 3457.542
# 2: Premium 82385.88 81985.82 50297.49 4584.258
# 3: Good 28645.08 28703.75 17855.42 3928.864
# 4: Very Good 69359.09 69713.45 43009.52 3981.760
# 5: Fair 10057.50 9954.07 6412.26 4358.758
For your specific example, this could simplified to just
data.table(diamonds)[, c(lapply(.SD[, 7:9, with = FALSE], sum), pe = mean(price)), by = cut]
# cut x y z pe
# 1: Ideal 118691.07 118963.24 73304.61 3457.542
# 2: Premium 82385.88 81985.82 50297.49 4584.258
# 3: Good 28645.08 28703.75 17855.42 3928.864
# 4: Very Good 69359.09 69713.45 43009.52 3981.760
# 5: Fair 10057.50 9954.07 6412.26 4358.758
Antoher solution using dplyr. First you apply both aggregate functions on every variable you want to be aggregated. Of the resulting variables you select only the desired function/variable combination.
library(dplyr)
library(ggplot2)
diamonds %>%
group_by(cut) %>%
summarise_each(funs(sum, mean), x:z, price) %>%
select(cut, matches("[xyz]_sum"), price_mean)
Yet another approach (in my opinion easier to read) for your particular case (mean = sum/n!)
nCut <- ddply(diamonds, ~cut, nrow)
res <- ddply(diamonds, ~cut, colwise(sum, 6:9))
res$price <- res$price/nCut$V1
or the more generic,
do.call(merge,
lapply(c(colwise(sum, 7:9), colwise(mean, 6)),
function(cw) ddply(diamonds, ~cut, cw)))
Just to throw in another solution:
library(plyr)
library(ggplot2)
trans <- list(mean = 8:10, sum = 7)
makeList <- function(inL, mdat = diamonds, by = ~cut) {
colN <- names(mdat)
args <- unlist(llply(names(inL), function(n) {
llply(inL[[n]], function(x) {
ret <- list(call(n, as.symbol(colN[[x]])))
names(ret) <- paste(n, colN[[x]], sep = ".")
ret
})
}))
args$.data <- as.symbol(deparse(substitute(mdat)))
args$.variables <- by
args$.fun <- as.symbol("summarise")
args
}
do.call(ddply, makeList(trans))
# cut mean.x mean.y mean.z sum.price
# 1 Fair 6.246894 6.182652 3.982770 7017600
# 2 Good 5.838785 5.850744 3.639507 19275009
# 3 Very Good 5.740696 5.770026 3.559801 48107623
# 4 Premium 5.973887 5.944879 3.647124 63221498
# 5 Ideal 5.507451 5.520080 3.401448 74513487
The idea is that the function makeList creates an argument list for ddply. In this way you can quite easily add terms to the list (as function.name = column.indices) and ddply will work as expected:
trans <- c(trans, sd = list(9:10))
do.call(ddply, makeList(trans))
# cut mean.x mean.y mean.z sum.price sd.y sd.z
# 1 Fair 6.246894 6.182652 3.982770 7017600 0.9563804 0.6516384
# 2 Good 5.838785 5.850744 3.639507 19275009 1.0515353 0.6548925
# 3 Very Good 5.740696 5.770026 3.559801 48107623 1.1029236 0.7302281
# 4 Premium 5.973887 5.944879 3.647124 63221498 1.2597511 0.7311610
# 5 Ideal 5.507451 5.520080 3.401448 74513487 1.0744953 0.6576481
It uses dplyr, but I believe this will accomplish the specified aim completely in reasonably easy to read syntax:
diamonds %>%
group_by(cut) %>%
select(x:z) %>%
summarize_each(funs(sum)) %>%
merge(diamonds %>%
group_by(cut) %>%
summarize(price = mean(price))
,by = "cut")
The only "trick" is that there is a piped expression inside of the merge that handles the calculation of the mean price separately from the calculation of sums.
I benchmarked this solution against the solution provided by #David Arenburg (using data.table) and #thothal (using plyr as requested by the question) with 5000 replications. Here data.table came out slower than plyr and dplyr. dplyr was faster than plyr. One imagines that the benchmark results could change as a function of the number of columns, number of levels in the grouping factor, and particular functions applied. For example, MarkusN submitted an answer after I did my initial benchmarks that is substantially faster than the previously submitted answers for the sample data. He accomplishes this by calculating many summary statistics that aren't desired and then throwing them away... surely there must be a point at which the costs of that approach outweigh the advantages.
test replications elapsed relative user.self sys.self user.child sys.child
2 dataTable 5000 119.686 2.008 119.611 0.127 0 0
1 dplyr 5000 59.614 1.000 59.676 0.004 0 0
3 plyr 5000 68.505 1.149 68.493 0.064 0 0
? MarkusN 5000 23.172 ????? 23.926 0 0 0
Certainly speed is not the only consideration. In particular, dplyr and plyr are picky about the order in which they are loaded (plyr before dplyr) and have several functions that mask each other.
Not 100% what you are looking for but it might give you another idea on how to do it. Using data.table you can do something like this:
diamonds2[, .(c = sum(c), p = sum(p), ce = sum(ce), pe = mean(pe)), by = cut]
To shorten the code (what you tried to do with colwise), you probably have to write some functions to achieve exactly what you want.
For completeness, here's a solution based on dplyr and answers posted by Veerendra Gadekar in another question and here by MarkusN.
In this particular case, it's possible to first apply sum to some of the columns and then mean to all columns of interest:
diamonds %>%
group_by(cut) %>%
mutate_each('sum', 8:10) %>%
summarise_each('mean', 8:10, price)
This is possible, because mean won't change the calculated sums of columns 8:10 and will calculate the required mean of prices. But if we wanted to calculate standard deviation of prices instead of mean, this approach wouldn't work as columns 8:10 would all be 0.
A more general approach could be:
diamonds %>%
group_by(cut) %>%
mutate_each('sum', 8:10) %>%
mutate_each('mean', price) %>%
summarise_each('first', 8:10, price)
One may not be pleased by summarise_each repeating column specifications that were named earlier, but this seems like an elegant solution nonetheless.
It has the advantage over MarkusN's solution that it doesn't require matching newly created columns and doesn't change their names.
Solution by Veerendra Gadekar should end with select(cut, 8:10, price) %>% arrange(cut) in order to produce expected results (subset of columns, plus rows sorted by grouping key). Suggestion of Hong Ooi is similar to the first one here, but assumes there are no other columns.
Finally, it seems to be more legible and easy to understand than a data.table solution, like the one proposed by David Arenburg.
I would like to know if there is a simple way to achieve what I describe below using ddply. My data frame describes an experiment with two conditions. Participants had to select between options A and B, and we recorded how long they took to decide, and whether their responses were accurate or not.
I use ddply to create averages by condition. The column nAccurate summarizes the number of accurate responses in each condition. I also want to know how much time they took to decide and express it in the column RT. However, I want to calculate average response times only when participants got the response right (i.e. Accuracy==1). Currently, the code below can only calculate average reaction times for all responses (accurate and inaccurate ones). Is there a simple way to modify it to get average response times computed only in accurate trials?
See sample code below and thanks!
library(plyr)
# Create sample data frame.
Condition = c(rep(1,6), rep(2,6)) #two conditions
Response = c("A","A","A","A","B","A","B","B","B","B","A","A") #whether option "A" or "B" was selected
Accuracy = rep(c(1,1,0),4) #whether the response was accurate or not
RT = c(110,133,121,122,145,166,178,433,300,340,250,674) #response times
df = data.frame(Condition,Response, Accuracy,RT)
head(df)
Condition Response Accuracy RT
1 1 A 1 110
2 1 A 1 133
3 1 A 0 121
4 1 A 1 122
5 1 B 1 145
6 1 A 0 166
# Calculate averages.
avg <- ddply(df, .(Condition), summarise,
N = length(Response),
nAccurate = sum(Accuracy),
RT = mean(RT))
# The problem: response times are calculated over all trials. I would like
# to calculate mean response times *for accurate responses only*.
avg
Condition N nAccurate RT
1 6 4 132.8333
2 6 4 362.5000
With plyr, you can do it as follows:
ddply(df,
.(Condition), summarise,
N = length(Response),
nAccurate = sum(Accuracy),
RT = mean(RT[Accuracy==1]))
this gives:
Condition N nAccurate RT
1: 1 6 4 127.50
2: 2 6 4 300.25
If you use data.table, then this is an alternative way:
library(data.table)
setDT(df)[, .(N = .N,
nAccurate = sum(Accuracy),
RT = mean(RT[Accuracy==1])),
by = Condition]
Using dplyr package:
library(dplyr)
df %>%
group_by(Condition) %>%
summarise(N = n(),
nAccurate = sum(Accuracy),
RT = mean(RT[Accuracy == 1]))
I have a data set that includes a whole bunch of data about students, including their current school, zipcode of former residence, and a score:
students <- read.table(text = "zip school score
43050 'Hunter' 202.72974236
48227 'NYU' 338.49571519
48227 'NYU' 223.48658339
32566 'CCNY' 310.40666224
78596 'Columbia' 821.59318662
78045 'Columbia' 853.09842034
60651 'Lang' 277.48624384
32566 'Lang' 315.49753763
32566 'Lang' 80.296556533
94941 'LIU' 373.53839238
",header = TRUE,sep = "")
I want a heap of summary data about it, per school. How many students from each school are in the data set, how many unique zipcodes per school, average and cumulative score. I know I can get this by using tapply to create a bunch of tmp frames:
tmp.mean <- data.frame(tapply(students$score, students$school, mean))
tmp.sum <- data.frame(tapply(students$score, students$school, sum))
tmp.unique.zip <- data.frame(tapply(students$zip, students$school, function(x) length(unique(x))))
tmp.count <- data.frame(tapply(students$zip, students$school, function(x) length(x)))
Giving them better column names:
colnames(tmp.unique.zip) <- c("Unique zips")
colnames(tmp.count) <- c("Count")
colnames(tmp.mean) <- c("Mean Score")
colnames(tmp.sum) <- c("Total Score")
And using cbind to tie them all back together again:
school.stats <- cbind(tmp.mean, tmp.sum, tmp.unique.zip, tmp.count)
I think the cleaner way to do this is:
library(plyr)
school.stats <- ddply(students, .(school), summarise,
record.count=length(score),
unique.r.zips=length(unique(zip)),
mean.dist=mean(score),
total.dist=sum(score)
)
The resulting data looks about the same (actually, the ddply approach is cleaner and includes the schools as a column instead of as row names). Two questions: is there better way to find out how many records there are associated with each school? And, am I using ddply efficiently here? I'm new to it.
If performance is an issue, you can also use data.table
require(data.table)
tab_s<-data.table(students)
setkey(tab_s,school)
tab_s[,list(total=sum(score),
avg=mean(score),
unique.zips=length(unique(zip)),
records=length(score)),
by="school"]
school total avg unique.zips records
1: Hunter 202.7297 202.7297 1 1
2: NYU 561.9823 280.9911 1 2
3: CCNY 310.4067 310.4067 1 1
4: Columbia 1674.6916 837.3458 2 2
5: Lang 673.2803 224.4268 2 3
6: LIU 373.5384 373.5384 1 1
Comments seem to be in general agreement: this looks good.
I have a dataframe that has two sets of data that I need to multiply for a calculation. A simple version would be
sample = data.frame(apples=c(10,20,25,30,40,NA,NA,15))
sample$oranges = c(25,60,90,86,10,67,45,10)
sample$oats = c(65,75,85,95,105,115,125,135)
sample$eggs = c(23,22,21,20,19,18,17,16)
sample$consumer =c('john','mark','luke','paul','peter','thomas','matthew','brian')
sample$mealtime = c('breakfast','lunch','lunch','snack','lunch','breakfast','snack','dinner')
s1 = melt(sample,id.vars=c(5,6),measure.vars=c(1:4))
and what I'm trying to do is something along the lines of
s2 = dcast(s1, mealtime ~ ., function(x) (x[variable == 'oranges'] * x[variable =='apples'])/sum(x[variable == 'apples'])
In practice its a much longer data.frame and a more elaborate calculation but the principle should be the same. Thanks -- first post to SO so apologies for any errors.
The output would be a data frame that has mealtimes as the Id var and the apple weighted average of the orange data as the values for each mealtime.
Something along the lines of
Group.1 x
1 breakfast 1.785714
2 dinner 1.071429
3 lunch 27.500000
4 snack 18.428571
This was calculated using
sample$wa = sample$oranges*sample$apples/sum(sample$apples)
aggregate(sample$wa,by=list(sample$mealtime),sum,na.rm=T)
which feels off mathematically but was meant to be a quick kludgy approximation.
This is a much better task for plyr than it is for reshape.
library(plyr)
s1<-ddply(sample,.(mealtime), function(x) {return(sum(x$apples,x$oranges))})
And now you have clarified the output:
ddply(sample,.(mealtime), summarize,
wavg.oranges = sum(apples * oranges, na.rm=TRUE) / sum(apples, na.rm=TRUE))
# mealtime wavg.oranges
# 1 breakfast 25.00000
# 2 dinner 10.00000
# 3 lunch 45.29412
# 4 snack 86.00000
Does anyone know a slick way to order the results coming out of a ddply summarise operation?
This is what I'm doing to get the output ordered by descending depth.
ddims <- ddply(diamonds, .(color), summarise, depth = mean(depth), table = mean(table))
ddims <- ddims[order(-ddims$depth),]
With output...
> ddims
color depth table
7 J 61.88722 57.81239
6 I 61.84639 57.57728
5 H 61.83685 57.51781
4 G 61.75711 57.28863
1 D 61.69813 57.40459
3 F 61.69458 57.43354
2 E 61.66209 57.49120
Not too ugly, but I'm hoping for a way do it nicely within ddply(). Anyone know how?
Hadley's ggplot2 book has this example for ddply and subset but it's not actually sorting the output, just selecting the two smallest diamonds per group.
ddply(diamonds, .(color), subset, order(carat) <= 2)
I'll use this occasion to advertise a bit for data.table, which is faster to run and (in my perception) at least as elegant to write:
library(data.table)
ddims <- data.table(diamonds)
system.time(ddims <- ddims[, list(depth=mean(depth), table=mean(table)), by=color][order(depth)])
user system elapsed
0.003 0.000 0.004
By contrast, without ordering, your ddply code already takes 30 times longer:
user system elapsed
0.106 0.010 0.119
With all the respect I have for Hadley's excellent work, e.g. on ggplot2, and general awesomeness, I must confess that for me, data.table entirely replaced ddply -- for speed reasons.
Yes, to sort you can just nest the ddply in another ddply. Here's how you would use ddply to sort on one column, for example your table column:
ddimsSortedTable <- ddply(ddply(diamonds, .(color),
summarise, depth = mean(depth), table = mean(table)), .(table))
color depth table
1 G 61.75711 57.28863
2 D 61.69813 57.40459
3 F 61.69458 57.43354
4 E 61.66209 57.49120
5 H 61.83685 57.51781
6 I 61.84639 57.57728
7 J 61.88722 57.81239
If you are using dplyr, I would recommend taking advantage of the %.% operator, which reads to more intuitive code.
data(diamonds, package = 'ggplot2')
library(dplyr)
diamonds %.%
group_by(color) %.%
summarise(
depth = mean(depth),
table = mean(table)
) %.%
arrange(desc(depth))
A bit late to the party, but things might be a bit different with dplyr. Borrowing crayola's solution for data.table:
dat1 <- microbenchmark(
dtbl<- data.table(diamonds)[, list(depth=mean(depth), table=mean(table)), by=color][order(- depth)],
dplyr_dtbl <- arrange(summarise(group_by(tbl_dt(diamonds),color), depth = mean(depth) , table = mean(table)),-depth),
dplyr_dtfr <- arrange(summarise(group_by(tbl_df(diamonds),color), depth = mean(depth) , table = mean(table)),-depth),
times = 20,
unit = "ms"
)
The results show that dplyr with tbl_dt is a bit slower than the data.table approach. However, dplyr with data.frame is faster:
expr min lq median uq max neval
data.table 9.606571 10.968881 11.958644 12.675205 14.334525 20
dplyr_data.table 13.553307 15.721261 17.494500 19.544840 79.771768 20
dplyr_data.frame 4.643799 5.148327 5.887468 6.537321 7.043286 20
Note: I have obviously changed the names so the microbenchmark results are more readable