How to use ddply to get weighted-mean of class in dataframe? - r

I'm new to plyr and want to take the weighted mean of values within a class to reshape a dataframe for multiple variables. Using the following code, I know how to do this for one variable, such as x2:
set.seed(123)
frame <- data.frame(class=sample(LETTERS[1:5], replace = TRUE),
x=rnorm(20), x2 = rnorm(20), weights=rnorm(20))
ddply(frame, .(class),function(x) data.frame(weighted.mean(x$x2, x$weights)))
However, I would like the code to create a new data frame for x and x2 (and any amount of variables in the frame). Does anybody know how to do this? Thanks

You might find what you want in the ?summarise function. I can replicate your code with summarise as follows:
library(plyr)
set.seed(123)
frame <- data.frame(class=sample(LETTERS[1:5], replace = TRUE), x=rnorm(20),
x2 = rnorm(20), weights=rnorm(20))
ddply(frame, .(class), summarise,
x2 = weighted.mean(x2, weights))
To do this for x as well, just add that line to be passed into the summarise function:
ddply(frame, .(class), summarise,
x = weighted.mean(x, weights),
x2 = weighted.mean(x2, weights))
Edit: If you want to do an operation over many columns, use colwise or numcolwise instead of summarise, or do summarise on a melted data frame with the reshape2 package, then cast back to original form. Here's an example.
That would give:
wmean.vars <- c("x", "x2")
ddply(frame, .(class), function(x)
colwise(weighted.mean, w = x$weights)(x[wmean.vars]))
Finally, if you don't like having to specify wmean.vars, you can also do:
ddply(frame, .(class), function(x)
numcolwise(weighted.mean, w = x$weights)(x[!colnames(x) %in% "weights"]))
which will compute a weighted-average for every numerical field, excluding the weights themselves.

A data.table answer for fun, which also doesn't require specifying all the variables individually.
library(data.table)
frame <- as.data.table(frame)
keynames <- setdiff(names(frame),c("class","weights"))
frame[, lapply(.SD,weighted.mean,w=weights), by=class, .SDcols=keynames]
Result:
class x x2
1: B 0.1390808 -1.7605032
2: D 1.3585759 -0.1493795
3: C -0.6502627 0.2530720
4: E 2.6657227 -3.7607866

Related

Assign multiple columns when using mutate in dtplyr

Is there a way of getting my data table to look like my target table when using dtplyr and mutate?`
A Dummy table
library(data.table)
library(dtplyr)
library(dplyr)
id <- rep(c("A","B"),each=3)
x1 <- rnorm(6)
x2 <- rnorm(6)
dat <- data.table(id,x1,x2)
A dummy function
my_fun <- function(x,y){
cbind(a = x+10,b=y-10)
}
And I would like to use this type of syntax
dat |>
group_by(id) |>
mutate(my_fun(x = x1,y = x2))
Where the end result will look like this
data.table(id, x1, x2, a=x1+10,b=x2-10)
I would like to have a generic solution that works for functions with variable number of columns returned but is that possible?
I think we would need more information about how this would work with a variable number of columns:
Are the columns named in a specific way?
Do the output columns need to be named in a specific way?
Are there standard calculations being done to each column dependent on name? E.g., x1 = +10 and x2 = -10?
At any rate, here is a solution that works with your provided data to return the data.table you specified:
my_fun <- function(data, ...){
dots <- list(...)
cbind(data,
a = data[[dots$x]] + 10,
b = data[[dots$y]] - 10
)
}
dat |>
my_fun(x = "x1", y = "x2")
id x1 x2 a b
1: A 0.8485309 -0.3532837 10.848531 -10.353284
2: A 0.7248478 -1.6561564 10.724848 -11.656156
3: A -1.3629114 0.4210139 8.637089 -9.578986
4: B -1.7934827 0.6717033 8.206517 -9.328297
5: B -1.0971890 -0.3008422 8.902811 -10.300842
6: B 0.4396630 -0.7447419 10.439663 -10.744742

using t.test within data.table on multiple columns

I am trying to run a t.test() on multiple columns of data within 'j' in a data.table. I've found a way that works, but isn't very elegant, and I feel like there's probably a more concise way using .SDcols, but haven't had any luck looking through here, or the data.table vignette. If this has been asked previously I apologize and please point me in the right direction.
My data.table has essentially the following format
DT <- data.table(name = c('a', 'b', 'c', 'a', 'b', 'c'),
y = c(rnorm(6, mean = 100, sd = 30)),
z = c(rnorm(6, mean = 10, sd = 3)),
group = rep(c('One', 'Two'), 3))
When I want to run a t.test comparing the values of y in group One and group Two, it's very straightforward:
DT[,t.test(y~group)]
If I want to get an output for both y and z the following works, but is clunky and inelegant. And with my actual data, I'm trying to do this over many columns so it would be more time consuming to type out each iteration I would like to run.
DT[,.(t.test(y~group), t.test(z~group))]
In the data.table vignette, using a function over a specific subset of columns is achieved by
DT[,lapply(.SD, mean), .SDcols = c('y', 'z')]
However replacing mean with t.test yields a one sample t.test, while I'm trying to get a two sample t.test. I've tried:
DT[,lapply(.SD, t.test, formula = .SDcols ~ group, data = DT), .SDcols = c('y', 'z')]
But this gives me a comparison between y and z, not both the comparisons of y~group and z~group.
I've tried several versions of lapply with a custom function to get the output I want, but I won't make anyone read through my walls of unsuccessful code. Needless to say I have been unable to get that to work.
Question:
Is there a way via lapply() or function() or a way currently unknown to me, to get t.test to run over multiple columns of data within 'j' in a data.table?
Thanks in advance for your help,
Chris
To pull together the parts of the answer and to rearrange to put the name in the first column (if desired for nicer printing):
library(data.table)
DT <- data.table(name = c('a', 'b', 'c', 'a', 'b', 'c'),
y = c(rnorm(6, mean = 100, sd = 30)),
z = c(rnorm(6, mean = 10, sd = 3)),
group = rep(c('One', 'Two'), 3))
result <-
DT[,lapply(.SD, function(x) t.test(x ~ group)), .SDcols = y:z][
,ttname:=names(t.test(1:5))][ # add names
,.(ttname,y,z)] # put names in first column
result
# ttname y z
# 1: statistic 0.1391646 0.1295093
# 2: parameter 3.468876 3.559917
# 3: p.value 0.8970165 0.9039359
# 4: conf.int -99.61786,109.47358 -8.209637, 8.972439
# 5: estimate 110.7286,105.8008 11.15414,10.77274
# 6: null.value 0 0
# 7: stderr 35.41031 2.94497
# 8: alternative two.sided two.sided
# 9: method Welch Two Sample t-test Welch Two Sample t-test
# 10: data.name x by group x by group
Here is my solution, wrapped as a function. In the accepted answer, , I didn't like that the test output was rows and the variables columns, i like it the other way around, makes it easier to read I think.
I also a added an argument for rounding, and one that default only prints the most important info, the pvalue and test statistic. the function requires purrr. the input for the group variable and the variables to test are character, so dt_test(dtx, 'varname', c('z','y'))
dt_ttest <- function(dtx, grp, thecols, decimals=3, small=TRUE, ...) {
x1 <- a2[, map(.SD, ~ t.test(.x ~ get(grp))), .SDcols = thecols]
x2 <- t(x1) %>% data.table()
setnames(x2, names(t.test(1:2)))
x2 <- x2[, var := thecols] [, !'data.name']
tcols <- c('p.value', 'statistic', 'stderr', 'null.value', 'parameter', 'method', 'alternative')
x2[, (tcols) := map(.SD, unlist), .SDcols=tcols ]
x2
thecols2 <- keep(x2, is.numeric) %>% names()
x2[, (thecols2) := map(.SD, ~ round(.x, decimals)), .SDcols=thecols2 ]
# go one level deeper to round the two list cols
thecols3 <- c('conf.int', 'estimate')
x2[, (thecols3) := modify_depth(.SD, 2, ~ round(.x, decimals)), .SDcols=thecols3 ]
# set order
setcolorder(x2, c('var', 'p.value', 'statistic', 'stderr', 'conf.int', 'estimate', 'parameter', 'method', 'alternative') )
if( small) x2[, .(var, p.value, statistic)] else x2[]
}

Transform multiple columns with a function that uses different arguments per column

I write and review a fair amount of R code like this:
df <- data.frame(replicate(10, sample(0:5, 10, rep = TRUE)))
my.func <- function(col, y) {col %in% y}
df$X2 <- my.func(df$X2, c(1,2))
df$X3 <- my.func(df$X3, c(4,5))
df$X5 <- my.func(df$X5, c(1,2))
df$X6 <- my.func(df$X6, c(4,5))
df$X8 <- my.func(df$X8, c(4,5))
df$X9 <- my.func(df$X9, c(1,2))
df$X10 <- my.func(df$X10, c(1))
That is, certain columns in a data.frame (or data.table) are transformed using a function, where one argument is a column and the other is some arbitrary, somewhat-unique-to-that-column value.
What's a more concise way to make such transformations?
I've tried using data.table's set (:=) operator, which makes things slightly cleaner, but still each column name must appear twice and the function must appear once for each column.
A concise way would be Map with the input arguments as the dataset ('df') and a list of vector that would be passed as argument to my.func. Here, each column of the data.frame is a unit and similarly the vector element from list.
df[] <- Map(my.func, df, list(1:2, 4:5, 3:4))
NOTE: The OP's function or a minimal reproducible example is not provided, so it is not tested
NOTE2: Here, the assumption is that the number of columns is 3. If it is more than 3, increase the length of the list as well
The above can also be converted to data.table syntax
library(data.table)
setDT(df)[, names(df) := Map(my.func, .SD, list(1:2, 4:5, 3:4))]
If only a subset of columns needs to be changed, specify the columns in .SDcols, and also change the names(df) to the subset of names
Or with tidyverse
library(tidyverse)
map2_dfc(df, list(1:2, 4:5, 3:4), my.func)
OP's request from a comment:
make the association between column names and function argument(s) for those columns more explicit
Adjusting the Map approach seen in the other answers:
yL <- list(X2 = 1:2, X3 = 4:5, X5 = 3:4, X6 = 4:5, X8 = 4:5, X9 = 1:2, X10 = 1)
df[names(yL)] <- Map(my.func, df[names(yL)], y = yL)
With data.table:
# this saves you from writing DT twice
DT[, names(yL) := Map(my.func, .SD, y = yL), .SDcols=names(yL)]

filter two data frames by the same group variables in dplyr

In many occasions, after grouping a data frame by some variables, I want to apply a function that uses data from another data frame that is grouped by the same variables. The best solution I found is to use semi_join inside the function as follow:
d1 <- data.frame(model = c(1,1,2,2), x = runif(4) )
d2 <- data.frame(model=c(1,1,1,2,2,2), y = runif(6) )
myfun <- function(df1, df2) {
subsetdf2 <- semi_join(df2, df1)
data.frame(z = sum(d1$x) - sum(subsetdf2$y)) # trivial manipulation just to exemplify
}
d1 %>% group_by(model) %>% do(myfun(., d2))
The problem is that semi_join returns 'Joining by...' messages and, as I am using the function to do bootstrap, I get many messages that collapse the console. So, is there any way to reduce the verbosity of joins? Do you know a more elegant way to do something like this?
P.S. I asked a similar question a few years ago for plyr: subset inside a function by the variables specified in ddply
If all you want to do is stop the 'Joining by: ' statement, you just need to specify what column you are joining on with the by argument.
For example:
semi_join(d2, d1, by="model")
EDIT - As an alternative to using semi_join you can use a base solution. As the group_by function is passing the data by groups, you can filter using a simple indexing statement. This will avoid the need for an additional parameter. This also currently assumes that the column of interest is the first column.
myfun <- function(df1, df2) {
subsetdf2 <- df2[df2[,1] %in% unique(df1[,1]),]
data.frame(z = sum(df1$x) - sum(subsetdf2$y)) # trivial manipulation just to exemplify
}
I adapted the solution of #cdeterman. It is a bit redundant though.
d1 <- data.frame(model = c(1,1,2,2), x = runif(4) )
d2 <- data.frame(model=c(1,1,1,2,2,2), y = runif(6) )
myfun <- function(df1, df2, gv) {
subsetdf2 <- semi_join(df2, df1, by = gv)
data.frame(z = sum(d1$x) - sum(subsetdf2$y)) # trivial manipulation just to exemplify
}
group_var <- 'model'
d1 %>% group_by_(group_var) %>% do(myfun(., d2,group_var))

Aggregate an entire data frame with Weighted Mean

I'm trying to aggregate a data frame using the function weighted.mean and continue to get an error. My data looks like this:
dat <- data.frame(date, nWords, v1, v2, v3, v4 ...)
I tried something like:
aggregate(dat, by = list(dat$date), weighted.mean, w = dat$nWords)
but got
Error in weighted.mean.default(X[[1L]], ...) :
'x' and 'w' must have the same length
There is another thread which answers this question using plyr but for only one variable, I want to aggregate all my variables that way.
You can do it with data.table:
library(data.table)
#set up your data
dat <- data.frame(date = c("2012-01-01","2012-01-01","2012-01-01","2013-01-01",
"2013-01-01","2013-01-01","2014-01-01","2014-01-01","2014-01-01"),
nwords = 1:9, v1 = rnorm(9), v2 = rnorm(9), v3 = rnorm(9))
#make it into a data.table
dat = data.table(dat, key = "date")
# grab the column names we want, generalized for V1:Vwhatever
c = colnames(dat)[-c(1,2)]
#get the weighted mean by date for each column
for(n in c){
dat[,
n := weighted.mean(get(n), nwords),
with = FALSE,
by = date]
}
#keep only the unique dates and weighted means
wms = unique(dat[,nwords:=NULL])
Try using by:
# your numeric data
x <- 111:120
# the weights
ww <- 10:1
mat <- cbind(x, ww)
# the group variable (in your case is 'date')
y <- c(rep("A", 7), rep("B", 3))
by(data=mat, y, weighted.mean)
If you want the results in a data frame, I suggest the plyr package:
plyr::ddply(data.frame(mat), "y", weighted.mean)

Resources