plyr ddply and summarise use in R - r

Hi I want to avoid using loops and so want to use something from plyr to help solve my problem.
I would like to create a function that gets the sum of a specifically chosen column for each factor from a dataframe.
So if we have the following example data...
df <- data.frame(cbind(x=rnorm(100),y=rnorm(100),z=rnorm(100),f=sample(1:10,100, replace=TRUE)))
df$f <- as.factor(df$f)
i.e. I would like something like:
foo <- function(df.obj,colname){
some code
}
where the df.obj would be the df variable above and the colname argument could be any of x,y or z.
and I would like the output/result of the function to have a column of the unique factors (in the above case 1:10) and the sums of the values in column x for each factor.
I expect that the solution to be quite simple and would probably be using ddply or summarise somehow but can't work out how to do it so that i can have the column name as an argument.
Thanks

Is this what you're after?
> ddply(df, .(f), colwise(sum))
f x y z
1 1 -0.4190284 2.61101681 1.2280026
2 2 1.1063977 2.40006922 4.9550079
3 3 0.4498366 -4.00610558 0.9964754
4 4 1.9325488 -2.81241212 -3.1185574
5 5 -4.1077670 -1.01232884 -3.9852388
6 6 -1.0488003 -2.42924689 3.5273636
7 7 2.2999306 0.85930085 -0.6245167
8 8 -4.8105311 -6.81352238 -2.1223436
9 9 -2.8187083 5.03391770 1.6433896
10 10 5.1323666 -0.06192382 1.8978994
Edit: correct answer as supplied by TS:
foo <- function(df.obj,colname){ddply(df, .(f), colwise(sum))[,c("f",colname)]}

This seems a perfect fit for data.table and the lapply(.SD,FUN) and .SDcols arguments
.SD is a data.table containing the Subset of x's Data for each group, excluding the group column(s).
.SDcols is a vector containing the names of the columns to which you wish to apply the function (FUN)
An example
Setup the data.table
library(data.table)
DT <- as.data.table(df)
The sums of x,y,z columns by f
DT[, lapply(.SD, sum), by = f, .SDcols = c("x", "y", "z")]
## f x y z
## 1: 4 4.8041 3.9788 1.2519
## 2: 2 1.1255 -0.8147 2.9053
## 3: 3 0.9699 -0.1550 -8.5876
## 4: 9 2.2685 -1.2734 1.0506
## 5: 5 -0.1282 -2.5512 5.0668
## 6: 10 -2.7397 0.5290 -0.3638
## 7: 1 2.9544 -3.1139 -1.3884
## 8: 8 -4.3488 0.6894 1.4195
## 9: 7 2.3152 0.6474 2.7183
## 10: 6 -0.1569 1.0142 0.9156
The sums of x, and z columns by f
DT[, lapply(.SD, sum), by = f, .SDcols = c("x", "z")]
## f x z
## 1: 4 4.8041 1.2519
## 2: 2 1.1255 2.9053
## 3: 3 0.9699 -8.5876
## 4: 9 2.2685 1.0506
## 5: 5 -0.1282 5.0668
## 6: 10 -2.7397 -0.3638
## 7: 1 2.9544 -1.3884
## 8: 8 -4.3488 1.4195
## 9: 7 2.3152 2.7183
## 10: 6 -0.1569 0.9156
Examples calculating the mean
DT[, lapply(.SD, mean), by = f, .SDcols = c("x", "y", "z")]
## f x y z
## 1: 4 0.36955 0.30606 0.09630
## 2: 2 0.10232 -0.07407 0.26412
## 3: 3 0.07461 -0.01193 -0.66059
## 4: 9 0.15123 -0.08489 0.07004
## 5: 5 -0.01425 -0.28346 0.56298
## 6: 10 -0.21075 0.04069 -0.02799
## 7: 1 0.29544 -0.31139 -0.13884
## 8: 8 -0.54360 0.08617 0.17744
## 9: 7 0.38586 0.10790 0.45305
## 10: 6 -0.07844 0.50710 0.45782
DT[, lapply(.SD, mean), by = f, .SDcols = c("x", "z")]
## f x z
## 1: 4 0.36955 0.09630
## 2: 2 0.10232 0.26412
## 3: 3 0.07461 -0.66059
## 4: 9 0.15123 0.07004
## 5: 5 -0.01425 0.56298
## 6: 10 -0.21075 -0.02799
## 7: 1 0.29544 -0.13884
## 8: 8 -0.54360 0.17744
## 9: 7 0.38586 0.45305
## 10: 6 -0.07844 0.45782

I haven't got enough rep to comment so will have to ask in answer form - why do you want to avoid using loops in R?
EDIT: Anyway using plyr I'd use count()

Related

Avoid data.table forcing lists in j to return a column

Let's say that I have the following data.table
library(data.table)
set.seed(20200210)
data <- data.table(
x = 1:3,
y = list(
data.table(a=4:6, b=runif(3)),
data.table(a=7:10, b=runif(4)),
data.table(a=11:15, b=runif(5))
)
)
data[]
## x y
## 1: 1 <data.table>
## 2: 2 <data.table>
## 3: 3 <data.table>
When we look in the y's data.tables, we obtain the following
data[, y]
## [[1]]
## a b
## 1: 4 0.1019356
## 2: 5 0.5566203
## 3: 6 0.7020533
##
## [[2]]
## a b
## 1: 7 0.6080464
## 2: 8 0.4421555
## 3: 9 0.5070702
## 4: 10 0.8181770
##
## [[3]]
## a b
## 1: 11 0.8444425
## 2: 12 0.5701193
## 3: 13 0.8412783
## 4: 14 0.5692414
## 5: 15 0.8402453
Up until now, everything works fine. What I want to do next is to perform the operation a+b on each data.table and retrieve the result in a list using the data.table syntax. Intuitively, I would have written the following
data[, lapply(y, function(z){
z[, a+b]
})]
## V1 V2 V3
## 1: 4.101936 7.608046 11.84444
## 2: 5.556620 8.442156 12.57012
## 3: 6.702053 9.507070 13.84128
## 4: 4.101936 10.818177 14.56924
## 5: 5.556620 7.608046 15.84025
## Warning messages:
## 1: In as.data.table.list(jval, .named = NULL) :
## Item 1 has 3 rows but longest item has 5; recycled with remainder.
## 2: In as.data.table.list(jval, .named = NULL) :
## Item 2 has 4 rows but longest item has 5; recycled with remainder.
but it won't work. What I understand is that, since my lapply will return a list and that it's defined inside data.table[], it will force the return to be a data.table column, even if the result is of different length. For me, this behaviour is not desirable. I think it should simplify the result to a column only if the lengths match.
However, the following will actually work
lapply(data$y, function(z){
z[, a+b]
})
## [[1]]
## [1] 4.101936 5.556620 6.702053
##
## [[2]]
## [1] 7.608046 8.442156 9.507070 10.818177
##
## [[3]]
## [1] 11.84444 12.57012 13.84128 14.56924 15.84025
but I'd rather use the data.table syntax if it's possible to access the data object.
Any hint?
It is trying to convert to a single column, but the list elements are of different length. We can wrap it in a list
data[, lapply(y, function(z) list(z[, a + b]))]
Or if we need the same structure as in the input, wrap outside the lapply
out <- data[, list(lapply(y, function(z) z[, .(a +b)]))]
out
# V1
#1: <data.table>
#2: <data.table>
#3: <data.table>
Or it can be also
data[, .(lapply(y, function(z) z[, a +b]))]
# V1
#1: 4.101936,5.556620,6.702053
#2: 7.608046, 8.442156, 9.507070,10.818177
#3: 11.84444,12.57012,13.84128,14.56924,15.84025

data.table version of split and repeat

I'm trying to convert some code to use data.table. In this situation, I need to create a graph structure from columns in a data.frame/data.table where rows have information containing the id and depth in the tree. My normal approach is a split/apply/combine, so I feel like it should be possible using by and some expression in data.table but I can't get it.
Here is an example,
## A data.table like this with ids and levels
dat <- data.table(level = rep(1:4, times=2^(0:3)), id = 1:15)
## my normal way, not using data table would involve a split and rep
levs <- split(dat$id, dat$level)
nodes <- unlist(mapply(function(a,b) rep(a, length.out=b), head(levs, -1L),
tail(lengths(levs), -1L)), use.names = FALSE)
## Desired result
res <- cbind(nodes, dat$id[-1L])
## To visualize
library(igraph)
plot(graph_from_edgelist(cbind(nodes, dat$id[-1L])), layout=layout.reingold.tilford,
asp=0.6)
Edit
I think the problem I'm having is when I do a by=level I need information from two levels to get the proper repeat lenght.
Here's another way of getting your nodes column:
dat[, .N, by = .(level = level - 1)][
dat, on = 'level', nomatch = 0][
, .(nodes = rep(id, length.out = N[1])), by = level]
# level nodes
# 1: 1 1
# 2: 1 1
# 3: 2 2
# 4: 2 3
# 5: 2 2
# 6: 2 3
# 7: 3 4
# 8: 3 5
# 9: 3 6
#10: 3 7
#11: 3 4
#12: 3 5
#13: 3 6
#14: 3 7

Use of lapply .SD in data.table R

I am not very clear about use of .SD and by.
For instance, does the below snippet mean: 'change all the columns in DT to factor except A and B?' It also says in data.table manual: ".SD refers to the Subset of the data.table for each group (excluding the grouping columns)" - so columns A and B are excluded?
DT = DT[ ,lapply(.SD, as.factor), by=.(A,B)]
However, I also read that by means like 'group by' in SQL when you do aggregation. For instance, if I would like to sum (like colsum in SQL) over all the columns except A and B do I still use something similar? Or in this case, does the below code mean to take the sum and group by values in columns A and B? (take sum and group by A,B as in SQL)
DT[,lapply(.SD,sum),by=.(A,B)]
Then how do I do a simple colsum over all the columns except A and B?
Just to illustrate the comments above with an example, let's take
set.seed(10238)
# A and B are the "id" variables within which the
# "data" variables C and D vary meaningfully
DT = data.table(
A = rep(1:3, each = 5L),
B = rep(1:5, 3L),
C = sample(15L),
D = sample(15L)
)
DT
# A B C D
# 1: 1 1 14 11
# 2: 1 2 3 8
# 3: 1 3 15 1
# 4: 1 4 1 14
# 5: 1 5 5 9
# 6: 2 1 7 13
# 7: 2 2 2 12
# 8: 2 3 8 6
# 9: 2 4 9 15
# 10: 2 5 4 3
# 11: 3 1 6 5
# 12: 3 2 12 10
# 13: 3 3 10 4
# 14: 3 4 13 7
# 15: 3 5 11 2
Compare the following:
#Sum all columns
DT[ , lapply(.SD, sum)]
# A B C D
# 1: 30 45 120 120
#Sum all columns EXCEPT A, grouping BY A
DT[ , lapply(.SD, sum), by = A]
# A B C D
# 1: 1 15 38 43
# 2: 2 15 30 49
# 3: 3 15 52 28
#Sum all columns EXCEPT A
DT[ , lapply(.SD, sum), .SDcols = !"A"]
# B C D
# 1: 45 120 120
#Sum all columns EXCEPT A, grouping BY B
DT[ , lapply(.SD, sum), by = B, .SDcols = !"A"]
# B C D
# 1: 1 27 29
# 2: 2 17 30
# 3: 3 33 11
# 4: 4 23 36
# 5: 5 20 14
A few notes:
You said "does the below snippet... change all the columns in DT..."
The answer is no, and this is very important for data.table. The object returned is a new data.table, and all of the columns in DT are exactly as they were before running the code.
You mentioned wanting to change the column types
Referring to the point above again, note that your code (DT[ , lapply(.SD, as.factor)]) returns a new data.table and does not change DT at all. One (incorrect) way to do this, which is done with data.frames in base, is to overwrite the old data.table with the new data.table you've returned, i.e., DT = DT[ , lapply(.SD, as.factor)].
This is wasteful because it involves creating copies of DT which can be an efficiency killer when DT is large. The correct data.table approach to this problem is to update the columns by reference using`:=`, e.g., DT[ , names(DT) := lapply(.SD, as.factor)], which creates no copies of your data. See data.table's reference semantics vignette for more on this.
You mentioned comparing efficiency of lapply(.SD, sum) to that of colSums. sum is internally optimized in data.table (you can note this is true from the output of adding the verbose = TRUE argument within []); to see this in action, let's beef up your DT a bit and run a benchmark:
Results:
library(data.table)
set.seed(12039)
nn = 1e7; kk = seq(100L)
DT = setDT(replicate(26L, sample(kk, nn, TRUE), simplify=FALSE))
DT[ , LETTERS[1:2] := .(sample(100L, nn, TRUE), sample(100L, nn, TRUE))]
library(microbenchmark)
microbenchmark(
times = 100L,
colsums = colSums(DT[ , !c("A", "B")]),
lapplys = DT[ , lapply(.SD, sum), .SDcols = !c("A", "B")]
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# colsums 1624.2622 2020.9064 2028.9546 2034.3191 2049.9902 2140.8962 100
# lapplys 246.5824 250.3753 252.9603 252.1586 254.8297 266.1771 100

rbindlist two data.tables where one has factor and other has character type for a column

I just discovered this warning in my script that was a bit strange.
# Warning message:
# In rbindlist(list(DT.1, DT.2)) : NAs introduced by coercion
Observation 1: Here's a reproducible example:
require(data.table)
DT.1 <- data.table(x = letters[1:5], y = 6:10)
DT.2 <- data.table(x = LETTERS[1:5], y = 11:15)
# works fine
rbindlist(list(DT.1, DT.2))
# x y
# 1: a 6
# 2: b 7
# 3: c 8
# 4: d 9
# 5: e 10
# 6: A 11
# 7: B 12
# 8: C 13
# 9: D 14
# 10: E 15
However, now if I convert column x to a factor (ordered or not) and do the same:
DT.1[, x := factor(x)]
rbindlist(list(DT.1, DT.2))
# x y
# 1: a 6
# 2: b 7
# 3: c 8
# 4: d 9
# 5: e 10
# 6: NA 11
# 7: NA 12
# 8: NA 13
# 9: NA 14
# 10: NA 15
# Warning message:
# In rbindlist(list(DT.1, DT.2)) : NAs introduced by coercion
But rbind does this job nicely!
rbind(DT.1, DT.2) # where DT.1 has column x as factor
# do.call(rbind, list(DT.1, DT.2)) # also works fine
# x y
# 1: a 6
# 2: b 7
# 3: c 8
# 4: d 9
# 5: e 10
# 6: A 11
# 7: B 12
# 8: C 13
# 9: D 14
# 10: E 15
The same behaviour can be reproduced if column x is an ordered factor as well. Since the help page ?rbindlist says: Same as do.call("rbind",l), but much faster., I'm guessing this is not the desired behaviour?
Here's my session info:
# R version 3.0.0 (2013-04-03)
# Platform: x86_64-apple-darwin10.8.0 (64-bit)
#
# locale:
# [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#
# attached base packages:
# [1] stats graphics grDevices utils datasets methods base
#
# other attached packages:
# [1] data.table_1.8.8
#
# loaded via a namespace (and not attached):
# [1] tools_3.0.0
Edit:
Observation 2: Following #AnandaMahto's another interesting observation, reversing the order:
# column x in DT.1 is still a factor
rbindlist(list(DT.2, DT.1))
# x y
# 1: A 11
# 2: B 12
# 3: C 13
# 4: D 14
# 5: E 15
# 6: 1 6
# 7: 2 7
# 8: 3 8
# 9: 4 9
# 10: 5 10
Here, the column from DT.1 is silently coerced to numeric.
Edit: Just to clarify, this is the same behaviour as that of rbind(DT2, DT1) with DT1's column x being a factor. rbind seems to retain the class of the first argument. I'll leave this part here and mention that in this case, this is the desired behaviour since rbindlist is a faster implementation of rbind.
Observation 3: If now, both the columns are converted to factors:
# DT.1 column x is already a factor
DT.2[, x := factor(x)]
rbindlist(list(DT.1, DT.2))
# x y
# 1: a 6
# 2: b 7
# 3: c 8
# 4: d 9
# 5: e 10
# 6: a 11
# 7: b 12
# 8: c 13
# 9: d 14
# 10: e 15
Here, the column x from DT.2 is lost (/ replaced with that of DT.1). If the order is reversed, the exact opposite happens (column x of DT.1 gets replaced with that of DT.2).
In general, there seems to be a problem with handling factor columns in rbindlist.
UPDATE - This bug (#2650) was fixed on 17 May 2013 in v1.8.9
I believe that rbindlist when applied to factors is combining the numerical values of the factors and using only the levels associated with the first list element.
As in this bug report:
http://r-forge.r-project.org/tracker/index.php?func=detail&aid=2650&group_id=240&atid=975
# Temporary workaround:
levs <- c(as.character(DT.1$x), as.character(DT.2$x))
DT.1[, x := factor(x, levels=levs)]
DT.2[, x := factor(x, levels=levs)]
rbindlist(list(DT.1, DT.2))
As another view of whats going on:
DT3 <- data.table(x=c("1st", "2nd"), y=1:2)
DT4 <- copy(DT3)
DT3[, x := factor(x, levels=x)]
DT4[, x := factor(x, levels=x, labels=rev(x))]
DT3
DT4
# Have a look at the difference:
rbindlist(list(DT3, DT4))$x
# [1] 1st 2nd 1st 2nd
# Levels: 1st 2nd
do.call(rbind, list(DT3, DT4))$x
# [1] 1st 2nd 2nd 1st
# Levels: 1st 2nd
Edit as per comments:
as for observation 1, what's happening is similar to:
x <- factor(LETTERS[1:5])
x[6:10] <- letters[1:5]
x
# Notice however, if you are assigning a value that is already present
x[11] <- "S" # warning, since `S` is not one of the levels of x
x[12] <- "D" # all good, since `D` *is* one of the levels of x
rbindlist is superfast because it doesn't do the checking of rbindfill or do.call(rbind.data.frame,...)
You can use a workaround like this to ensure that factors are coerced to characters.
DT.1 <- data.table(x = factor(letters[1:5]), y = 6:10)
DT.2 <- data.table(x = LETTERS[1:5], y = 11:15)
for(ii in seq_along(DDL)){
ff <- Filter(function(x) is.factor(DDL[[ii]][[x]]), names(DDL[[ii]]))
for(fn in ff){
set(DDL[[ii]], j = fn, value = as.character(DDL[[ii]][[fn]]))
}
}
rbindlist(DDL)
or (less memory efficiently)
rbindlist(rapply(DDL, classes = 'factor', f= as.character, how = 'replace'))
The bug is not fixed in R 4.0.2 and data.table 1.13.0. When I try to rbindlist() two DTs, one of which has factor columns, the other one empty, final result gets this column broken, and factor values mangled (\n occuring randomly; levels are broken, NAs are introduced).
The workaround is to not rbindlisting a DT with an empty one, but instead rbindlist it with other DTs which also has payload data. Although this requires some boilerplate code.

Using .BY with a lookup table--unexpected results

I'd like to create a variable in dt according to a lookup table k. I'm getting some unexpected results depending on how I extract the variable of interest in k.
dt <- data.table(x=c(1:10))
setkey(dt, x)
k <- data.table(x=c(1:5,10), b=c(letters[1:5], "d"))
setkey(k, x)
dt[,b:=k[.BY, list(b)],by=x]
dt #unexpected results
# x b
# 1: 1 1
# 2: 2 2
# 3: 3 3
# 4: 4 4
# 5: 5 5
# 6: 6 6
# 7: 7 7
# 8: 8 8
# 9: 9 9
# 10: 10 10
dt <- data.table(x=c(1:10))
setkey(x, x)
dt[,b:=k[.BY]$b,by=x]
dt #expected results
# x b
# 1: 1 a
# 2: 2 b
# 3: 3 c
# 4: 4 d
# 5: 5 e
# 6: 6 NA
# 7: 7 NA
# 8: 8 NA
# 9: 9 NA
# 10: 10 d
Can anyone explain why this is happening?
You don't have to use by=. here at all.
First solution:
Set appropriate keys and use X[Y] syntax from data.table:
require(data.table)
dt <- data.table(x=c(1:10))
setkey(dt, "x")
k <- data.table(x=c(1:5,10), b=c(letters[1:5], "d"))
setkey(k, "x")
k[dt]
# x b
# 1: 1 a
# 2: 2 b
# 3: 3 c
# 4: 4 d
# 5: 5 e
# 6: 6 NA
# 7: 7 NA
# 8: 8 NA
# 9: 9 NA
# 10: 10 d
OP said that this creates a new data.table and it is undesirable for him.
Second solution
Again, without by:
dt <- data.table(x=c(1:10))
setkey(dt, "x")
k <- data.table(x=c(1:5,10), b=c(letters[1:5], "d"))
setkey(k, "x")
# solution
dt[k, b := i.b]
This does not create a new data.table and gives the solution you're expecting.
To explain why the unexpected result happens:
For the first case you do, dt[,b:=k[.BY, list(b)],by=x]. Here, k[.BY, list(b)] itself returns a data.table. For example:
k[list(x=1), list(b)]
# x b
# 1: 1 a
So, basically, if you would do:
k[list(x=dt$x), list(b)]
That would give you the desired solution as well. To answer why you get what you get when you do b := k[.BY, list(b)], since, the RHS returns a data.table and you're assigning a variable to it, it takes the first element and drops the rest. For example, do this:
dt[, c := dt[1], by=x]
# you'll get the whole column to be 1
For the second case, to understand why it works, you'll have to know the subtle difference between, accessing a data.table as k[6] and k[list(6)], for example:
In the first case, k[6], you are accessing the 6th element of k, which is 10 d. But in the second case, you're asking for a J, join. So, it searches for x = 6 (key column) and since there isn't any in k, it returns 6 NA. In your case, since you use k[.BY] which returns a list, it is a J operation, which fetches the right value.
I hope this helps.

Resources