matrix operations and component-wise addition using data.table - r

What is the best way to do component-wise matrix addition if the number of matrices to be summed is not known in advance? More generally, is there a good way to perform matrix (or multi-dimensional array) operations in the context of data.table? I use data.table for its efficiency at sorting and grouping data by several fixed variables, or categories, each comprising a different number of observations.
For example:
Find the outer product of vector components given in each observation (row) of the data, returning a matrix for each row.
Sum the resulting matrices component-wise over all rows of each grouping of data categories.
Here illustrated with 2x2 matrices and only one category:
library(data.table)
# example data, number of rows differs by category t
N <- 5
dt <- data.table(t = rep(c("a", "b"), each = 3, len = N),
x1 = rep(1:2, len = N), x2 = rep(3:5, len = N),
y1 = rep(1:3, len = N), y2 = rep(2:5, len = N))
setkey(dt, t)
> dt
t x1 x2 y1 y2
1: a 1 3 1 2
2: a 2 4 2 3
3: a 1 5 3 4
4: b 2 3 1 5
5: b 1 4 2 2
I attempted a function to compute matrix sum on outer product, %o%
mat_sum <- function(x1, x2, y1, y2){
x <- c(x1, x2) # x vector
y <- c(y1, y2) # y vector
xy <- x %o% y # outer product (i.e. 2x2 matrix)
sum(xy) # <<< THIS RETURNS A SINGLE VALUE, NOT WHAT I WANT.
}
which, of course, does not work because sum adds up all the elements across the arrays.
I saw this answer using Reduce('+', .list) but that seems to require already having a list of all the matrices to be added. I haven't figured out how to do that within data.table, so instead I've got a cumbersome work-around:
# extract each outer product component first...
mat_comps <- function(x1, x2, y1, y2){
x <- c(x1, x2) # x vector
y <- c(y1, y2) # y vector
xy <- x %o% y # outer product (i.e. 2x2 matrix)
xy11 <- xy[1,1]
xy21 <- xy[2,1]
xy12 <- xy[1,2]
xy22 <- xy[2,2]
return(c(xy11, xy21, xy12, xy22))
}
# ...then running this function on dt,
# taking extra step (making column 'n') to apply it row-by-row...
dt[, n := 1:nrow(dt)]
dt[, c("xy11", "xy21", "xy12", "xy22") := as.list(mat_comps(x1, x2, y1, y2)),
by = n]
# ...then sum them individually, now grouping by t
s <- dt[, list(s11 = sum(xy11),
s21 = sum(xy21),
s12 = sum(xy12),
s22 = sum(xy22)),
by = key(dt)]
> s
t s11 s21 s12 s22
1: a 8 26 12 38
2: b 4 11 12 23
and that gives the summed components, which can finally be converted back to matrices.

In general, data.table is designed to work with columns. The more you transform your problem to col-wise operations, the more you can get out of data.table.
Here's an attempt at accomplishing this operation col-wise. Probably there are better ways. This is intended more as a template, to provide an idea on approaching the problem (even though I understand it may not be possible in all cases).
xcols <- grep("^x", names(dt))
ycols <- grep("^y", names(dt))
combs <- CJ(ycols, xcols)
len <- seq_len(nrow(combs))
cols = paste("V", len, sep="")
for (i in len) {
c1 = combs$V2[i]
c2 = combs$V1[i]
set(dt, i=NULL, j=cols[i], value = dt[[c1]] * dt[[c2]])
}
# t x1 x2 y1 y2 V1 V2 V3 V4
# 1: a 1 3 1 2 1 3 2 6
# 2: a 2 4 2 3 4 8 6 12
# 3: a 1 5 3 4 3 15 4 20
# 4: b 2 3 1 5 2 3 10 15
# 5: b 1 4 2 2 2 8 2 8
This basically applies the outer product col-wise. Now it's just a matter of aggregating it.
dt[, lapply(.SD, sum), by=t, .SDcols=cols]
# t V1 V2 V3 V4
# 1: a 8 26 12 38
# 2: b 4 11 12 23
HTH
Edit: Modified cols, c1, c2 a bit to get the output with the correct order for V2 and V3.

EDIT:
For not only 2 elements in "x"s and "y"s, a modified function could be:
ff2 = function(x_ls, y_ls)
{
combs_ls = lapply(seq_along(x_ls[[1]]),
function(i) list(sapply(x_ls, "[[", i),
sapply(y_ls, "[[", i)))
rowSums(sapply(combs_ls, function(x) as.vector(do.call(outer, x))))
}
where, "x_ls" and "y_ls" are lists of the respective vectors.
Using it:
dt[, as.list(ff2(list(x1, x2), list(y1, y2))), by = t]
# t V1 V2 V3 V4
#1: a 8 26 12 38
#2: b 4 11 12 23
And on other "data.frames/tables":
set.seed(101)
DF = data.frame(group = rep(letters[1:3], c(4, 2, 3)),
x1 = sample(1:20, 9, T), x2 = sample(1:20, 9, T),
x3 = sample(1:20, 9, T), x4 = sample(1:20, 9, T),
y1 = sample(1:20, 9, T), y2 = sample(1:20, 9, T),
y3 = sample(1:20, 9, T), y4 = sample(1:20, 9, T))
DT = as.data.table(DF)
DT[, as.list(ff2(list(x1, x2, x3, x4),
list(y1, y2, y3, y4))), by = group]
# group V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16
#1: a 338 661 457 378 551 616 652 468 460 773 536 519 416 766 442 532
#2: b 108 261 171 99 29 77 43 29 154 386 238 146 161 313 287 121
#3: c 345 351 432 293 401 421 425 475 492 558 621 502 510 408 479 492
I don't know, though, how would one in "data.table" not state explicitly which columns to use inside the function; i.e. how you could do the equivalent of:
do.call(rbind, lapply(split(DF[-1], DF$group),
function(x)
do.call(ff2, c(list(x[grep("^x", names(x))]),
list(x[grep("^y", names(x))])))))
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16]
#a 338 661 457 378 551 616 652 468 460 773 536 519 416 766 442 532
#b 108 261 171 99 29 77 43 29 154 386 238 146 161 313 287 121
#c 345 351 432 293 401 421 425 475 492 558 621 502 510 408 479 492
OLD ANSWER:
Perhaps you could define your function like:
ff1 = function(x1, x2, y1, y2)
rowSums(sapply(seq_along(x1),
function(i) as.vector(c(x1[i], x2[i]) %o% c(y1[i], y2[i]))))
dt[, as.list(ff1(x1, x2, y1, y2)), by = list(t)]
# t V1 V2 V3 V4
#1: a 8 26 12 38
#2: b 4 11 12 23

Related

Creating new columns from long strings split into 300 substrings?

I have a column containing 1200 character strings. In each one, every four character group is hexadecimal for a number. i.e. 300 numbers in hexadecimal crammed into a 1200 character string, in every row. I need to get each number out into decimal, and into its own column (300 new columns) named 1-300.
Here's what I've figured out so far:
Data.frame:
BigString
[1] 0043003E803C0041004A...(etc...)
Here's what I've done so far:
decimal.fours <- function(x) {
strtoi(substring(BigString[x], seq(1,1197,4), seq(4,1197,4)), 16L)
}
decimal.fours(1)
[1] 283 291 239 177 ...
But now I'm stuck. How can I output these individual number, (and the remaining 296, into new columns? I have fifty total rows/strings. It would be great to do them all at once, i.e. 300 new columns, containing split up substrings from 50 strings.
You can use read.fwf which read in files with fixed width for each column:
# an example vector of big strings
BigString = c("0043003E803C0041004A", "0043003E803C0041004A", "0043003E803C0041004A")
n = 5 # n is the number of columns for your result(300 for your real case)
as.data.frame(
lapply(read.fwf(file = textConnection(BigString),
widths = rep(4, n),
colClasses = "character"),
strtoi, base = 16))
# V1 V2 V3 V4 V5
#1 67 62 32828 65 74
#2 67 62 32828 65 74
#3 67 62 32828 65 74
If you'd like to keep the decimal.hours function, you can modify it as follows and call lapply to convert your bigStrings to list of integers which can be further converted to data.frame with do.call(rbind, ...) pattern:
decimal.fours <- function(x) {
strtoi(substring(x, seq(1,1197,4), seq(4,1197,4)), 16L)
}
do.call(rbind, lapply(BigString, decimal.fours))
Obligatory tidyverse example:
library(tidyverse)
Setup some data
set.seed(1492)
bet <- c(0:9, LETTERS[1:6]) # alphabet for hex digit sequences
i <- 8 # number of rows
n <- 10 # number of 4-hex-digit sequences
df <- data_frame(
some_other_col=LETTERS[1:i],
big_str=map_chr(1:i, ~sample(bet, 4*n, replace=TRUE) %>% paste0(collapse=""))
)
df
## # A tibble: 8 × 2
## some_other_col big_str
## <chr> <chr>
## 1 A 432100D86CAA388C15AEA6291E985F2FD3FB6104
## 2 B BC2673D112925EBBB3FD175837AF7176C39B4888
## 3 C B4E99FDAABA47515EADA786715E811EE0502ABE8
## 4 D 64E622D7037D35DE6ADC40D0380E1DC12D753CBC
## 5 E CF7CDD7BBC610443A8D8FCFD896CA9730673B181
## 6 F ED86AEE8A7B65F843200B823CFBD17E9F3CA4EEF
## 7 G 2B9BCB73941228C501F937DA8E6EF033B5DD31F6
## 8 H 40823BBBFDF9B14839B7A95B6E317EBA9B016ED5
Do the manipulation
read_fwf(paste0(df$big_str, collapse="\n"),
fwf_widths(rep(4, n)),
col_types=paste0(rep("c", n), collapse="")) %>%
mutate_all(strtoi, base=16) %>%
bind_cols(df) %>%
select(some_other_col, everything(), -big_str)
## # A tibble: 8 × 11
## some_other_col X1 X2 X3 X4 X5 X6 X7 X8 X9
## <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int>
## 1 A 17185 216 27818 14476 5550 42537 7832 24367 54267
## 2 B 48166 29649 4754 24251 46077 5976 14255 29046 50075
## 3 C 46313 40922 43940 29973 60122 30823 5608 4590 1282
## 4 D 25830 8919 893 13790 27356 16592 14350 7617 11637
## 5 E 53116 56699 48225 1091 43224 64765 35180 43379 1651
## 6 F 60806 44776 42934 24452 12800 47139 53181 6121 62410
## 7 G 11163 52083 37906 10437 505 14298 36462 61491 46557
## 8 H 16514 15291 65017 45384 14775 43355 28209 32442 39681
## # ... with 1 more variables: X10 <int>
just a try using base-R
BigString = c("0043003E803C0041004A", "0043003E803C0041004A", "0043003E803C0041004A")
df = data.frame(BigString)
t(sapply(df$BigString, function(x) strtoi(substring(x, seq(1, 297, 4)[1:5],
seq(4, 300, 4)[1:5]), base = 16)))
# [,1] [,2] [,3] [,4] [,5]
#[1,] 67 62 32828 65 74
#[2,] 67 62 32828 65 74
#[3,] 67 62 32828 65 74
# you can set the columns together at the end using `paste0("new_col", 1:300)`
# [1:5] was just used for this example, because i had strings of length 20cahr

Apply a family of functions over nested list -R

I need to apply a family of functions of form (a*x +b+ c) to
a nested list, e.g. in following form
map_function <- function(x,y){
return(linear_function(x[1],x[2],x[3],y))
}
linear_function <- function(x1,x2,x3,y){
g <- sapply(y, function(x){x1*x+x2+x3})%>% min(.)
return(g)
}
over two lists e.g. so that when map_function is passed on all arguments
pr_list <- list(c(1,2,3),c(4,5,6)) and
f_list <- list(c(234,34),c(456,34,567),c(111,222))
It will generate a nested list/matrix of 3 with 2 values in each. What is the R-way to do it other than using for loop?
e.g. if the output is a matrix, examples of the elements will be
M11 <- linear_function(pr_list[[1]][1],pr_list[[1]][2],pr_list[[1]][3],f_list[[1]] )
M12 <- linear_function(pr_list[[1]][1],pr_list[[1]][2],pr_list[[1]][3],f_list[[2]] )
M13 <- linear_function(pr_list[[1]][1],pr_list[[1]][2],pr_list[[1]][3],f_list[[3]] )
M21 <- linear_function(pr_list[[2]][1],pr_list[[2]][2],pr_list[[2]][3],f_list[[1]] )
M22 <- linear_function(pr_list[[2]][1],pr_list[[2]][2],pr_list[[2]][3],f_list[[2]] )
M23 <- linear_function(pr_list[[2]][1],pr_list[[2]][2],pr_list[[2]][3],f_list[[3]] )
M <- list(c(M11,M21),c(M12,M22),c(M13,M23))
print(M)
[[1]]
[1] 39 147
[[2]]
[1] 39 147
[[3]]
[1] 116 455
This is my best guess for you.
x <- data.frame(x1 = c(1,2,3), x2 = c(4,5,6))
y <- data.frame(y1 = c(234,34,NA),y2 = c(456,34,567), y3 = c(111,222,NA))
linear_function <- function(x, y){x[[1]]*y +x[[2]]+x[[3]]}
Which when applied like this, results in the following.
> linear_function(x$x1, y)
y1 y2 y3
1 239 461 116
2 39 39 227
3 NA 572 NA
> linear_function(x$x2, y)
y1 y2 y3
1 947 1835 455
2 147 147 899
3 NA 2279 NA
If you want a single object.
> z <- lapply(x, linear_function, y)
> z
$x1
y1 y2 y3
1 239 461 116
2 39 39 227
3 NA 572 NA
$x2
y1 y2 y3
1 947 1835 455
2 147 147 899
3 NA 2279 NA

R programming(sum of products)

i'm working on how to find sum of products of two dataframes.
data<-w1 w2 w3 w4
4 6 8 5
where w1 w2 w3 w4 are column names
and I have one more dataframe
data2<-p1 p2 p3 p4
3 4 5 6
5 6 8 4
4 6 6 8
3 5 8 9
my result should be like this:
result <- w1*P1+w2*p2+w3*p3*w4*p4
result1 <- 4*3+6*4+8*5+5*6 # result on row 1
result2 <- 4*5+6*6+8*8+5*4 # result on row 2
and so on for each row in data2
how to do this in general
Thanks
Fastest way is to come back to R linear algebra (even more is you have big data.frame's):
> as.matrix(data2) %*% unlist(data)
# [,1]
#[1,] 106
#[2,] 140
#[3,] 140
#[4,] 151
Or sweep:
> rowSums(sweep(as.matrix(data2), 2, unlist(data), `*`))
#[1] 106 140 140 151
Data
data=data.frame(a=4,b=6,c=8,d=5)
data2=data.frame(a=c(3,5,4,3),b=c(4,6,6,5),c=c(5,8,6,8),d=c(6,4,8,9))
You could use mapply:
df1 <- data.frame(w1 = 4, w2 = 6, w3 = 8, w4 = 5)
df2 <- data.frame(p1 = c(3, 5, 4, 3), p2 = c(4, 6, 6, 5),
p3 = c(5, 8, 6, 8), p4 = c(6, 4, 8, 9))
This multiplies each element of df2 with each element of df1 (by element I mean column - the data frame is treated as a list in this context):
> (tmp <- mapply(`*`, df2, df1))
p1 p2 p3 p4
[1,] 12 24 40 30
[2,] 20 36 64 20
[3,] 16 36 48 40
[4,] 12 30 64 45
>sum(tmp)
[1] 537
Edit If you want to get the sum of each row from the above matrix you can use either apply(tmp, 1, sum) or rowSums:
> rowSums(tmp)
[1] 106 140 140 151

Condense a matrix in R

I have loaded a table of integer data with 2,200 columns. What I'd like to do is condense the data down by averaging the values in every 5 columns and placing that in a new column in a new table.
For example, if I had:
Col1 | Col2 | Col3 | Col4 | Col5 | Col6 | Col7 | Col8 | Col9 | Col10
2 4 6 8 10 12 14 16 18 20
I would get:
Col1 | Col2
6 16
Which is just the average of the values in columns 1-5 from the original table in Col1 and the average of the values in columns 6-10 in Col2.
I haven't quite wrapped my head around R syntax, so any help would be appreciated.
Here's one approach that's applicable if the number of elements to be grouped is divisible by n (5, in your case):
x <- 1:100
n <- 5
tapply(x, rep(seq(1, length(x), n), each=n), mean)
# 1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96
# 3 8 13 18 23 28 33 38 43 48 53 58 63 68 73 78 83 88 93 98
The first row of output contains element names, and the second row contains means of successive groups of n elements.
To apply this to all rows of a matrix or data.frame, you can do, e.g.:
m <- matrix(1:1000, ncol=100)
apply(m, 1, function(x) tapply(x, rep(seq(1, length(x), n), each=n), mean))
EDIT
This alternative approach will give you some performance gains due to vectorisation with rowMeans:
t(mapply(function(x, y) rowMeans(m[, x:y]),
seq(1, ncol(m), n), seq(n, ncol(m), n)))
Oops, I see this is the comment of #user20650 in #jbaums answer. The rowsum function splits rows of a matrix by a factor, and sums the columns of each split. So for
m <- matrix(1:1000, ncol=100)
n <- 5
we have
rowsum(t(m), rep(seq_len(ncol(m) / n), each=n)) / n
This is fast, if that's important
library(microbenchmark)
f0 = function(m, n) rowsum(t(m), rep(seq_len(ncol(m) / n), each=n)) / n
f1 = function(m, n)
apply(m, 1, function(x) tapply(x, rep(seq(1, length(x), n), each=n), mean))
f2 = function(m, n)
t(mapply(function(x, y) rowMeans(m[, x:y]),
seq(1, ncol(m), n), seq(n, ncol(m), n)))
all.equal(f0(m, n), f1(m, n), check.attributes=FALSE)
## [1] TRUE
all.equal(f0(m, n), f2(m, n), check.attributes=FALSE)
## [1] TRUE
microbenchmark(f0(m, n), f1(m, n), f2(m, n))
## Unit: microseconds
## expr min lq median uq max neval
## f0(m, n) 164.351 170.1675 176.730 187.8570 237.419 100
## f1(m, n) 8060.639 8513.3035 8696.742 8908.5190 9771.019 100
## f2(m, n) 540.894 588.3820 603.787 634.1615 732.209 100
Here's another approach using a loop and rowMeans instead, in case you prefer a loop in this case. Will work for matrices, but needs adjustment for vectors.
# example data
dat <- as.data.frame( matrix(1:20,ncol=10,byrow=TRUE) )
# pick range
range <- 5
ind <- seq(1,ncol(dat),range)
newdat <- NULL
for(i in ind){
newcol <- rowMeans(dat[,i:(i+range-1)])
newdat <- cbind(newdat, newcol)
}
Will result in:
> newdat
newcol newcol
[1,] 3 8
[2,] 13 18
#jbaums answer looks pretty good. Since I had already started this answer, I thought I would post my solution as well.
#Make some fake data
require(data.table)
data <- data.table(t(iris[,1:4]))
#Transpose since rows are easier to deal with than columns
data <- data.table(t(data))
data[ , row := .I]
#Sum by every 5 rows
data <- data[ , lapply(.SD,sum), by=cut(row,seq(0,nrow(data),5))]
#Transpose back to original results
result <- data.table(t(data))
If you wanted to get the means of the elements from col1-col5, col6-col10, etc.
m1 <- matrix(c(rep(1:100, 2), 1:20), ncol=22)
n <- 5
p1 <- prod(dim(m1))
n1 <- nrow(m1)*n
n2 <- p1-p1%%n1
c(rowMeans(matrix(m1[1:n2], nrow=p1%/%n1, byrow=TRUE)), mean(m1[(n2+1):p1]))
#[1] 25.5 75.5 25.5 75.5 10.5
Or
sapply(seq(1,ncol(m1), by=n), function(i) mean(m1[,i:(min(c(i+n-1), ncol(m1)))]) )
#[1] 25.5 75.5 25.5 75.5 10.5
With some labels
indx <- seq(1,n2/nrow(m1), by=n)
indx1 <- paste("Col",paste(indx, indx+4, sep="-"),sep="_")
indx2 <- paste("Col", paste(seq(p1%%n1+1, ncol(m1)),collapse="-"), sep="_")
c(rowMeans(matrix(m1[1:n2], nrow=p1%/%n1, byrow=TRUE, dimnames=list(indx1, NULL))), setNames(mean(m1[(n2+1):p1]), indx2))
# Col_1-5 Col_6-10 Col_11-15 Col_16-20 Col_21-22
# 25.5 75.5 25.5 75.5 10.5
Update
I realized that you wanted the rowMeans by splitting up columns 1:5, 6:10, 11:15 etc. If that is the case:
res1 <- cbind( colMeans(aperm(array(m1[1:n2], dim=c(nrow(m1), n, p1%/%n1)), c(2,1,3))),
rowMeans(m1[,(ncol(m1)-ncol(m1)%%n+1):ncol(m1)]))
which is equal to manual splitting the columns
res2 <- cbind(rowMeans(m1[,1:5]), rowMeans(m1[,6:10]), rowMeans(m1[,11:15]),
rowMeans(m1[,16:20]), rowMeans(m1[,21:22]))
identical(res1,res2)
#[1] TRUE
colnames(res1) <- c(indx1,indx2)
res1
# Col_1-5 Col_6-10 Col_11-15 Col_16-20 Col_21-22
#[1,] 21 71 21 71 6
#[2,] 22 72 22 72 7
#[3,] 23 73 23 73 8
#[4,] 24 74 24 74 9
#[5,] 25 75 25 75 10
#[6,] 26 76 26 76 11
#[7,] 27 77 27 77 12
#[8,] 28 78 28 78 13
#[9,] 29 79 29 79 14
#[10,] 30 80 30 80 15

Difference between tilde and "by" while using aggregate function in R

Every time I do an aggregate on a data.frame I default to using the "by = list(...)" parameter. But I do see solutions on stackoverflow and elsewhere where tilde (~) is used in the "formula" parameter. I kinda see the "by" parameter as the "pivot" around these variables.
In some cases, the output is exactly the same. For example:
aggregate(cbind(df$A, df$B, df$C), FUN = sum, by = list("x" = df$D, "y" = df$E))
AND
aggregate(cbind(df$A, df$B, df$C) ~ df$E, FUN = sum)
What is the difference between the two and when do you use which?
I would not entirely disagree that it doesn't really matter which approach you use, however, it is important to note that they do behave differently.
I'll illustrate with a small example.
Here's some sample data:
set.seed(1)
mydf <- data.frame(A = c(1, 1, 1, 2, 2, 3, 3, 3, 3, 4, 4, 4),
B = LETTERS[c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2)],
matrix(sample(100, 36, replace = TRUE), nrow = 12))
mydf[3:5] <- lapply(mydf[3:5], function(x) {
x[sample(nrow(mydf), 1)] <- NA
x
})
mydf
# A B X1 X2 X3
# 1 1 A 27 69 27
# 2 1 A 38 NA 39
# 3 1 A 58 77 2
# 4 2 A 91 50 39
# 5 2 A 21 72 87
# 6 3 B 90 100 35
# 7 3 B 95 39 49
# 8 3 B 67 78 60
# 9 3 B 63 94 NA
# 10 4 B NA 22 19
# 11 4 B 21 66 83
# 12 4 B 18 13 67
First, the formula interface. The following three commands will all yield the same output.
aggregate(cbind(X1, X2, X3) ~ A + B, mydf, sum)
aggregate(cbind(X1, X2, X3) ~ ., mydf, sum)
aggregate(. ~ A + B, mydf, sum)
# A B X1 X2 X3
# 1 1 A 85 146 29
# 2 2 A 112 122 126
# 3 3 B 252 217 144
# 4 4 B 39 79 150
Here's a related command for the "by" interface. Pretty cumbersome to type (but that can be addressed by using with, if required).
aggregate(cbind(mydf$X1, mydf$X2, mydf$X3),
by = list(mydf$A, mydf$B), sum)
Group.1 Group.2 V1 V2 V3
1 1 A 123 NA 68
2 2 A 112 122 126
3 3 B 315 311 NA
4 4 B NA 101 169
Now, stop and make note of any differences.
The two that pop into my mind are:
The formula method does a nicer job of preserving names but it doesn't let you control the names directly in your command, which you can do in the data.frame method:
aggregate(cbind(NewX1 = mydf$X1, NewX2 = mydf$X2, NewX3 = mydf$X3),
by = list(NewA = mydf$A, NewB = mydf$B), sum)
The formula method and the data.frame method treat NA values differently. To get the same result with the formula method as you do with the data.frame method, you need to use na.action = na.pass.
aggregate(. ~ A + B, mydf, sum, na.action=na.pass)
Again, it is not entirely wrong to say "I don't think it really matters", and I'm not going to state my preference here since that's not really what Stack Overflow is about, but it is important to always read the function documentation carefully before making such decisions.
From the help page,
aggregate.formula is a standard formula interface to aggregate.data.frame
So I don't think it really matters. Use whichever approach you're comfortable with, or which fits existing variables and formulas in your workspace.

Resources