Sum each list of numerical values within a list - r

I would like to attach values to labels in a riverplot in R
I have a list of lists of values, which is required to show flows between nodes, like this:
edges <- list( A= list( C= 10, E= 5 ),
B= list( C= 10 ),
C=list(D = 13, E = 7 ))
I am aware of a function that would reduce (or sum) one element of a list like this:
Reduce("+",edges$A)
Is there a way of reducing such a list of lists of values to a dataframe or so that I could get the sums:
Node Sum
A 15
B 10
C 20
D 13
E 12
Edit:
I just realised there is a confusion:
It looks like I need two outputs and it may be a little more complicated:
1. if 'edges' list has a sublist with a category name, sum up
2. if not, get the sum of all occurences of this item
Case 1: Categories A, B, C (these are the starting nodes)
Case 2: Categories D, E (these are the end nodes in a riverplot)
I am sorry for the confusion.

I think you can do
lapply(edges, function (x) sum(unlist(x)))
This returns a list. Using sapply will simplify the result to a vector.

We can also use base R
v1 <- unlist(edges)
tapply(v1, sub("\\..*", "", names(v1)), sum)
# A B C
#15 10 20
Or in a single step
r1 <- tapply(unlist(edges), rep(names(edges), lengths(edges)), FUN = sum)
r1
# A B C
#15 10 20
if we need to sum based on the names after the .
r2 <- tapply(v1, sub("[^.]+\\.", "", names(v1)), FUN = sum)
r2
# C D E
#20 13 12
c(r1, r2)[!duplicated(c(names(r1), names(r2)))]
# A B C D E
#15 10 20 13 12
Or using aggregate/stack
aggregate(values~., stack(edges), FUN = sum)
# ind values
#1 A 15
#2 B 10
#3 C 20

Another option is to use purrr package:
library(purrr)
stack(map(edges, compose(sum, unlist)))
# values ind
# 1 15 A
# 2 10 B
# 3 20 C
where compose(sum, unlist) is equivalent to function(x) sum(unlist(x)).

Or using your own proposed Reduce function:
unlist(lapply(edges, function(a) Reduce(sum, a)))
# A B C
#15 10 20

Related

random sampling of columns based on column group

I have a simple problem which can be solved in a dirty way, but I'm looking for a clean way using data.table
I have the following data.table with n columns belonging to m unequal groups. Here is an example of my data.table:
dframe <- as.data.frame(matrix(rnorm(60), ncol=30))
cletters <- rep(c("A","B","C"), times=c(10,14,6))
colnames(dframe) <- cletters
A A A A A A
1 -0.7431185 -0.06356047 -0.2247782 -0.15423889 -0.03894069 0.1165187
2 -1.5891905 -0.44468389 -0.1186977 0.02270782 -0.64950716 -0.6844163
A A A A B B B
1 -1.277307 1.8164195 -0.3957006 -0.6489105 0.3498384 -0.463272 0.8458673
2 -1.644389 0.6360258 0.5612634 0.3559574 1.9658743 1.858222 -1.4502839
B B B B B B B
1 0.3167216 -0.2919079 0.5146733 0.6628149 0.5481958 -0.01721261 -0.5986918
2 -0.8104386 1.2335948 -0.6837159 0.4735597 -0.4686109 0.02647807 0.6389771
B B B B C C
1 -1.2980799 0.3834073 -0.04559749 0.8715914 1.1619585 -1.26236232
2 -0.3551722 -0.6587208 0.44822253 -0.1943887 -0.4958392 0.09581703
C C C C
1 -0.1387091 -0.4638417 -2.3897681 0.6853864
2 0.1680119 -0.5990310 0.9779425 1.0819789
What I want to do is to take a random subset of the columns (of a sepcific size), keeping the same number of columns per group (if the chosen sample size is larger than the number of columns belonging to one group, take all of the columns of this group).
I have tried an updated version of the method mentioned in this question:
sample rows of subgroups from dataframe with dplyr
but I'm not able to map the column names to the by argument.
Can someone help me with this?
Here's another approach, IIUC:
idx <- split(seq_along(dframe), names(dframe))
keep <- unlist(Map(sample, idx, pmin(7, lengths(idx))))
dframe[, keep]
Explanation:
The first step splits the column indices according to the column names:
idx
# $A
# [1] 1 2 3 4 5 6 7 8 9 10
#
# $B
# [1] 11 12 13 14 15 16 17 18 19 20 21 22 23 24
#
# $C
# [1] 25 26 27 28 29 30
In the next step we use
pmin(7, lengths(idx))
#[1] 7 7 6
to determine the sample size in each group and apply this to each list element (group) in idx using Map. We then unlist the result to get a single vector of column indices.
Not sure if you want a solution with dplyr, but here's one with just lapply:
dframe <- as.data.frame(matrix(rnorm(60), ncol=30))
cletters <- rep(c("A","B","C"), times=c(10,14,6))
colnames(dframe) <- cletters
# Number of columns to sample per group
nc <- 8
res <- do.call(cbind,
lapply(unique(colnames(dframe)),
function(x){
dframe[,if(sum(colnames(dframe) == x) <= nc) which(colnames(dframe) == x) else sample(which(colnames(dframe) == x),nc,replace = F)]
}
))
It might look complicated, but it really just takes all columns per group if there's less than nc, and samples random nc columns if there are more than nc columns.
And to restore your original column-name scheme, gsub does the trick:
colnames(res) <- gsub('.[[:digit:]]','',colnames(res))

data.table execute function on groups of columns

If I have the following data table
m = matrix(1:12, ncol=4)
colnames(m) = c('A1','A2','B1','B2')
d = data.table(m)
is it possible to execute a function on sets of columns?
For example the following would be the sum of A1,A2 and B1,B2.
A B
1: 5 17
2: 7 19
3: 9 21
The solution would preferably work with a 500k x 100 matrix
Solution
A trick would be to split the column into groups.
Then you can use rowSums as Frank suggests (see comments on question):
# using your data example
m <- matrix(1:12, ncol = 4)
colnames(m) <- c('A1', 'A2', 'B1', 'B2')
d <- data.table(m)
# 1) group columns
groups <- split(colnames(d), substr(colnames(d), 1, 1))
# 2) group wise row sums
d[,lapply(groups, function(i) {rowSums(d[, i, with = FALSE])})]
Result
This will return the data.table:
A B
1: 5 17
2: 7 19
3: 9 21
Explanation
split creates a list of column names for each group, defined by a (something coercable to a) factor.
substr(colnames(m), 1, 1) takes the first letter as group id, use a different approach (e.g. sub("([A-Z]).*", "\\1", colnames(m)) for variable number of letters).
lapply is commonly used to apply functions over multiple columns in a data.table. Here we create a list output, named as the groups, containing the rowSums. with = FALSE is important to use the value of i to get the respective columns from d.
Definitely possible...
d[, ":=" (A = A1 + A2, B = B1 + B2)]
d
A1 A2 B1 B2 A B
1: 1 4 7 10 5 17
2: 2 5 8 11 7 19
3: 3 6 9 12 9 21
# Want to drop the old columns?
set(d, j = which(names(d) %in% c("A1", "B1", "A2", "B2")), value = NULL)
d
A B
1: 5 17
2: 7 19
3: 9 21
Whether it is desirable I shall not tell. Probably better to follow Frank's advice (see comments).

Does aggregate() guarantee that the result will be ordered by the grouping columns?

I've noticed that aggregate() appears to return its result ordered by the grouping column(s). Is this a guarantee? Can this be relied upon in surrounding logic?
A couple of examples:
set.seed(1); df <- data.frame(group=sample(letters[1:3],10,replace=T),value=1:10);
aggregate(value~group,df,sum);
## group value
## 1 a 16
## 2 b 22
## 3 c 17
And with two groups (notice the second group is ordered first, then the first group to break ties):
set.seed(1); df <- data.frame(group1=sample(letters[1:3],10,replace=T),group2=sample(letters[4:6],10,replace=T),value=1:10);
aggregate(value~group1+group2,df,sum);
## group1 group2 value
## 1 a d 1
## 2 b d 2
## 3 b e 9
## 4 c e 10
## 5 a f 15
## 6 b f 11
## 7 c f 7
Note: I'm asking because I just came up with an answer for Aggregating while merging two dataframes in R which, at least in its current form at the time of writing, depends on aggregate() returning its result ordered by the grouping column.
Yes, as long as you understand the natural ordering of factors to be by their integer keys. You can see this in the code:
y <- as.data.frame(by, stringsAsFactors = FALSE)
... # y becomes the "integerized" dataframe of index vectors
grp <- rank(do.call(paste, c(lapply(rev(y), ident), list(sep = "."))),
ties.method = "min")
y <- y[match(sort(unique(grp)), grp, 0L), , drop = FALSE]
...

How to create a table in R that includes column totals

I'm somewhat new to R programming and am in need of assistance.
I'm looking to take the sum of 4 columns in a dataframe and list these totals in a simple table.
Essentially, take the sum of 4 columns (A, B, C, D) and list the total in a table (table = column 1: A, B, C, D column 2: sum of column A, B, C, D) - something along the lines of:
A = 3
B = 4
C = 4
D = 3
Does anyone know how to get this output? Also, the less "manual" the response, the better (i.e. trying to avoid having to input several lines of code to get this output if possible).
Thank you.
If your data looks like this:
a <- c(1:4)
b <- c(2:5)
c <- c(3:6)
d <- c(4:7)
df <- data.frame(a,b,c,d)
a b c d
1 1 2 3 4
2 2 3 4 5
3 3 4 5 6
4 4 5 6 7
Use
> res <- sapply(df,sum)
to get
a b c d
10 14 18 22
in order to apply the function only on numeric columns, try
> res <- colSums(df[sapply(df,is.numeric)])
There is colSums:
colSums(Filter(is.numeric, df))

Assign results of apply to multiple columns of data frame

I would like to process all rows in data frame df by applying function f to every row. As function f returns numeric vector with two elements I would like to assign individual elements to new columns in df.
Sample df, trivial function f returning two elements and my trial with using apply
df <- data.frame(a = 1:3, b = 3:5)
f <- function (a, b) {
c(a + b, a * b)
}
df[, c('apb', 'amb')] <- apply(df, 1, function(x) f(a = x[1], b = x[2]))
This does not work results are assigned by columns:
> df
a b apb amb
1 1 3 4 8
2 2 4 3 8
3 3 5 6 15
You could also use Reduce instead of apply as it is generally more efficient. You just need to slightly modify your function to use cbind instead of c
f <- function (a, b) {
cbind(a + b, a * b) # midified to use `cbind` instead of `c`
}
df[c('apb', 'amb')] <- Reduce(f, df)
df
# a b apb amb
# 1 1 3 4 3
# 2 2 4 6 8
# 3 3 5 8 15
Note: This will only work nicely if you have only two columns (as in your example), thus if you have more columns in you data set, run this only on a subset
You need to transpose apply results to get what you want :
df[, c('apb', 'amb')] <- t(apply(df, 1, function(x) f(a = x[1], b = x[2])))
> df
a b apb amb
1 1 3 4 3
2 2 4 6 8
3 3 5 8 15

Resources