If I have the following data table
m = matrix(1:12, ncol=4)
colnames(m) = c('A1','A2','B1','B2')
d = data.table(m)
is it possible to execute a function on sets of columns?
For example the following would be the sum of A1,A2 and B1,B2.
A B
1: 5 17
2: 7 19
3: 9 21
The solution would preferably work with a 500k x 100 matrix
Solution
A trick would be to split the column into groups.
Then you can use rowSums as Frank suggests (see comments on question):
# using your data example
m <- matrix(1:12, ncol = 4)
colnames(m) <- c('A1', 'A2', 'B1', 'B2')
d <- data.table(m)
# 1) group columns
groups <- split(colnames(d), substr(colnames(d), 1, 1))
# 2) group wise row sums
d[,lapply(groups, function(i) {rowSums(d[, i, with = FALSE])})]
Result
This will return the data.table:
A B
1: 5 17
2: 7 19
3: 9 21
Explanation
split creates a list of column names for each group, defined by a (something coercable to a) factor.
substr(colnames(m), 1, 1) takes the first letter as group id, use a different approach (e.g. sub("([A-Z]).*", "\\1", colnames(m)) for variable number of letters).
lapply is commonly used to apply functions over multiple columns in a data.table. Here we create a list output, named as the groups, containing the rowSums. with = FALSE is important to use the value of i to get the respective columns from d.
Definitely possible...
d[, ":=" (A = A1 + A2, B = B1 + B2)]
d
A1 A2 B1 B2 A B
1: 1 4 7 10 5 17
2: 2 5 8 11 7 19
3: 3 6 9 12 9 21
# Want to drop the old columns?
set(d, j = which(names(d) %in% c("A1", "B1", "A2", "B2")), value = NULL)
d
A B
1: 5 17
2: 7 19
3: 9 21
Whether it is desirable I shall not tell. Probably better to follow Frank's advice (see comments).
Related
I have two tables, one with time-series data (dat), and another with some reference points (pts), for a bunch of different observations (time.group and well). Please see minimum example tables:
set.seed(5)
dat = data.table ( time.group = c (rep ("base", 42), rep ("4h", 42)),
well = c (rep ("A1", 20), rep ("B1", 22), rep ("A1", 19), rep ("B1", 23)),
frame = c(1:20, 1:22, 1:19, 1:23),
signal = runif (84, 0, 1) )
pts = data.table (time.group = c (rep ("base", 2), rep ("4h", 2)),
well = rep (c ("A1", "B1"), 2),
frame.start = c (3, 4, 3, 6),
frame.stop = c (17, 18, 12, 19) )
head (dat)
time.group well frame signal
1: base A1 1 0.2002145
2: base A1 2 0.6852186
3: base A1 3 0.9168758
4: base A1 4 0.2843995
5: base A1 5 0.1046501
6: base A1 6 0.7010575
head (pts)
time.group well frame.start frame.stop
1: base A1 3 17
2: base B1 4 18
3: 4h A1 3 12
4: 4h B1 6 19
I would like to extract the frame for each time.group and well, for which the signal is highest in the dat table, between frames of frame.start and frame.stop from the pts table
What is the most efficient way to do so, as I have pretty large data sets with lots of time.groups and wells, and a few other "signal"-like data columns?
These are the strategies I have come up with so far:
Example 1: This works, but I feel that this is redundant/slow, as it essentially has to perform the "by" grouping twice:
dat [pts, .(time.group, well, frame = x.frame, signal), # returns dat's frame column (desired)
on = .(time.group, well, frame >= frame.start, frame <= frame.stop) # non-equi join, groups once
][ ,
.SD [which.max (signal), .(plus = frame)], # extracting frame at max (signal)
by = .(time.group, well)] # groups again
>>>>>
time.group well plus
1: base A1 9
2: base B1 8
3: 4h A1 12
4: 4h B1 8
Example 2: Here, I would get the right numbers if I added the i.plus column with the first frame column (-1), however I can't do that and it trips out because there are two columns named "frame" in the output after the join.
Also, it wouldn't work if frame didn't start from 1 for every group:
dat [pts,
on = .(time.group, well, frame >= frame.start, frame <= frame.stop), # non-equi join
.(i.plus = which.max (signal)), # if I add i.plus and the first column frame, -1, it gives what I want, but there are two columns named frame
by = .EACHI
]
>>>>>>
time.group well frame frame i.plus
1: base A1 3 17 7
2: base B1 4 18 5
3: 4h A1 3 12 10
4: 4h B1 6 19 3
Example 3: This also works and gives the same table from example 1, but just seems like lots of code:
tmp =
dat [pts,
on = .(time.group, well, frame >= frame.start, frame <= frame.stop),
.(plus = .I [which.max (signal)] ), # returns row indeces from orginal data.table (dat)
by = .EACHI][["plus"]]
dat [tmp, .(time.group, well, plus = frame)] # extract from original table
Example 4: And this does not return the original frame column from dat, but returns the columns from pts, so I can't access the frame that corresponds to max (signal) in dat:
dat [pts,
on = .(time.group, well, frame >= frame.start, frame <= frame.stop), # non-equi join
.SD [which.max (signal) ], # does not return original frame column (x.frame), so I can't extract it
by = .EACHI
]
>>>>>>>>
time.group well frame frame signal
1: base A1 3 17 0.9565001
2: base B1 4 18 0.9659641
3: 4h A1 3 12 0.9758776
4: 4h B1 6 19 0.9304595
I'm not sure if I should approach this from an entirely different angle and try to join pts into dat instead, I have no idea! Any insight into if there are more elegant ways of accomplishing this are greatly appreciated!
I'd also like to note that coming up with an optimal strategy to do this is pretty important, as I will be doing these types of data extractions many times, so I've been cracking my head about it for a while now :(
Thank you!
Is this what you're looking for?
dat[pts, on = .(time.group, well, frame >= frame.start, frame <= frame.stop),
.(plus = x.frame[which.max(signal)]),
by = .EACHI]
# time.group well frame frame plus
# 1: base A1 3 17 9
# 2: base B1 4 18 8
# 3: 4h A1 3 12 12
# 4: 4h B1 6 19 8
For some reason, using frame instead of x.frame, i.e., frame[which.max(signal)], returns all NA, which I'd suppose is a bug .. Could you please file an issue by linking to this post? Thanks.
I have a simple problem which can be solved in a dirty way, but I'm looking for a clean way using data.table
I have the following data.table with n columns belonging to m unequal groups. Here is an example of my data.table:
dframe <- as.data.frame(matrix(rnorm(60), ncol=30))
cletters <- rep(c("A","B","C"), times=c(10,14,6))
colnames(dframe) <- cletters
A A A A A A
1 -0.7431185 -0.06356047 -0.2247782 -0.15423889 -0.03894069 0.1165187
2 -1.5891905 -0.44468389 -0.1186977 0.02270782 -0.64950716 -0.6844163
A A A A B B B
1 -1.277307 1.8164195 -0.3957006 -0.6489105 0.3498384 -0.463272 0.8458673
2 -1.644389 0.6360258 0.5612634 0.3559574 1.9658743 1.858222 -1.4502839
B B B B B B B
1 0.3167216 -0.2919079 0.5146733 0.6628149 0.5481958 -0.01721261 -0.5986918
2 -0.8104386 1.2335948 -0.6837159 0.4735597 -0.4686109 0.02647807 0.6389771
B B B B C C
1 -1.2980799 0.3834073 -0.04559749 0.8715914 1.1619585 -1.26236232
2 -0.3551722 -0.6587208 0.44822253 -0.1943887 -0.4958392 0.09581703
C C C C
1 -0.1387091 -0.4638417 -2.3897681 0.6853864
2 0.1680119 -0.5990310 0.9779425 1.0819789
What I want to do is to take a random subset of the columns (of a sepcific size), keeping the same number of columns per group (if the chosen sample size is larger than the number of columns belonging to one group, take all of the columns of this group).
I have tried an updated version of the method mentioned in this question:
sample rows of subgroups from dataframe with dplyr
but I'm not able to map the column names to the by argument.
Can someone help me with this?
Here's another approach, IIUC:
idx <- split(seq_along(dframe), names(dframe))
keep <- unlist(Map(sample, idx, pmin(7, lengths(idx))))
dframe[, keep]
Explanation:
The first step splits the column indices according to the column names:
idx
# $A
# [1] 1 2 3 4 5 6 7 8 9 10
#
# $B
# [1] 11 12 13 14 15 16 17 18 19 20 21 22 23 24
#
# $C
# [1] 25 26 27 28 29 30
In the next step we use
pmin(7, lengths(idx))
#[1] 7 7 6
to determine the sample size in each group and apply this to each list element (group) in idx using Map. We then unlist the result to get a single vector of column indices.
Not sure if you want a solution with dplyr, but here's one with just lapply:
dframe <- as.data.frame(matrix(rnorm(60), ncol=30))
cletters <- rep(c("A","B","C"), times=c(10,14,6))
colnames(dframe) <- cletters
# Number of columns to sample per group
nc <- 8
res <- do.call(cbind,
lapply(unique(colnames(dframe)),
function(x){
dframe[,if(sum(colnames(dframe) == x) <= nc) which(colnames(dframe) == x) else sample(which(colnames(dframe) == x),nc,replace = F)]
}
))
It might look complicated, but it really just takes all columns per group if there's less than nc, and samples random nc columns if there are more than nc columns.
And to restore your original column-name scheme, gsub does the trick:
colnames(res) <- gsub('.[[:digit:]]','',colnames(res))
I would like to attach values to labels in a riverplot in R
I have a list of lists of values, which is required to show flows between nodes, like this:
edges <- list( A= list( C= 10, E= 5 ),
B= list( C= 10 ),
C=list(D = 13, E = 7 ))
I am aware of a function that would reduce (or sum) one element of a list like this:
Reduce("+",edges$A)
Is there a way of reducing such a list of lists of values to a dataframe or so that I could get the sums:
Node Sum
A 15
B 10
C 20
D 13
E 12
Edit:
I just realised there is a confusion:
It looks like I need two outputs and it may be a little more complicated:
1. if 'edges' list has a sublist with a category name, sum up
2. if not, get the sum of all occurences of this item
Case 1: Categories A, B, C (these are the starting nodes)
Case 2: Categories D, E (these are the end nodes in a riverplot)
I am sorry for the confusion.
I think you can do
lapply(edges, function (x) sum(unlist(x)))
This returns a list. Using sapply will simplify the result to a vector.
We can also use base R
v1 <- unlist(edges)
tapply(v1, sub("\\..*", "", names(v1)), sum)
# A B C
#15 10 20
Or in a single step
r1 <- tapply(unlist(edges), rep(names(edges), lengths(edges)), FUN = sum)
r1
# A B C
#15 10 20
if we need to sum based on the names after the .
r2 <- tapply(v1, sub("[^.]+\\.", "", names(v1)), FUN = sum)
r2
# C D E
#20 13 12
c(r1, r2)[!duplicated(c(names(r1), names(r2)))]
# A B C D E
#15 10 20 13 12
Or using aggregate/stack
aggregate(values~., stack(edges), FUN = sum)
# ind values
#1 A 15
#2 B 10
#3 C 20
Another option is to use purrr package:
library(purrr)
stack(map(edges, compose(sum, unlist)))
# values ind
# 1 15 A
# 2 10 B
# 3 20 C
where compose(sum, unlist) is equivalent to function(x) sum(unlist(x)).
Or using your own proposed Reduce function:
unlist(lapply(edges, function(a) Reduce(sum, a)))
# A B C
#15 10 20
I have a data frame with coordinates ("start","end") and labels ("group"):
a <- data.frame(start=1:4, end=3:6, group=c("A","B","C","D"))
a
start end group
1 1 3 A
2 2 4 B
3 3 5 C
4 4 6 D
I want to create a new data frame in which labels are assigned to every element of the sequence on the range of coordinates:
V1 V2
1 1 A
2 2 A
3 3 A
4 2 B
5 3 B
6 4 B
7 3 C
8 4 C
9 5 C
10 4 D
11 5 D
12 6 D
The following code works but it is extremely slow with wide ranges:
df<-data.frame()
for(i in 1:dim(a)[1]){
s<-seq(a[i,1],a[i,2])
df<-rbind(df,data.frame(s,rep(a[i,3],length(s))))
}
colnames(df)<-c("V1","V2")
How can I speed this up?
You can try data.table
library(data.table)
setDT(a)[, start:end, by = group]
which gives
group V1
1: A 1
2: A 2
3: A 3
4: B 2
5: B 3
6: B 4
7: C 3
8: C 4
9: C 5
10: D 4
11: D 5
12: D 6
Obviously this would only work if you have one row per group, which it seems you have here.
If you want a very fast solution in base R, you can manually create the data.frame in two steps:
Use mapply to create a list of your ranges from "start" to "end".
Use rep + lengths to repeat the "groups" column to the expected number of rows.
The base R approach shared here won't depend on having only one row per group.
Try:
temp <- mapply(":", a[["start"]], a[["end"]], SIMPLIFY = FALSE)
data.frame(group = rep(a[["group"]], lengths(temp)),
values = unlist(temp, use.names = FALSE))
If you're doing this a lot, just put it in a function:
myFun <- function(indf) {
temp <- mapply(":", indf[["start"]], indf[["end"]], SIMPLIFY = FALSE)
data.frame(group = rep(indf[["group"]], lengths(temp)),
values = unlist(temp, use.names = FALSE))
}
Then, if you want some sample data to try it with, you can use the following as sample data:
set.seed(1)
a <- data.frame(start=1:4, end=sample(5:10, 4, TRUE), group=c("A","B","C","D"))
x <- do.call(rbind, replicate(1000, a, FALSE))
y <- do.call(rbind, replicate(100, x, FALSE))
Note that this does seem to slow down as the number of different unique values in "group" increases.
(In other words, the "data.table" approach will make the most sense in general. I'm just sharing a possible base R alternative that should be considerably faster than your existing approach.)
I've noticed that aggregate() appears to return its result ordered by the grouping column(s). Is this a guarantee? Can this be relied upon in surrounding logic?
A couple of examples:
set.seed(1); df <- data.frame(group=sample(letters[1:3],10,replace=T),value=1:10);
aggregate(value~group,df,sum);
## group value
## 1 a 16
## 2 b 22
## 3 c 17
And with two groups (notice the second group is ordered first, then the first group to break ties):
set.seed(1); df <- data.frame(group1=sample(letters[1:3],10,replace=T),group2=sample(letters[4:6],10,replace=T),value=1:10);
aggregate(value~group1+group2,df,sum);
## group1 group2 value
## 1 a d 1
## 2 b d 2
## 3 b e 9
## 4 c e 10
## 5 a f 15
## 6 b f 11
## 7 c f 7
Note: I'm asking because I just came up with an answer for Aggregating while merging two dataframes in R which, at least in its current form at the time of writing, depends on aggregate() returning its result ordered by the grouping column.
Yes, as long as you understand the natural ordering of factors to be by their integer keys. You can see this in the code:
y <- as.data.frame(by, stringsAsFactors = FALSE)
... # y becomes the "integerized" dataframe of index vectors
grp <- rank(do.call(paste, c(lapply(rev(y), ident), list(sep = "."))),
ties.method = "min")
y <- y[match(sort(unique(grp)), grp, 0L), , drop = FALSE]
...