Melt a array into data.frame but converting one dimension into columns - r

I would like to convert a array with multiple dimensions (e.g. x, y, z; see 'arr' below) into a data.frame, but keep a dimension in the columns (e.g. z, see 'df2' below).
Currently, I use melt and dcast functions in reshape2 package.
set.seed(1111)
num <- 1000
dim_names <- list(x = seq(num), y = seq(num),
z = paste0('Z', 1:5))
dim_arr <- as.numeric(lapply(dim_names, length))
arr <- array(runif(prod(dim_arr)), dim = dim_arr)
dimnames(arr) <- dim_names
library(reshape2)
df <- melt(arr)
head(df)
system.time(df2 <- dcast(df, x + y ~ z, value.var = 'value'))
head(df2)
x y Z1 Z2 Z3 Z4 Z5
1 1 1 0.4655026 0.8027921 0.1950717 0.0403759 0.04669389
2 1 2 0.5156263 0.5427343 0.5799924 0.1911539 0.26069063
3 1 3 0.2788747 0.9394142 0.9081274 0.7712205 0.68748300
4 1 4 0.2827058 0.8001632 0.6995503 0.9913805 0.25421346
5 1 5 0.7054767 0.8013255 0.2511769 0.6556174 0.07780849
6 1 6 0.5576141 0.6452644 0.3362980 0.7353494 0.93147223
However, it took about 10 s to convert 5 M values
user system elapsed
8.13 1.11 9.39
Are there more efficient methods? Thanks for any suggestions.

Here's a slightly more generalized solution for a 4-dimensional array using a combination of aperm(...) and matrix(...). I'm not wizard enough to generalize this further.
nx <- 2 ; ny <- 3 ; nz <- 4 ; nw <- 5
original <- array(rnorm(nx*ny*nz*nw), dim=c(nx,ny,nz,nw),
dimnames=list(x=sprintf('x%s', 1:nx), y=sprintf('y%s', 1:ny),
z=sprintf('z%s', 1:nz), w=sprintf('w%s', 1:nw)))
This is your existing method that uses melt(...) and dcast(...) to remove all but the last dimension:
f.dcast <- function(a) dcast(melt(a), x + y + z ~ w)
The following does the same thing by using aperm(...) to write out the data as a vector in a particular order so that it winds up as a properly formatted matrix, then cbinds with the variable names:
f.aperm <- function(a) {
d <- dim(a)
data <- matrix(as.vector(aperm(a, c(4,3,2,1))), ncol=d[4], byrow=T)
colnames(data) <- dimnames(a)[[4]]
# specify levels in the same order as the input so they don't wind up alphabetical
varnames <- data.frame(
factor(rep(dimnames(a)[[1]], times=1, each=d[2]*d[3]), levels=dimnames(a)[[1]]),
factor(rep(dimnames(a)[[2]], times=d[1], each=d[3] ), levels=dimnames(a)[[2]]),
factor(rep(dimnames(a)[[3]], times=d[1]*d[2], each=1 ), levels=dimnames(a)[[3]])
)
names(varnames) <- names(dimnames(a))[1:3]
cbind(varnames, data)
}
They both give me the same result:
> desired <- f.dcast(original)
> test <- f.aperm(original)
> all.equal(desired, test)
[1] TRUE
The second method is 6 times faster for an array this size:
> microbenchmark::microbenchmark(f.dcast(original), f.aperm(original))
Unit: milliseconds
expr min lq mean median uq max neval
f.dcast(original) 7.270208 7.343227 7.703360 7.481984 7.698812 10.392141 100
f.aperm(original) 1.218162 1.244595 1.327204 1.259987 1.291986 4.182391 100
If I increase the size of the original array:
nx <- 10 ; ny <- 20 ; nz <- 30 ; nw <- 40
Then the second method is over ten times faster:
> microbenchmark::microbenchmark(f.dcast(original), f.aperm(original))
Unit: milliseconds
expr min lq mean median uq max neval
f.dcast(original) 303.21812 385.44857 381.29150 392.11693 394.81721 472.80343 100
f.aperm(original) 18.62788 22.25814 28.85363 23.90133 24.54939 97.96776 100

cbind(x=rep(1:1000,each=1000),
y=1:1000,
matrix(arr, ncol=5, dimnames=list(list(),dimnames(arr)$z) ) ))
Elapsed time for that was around a tenth of a second. This was the result of str()
num [1:1000000, 1:7] 1 1 1 1 1 1 1 1 1 1 ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:7] "x" "y" "Z1" "Z2" ..
I suppose you could put in row.names, although it does increase elapsed time to a bit over a second.

Related

R-like way to add a % column to a data frame

I want to add a column that shows me the percentage part of that row compared to the some of the column. (sorry for my bad mathematical english here).
> trees['Heigth_%'] <- round((100 / sum(trees$Height) * trees$Height), digits=2)
> head(trees)
Girth Height Volume Heigth_%
1 8.3 70 10.3 2.97
2 8.6 65 10.3 2.76
3 8.8 63 10.2 2.67
4 10.5 72 16.4 3.06
5 10.7 81 18.8 3.44
6 10.8 83 19.7 3.52
This work.
But the question is if this is a good and R-like way?
e.g. Is sum() called for each row? Or is R intelligent enough here?
To answer you question if sum is called for every row or is R intelligent enough, you can use trace:
df = data.frame(a = 1:10, b = 21:30)
df['b_%'] = round((100 / sum(df$b) * df$b), digits=2)
trace('sum')
round((100 / sum(df$b) * df$b), digits=2)
untrace('sum')
Which shows only one call to the sum function. Afterwards, R recognizes that the lengths of trees$Height and sum(trees$Height) differ and tries to replicate the shorter one until is has the same length as the bigger one.
Converting first to data.table and then using prop.table is a bit faster then f1-f3 (IF you neglect the conversion to a data.table, as this is usually only done once for all subsequent commands).
# Get data
data(trees)
# Load package & convert to data.table
library(data.table)
trees <- as.data.table(trees)
# data.table way to create the new variable
f4 <- function(trees) {
trees[, Heigth_percentage:=round(prop.table(Height)*100,2)]
}
Here are the bechmark results:
# > microbenchmark(r1 <- f1(trees),
# + r2 <- f2(trees),
# + r3 <- f3(trees),
# + r3 <- f4(trees),
# + times = 10000)
# Unit: microseconds
# expr min lq mean median uq max neval cld
# r1 <- f1(trees) 616.616 666.290 730.8883 683.122 708.164 8390.818 10000 b
# r2 <- f2(trees) 617.437 666.701 730.3211 683.533 709.191 8100.574 10000 b
# r3 <- f3(trees) 596.500 655.616 721.1057 672.243 697.080 55048.757 10000 b
# r4 <- f4(trees) 551.342 612.922 680.7581 633.037 665.059 54672.712 10000 a
To begin, Vandenman's answer is much more adequate and precise. What follows is not really worth an answer, but as usual - not readable as a comment.
I have added prop.table() and data.table() (see majom's answer) approaches to the timings. With 40k rows data.table() is a bit closer to the rest, but still slower (~3 ms to ~3.7 ms), with 400k rows it starts to be comparable, and with 4M rows it is finally faster than the rest:
library(microbenchmark)
trees <- data.frame(Height = runif(400000, 9, 11),
Heigth_PCT = numeric(4000000))
trees_dt <- as.data.table(trees)
f1 <- function(trees) {
trees$Heigth_PCT <- round((100 / sum(trees$Height) * trees$Height), digits = 2)
return(trees)
}
f2 <- function(trees) {
sum_trees <- sum(trees$Height)
trees$Heigth_PCT <- round((100 / sum_trees * trees$Height), digits = 2)
return(trees)
}
f3 <- function(trees) {
trees$Heigth_PCT <- round(prop.table(trees$Height)*100, digits = 2)
return(trees)
}
f4 <- function(trees_dt) {
trees_dt[, Heigth_PCT := round(prop.table(Height)*100, 2)]
}
# Time both functions
microbenchmark(r1 <- f1(trees),
r2 <- f2(trees),
r3 <- f3(trees),
r4 <- f4(trees_dt),
times = 100)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# r1 <- f1(trees) 296.4452 309.3853 373.5945 318.7987 400.0373 639.8556 100 a
# r2 <- f2(trees) 296.3453 310.6638 381.4048 323.0655 474.9295 682.2172 100 a
# r3 <- f3(trees) 304.3185 317.0654 383.9600 328.5494 395.6238 783.2435 100 a
# r4 <- f4(trees_dt) 304.3327 315.4685 361.9526 325.8711 366.1153 722.7629 100 a
sapply(list(r2, r3, as.data.frame(r4)), identical, r1)
# [1] TRUE TRUE TRUE
Edit: prop.table() added.
Edit 2: data.table() added.

Create groups from vector of 0,1 and NA

Background:
I'm trying to strip out a corpus where the speaker is identified. I've reduced the problem of removing a particular speaker from the corpuse to the following stream of 1,0, and NA (x). 0 means that person is speaking, 1 someone else is speaking, NA means that whoever was the last speaker is still speaking.
Here's a visual example:
0 1 S0: Hello, how are you today?
1 2 S1: I'm great thanks for asking!
NA 3 I'm a little tired though!
0 4 S0: I'm sorry to hear that. Are you ready for our discussion?
1 5 S1: Yes, I have everything I need.
NA 7 Let's begin.
So from this frame, I'd like to take 2,3,5, and 7. Or,. I would like the result to be 0,1,1,0,1,1.
How do I pull the positions of each run of 1 and NA up to the position before the next 0 in a vector.
Here is an example, and my desired output:
Example input:
x <- c(NA,NA,NA,NA,0,1,0,1,NA,NA,NA,0,NA,NA,1,NA,NA,0)
Example output:
These are the positions that I want because they identify that "speaker 1" is talking (1, or 1 followed by NA up to the next 0)
pos <- c(6,8,9,10,11,15,16,17)
An alternative output would be a filling:
fill <- c(0,0,0,0,0,1,0,1,1,1,1,0,0,0,1,1,1,0)
Where the NA values of the previous 1 or 0 are filled up to the next new value.
s <- which(x==1);
e <- c(which(x!=1),length(x)+1L);
unlist(Map(seq,s,e[findInterval(s,e)+1L]-1L));
## [1] 6 8 9 10 11 15 16 17
Every occurrence of a 1 in the input vector is the start of a sequence of position indexes applicable to speaker 1. We capture this in s with which(x==1).
For each start index, we must find the length of its containing sequence. The length is determined by the closest forward occurrence of a 0 (or, more generally, any non-NA value other than 1, if such was possible). Hence, we must first compute which(x!=1) to get these indexes. Because the final occurrence of a 1 may not have a forward occurrence of a 0, we must append an extra virtual index one unit past the end of the input vector, which is why we must call c() to combine length(x)+1L. We store this as e, reflecting that these are (potential) end indexes. Note that these are exclusive end indexes; they are not actually part of the (potential) preceding speaker 1 sequence.
Finally, we must generate the actual sequences. To do this, we must make one call to seq() for each element of s, also passing its corresponding end index from e. To find the end index we can use findInterval() to find the index into e whose element value (that is, the end index into x) falls just before each respective element of s. (The reason why it is just before is that the algorithm used by findInterval() is v[i[j]] ≤ x[j] < v[i[j]+1] as explained on the doc page.) We must then add one to it to get the index into e whose element value falls just after each respective element of s. We then index e with it, giving us the end indexes into x that follow each respective element of s. We must subtract one from that because the sequence we generate must exclude the (exclusive) end element. The easiest way to make the calls to seq() is to Map() the two endpoint vectors to it, returning a list of each sequence, which we can unlist() to get the required output.
s <- which(!is.na(x));
rep(c(0,x[s]),diff(c(1L,s,length(x)+1L)));
## [1] 0 0 0 0 0 1 0 1 1 1 1 0 0 0 1 1 1 0
Every occurrence of a non-NA value in the input vector is the start of a segment which, in the output, must become a repetition of the element value at that start index. We capture these indexes in s with which(!is.na(x));.
We must then repeat each start element a sufficient number of times to reach the following segment. Hence we can call rep() on x[s] with a vectorized times argument whose values consist of diff() called on s. To handle the final segment, we must append an index one unit past the end of the input vector, length(x)+1L. Also, to deal with the possible case of NAs leading the input vector, we must prepend a 0 to x[s] and a 1 to the diff() argument, which will repeat 0 a sufficient number of times to cover the leading NAs, if such exist.
Benchmarking (Position)
library(zoo);
library(microbenchmark);
library(stringi);
marat <- function(x) { v <- na.locf(zoo(x)); index(v)[v==1]; };
rawr <- function(x) which(zoo::na.locf(c(0L, x))[-1L] == 1L);
jota1 <- function(x) { stringx <- paste(x, collapse = ""); stringx <- gsub("NA", "N", stringx, fixed = TRUE); while(grepl("(?<=1)N", stringx, perl = TRUE)) stringx <- gsub("(?<=1)N", "1", stringx, perl = TRUE); unlist(gregexpr("1", stringx)); };
jota2 <- function(x) { stringx <- paste(x, collapse = ""); stringx <- gsub("NA", "N", stringx, fixed = TRUE); while(grepl("(?<=1)N", stringx, perl = TRUE)) stringx <- gsub("(?<=1)N", "1", stringx, perl = TRUE); newx <-unlist(strsplit(stringx, "")); which(newx == 1); };
jota3 <- function(x) {x[is.na(x)] <- "N"; stringx <- stri_flatten(x); ones <- stri_locate_all_regex(stringx, "1N*")[[1]]; unlist(lapply(seq_along(ones[, 1]), function(ii) seq.int(ones[ii, "start"], ones[ii, "end"]))); };
bgoldst <- function(x) { s <- which(x==1); e <- c(which(x!=1),length(x)+1L); unlist(Map(seq,s,e[findInterval(s,e)+1L]-1L)); };
## OP's test case
x <- c(NA,NA,NA,NA,0,1,0,1,NA,NA,NA,0,NA,NA,1,NA,NA,0);
ex <- marat(x);
identical(ex,rawr(x));
## [1] TRUE
identical(ex,jota1(x));
## [1] TRUE
identical(ex,jota2(x));
## [1] TRUE
identical(ex,jota3(x));
## [1] TRUE
identical(ex,bgoldst(x));
## [1] TRUE
microbenchmark(marat(x),rawr(x),jota1(x),jota2(x),jota3(x),bgoldst(x));
## Unit: microseconds
## expr min lq mean median uq max neval
## marat(x) 411.830 438.5580 503.24486 453.7400 489.2345 2299.915 100
## rawr(x) 115.466 143.0510 154.64822 153.5280 163.7920 276.692 100
## jota1(x) 448.180 469.7770 484.47090 479.6125 491.1595 835.633 100
## jota2(x) 440.911 464.4315 478.03050 472.1290 484.3170 661.579 100
## jota3(x) 53.885 65.4315 74.34808 71.2050 76.9785 158.232 100
## bgoldst(x) 34.212 44.2625 51.54556 48.5395 55.8095 139.843 100
## scale test, high probability of NA
set.seed(1L);
N <- 1e5L; probNA <- 0.8; x <- sample(c(NA,T),N,T,c(probNA,1-probNA)); x[which(x)] <- rep(c(0,1),len=sum(x,na.rm=T));
ex <- marat(x);
identical(ex,rawr(x));
## [1] TRUE
identical(ex,jota1(x));
## [1] TRUE
identical(ex,jota2(x));
## [1] TRUE
identical(ex,jota3(x));
## [1] TRUE
identical(ex,bgoldst(x));
## [1] TRUE
microbenchmark(marat(x),rawr(x),jota1(x),jota2(x),jota3(x),bgoldst(x));
## Unit: milliseconds
## expr min lq mean median uq max neval
## marat(x) 189.34479 196.70233 226.72926 233.39234 237.45738 293.95154 100
## rawr(x) 24.46984 27.46084 43.91167 29.92112 68.86464 79.53008 100
## jota1(x) 154.91450 157.09231 161.73505 158.18326 160.42694 206.04889 100
## jota2(x) 149.47561 151.68187 155.92497 152.93682 154.79668 201.13302 100
## jota3(x) 82.30768 83.89149 87.35308 84.99141 86.95028 129.94730 100
## bgoldst(x) 80.94261 82.94125 87.80780 84.02107 86.10844 130.56440 100
## scale test, low probability of NA
set.seed(1L);
N <- 1e5L; probNA <- 0.2; x <- sample(c(NA,T),N,T,c(probNA,1-probNA)); x[which(x)] <- rep(c(0,1),len=sum(x,na.rm=T));
ex <- marat(x);
identical(ex,rawr(x));
## [1] TRUE
identical(ex,jota1(x));
## [1] TRUE
identical(ex,jota2(x));
## [1] TRUE
identical(ex,jota3(x));
## [1] TRUE
identical(ex,bgoldst(x));
## [1] TRUE
microbenchmark(marat(x),rawr(x),jota1(x),jota2(x),jota3(x),bgoldst(x));
## Unit: milliseconds
## expr min lq mean median uq max neval
## marat(x) 178.93359 189.56032 216.68963 226.01940 234.06610 294.6927 100
## rawr(x) 17.75869 20.39367 36.16953 24.44931 60.23612 79.5861 100
## jota1(x) 100.10614 101.49238 104.11655 102.27712 103.84383 150.9420 100
## jota2(x) 94.59927 96.04494 98.65276 97.20965 99.26645 137.0036 100
## jota3(x) 193.15175 202.02810 216.68833 209.56654 227.94255 295.5672 100
## bgoldst(x) 253.33013 266.34765 292.52171 292.18406 311.20518 387.3093 100
Benchmarking (Fill)
library(microbenchmark);
bgoldst <- function(x) { s <- which(!is.na(x)); rep(c(0,x[s]),diff(c(1L,s,length(x)+1L))); };
user31264 <- function(x) { x[is.na(x)]=2; x.rle=rle(x); val=x.rle$v; if (val[1]==2) val[1]=0; ind = (val==2); val[ind]=val[which(ind)-1]; rep(val,x.rle$l); };
## OP's test case
x <- c(NA,NA,NA,NA,0,1,0,1,NA,NA,NA,0,NA,NA,1,NA,NA,0);
ex <- bgoldst(x);
identical(ex,user31264(x));
## [1] TRUE
microbenchmark(bgoldst(x),user31264(x));
## Unit: microseconds
## expr min lq mean median uq max neval
## bgoldst(x) 10.264 11.548 14.39548 12.403 13.258 73.557 100
## user31264(x) 31.646 32.930 35.74805 33.785 35.068 84.676 100
## scale test, high probability of NA
set.seed(1L);
N <- 1e5L; probNA <- 0.8; x <- sample(c(NA,T),N,T,c(probNA,1-probNA)); x[which(x)] <- rep(c(0,1),len=sum(x,na.rm=T));
ex <- bgoldst(x);
identical(ex,user31264(x));
## [1] TRUE
microbenchmark(bgoldst(x),user31264(x));
## Unit: milliseconds
## expr min lq mean median uq max neval
## bgoldst(x) 10.94491 11.21860 12.50473 11.53015 12.28945 50.25899 100
## user31264(x) 17.18649 18.35634 22.50400 18.91848 19.53708 65.02668 100
## scale test, low probability of NA
set.seed(1L);
N <- 1e5L; probNA <- 0.2; x <- sample(c(NA,T),N,T,c(probNA,1-probNA)); x[which(x)] <- rep(c(0,1),len=sum(x,na.rm=T));
ex <- bgoldst(x);
identical(ex,user31264(x));
## [1] TRUE
microbenchmark(bgoldst(x),user31264(x));
## Unit: milliseconds
## expr min lq mean median uq max neval
## bgoldst(x) 5.24815 6.351279 7.723068 6.635454 6.923264 45.04077 100
## user31264(x) 11.79423 13.063710 22.367334 13.986584 14.908603 55.45453 100
You can make use of na.locf from the zoo package:
library(zoo)
x <- c(NA,NA,NA,NA,0,1,0,1,NA,NA,NA,0,NA,NA,1,NA,NA,0)
v <- na.locf(zoo(x))
index(v)[v==1]
#[1] 6 8 9 10 11 15 16 17
x <- c(NA,NA,NA,NA,0,1,0,1,NA,NA,NA,0,NA,NA,1,NA,NA,0)
x[is.na(x)]=2
x.rle=rle(x)
val=x.rle$v
if (val[1]==2) val[1]=0
ind = (val==2)
val[ind]=val[which(ind)-1]
rep(val,x.rle$l)
Output:
[1] 0 0 0 0 0 1 0 1 1 1 1 0 0 0 1 1 1 0
Pasting the sequence into a string and using a while loop that checks (with grep) whether there are any NAs preceded by 1s and substitutes (with gsub) such cases with a 1 will do it:
# substitute NA for "N" for later ease of processing and locating 1s by position
x[is.na(x)] <- "N"
# Collapse vector into a string
stringx <- paste(x, collapse = "")
while(grepl("(?<=1)N", stringx, perl = TRUE)) {
stringx <- gsub("(?<=1)N", "1", stringx, perl = TRUE)
}
Then you can use gregexpr to get the indices of 1s.
unlist(gregexpr("1", stringx))
#[1] 6 8 9 10 11 15 16 17
Or you can split the string and look through to find the indices of 1s in the resulting vector:
newx <-unlist(strsplit(stringx, ""))
#[1] "N" "N" "N" "N" "0" "1" "0" "1" "1" "1" "1" "0" "N" "N" "1" "1" "1" "0"
which(newx == "1")
#[1] 6 8 9 10 11 15 16 17
Using stri_flatten from the stringi package instead of paste and stri_locate_all_fixed rather than gregexpr or a string splitting route can provide a little bit more performance if it's a larger vector you're processing. If the vector isn't large, no performance gains will result.
library(stringi)
x[is.na(x)] <- "N"
stringx <- stri_flatten(x)
while(grepl("(?<=1)N", stringx, perl = TRUE)) {
stringx <- gsub("(?<=1)N", "1", stringx, perl = TRUE)
}
stri_locate_all_fixed(stringx, "1")[[1]][,"start"]
The following approach is fairly straightforward and performs relatively well (based on bgoldst's excellent benchmarking example) on small and large samples (very well on bgoldst's high probability of NA example)
x[is.na(x)] <- "N"
stringx <- stri_flatten(x)
ones <- stri_locate_all_regex(stringx, "1N*")[[1]]
#[[1]]
#
# start end
#[1,] 6 6
#[2,] 8 11
#[3,] 15 17
unlist(lapply(seq_along(ones[, 1]),
function(ii) seq.int(ones[ii, "start"], ones[ii, "end"])))
#[1] 6 8 9 10 11 15 16 17

Convert list of lists of vectors to data frame in R

I have a list of lists of numeric vectors. I would like to
sum each vector
make each inner list of scalars into a vector
combine the vectors into one data frame where the vectors are rows
I hacked together the following, but it seems that this is very clumsy and not the R way. I suspect that if I use rapply and four steps that I am doing something wrong (not in terms of results, but in terms of efficiency and understanding). Is there a right way to do this operation?
dat1 <- list(list(1:2, 3:4), list(5:6, 7:8))
dat2 <- rapply(dat1, sum, how="list")
dat3 <- lapply(dat2, unlist)
dat4 <- do.call(rbind, dat3)
dat5 <- data.frame(dat4)
So the desired output is
> dat5
X1 X2
1 3 7
2 11 15
> class(dat5)
[1] "data.frame"
Well, I'm not sure if it's any more efficient, but here's something similar:
as.data.frame(matrix(rapply(dat1,sum), nrow=length(dat1), byrow=TRUE))
X1 X2
1 3 7
2 11 15
This seems to get you there in one go, without much difference in efficiency from using matrix
> as.data.frame(do.call(rbind, rapply(dat1, sum, how = "list")))
# X1 X2
# 1 3 7
# 2 11 15
> f1 <- function()
as.data.frame(matrix(rapply(dat1,sum), nrow=length(dat1), byrow=TRUE))
> f2 <- function()
as.data.frame(do.call(rbind, rapply(dat1, sum, how = "list")))
> library(microbenchmark)
> microbenchmark(f1(), f2())
# Unit: microseconds
# expr min lq median uq max neval
# f1() 91.047 92.7105 93.94 102.2855 199.305 100
# f2() 95.213 97.3150 98.43 101.8240 134.403 100
Almost a wash for this example.

R_Finding the closest match from number of vectors

I have the following vectors
> X <- c(1,1,3,4)
> a <- c(1,1,2,2)
> b <- c(2,1,4,3)
> c <- c(2,1,4,6)
I want to compare each element of X with corresponding elements of a,b and c and finally I need a class assigned to each row of X. for eg.
The first element of X is 1 and it has a match in corresponding element vector a, then I need to assign a class as '1-1' (no matter from which vector it got the match)
The second element of X is 1 and it also has match (in fact 3) so, again the class is '1-1'
The third element of X is 3 and it doesn't have a match then I should look for next integer value, which is 4 and there is 4 (in b and c). So the class should be '3-4'
The fourth element of X is 4 and it doesn't have a match. Also there is no 5 (next integer) then it should look for the previous integer which is 3 and there is 3. So the class should be '4-3'
Actually I have thousand of rows for each vector and I have to do this for each row. Any suggestion to do it in a less complicated way. I would prefer to use base functions of R.
Based on rbatt's comment and answer I realized my original answer was quite lacking. Here's a redo...
match_nearest <- function( x, table )
{
dist <- x - table
tgt <- which( dist < 0, arr.ind=TRUE, useNames=F )
dist[tgt] <- abs( dist[tgt] + .5 )
table[ cbind( seq_along(x), max.col( -dist, ties.method="first" ) ) ]
}
X <- c(1,1,3,4)
a <- c(1,1,2,2)
b <- c(2,1,4,3)
c <- c(2,1,4,6)
paste(X, match_nearest(X, cbind(a,b,c) ), sep="-")
## [1] "1-1" "1-1" "3-4" "4-3"
Compared to the original answer and rbatt's we find neither was correct!
set.seed(1)
X <- rbinom(n=1E4, size=10, prob=0.5)
a <- rbinom(n=1E4, size=10, prob=0.5)
b <- rbinom(n=1E4, size=10, prob=0.5)
c <- rbinom(n=1E4, size=10, prob=0.5)
T <- current_solution(X,a,b,c)
R <- rbatt_solution(X,a,b,c)
all.equal( T, R )
## [1] "195 string mismatches"
# Look at mismatched rows...
mismatch <- head( which( T != R ) )
cbind(X,a,b,c)[mismatch,]
## X a b c
## [1,] 4 6 3 3
## [2,] 5 7 4 7
## [3,] 5 8 3 9
## [4,] 5 7 7 4
## [5,] 4 6 3 7
## [6,] 5 7 4 2
T[mismatch]
## [1] "4-3" "5-4" "5-3" "5-4" "4-3" "5-4"
R[mismatch]
## [1] "4-6" "5-7" "5-8" "5-7" "4-6" "5-7"
and needlessly slow...
library(microbenchmark)
bm <- microbenchmark( current_solution(X,a,b,c),
previous_solution(X,a,b,c),
rbatt_solution(X,a,b,c) )
print(bm, order="median")
## Unit: milliseconds
## expr min lq median uq max neval
## current_solution(X, a, b, c) 7.088 7.298 7.996 8.268 38.25 100
## rbatt_solution(X, a, b, c) 33.920 38.236 46.524 53.441 85.50 100
## previous_solution(X, a, b, c) 83.082 93.869 101.997 115.961 135.98 100
Looks like the current_solution is getting it right; but without an expected output ...
Here's the functions...
current_solution <- function(X,a,b,c) {
paste(X, match_nearest(X, cbind(a,b,c) ), sep="-")
}
# DO NOT USE... it is wrong!
previous_solution <- function(X,a,b,c) {
dat <- rbind(X,a,b,c)
v <- apply(dat,2, function(v) {
v2 <- v[1] - v
v2[v2<0] <- abs( v2[v2<0]) - 1
v[ which.min( v2[-1] ) + 1 ]
})
paste("X", v, sep="-")
}
# DO NOT USE... it is wrong!
rbatt_solution <- function(X,a,b,c) {
mat <- cbind(X,a,b,c)
diff.signed <- mat[,"X"]-mat[,c("a","b","c")]
diff.break <- abs(diff.signed) + sign(diff.signed)*0.5
min.ind <- apply(diff.break, 1, which.min)
ind.array <- matrix(c(1:nrow(mat),min.ind), ncol=2)
match.value <- mat[,c("a","b","c")][ind.array]
ref.class <- paste(X, match.value, sep="-")
ref.class
}
This solution should provide the output you want. Also, it is ~ 3x faster than Thell's solution, because the differences are vectorized and are not calculated row-wise with apply.
I compare times for the two approaches below. Note that if you want the "class" as another column in a data.frame, just uncomment the last line of my function. I commented it out to make the calculation times between the two answers more comparable (creating a data.frame is quite slow).
# Example data from Thell, plus 1 more
X1 <- c(1,1,3,4,7,1, 5)
a1 <- c(1,1,2,2,2,2, 9)
b1 <- c(2,1,4,3,3,3, 3)
c1 <- c(2,1,4,6,6,6, 7)
# Random example data, much larger
# X1 <- rbinom(n=1E4, size=10, prob=0.5)
# a1 <- rbinom(n=1E4, size=10, prob=0.5)
# b1 <- rbinom(n=1E4, size=10, prob=0.5)
# c1 <- rbinom(n=1E4, size=10, prob=0.5)
My answer:
rbTest <- function(){
mat <- cbind(X1,a1,b1,c1)
diff.signed <- mat[,"X1"]-mat[,c("a1","b1","c1")] # differences (with sign)
diff.break <- abs(diff.signed) + sign(diff.signed)*0.5 # penalize for differences that are negative by adding 0.5 to them (break ties by preferring higher integer)
min.ind <- apply(diff.break, 1, which.min) # index of smallest difference (prefer larger integers when there is a tie)
ind.array <- matrix(c(1:nrow(mat),min.ind), ncol=2) # array index format
match.value <- mat[,c("a1","b1","c1")][ind.array] # value of the smallest difference (value of the match)
ref.class <- paste(X1, match.value, sep="-") # the 'class' in the format 'ref-match'
ref.class
# data.frame(class=ref.class, mat)
}
Thell answer:
thTest <- function(){
dat <- rbind(X1,a1,b1,c1)
apply(dat,2, function(v) {
# Get distance
v2 <- v[1] - v
# Prefer values >= v[1]
v2[v2<0] <- abs( v2[v2<0]) - 1
# Obtain and return nearest v excluding v[1]
v[ which.min( v2[-1] ) + 1 ]
})
}
Benchmark on large matrix (10,000 rows)
# > microbenchmark(rbTest(), thTest())
# Unit: milliseconds
# expr min lq median uq max neval
# rbTest() 47.95451 52.01729 59.36161 71.94076 103.1314 100
# thTest() 167.49798 180.69627 195.02828 204.19916 315.0610 100
Benchmark on small matrix (7 rows)
# > microbenchmark(rbTest(), thTest())
# Unit: microseconds
# expr min lq median uq max neval
# rbTest() 108.299 112.3550 115.4225 119.4630 146.722 100
# thTest() 147.727 152.2015 155.9005 159.3115 235.898 100
Example output (small matrix):
# > rbTest()
# [1] "1-1" "1-1" "3-4" "4-3" "7-6" "1-2" "5-7" "6-1"
# > thTest()
# [1] 1 1 4 3 6 2 7

Aggregate rows in a large matrix by rowname

I would like to aggregate the rows of a matrix by adding the values in rows that have the same rowname. My current approach is as follows:
> M
a b c d
1 1 1 2 0
1 2 3 4 2
2 3 0 1 2
3 4 2 5 2
> index <- as.numeric(rownames(M))
> M <- cbind(M,index)
> Dfmat <- data.frame(M)
> Dfmat <- aggregate(. ~ index, data = Dfmat, sum)
> M <- as.matrix(Dfmat)
> rownames(M) <- M[,"index"]
> M <- subset(M, select= -index)
> M
a b c d
1 3 4 6 2
2 3 0 1 2
3 4 2 5 2
The problem of this appraoch is that i need to apply it to a number of very large matrices (up to 1.000 rows and 30.000 columns). In these cases the computation time is very high (Same problem when using ddply). Is there a more eficcient to come up with the solution? Does it help that the original input matrices are DocumentTermMatrix from the tm package? As far as I know they are stored in a sparse matrix format.
Here's a solution using by and colSums, but requires some fiddling due to the default output of by.
M <- matrix(1:9,3)
rownames(M) <- c(1,1,2)
t(sapply(by(M,rownames(M),colSums),identity))
V1 V2 V3
1 3 9 15
2 3 6 9
There is now an aggregate function in Matrix.utils. This can accomplish what you want with a single line of code and is about 10x faster than the combineByRow solution and 100x faster than the by solution:
N <- 10000
m <- matrix( runif(N*100), nrow=N)
rownames(m) <- sample(1:(N/2),N,replace=T)
> microbenchmark(a<-t(sapply(by(m,rownames(m),colSums),identity)),b<-combineByRow(m),c<-aggregate.Matrix(m,row.names(m)),times = 10)
Unit: milliseconds
expr min lq mean median uq max neval
a <- t(sapply(by(m, rownames(m), colSums), identity)) 6000.26552 6173.70391 6660.19820 6419.07778 7093.25002 7723.61642 10
b <- combineByRow(m) 634.96542 689.54724 759.87833 732.37424 866.22673 923.15491 10
c <- aggregate.Matrix(m, row.names(m)) 42.26674 44.60195 53.62292 48.59943 67.40071 70.40842 10
> identical(as.vector(a),as.vector(c))
[1] TRUE
EDIT: Frank is right, rowsum is somewhat faster than any of these solutions. You would want to consider using another one of these other functions only if you were using a Matrix, especially a sparse one, or if you were performing an aggregation besides sum.
The answer by James work as expected, but is quite slow for large matrices. Here is a version that avoids creating of new objects:
combineByRow <- function(m) {
m <- m[ order(rownames(m)), ]
## keep track of previous row name
prev <- rownames(m)[1]
i.start <- 1
i.end <- 1
## cache the rownames -- profiling shows that it takes
## forever to look at them
m.rownames <- rownames(m)
stopifnot(all(!is.na(m.rownames)))
## go through matrix in a loop, as we need to combine some unknown
## set of rows
for (i in 2:(1+nrow(m))) {
curr <- m.rownames[i]
## if we found a new row name (or are at the end of the matrix),
## combine all rows and mark invalid rows
if (prev != curr || is.na(curr)) {
if (i.start < i.end) {
m[i.start,] <- apply(m[i.start:i.end,], 2, max)
m.rownames[(1+i.start):i.end] <- NA
}
prev <- curr
i.start <- i
} else {
i.end <- i
}
}
m[ which(!is.na(m.rownames)),]
}
Testing it shows that is about 10x faster than the answer using by (2 vs. 20 seconds in this example):
N <- 10000
m <- matrix( runif(N*100), nrow=N)
rownames(m) <- sample(1:(N/2),N,replace=T)
start <- proc.time()
m1 <- combineByRow(m)
print(proc.time()-start)
start <- proc.time()
m2 <- t(sapply(by(m,rownames(m),function(x) apply(x, 2, max)),identity))
print(proc.time()-start)
all(m1 == m2)

Resources