Insert value to vector [duplicate] - r

I want to interlace two vectors of same mode and equal length. Say:
a <- rpois(lambda=3,n=5e5)
b <- rpois(lambda=4,n=5e5)
I would like to interweave or interlace these two vectors, to create a vector that would be equivalently c(a[1],b[1],a[2],b[2],...,a[length(a)],b[length(b)])
My first attempt was this:
sapply(X=rep.int(c(3,4),times=5e5),FUN=rpois,n=1)
but it requires rpois to be called far more times than needed.
My best attempt so far has been to transform it into a matrix and reconvert back into a vector:
d <- c(rbind(rpois(lambda=3,n=5e5),rpois(lambda=4,n=5e5)))
d <- c(rbind(a,b))
Is there a better way to go about doing it? Or is there a function in base R that accomplishes the same thing?

Your rbind method should work well. You could also use
rpois(lambda=c(3,4),n=1e6)
because R will automatically replicate the vector of lambda values to the required length. There's not much difference in speed:
library(rbenchmark)
benchmark(rpois(1e6,c(3,4)),
c(rbind(rpois(5e5,3),rpois(5e5,4))))
# test replications elapsed relative
# 2 c(rbind(rpois(5e+05, 3), rpois(5e+05, 4))) 100 23.390 1.112168
# 1 rpois(1e+06, c(3, 4)) 100 21.031 1.000000
and elegance is in the eye of the beholder ... of course, the c(rbind(...)) method works in general for constructing alternating vectors, while the other solution is specific to rpois or other functions that replicate their arguments in that way.

Some speed tests, incorporating Ben Bolker's answer:
benchmark(
c(rbind(rpois(lambda=3,n=5e5),rpois(lambda=4,n=5e5))),
c(t(sapply(X=list(3,4),FUN=rpois,n=5e5))),
sapply(X=rep.int(c(3,4),times=5e5),FUN=rpois,n=1),
rpois(lambda=c(3,4),n=1e6),
rpois(lambda=rep.int(c(3,4),times=5e5),n=1e6)
)
test
1 c(rbind(rpois(lambda = 3, n = 5e+05), rpois(lambda = 4, n = 5e+05)))
2 c(t(sapply(X = list(3, 4), FUN = rpois, n = 5e+05)))
4 rpois(lambda = c(3, 4), n = 1e+06)
5 rpois(lambda = rep.int(c(3, 4), times = 5e+05), n = 1e+06)
3 sapply(X = rep.int(c(3, 4), times = 5e+05), FUN = rpois, n = 1)
replications elapsed relative user.self sys.self user.child sys.child
1 100 6.14 1.000000 5.93 0.15 NA NA
2 100 7.11 1.157980 7.02 0.02 NA NA
4 100 14.09 2.294788 13.61 0.05 NA NA
5 100 14.24 2.319218 13.73 0.21 NA NA
3 100 700.84 114.143322 683.51 0.50 NA NA

Related

Create a sequence of sequences of numbers

I would like to make the following sequence in R, by using rep or any other function.
c(1, 2, 3, 4, 5, 2, 3, 4, 5, 3, 4, 5, 4, 5, 5)
Basically, c(1:5, 2:5, 3:5, 4:5, 5:5).
Use sequence.
sequence(5:1, from = 1:5)
[1] 1 2 3 4 5 2 3 4 5 3 4 5 4 5 5
The first argument, nvec, is the length of each sequence (5:1); the second, from, is the starting point for each sequence (1:5).
Note: this works only for R >= 4.0.0. From R News 4.0.0:
sequence() [...] gains arguments [e.g. from] to generate more complex sequences.
unlist(lapply(1:5, function(i) i:5))
# [1] 1 2 3 4 5 2 3 4 5 3 4 5 4 5 5
Some speed tests on all answers provided
note the OP mentioned 10K somewhere if I recall correctly
s1 <- function(n) {
unlist(lapply(1:n, function(i) i:n))
}
s2 <- function(n) {
unlist(lapply(seq_len(n), function(i) seq(from = i, to = n, by = 1)))
}
s3 <- function(n) {
vect <- 0:n
unlist(replicate(n, vect <<- vect[-1]))
}
s4 <- function(n) {
m <- matrix(1:n, ncol = n, nrow = n, byrow = TRUE)
m[lower.tri(m)] <- 0
c(t(m)[t(m != 0)])
}
s5 <- function(n) {
m <- matrix(seq.int(n), ncol = n, nrow = n)
m[lower.tri(m, diag = TRUE)]
}
s6 <- function(n) {
out <- c()
for (i in 1:n) {
out <- c(out, (1:n)[i:n])
}
out
}
library(rbenchmark)
n = 5
n = 5L
benchmark(
"s1" = { s1(n) },
"s2" = { s2(n) },
"s3" = { s3(n) },
"s4" = { s4(n) },
"s5" = { s5(n) },
"s6" = { s6(n) },
replications = 1000,
columns = c("test", "replications", "elapsed", "relative")
)
Do not get fooled by some "fast" solutions using hardly any function that takes time to be called, and differences are multiplied by 1000x replications.
test replications elapsed relative
1 s1 1000 0.05 2.5
2 s2 1000 0.44 22.0
3 s3 1000 0.14 7.0
4 s4 1000 0.08 4.0
5 s5 1000 0.02 1.0
6 s6 1000 0.02 1.0
n = 1000
n = 1000L
benchmark(
"s1" = { s1(n) },
"s2" = { s2(n) },
"s3" = { s3(n) },
"s4" = { s4(n) },
"s5" = { s5(n) },
"s6" = { s6(n) },
replications = 10,
columns = c("test", "replications", "elapsed", "relative")
)
As the poster already mentioned as "not to do", we see the for loop becoming pretty slow compared to any other method, on n = 1000L
test replications elapsed relative
1 s1 10 0.17 1.000
2 s2 10 0.83 4.882
3 s3 10 0.19 1.118
4 s4 10 1.50 8.824
5 s5 10 0.29 1.706
6 s6 10 28.64 168.471
n = 10000
n = 10000L
benchmark(
"s1" = { s1(n) },
"s2" = { s2(n) },
"s3" = { s3(n) },
"s4" = { s4(n) },
"s5" = { s5(n) },
# "s6" = { s6(n) },
replications = 10,
columns = c("test", "replications", "elapsed", "relative")
)
At big n's we see matrix becomes very slow compared to the other methods.
Using seq in the apply might be neater, but comes with a trade-off as calling that function n times increases processing time a lot. Although seq_len(n) is nicer than 1:n and is just run once. Interesting to see that the replicate method is the fastest.
test replications elapsed relative
1 s1 10 5.44 1.915
2 s2 10 9.98 3.514
3 s3 10 2.84 1.000
4 s4 10 72.37 25.482
5 s5 10 35.78 12.599
Your mention of rep reminded me of replicate, so here's a very stateful solution. I'm presenting this because it's short and unusual, not because it's good. This is very unidiomatic R.
vect <- 0:5
unlist(replicate(5, vect <<- vect[-1]))
[1] 1 2 3 4 5 2 3 4 5 3 4 5 4 5 5
You can do it with a combination of rep and lapply, but it's basically the same as Merijn van Tilborg's answer.
Of course, the truly fearless unidomatic R user does this and refuses to elaborate further.
mat <- matrix(1:5, ncol = 5, nrow = 5, byrow = TRUE)
mat[lower.tri(mat)] <- 0
c(t(mat)[t(mat != 0)])
[1] 1 2 3 4 5 2 3 4 5 3 4 5 4 5 5
You could use a loop like so:
out=c();for(i in 1:5){ out=c(out, (1:5)[i:5]) }
out
# [1] 1 2 3 4 5 2 3 4 5 3 4 5 4 5 5
but that's not a good idea!
Why not use a loop?
Using a loop is:
slower,
less memory efficient, and
harder to read and understand.
By contrast, using a vectorised function like sequence is the opposite (faster, more efficient, and easy to read).
Further info
From ?sequence:
The default method for sequence generates the sequence seq(from[i], by = by[i], length.out = nvec[i]) for each element i in the parallel (and recycled) vectors from, by and nvec. It then returns the result of concatenating those sequences.
and about the from argument:
from: each element specifies the first element of a sequence.
Also, since the vector used in the loop is not preallocated, it will require more memory, and will also be slower.

Efficiently change elements in data based on neighbouring elements

Let me delve right in. Imagine you have data that looks like this:
df <- data.frame(one = c(1, 1, NA, 13),
two = c(2, NA,10, 14),
three = c(NA,NA,11, NA),
four = c(4, 9, 12, NA))
This gives us:
df
# one two three four
# 1 1 2 NA 4
# 2 1 NA NA 9
# 3 NA 10 11 12
# 4 13 14 NA NA
Each row are measurements in week 1, 2, 3 and 4 respectively. Suppose the numbers represent some accumulated measure since the last time a measurement happened. For example, in row 1, the "4" in column "four" represents a cumulative value of week 3 and 4.
Now I want to "even out" these numbers (feel free to correct my terminology here) by evenly spreading out the measurements to all weeks before the measurement if no measurement took place in the preceeding weeks. For instance, row 1 should read
1 2 2 2
since the 4 in the original data represents the cumulative value of 2 weeks (week "three" and "four"), and 4/2 is 2.
The final end result should look like this:
df
# one two three four
# 1 1 2 2 2
# 2 1 3 3 3
# 3 5 5 11 12
# 4 13 14 NA NA
I struggle a bit with how to best approach this. One candidate solution would be to get the indices of all missing values, then to count the length of runs (NAs occuring multiple times), and use that to fill up the values somehow. However, my real data is large, and I think such a strategy might be time consuming. Is there an easier and more efficient way?
A base R solution would be to first identify the indices that need to be replaced, then determine groupings of those indices, finally assigning grouped values with the ave function:
clean <- function(x) {
to.rep <- which(is.na(x) | c(FALSE, head(is.na(x), -1)))
groups <- cumsum(c(TRUE, head(!is.na(x[to.rep]), -1)))
x[to.rep] <- ave(x[to.rep], groups, FUN=function(y) {
rep(tail(y, 1) / length(y), length(y))
})
return(x)
}
t(apply(df, 1, clean))
# one two three four
# [1,] 1 2 2 2
# [2,] 1 3 3 3
# [3,] 5 5 11 12
# [4,] 13 14 NA NA
If efficiency is important (your question implies it is), then an Rcpp solution could be a good option:
library(Rcpp)
cppFunction(
"NumericVector cleanRcpp(NumericVector x) {
const int n = x.size();
NumericVector y(x);
int consecNA = 0;
for (int i=0; i < n; ++i) {
if (R_IsNA(x[i])) {
++consecNA;
} else if (consecNA > 0) {
const double replacement = x[i] / (consecNA + 1);
for (int j=i-consecNA; j <= i; ++j) {
y[j] = replacement;
}
consecNA = 0;
} else {
consecNA = 0;
}
}
return y;
}")
t(apply(df, 1, cleanRcpp))
# one two three four
# [1,] 1 2 2 2
# [2,] 1 3 3 3
# [3,] 5 5 11 12
# [4,] 13 14 NA NA
We can compare performance on a larger instance (10000 x 100 matrix):
set.seed(144)
mat <- matrix(sample(c(1:3, NA), 1000000, replace=TRUE), nrow=10000)
all.equal(apply(mat, 1, clean), apply(mat, 1, cleanRcpp))
# [1] TRUE
system.time(apply(mat, 1, clean))
# user system elapsed
# 4.918 0.035 4.992
system.time(apply(mat, 1, cleanRcpp))
# user system elapsed
# 0.093 0.016 0.120
In this case the Rcpp solution provides roughly a 40x speedup compared to the base R implementation.
Here's a base R solution that's nearly as fast as josilber's Rcpp function:
spread_left <- function(df) {
nc <- ncol(df)
x <- rev(as.vector(t(as.matrix(cbind(df, -Inf)))))
ii <- cumsum(!is.na(x))
f <- tabulate(ii)
v <- x[!duplicated(ii)]
xx <- v[ii]/f[ii]
xx[xx == -Inf] <- NA
m <- matrix(rev(xx), ncol=nc+1, byrow=TRUE)[,seq_len(nc)]
as.data.frame(m)
}
spread_left(df)
# one two three four
# 1 1 2 2 2
# 2 1 3 3 3
# 3 5 5 11 12
# 4 13 14 NA NA
It manages to be relatively fast by vectorizing everything and completely avoiding time-expensive calls to apply(). (The downside is that it's also relatively obfuscated; to see how it works, do debug(spread_left) and then apply it to the small data.frame df in the OP.
Here are benchmarks for all currently posted solutions:
library(rbenchmark)
set.seed(144)
mat <- matrix(sample(c(1:3, NA), 1000000, replace=TRUE), nrow=10000)
df <- as.data.frame(mat)
## First confirm that it produces the same results
identical(spread_left(df), as.data.frame(t(apply(mat, 1, clean))))
# [1] TRUE
## Then compare its speed
benchmark(josilberR = t(apply(mat, 1, clean)),
josilberRcpp = t(apply(mat, 1, cleanRcpp)),
Josh = spread_left(df),
Henrik = t(apply(df, 1, fn)),
replications = 10)
# test replications elapsed relative user.self sys.self
# 4 Henrik 10 38.81 25.201 38.74 0.08
# 3 Josh 10 2.07 1.344 1.67 0.41
# 1 josilberR 10 57.42 37.286 57.37 0.05
# 2 josilberRcpp 10 1.54 1.000 1.44 0.11
Another base possibility. I first create a grouping variable (grp), over which the 'spread' is then made with ave.
fn <- function(x){
grp <- rev(cumsum(!is.na(rev(x))))
res <- ave(x, grp, FUN = function(y) sum(y, na.rm = TRUE) / length(y))
res[grp == 0] <- NA
res
}
t(apply(df, 1, fn))
# one two three four
# [1,] 1 2 2 2
# [2,] 1 3 3 3
# [3,] 5 5 11 12
# [4,] 13 14 NA NA
I was thinking that if NAs are relatively rare, it might be better to make the edits by reference. (I'm guessing this is how the Rcpp approach works.) Here's how it can be done in data.table, borrowing #Henrik's function almost verbatim and converting to long format:
require(data.table) # 1.9.5
fill_naseq <- function(df){
# switch to long format
DT <- data.table(id=(1:nrow(df))*ncol(df),df)
mDT <- setkey(melt(DT,id.vars="id"),id)
mDT[,value := as.numeric(value)]
mDT[,badv := is.na(value)]
mDT[
# subset to rows that need modification
badv|shift(badv),
# apply #Henrik's function, more or less
value:={
g = ave(!badv,id,FUN=function(x)rev(cumsum(rev(x))))+id
ave(value,g,FUN=function(x){n = length(x); x[n]/n})
}]
# revert to wide format
(setDF(dcast(mDT,id~variable)[,id:=NULL]))
}
identical(fill_naseq(df),spread_left(df)) # TRUE
To show the best-case scenario for this approach, I simulated so that NAs are very infrequent:
nr = 1e4
nc = 100
nafreq = 1/1e4
mat <- matrix(sample(
c(NA,1:3),
nr*nc,
replace=TRUE,
prob=c(nafreq,rep((1-nafreq)/3,3))
),nrow=nr)
df <- as.data.frame(mat)
benchmark(F=fill_naseq(df),Josh=spread_left(df),replications=10)[1:5]
# test replications elapsed relative user.self
# 1 F 10 3.82 1.394 3.72
# 2 Josh 10 2.74 1.000 2.70
# I don't have Rcpp installed and so left off josilber's even faster approach
So, it's still slower. However, with data kept in a long format, reshaping wouldn't be necessary:
DT <- data.table(id=(1:nrow(df))*ncol(df),df)
mDT <- setkey(melt(DT,id.vars="id"),id)
mDT[,value := as.numeric(value)]
fill_naseq_long <- function(mDT){
mDT[,badv := is.na(value)]
mDT[badv|shift(badv),value:={
g = ave(!badv,id,FUN=function(x)rev(cumsum(rev(x))))+id
ave(value,g,FUN=function(x){n = length(x); x[n]/n})
}]
mDT
}
benchmark(
F2=fill_naseq_long(mDT),F=fill_naseq(df),Josh=spread_left(df),replications=10)[1:5]
# test replications elapsed relative user.self
# 2 F 10 3.98 8.468 3.81
# 1 F2 10 0.47 1.000 0.45
# 3 Josh 10 2.72 5.787 2.69
Now it's a little faster. And who doesn't like keeping their data in long format? This also has the advantage of working even if we don't have the same number of observations per "id".

Efficiently replicate matrices in R

I have a matrix and look for an efficient way to replicate it n times (where n is the number of observations in the dataset). For example, if I have a matrix A
A <- matrix(1:15, nrow=3)
then I want an output of the form
rbind(A, A, A, ...) #n times.
Obviously, there are many ways to construct such a large matrix, for example using a for loop or apply or similar functions. However, the call to the "matrix-replication-function" takes place in the very core of my optimization algorithm where it is called tens of thousands of times during one run of my program. Therefore, loops, apply-type of functions and anything similar to that are not efficient enough. (Such a solution would basically mean that a loop over n is performed tens of thousands of times, which is obviously inefficient.) I already tried to use the ordinary rep function, but haven't found a way to arrange the output of rep in a matrix of the desired format.
The solution
do.call("rbind", replicate(n, A, simplify=F))
is also too inefficient because rbind is used too often in this case. (Then, about 30% of the total runtime of my program are spent performing the rbinds.)
Does anyone know a better solution?
Two more solutions:
The first is a modification of the example in the question
do.call("rbind", rep(list(A), n))
The second involves unrolling the matrix, replicating it, and reassembling it.
matrix(rep(t(A),n), ncol=ncol(A), byrow=TRUE)
Since efficiency is what was requested, benchmarking is necessary
library("rbenchmark")
A <- matrix(1:15, nrow=3)
n <- 10
benchmark(rbind(A, A, A, A, A, A, A, A, A, A),
do.call("rbind", replicate(n, A, simplify=FALSE)),
do.call("rbind", rep(list(A), n)),
apply(A, 2, rep, n),
matrix(rep(t(A),n), ncol=ncol(A), byrow=TRUE),
order="relative", replications=100000)
which gives:
test replications elapsed
1 rbind(A, A, A, A, A, A, A, A, A, A) 100000 0.91
3 do.call("rbind", rep(list(A), n)) 100000 1.42
5 matrix(rep(t(A), n), ncol = ncol(A), byrow = TRUE) 100000 2.20
2 do.call("rbind", replicate(n, A, simplify = FALSE)) 100000 3.03
4 apply(A, 2, rep, n) 100000 7.75
relative user.self sys.self user.child sys.child
1 1.000 0.91 0 NA NA
3 1.560 1.42 0 NA NA
5 2.418 2.19 0 NA NA
2 3.330 3.03 0 NA NA
4 8.516 7.73 0 NA NA
So the fastest is the raw rbind call, but that assumes n is fixed and known ahead of time. If n is not fixed, then the fastest is do.call("rbind", rep(list(A), n). These were for a 3x5 matrix and 10 replications. Different sized matrices might give different orderings.
EDIT:
For n=600, the results are in a different order (leaving out the explicit rbind version):
A <- matrix(1:15, nrow=3)
n <- 600
benchmark(do.call("rbind", replicate(n, A, simplify=FALSE)),
do.call("rbind", rep(list(A), n)),
apply(A, 2, rep, n),
matrix(rep(t(A),n), ncol=ncol(A), byrow=TRUE),
order="relative", replications=10000)
giving
test replications elapsed
4 matrix(rep(t(A), n), ncol = ncol(A), byrow = TRUE) 10000 1.74
3 apply(A, 2, rep, n) 10000 2.57
2 do.call("rbind", rep(list(A), n)) 10000 2.79
1 do.call("rbind", replicate(n, A, simplify = FALSE)) 10000 6.68
relative user.self sys.self user.child sys.child
4 1.000 1.75 0 NA NA
3 1.477 2.54 0 NA NA
2 1.603 2.79 0 NA NA
1 3.839 6.65 0 NA NA
If you include the explicit rbind version, it is slightly faster than the do.call("rbind", rep(list(A), n)) version, but not by much, and slower than either the apply or matrix versions. So a generalization to arbitrary n does not require a loss of speed in this case.
Probably this is more efficient:
apply(A, 2, rep, n)
There's also this way:
rep(1, n) %x% A
You can use indexing
A[rep(seq(nrow(A)), n), ]
I came here for the same reason as the original poster and ultimately updated #Brian Diggs comparison to include all of the other posted answers. Hopefully I did this correctly:
#install.packages("rbenchmark")
library("rbenchmark")
A <- matrix(1:15, nrow=3)
n <- 600
benchmark(do.call("rbind", replicate(n, A, simplify=FALSE)),
do.call("rbind", rep(list(A), n)),
apply(A, 2, rep, n),
matrix(rep(t(A),n), ncol=ncol(A), byrow=TRUE),
A[rep(seq(nrow(A)), n), ],
rep(1, n) %x% A,
apply(A, 2, rep, n),
matrix(rep(as.integer(t(A)),n),nrow=nrow(A)*n,byrow=TRUE),
order="relative", replications=10000)
# test replications elapsed relative user.self sys.self user.child sys.child
#5 A[rep(seq(nrow(A)), n), ] 10000 0.32 1.000 0.33 0.00 NA NA
#8 matrix(rep(as.integer(t(A)), n), nrow = nrow(A) * n, byrow = TRUE) 10000 0.36 1.125 0.35 0.02 NA NA
#4 matrix(rep(t(A), n), ncol = ncol(A), byrow = TRUE) 10000 0.38 1.188 0.37 0.00 NA NA
#3 apply(A, 2, rep, n) 10000 0.59 1.844 0.56 0.03 NA NA
#7 apply(A, 2, rep, n) 10000 0.61 1.906 0.58 0.03 NA NA
#6 rep(1, n) %x% A 10000 1.44 4.500 1.42 0.02 NA NA
#2 do.call("rbind", rep(list(A), n)) 10000 1.67 5.219 1.67 0.00 NA NA
#1 do.call("rbind", replicate(n, A, simplify = FALSE)) 10000 5.03 15.719 5.02 0.01 NA NA
what about transforming it into an array, replicate the content and create a new matrix with the updated number of rows?
A <- matrix(...)
n = 2 # just a test
a = as.integer(A)
multi.a = rep(a,n)
multi.A = matrix(multi.a,nrow=nrow(A)*n,byrow=T)

Alternate, interweave or interlace two vectors

I want to interlace two vectors of same mode and equal length. Say:
a <- rpois(lambda=3,n=5e5)
b <- rpois(lambda=4,n=5e5)
I would like to interweave or interlace these two vectors, to create a vector that would be equivalently c(a[1],b[1],a[2],b[2],...,a[length(a)],b[length(b)])
My first attempt was this:
sapply(X=rep.int(c(3,4),times=5e5),FUN=rpois,n=1)
but it requires rpois to be called far more times than needed.
My best attempt so far has been to transform it into a matrix and reconvert back into a vector:
d <- c(rbind(rpois(lambda=3,n=5e5),rpois(lambda=4,n=5e5)))
d <- c(rbind(a,b))
Is there a better way to go about doing it? Or is there a function in base R that accomplishes the same thing?
Your rbind method should work well. You could also use
rpois(lambda=c(3,4),n=1e6)
because R will automatically replicate the vector of lambda values to the required length. There's not much difference in speed:
library(rbenchmark)
benchmark(rpois(1e6,c(3,4)),
c(rbind(rpois(5e5,3),rpois(5e5,4))))
# test replications elapsed relative
# 2 c(rbind(rpois(5e+05, 3), rpois(5e+05, 4))) 100 23.390 1.112168
# 1 rpois(1e+06, c(3, 4)) 100 21.031 1.000000
and elegance is in the eye of the beholder ... of course, the c(rbind(...)) method works in general for constructing alternating vectors, while the other solution is specific to rpois or other functions that replicate their arguments in that way.
Some speed tests, incorporating Ben Bolker's answer:
benchmark(
c(rbind(rpois(lambda=3,n=5e5),rpois(lambda=4,n=5e5))),
c(t(sapply(X=list(3,4),FUN=rpois,n=5e5))),
sapply(X=rep.int(c(3,4),times=5e5),FUN=rpois,n=1),
rpois(lambda=c(3,4),n=1e6),
rpois(lambda=rep.int(c(3,4),times=5e5),n=1e6)
)
test
1 c(rbind(rpois(lambda = 3, n = 5e+05), rpois(lambda = 4, n = 5e+05)))
2 c(t(sapply(X = list(3, 4), FUN = rpois, n = 5e+05)))
4 rpois(lambda = c(3, 4), n = 1e+06)
5 rpois(lambda = rep.int(c(3, 4), times = 5e+05), n = 1e+06)
3 sapply(X = rep.int(c(3, 4), times = 5e+05), FUN = rpois, n = 1)
replications elapsed relative user.self sys.self user.child sys.child
1 100 6.14 1.000000 5.93 0.15 NA NA
2 100 7.11 1.157980 7.02 0.02 NA NA
4 100 14.09 2.294788 13.61 0.05 NA NA
5 100 14.24 2.319218 13.73 0.21 NA NA
3 100 700.84 114.143322 683.51 0.50 NA NA

Replacing columns names using a data frame in r

I have the matrix
m <- matrix(1:9, nrow = 3, ncol = 3, byrow = TRUE,dimnames = list(c("s1", "s2", "s3"),c("tom", "dick","bob")))
tom dick bob
s1 1 2 3
s2 4 5 6
s3 7 8 9
#and the data frame
current<-c("tom", "dick","harry","bob")
replacement<-c("x","y","z","b")
df<-data.frame(current,replacement)
current replacement
1 tom x
2 dick y
3 harry z
4 bob b
#I need to replace the existing names i.e. df$current with df$replacement if
#colnames(m) are equal to df$current thereby producing the following matrix
m <- matrix(1:9, nrow = 3, ncol = 3, byrow = TRUE,dimnames = list(c("s1", "s2", "s3"),c("x", "y","b")))
x y b
s1 1 2 3
s2 4 5 6
s3 7 8 9
Any advice? Should I use an 'if' loop? Thanks.
You can use which to match the colnames from m with the values in df$current. Then, when you have the indices, you can subset the replacement colnames from df$replacement.
colnames(m) = df$replacement[which(df$current %in% colnames(m))]
In the above:
%in% tests for TRUE or FALSE for any matches between the objects being compared.
which(df$current %in% colnames(m)) identifies the indexes (in this case, the row numbers) of the matched names.
df$replacement[...] is the basic way to subset the column df$replacement returning only the rows matched with step 2.
A slightly more direct way to find the indices is to use match:
> id <- match(colnames(m), df$current)
> id
[1] 1 2 4
> colnames(m) <- df$replacement[id]
> m
x y b
s1 1 2 3
s2 4 5 6
s3 7 8 9
As discussed below %in% is generally more intuitive to use and the difference in efficiency is marginal unless the sets are relatively large, e.g.
> n <- 50000 # size of full vector
> m <- 10000 # size of subset
> query <- paste("A", sort(sample(1:n, m)))
> names <- paste("A", 1:n)
> all.equal(which(names %in% query), match(query, names))
[1] TRUE
> library(rbenchmark)
> benchmark(which(names %in% query))
test replications elapsed relative user.self sys.self user.child sys.child
1 which(names %in% query) 100 0.267 1 0.268 0 0 0
> benchmark(match(query, names))
test replications elapsed relative user.self sys.self user.child sys.child
1 match(query, names) 100 0.172 1 0.172 0 0 0

Resources