Create a sequence of sequences of numbers - r

I would like to make the following sequence in R, by using rep or any other function.
c(1, 2, 3, 4, 5, 2, 3, 4, 5, 3, 4, 5, 4, 5, 5)
Basically, c(1:5, 2:5, 3:5, 4:5, 5:5).

Use sequence.
sequence(5:1, from = 1:5)
[1] 1 2 3 4 5 2 3 4 5 3 4 5 4 5 5
The first argument, nvec, is the length of each sequence (5:1); the second, from, is the starting point for each sequence (1:5).
Note: this works only for R >= 4.0.0. From R News 4.0.0:
sequence() [...] gains arguments [e.g. from] to generate more complex sequences.

unlist(lapply(1:5, function(i) i:5))
# [1] 1 2 3 4 5 2 3 4 5 3 4 5 4 5 5
Some speed tests on all answers provided
note the OP mentioned 10K somewhere if I recall correctly
s1 <- function(n) {
unlist(lapply(1:n, function(i) i:n))
}
s2 <- function(n) {
unlist(lapply(seq_len(n), function(i) seq(from = i, to = n, by = 1)))
}
s3 <- function(n) {
vect <- 0:n
unlist(replicate(n, vect <<- vect[-1]))
}
s4 <- function(n) {
m <- matrix(1:n, ncol = n, nrow = n, byrow = TRUE)
m[lower.tri(m)] <- 0
c(t(m)[t(m != 0)])
}
s5 <- function(n) {
m <- matrix(seq.int(n), ncol = n, nrow = n)
m[lower.tri(m, diag = TRUE)]
}
s6 <- function(n) {
out <- c()
for (i in 1:n) {
out <- c(out, (1:n)[i:n])
}
out
}
library(rbenchmark)
n = 5
n = 5L
benchmark(
"s1" = { s1(n) },
"s2" = { s2(n) },
"s3" = { s3(n) },
"s4" = { s4(n) },
"s5" = { s5(n) },
"s6" = { s6(n) },
replications = 1000,
columns = c("test", "replications", "elapsed", "relative")
)
Do not get fooled by some "fast" solutions using hardly any function that takes time to be called, and differences are multiplied by 1000x replications.
test replications elapsed relative
1 s1 1000 0.05 2.5
2 s2 1000 0.44 22.0
3 s3 1000 0.14 7.0
4 s4 1000 0.08 4.0
5 s5 1000 0.02 1.0
6 s6 1000 0.02 1.0
n = 1000
n = 1000L
benchmark(
"s1" = { s1(n) },
"s2" = { s2(n) },
"s3" = { s3(n) },
"s4" = { s4(n) },
"s5" = { s5(n) },
"s6" = { s6(n) },
replications = 10,
columns = c("test", "replications", "elapsed", "relative")
)
As the poster already mentioned as "not to do", we see the for loop becoming pretty slow compared to any other method, on n = 1000L
test replications elapsed relative
1 s1 10 0.17 1.000
2 s2 10 0.83 4.882
3 s3 10 0.19 1.118
4 s4 10 1.50 8.824
5 s5 10 0.29 1.706
6 s6 10 28.64 168.471
n = 10000
n = 10000L
benchmark(
"s1" = { s1(n) },
"s2" = { s2(n) },
"s3" = { s3(n) },
"s4" = { s4(n) },
"s5" = { s5(n) },
# "s6" = { s6(n) },
replications = 10,
columns = c("test", "replications", "elapsed", "relative")
)
At big n's we see matrix becomes very slow compared to the other methods.
Using seq in the apply might be neater, but comes with a trade-off as calling that function n times increases processing time a lot. Although seq_len(n) is nicer than 1:n and is just run once. Interesting to see that the replicate method is the fastest.
test replications elapsed relative
1 s1 10 5.44 1.915
2 s2 10 9.98 3.514
3 s3 10 2.84 1.000
4 s4 10 72.37 25.482
5 s5 10 35.78 12.599

Your mention of rep reminded me of replicate, so here's a very stateful solution. I'm presenting this because it's short and unusual, not because it's good. This is very unidiomatic R.
vect <- 0:5
unlist(replicate(5, vect <<- vect[-1]))
[1] 1 2 3 4 5 2 3 4 5 3 4 5 4 5 5
You can do it with a combination of rep and lapply, but it's basically the same as Merijn van Tilborg's answer.
Of course, the truly fearless unidomatic R user does this and refuses to elaborate further.
mat <- matrix(1:5, ncol = 5, nrow = 5, byrow = TRUE)
mat[lower.tri(mat)] <- 0
c(t(mat)[t(mat != 0)])
[1] 1 2 3 4 5 2 3 4 5 3 4 5 4 5 5

You could use a loop like so:
out=c();for(i in 1:5){ out=c(out, (1:5)[i:5]) }
out
# [1] 1 2 3 4 5 2 3 4 5 3 4 5 4 5 5
but that's not a good idea!
Why not use a loop?
Using a loop is:
slower,
less memory efficient, and
harder to read and understand.
By contrast, using a vectorised function like sequence is the opposite (faster, more efficient, and easy to read).
Further info
From ?sequence:
The default method for sequence generates the sequence seq(from[i], by = by[i], length.out = nvec[i]) for each element i in the parallel (and recycled) vectors from, by and nvec. It then returns the result of concatenating those sequences.
and about the from argument:
from: each element specifies the first element of a sequence.
Also, since the vector used in the loop is not preallocated, it will require more memory, and will also be slower.

Related

Efficiently change elements in data based on neighbouring elements

Let me delve right in. Imagine you have data that looks like this:
df <- data.frame(one = c(1, 1, NA, 13),
two = c(2, NA,10, 14),
three = c(NA,NA,11, NA),
four = c(4, 9, 12, NA))
This gives us:
df
# one two three four
# 1 1 2 NA 4
# 2 1 NA NA 9
# 3 NA 10 11 12
# 4 13 14 NA NA
Each row are measurements in week 1, 2, 3 and 4 respectively. Suppose the numbers represent some accumulated measure since the last time a measurement happened. For example, in row 1, the "4" in column "four" represents a cumulative value of week 3 and 4.
Now I want to "even out" these numbers (feel free to correct my terminology here) by evenly spreading out the measurements to all weeks before the measurement if no measurement took place in the preceeding weeks. For instance, row 1 should read
1 2 2 2
since the 4 in the original data represents the cumulative value of 2 weeks (week "three" and "four"), and 4/2 is 2.
The final end result should look like this:
df
# one two three four
# 1 1 2 2 2
# 2 1 3 3 3
# 3 5 5 11 12
# 4 13 14 NA NA
I struggle a bit with how to best approach this. One candidate solution would be to get the indices of all missing values, then to count the length of runs (NAs occuring multiple times), and use that to fill up the values somehow. However, my real data is large, and I think such a strategy might be time consuming. Is there an easier and more efficient way?
A base R solution would be to first identify the indices that need to be replaced, then determine groupings of those indices, finally assigning grouped values with the ave function:
clean <- function(x) {
to.rep <- which(is.na(x) | c(FALSE, head(is.na(x), -1)))
groups <- cumsum(c(TRUE, head(!is.na(x[to.rep]), -1)))
x[to.rep] <- ave(x[to.rep], groups, FUN=function(y) {
rep(tail(y, 1) / length(y), length(y))
})
return(x)
}
t(apply(df, 1, clean))
# one two three four
# [1,] 1 2 2 2
# [2,] 1 3 3 3
# [3,] 5 5 11 12
# [4,] 13 14 NA NA
If efficiency is important (your question implies it is), then an Rcpp solution could be a good option:
library(Rcpp)
cppFunction(
"NumericVector cleanRcpp(NumericVector x) {
const int n = x.size();
NumericVector y(x);
int consecNA = 0;
for (int i=0; i < n; ++i) {
if (R_IsNA(x[i])) {
++consecNA;
} else if (consecNA > 0) {
const double replacement = x[i] / (consecNA + 1);
for (int j=i-consecNA; j <= i; ++j) {
y[j] = replacement;
}
consecNA = 0;
} else {
consecNA = 0;
}
}
return y;
}")
t(apply(df, 1, cleanRcpp))
# one two three four
# [1,] 1 2 2 2
# [2,] 1 3 3 3
# [3,] 5 5 11 12
# [4,] 13 14 NA NA
We can compare performance on a larger instance (10000 x 100 matrix):
set.seed(144)
mat <- matrix(sample(c(1:3, NA), 1000000, replace=TRUE), nrow=10000)
all.equal(apply(mat, 1, clean), apply(mat, 1, cleanRcpp))
# [1] TRUE
system.time(apply(mat, 1, clean))
# user system elapsed
# 4.918 0.035 4.992
system.time(apply(mat, 1, cleanRcpp))
# user system elapsed
# 0.093 0.016 0.120
In this case the Rcpp solution provides roughly a 40x speedup compared to the base R implementation.
Here's a base R solution that's nearly as fast as josilber's Rcpp function:
spread_left <- function(df) {
nc <- ncol(df)
x <- rev(as.vector(t(as.matrix(cbind(df, -Inf)))))
ii <- cumsum(!is.na(x))
f <- tabulate(ii)
v <- x[!duplicated(ii)]
xx <- v[ii]/f[ii]
xx[xx == -Inf] <- NA
m <- matrix(rev(xx), ncol=nc+1, byrow=TRUE)[,seq_len(nc)]
as.data.frame(m)
}
spread_left(df)
# one two three four
# 1 1 2 2 2
# 2 1 3 3 3
# 3 5 5 11 12
# 4 13 14 NA NA
It manages to be relatively fast by vectorizing everything and completely avoiding time-expensive calls to apply(). (The downside is that it's also relatively obfuscated; to see how it works, do debug(spread_left) and then apply it to the small data.frame df in the OP.
Here are benchmarks for all currently posted solutions:
library(rbenchmark)
set.seed(144)
mat <- matrix(sample(c(1:3, NA), 1000000, replace=TRUE), nrow=10000)
df <- as.data.frame(mat)
## First confirm that it produces the same results
identical(spread_left(df), as.data.frame(t(apply(mat, 1, clean))))
# [1] TRUE
## Then compare its speed
benchmark(josilberR = t(apply(mat, 1, clean)),
josilberRcpp = t(apply(mat, 1, cleanRcpp)),
Josh = spread_left(df),
Henrik = t(apply(df, 1, fn)),
replications = 10)
# test replications elapsed relative user.self sys.self
# 4 Henrik 10 38.81 25.201 38.74 0.08
# 3 Josh 10 2.07 1.344 1.67 0.41
# 1 josilberR 10 57.42 37.286 57.37 0.05
# 2 josilberRcpp 10 1.54 1.000 1.44 0.11
Another base possibility. I first create a grouping variable (grp), over which the 'spread' is then made with ave.
fn <- function(x){
grp <- rev(cumsum(!is.na(rev(x))))
res <- ave(x, grp, FUN = function(y) sum(y, na.rm = TRUE) / length(y))
res[grp == 0] <- NA
res
}
t(apply(df, 1, fn))
# one two three four
# [1,] 1 2 2 2
# [2,] 1 3 3 3
# [3,] 5 5 11 12
# [4,] 13 14 NA NA
I was thinking that if NAs are relatively rare, it might be better to make the edits by reference. (I'm guessing this is how the Rcpp approach works.) Here's how it can be done in data.table, borrowing #Henrik's function almost verbatim and converting to long format:
require(data.table) # 1.9.5
fill_naseq <- function(df){
# switch to long format
DT <- data.table(id=(1:nrow(df))*ncol(df),df)
mDT <- setkey(melt(DT,id.vars="id"),id)
mDT[,value := as.numeric(value)]
mDT[,badv := is.na(value)]
mDT[
# subset to rows that need modification
badv|shift(badv),
# apply #Henrik's function, more or less
value:={
g = ave(!badv,id,FUN=function(x)rev(cumsum(rev(x))))+id
ave(value,g,FUN=function(x){n = length(x); x[n]/n})
}]
# revert to wide format
(setDF(dcast(mDT,id~variable)[,id:=NULL]))
}
identical(fill_naseq(df),spread_left(df)) # TRUE
To show the best-case scenario for this approach, I simulated so that NAs are very infrequent:
nr = 1e4
nc = 100
nafreq = 1/1e4
mat <- matrix(sample(
c(NA,1:3),
nr*nc,
replace=TRUE,
prob=c(nafreq,rep((1-nafreq)/3,3))
),nrow=nr)
df <- as.data.frame(mat)
benchmark(F=fill_naseq(df),Josh=spread_left(df),replications=10)[1:5]
# test replications elapsed relative user.self
# 1 F 10 3.82 1.394 3.72
# 2 Josh 10 2.74 1.000 2.70
# I don't have Rcpp installed and so left off josilber's even faster approach
So, it's still slower. However, with data kept in a long format, reshaping wouldn't be necessary:
DT <- data.table(id=(1:nrow(df))*ncol(df),df)
mDT <- setkey(melt(DT,id.vars="id"),id)
mDT[,value := as.numeric(value)]
fill_naseq_long <- function(mDT){
mDT[,badv := is.na(value)]
mDT[badv|shift(badv),value:={
g = ave(!badv,id,FUN=function(x)rev(cumsum(rev(x))))+id
ave(value,g,FUN=function(x){n = length(x); x[n]/n})
}]
mDT
}
benchmark(
F2=fill_naseq_long(mDT),F=fill_naseq(df),Josh=spread_left(df),replications=10)[1:5]
# test replications elapsed relative user.self
# 2 F 10 3.98 8.468 3.81
# 1 F2 10 0.47 1.000 0.45
# 3 Josh 10 2.72 5.787 2.69
Now it's a little faster. And who doesn't like keeping their data in long format? This also has the advantage of working even if we don't have the same number of observations per "id".

Insert value to vector [duplicate]

I want to interlace two vectors of same mode and equal length. Say:
a <- rpois(lambda=3,n=5e5)
b <- rpois(lambda=4,n=5e5)
I would like to interweave or interlace these two vectors, to create a vector that would be equivalently c(a[1],b[1],a[2],b[2],...,a[length(a)],b[length(b)])
My first attempt was this:
sapply(X=rep.int(c(3,4),times=5e5),FUN=rpois,n=1)
but it requires rpois to be called far more times than needed.
My best attempt so far has been to transform it into a matrix and reconvert back into a vector:
d <- c(rbind(rpois(lambda=3,n=5e5),rpois(lambda=4,n=5e5)))
d <- c(rbind(a,b))
Is there a better way to go about doing it? Or is there a function in base R that accomplishes the same thing?
Your rbind method should work well. You could also use
rpois(lambda=c(3,4),n=1e6)
because R will automatically replicate the vector of lambda values to the required length. There's not much difference in speed:
library(rbenchmark)
benchmark(rpois(1e6,c(3,4)),
c(rbind(rpois(5e5,3),rpois(5e5,4))))
# test replications elapsed relative
# 2 c(rbind(rpois(5e+05, 3), rpois(5e+05, 4))) 100 23.390 1.112168
# 1 rpois(1e+06, c(3, 4)) 100 21.031 1.000000
and elegance is in the eye of the beholder ... of course, the c(rbind(...)) method works in general for constructing alternating vectors, while the other solution is specific to rpois or other functions that replicate their arguments in that way.
Some speed tests, incorporating Ben Bolker's answer:
benchmark(
c(rbind(rpois(lambda=3,n=5e5),rpois(lambda=4,n=5e5))),
c(t(sapply(X=list(3,4),FUN=rpois,n=5e5))),
sapply(X=rep.int(c(3,4),times=5e5),FUN=rpois,n=1),
rpois(lambda=c(3,4),n=1e6),
rpois(lambda=rep.int(c(3,4),times=5e5),n=1e6)
)
test
1 c(rbind(rpois(lambda = 3, n = 5e+05), rpois(lambda = 4, n = 5e+05)))
2 c(t(sapply(X = list(3, 4), FUN = rpois, n = 5e+05)))
4 rpois(lambda = c(3, 4), n = 1e+06)
5 rpois(lambda = rep.int(c(3, 4), times = 5e+05), n = 1e+06)
3 sapply(X = rep.int(c(3, 4), times = 5e+05), FUN = rpois, n = 1)
replications elapsed relative user.self sys.self user.child sys.child
1 100 6.14 1.000000 5.93 0.15 NA NA
2 100 7.11 1.157980 7.02 0.02 NA NA
4 100 14.09 2.294788 13.61 0.05 NA NA
5 100 14.24 2.319218 13.73 0.21 NA NA
3 100 700.84 114.143322 683.51 0.50 NA NA

insert elements in a vector in R

I have a vector in R,
a = c(2,3,4,9,10,2,4,19)
let us say I want to efficiently insert the following vectors, b, and c,
b = c(2,1)
d = c(0,1)
right after the 3rd and 7th positions (the "4" entries), resulting in,
e = c(2,3,4,2,1,9,10,2,4,0,1,19)
How would I do this efficiently in R, without recursively using cbind or so.
I found a package R.basic but its not part of CRAN packages so I thought about using a supported version.
Try this:
result <- vector("list",5)
result[c(TRUE,FALSE)] <- split(a, cumsum(seq_along(a) %in% (c(3,7)+1)))
result[c(FALSE,TRUE)] <- list(b,d)
f <- unlist(result)
identical(f, e)
#[1] TRUE
EDIT: generalization to arbitrary number of insertions is straightforward:
insert.at <- function(a, pos, ...){
dots <- list(...)
stopifnot(length(dots)==length(pos))
result <- vector("list",2*length(pos)+1)
result[c(TRUE,FALSE)] <- split(a, cumsum(seq_along(a) %in% (pos+1)))
result[c(FALSE,TRUE)] <- dots
unlist(result)
}
> insert.at(a, c(3,7), b, d)
[1] 2 3 4 2 1 9 10 2 4 0 1 19
> insert.at(1:10, c(4,7,9), 11, 12, 13)
[1] 1 2 3 4 11 5 6 7 12 8 9 13 10
> insert.at(1:10, c(4,7,9), 11, 12)
Error: length(dots) == length(pos) is not TRUE
Note the bonus error checking if the number of positions and insertions do not match.
You can use the following function,
ins(a, list(b, d), pos=c(3, 7))
# [1] 2 3 4 2 1 9 10 2 4 0 1 4 19
where:
ins <- function(a, to.insert=list(), pos=c()) {
c(a[seq(pos[1])],
to.insert[[1]],
a[seq(pos[1]+1, pos[2])],
to.insert[[2]],
a[seq(pos[2], length(a))]
)
}
Here's another function, using Ricardo's syntax, Ferdinand's split and #Arun's interleaving trick from another question:
ins2 <- function(a,bs,pos){
as <- split(a,cumsum(seq(a)%in%(pos+1)))
idx <- order(c(seq_along(as),seq_along(bs)))
unlist(c(as,bs)[idx])
}
The advantage is that this should extend to more insertions. However, it may produce weird output when passed invalid arguments, e.g., with any(pos > length(a)) or length(bs)!=length(pos).
You can change the last line to unname(unlist(... if you don't want a's items named.
The straightforward approach:
b.pos <- 3
d.pos <- 7
c(a[1:b.pos],b,a[(b.pos+1):d.pos],d,a[(d.pos+1):length(a)])
[1] 2 3 4 2 1 9 10 2 4 0 1 19
Note the importance of parenthesis for the boundaries of the : operator.
After using Ferdinand's function, I tried to write my own and surprisingly it is far more efficient.
Here's mine :
insertElems = function(vect, pos, elems) {
l = length(vect)
j = 0
for (i in 1:length(pos)){
if (pos[i]==1)
vect = c(elems[j+1], vect)
else if (pos[i] == length(vect)+1)
vect = c(vect, elems[j+1])
else
vect = c(vect[1:(pos[i]-1+j)], elems[j+1], vect[(pos[i]+j):(l+j)])
j = j+1
}
return(vect)
}
tmp = c(seq(1:5))
insertElems(tmp, c(2,4,5), c(NA,NA,NA))
# [1] 1 NA 2 3 NA 4 NA 5
insert.at(tmp, c(2,4,5), c(NA,NA,NA))
# [1] 1 NA 2 3 NA 4 NA 5
And there's the benchmark result :
> microbenchmark(insertElems(tmp, c(2,4,5), c(NA,NA,NA)), insert.at(tmp, c(2,4,5), c(NA,NA,NA)), times = 10000)
Unit: microseconds
expr min lq mean median uq max neval
insertElems(tmp, c(2, 4, 5), c(NA, NA, NA)) 9.660 11.472 13.44247 12.68 13.585 1630.421 10000
insert.at(tmp, c(2, 4, 5), c(NA, NA, NA)) 58.866 62.791 70.36281 64.30 67.923 2475.366 10000
my code works even better for some cases :
> insert.at(tmp, c(1,4,5), c(NA,NA,NA))
# [1] 1 2 3 NA 4 NA 5 NA 1 2 3
# Warning message:
# In result[c(TRUE, FALSE)] <- split(a, cumsum(seq_along(a) %in% (pos))) :
# number of items to replace is not a multiple of replacement length
> insertElems(tmp, c(1,4,5), c(NA,NA,NA))
# [1] NA 1 2 3 NA 4 NA 5
Here's an alternative that uses append. It's fine for small vectors, but I can't imagine it being efficient for large vectors since a new vector is created upon each iteration of the loop (which is, obviously, bad). The trick is to reverse the vector of things that need to be inserted to get append to insert them in the correct place relative to the original vector.
a = c(2,3,4,9,10,2,4,19)
b = c(2,1)
d = c(0,1)
pos <- c(3, 7)
z <- setNames(list(b, d), pos)
z <- z[order(names(z), decreasing=TRUE)]
for (i in seq_along(z)) {
a <- append(a, z[[i]], after = as.numeric(names(z)[[i]]))
}
a
# [1] 2 3 4 2 1 9 10 2 4 0 1 19

Alternate, interweave or interlace two vectors

I want to interlace two vectors of same mode and equal length. Say:
a <- rpois(lambda=3,n=5e5)
b <- rpois(lambda=4,n=5e5)
I would like to interweave or interlace these two vectors, to create a vector that would be equivalently c(a[1],b[1],a[2],b[2],...,a[length(a)],b[length(b)])
My first attempt was this:
sapply(X=rep.int(c(3,4),times=5e5),FUN=rpois,n=1)
but it requires rpois to be called far more times than needed.
My best attempt so far has been to transform it into a matrix and reconvert back into a vector:
d <- c(rbind(rpois(lambda=3,n=5e5),rpois(lambda=4,n=5e5)))
d <- c(rbind(a,b))
Is there a better way to go about doing it? Or is there a function in base R that accomplishes the same thing?
Your rbind method should work well. You could also use
rpois(lambda=c(3,4),n=1e6)
because R will automatically replicate the vector of lambda values to the required length. There's not much difference in speed:
library(rbenchmark)
benchmark(rpois(1e6,c(3,4)),
c(rbind(rpois(5e5,3),rpois(5e5,4))))
# test replications elapsed relative
# 2 c(rbind(rpois(5e+05, 3), rpois(5e+05, 4))) 100 23.390 1.112168
# 1 rpois(1e+06, c(3, 4)) 100 21.031 1.000000
and elegance is in the eye of the beholder ... of course, the c(rbind(...)) method works in general for constructing alternating vectors, while the other solution is specific to rpois or other functions that replicate their arguments in that way.
Some speed tests, incorporating Ben Bolker's answer:
benchmark(
c(rbind(rpois(lambda=3,n=5e5),rpois(lambda=4,n=5e5))),
c(t(sapply(X=list(3,4),FUN=rpois,n=5e5))),
sapply(X=rep.int(c(3,4),times=5e5),FUN=rpois,n=1),
rpois(lambda=c(3,4),n=1e6),
rpois(lambda=rep.int(c(3,4),times=5e5),n=1e6)
)
test
1 c(rbind(rpois(lambda = 3, n = 5e+05), rpois(lambda = 4, n = 5e+05)))
2 c(t(sapply(X = list(3, 4), FUN = rpois, n = 5e+05)))
4 rpois(lambda = c(3, 4), n = 1e+06)
5 rpois(lambda = rep.int(c(3, 4), times = 5e+05), n = 1e+06)
3 sapply(X = rep.int(c(3, 4), times = 5e+05), FUN = rpois, n = 1)
replications elapsed relative user.self sys.self user.child sys.child
1 100 6.14 1.000000 5.93 0.15 NA NA
2 100 7.11 1.157980 7.02 0.02 NA NA
4 100 14.09 2.294788 13.61 0.05 NA NA
5 100 14.24 2.319218 13.73 0.21 NA NA
3 100 700.84 114.143322 683.51 0.50 NA NA

Replacing columns names using a data frame in r

I have the matrix
m <- matrix(1:9, nrow = 3, ncol = 3, byrow = TRUE,dimnames = list(c("s1", "s2", "s3"),c("tom", "dick","bob")))
tom dick bob
s1 1 2 3
s2 4 5 6
s3 7 8 9
#and the data frame
current<-c("tom", "dick","harry","bob")
replacement<-c("x","y","z","b")
df<-data.frame(current,replacement)
current replacement
1 tom x
2 dick y
3 harry z
4 bob b
#I need to replace the existing names i.e. df$current with df$replacement if
#colnames(m) are equal to df$current thereby producing the following matrix
m <- matrix(1:9, nrow = 3, ncol = 3, byrow = TRUE,dimnames = list(c("s1", "s2", "s3"),c("x", "y","b")))
x y b
s1 1 2 3
s2 4 5 6
s3 7 8 9
Any advice? Should I use an 'if' loop? Thanks.
You can use which to match the colnames from m with the values in df$current. Then, when you have the indices, you can subset the replacement colnames from df$replacement.
colnames(m) = df$replacement[which(df$current %in% colnames(m))]
In the above:
%in% tests for TRUE or FALSE for any matches between the objects being compared.
which(df$current %in% colnames(m)) identifies the indexes (in this case, the row numbers) of the matched names.
df$replacement[...] is the basic way to subset the column df$replacement returning only the rows matched with step 2.
A slightly more direct way to find the indices is to use match:
> id <- match(colnames(m), df$current)
> id
[1] 1 2 4
> colnames(m) <- df$replacement[id]
> m
x y b
s1 1 2 3
s2 4 5 6
s3 7 8 9
As discussed below %in% is generally more intuitive to use and the difference in efficiency is marginal unless the sets are relatively large, e.g.
> n <- 50000 # size of full vector
> m <- 10000 # size of subset
> query <- paste("A", sort(sample(1:n, m)))
> names <- paste("A", 1:n)
> all.equal(which(names %in% query), match(query, names))
[1] TRUE
> library(rbenchmark)
> benchmark(which(names %in% query))
test replications elapsed relative user.self sys.self user.child sys.child
1 which(names %in% query) 100 0.267 1 0.268 0 0 0
> benchmark(match(query, names))
test replications elapsed relative user.self sys.self user.child sys.child
1 match(query, names) 100 0.172 1 0.172 0 0 0

Resources