Related
I am trying to understand the working of for loop for a vector in R. I figured out the solution for my problem but left with a doubt about its fundamental working.
In the process of creating a function, I came across this problem. The problem is that the for loop is looping through the elements of the vector but till a certain index.
## the output is partially complete, seems like it didn't loop through all the values however the loop counter is perfect
temp_vector<- c(1, NA,Inf, NaN,3,2,4,6,4,6,7,3,2,5,NaN, NA, 3,3,NaN, Inf, Inf, NaN, NA, 3,5,6,7)
ctr<- 0
for(i in temp_vector){
temp_vector[i]<- ifelse((!is.na(temp_vector[i])) & (!is.infinite(temp_vector[i])), temp_vector[i], 0 )
## replace the element of vector by 0 if they are Inf or NA or NaN
ctr<- ctr+1
}
temp_vector
print(ctr)
# output
> temp_vector
[1] 1 0 0 0 3 2 4 6 4 6 7 3 2 5 NaN NA 3 3 NaN Inf Inf NaN NA 3 5 6 7
> print(ctr)
[1] 27
## this is generating correct output
temp_vector<- c(1, NA,Inf, NaN,3,2,4,6,4,6,7,3,2,5,NaN, NA, 3,3,NaN, Inf, Inf, NaN, NA, 3,5,6,7)
for(i in 1:length(temp_vector)){
temp_vector[i]<- ifelse((!is.na(temp_vector[i])) & (!is.infinite(temp_vector[i])), temp_vector[i], 0 )
## replace the element of vector by 0 if they are Inf or NA or NaN
}
temp_vector
# output
> temp_vector
[1] 1 0 0 0 3 2 4 6 4 6 7 3 2 5 0 0 3 3 0 0 0 0 0 3 5 6 7
Below are few variants of for loops that I tried which generate different output, I am trying to understand how it basically works. It would be helpful, if you could shed some light on it. Thanks!
## variant-0
y <- c(2,5,3,9,8,11,6)
count <- 0
for (val in y) {
if(val %% 2 == 0)
count = count+1
}
print(count)
# output
[1] 3
## variant-1
x<- c(2,4,6,4,6,7,3,2,5,6)
for(i in x){
x[i]<- ifelse(x[i]==6, NaN, x[i])
}
x
# output, Last element of the vector is not a NaN
[1] 2 4 NaN 4 NaN 7 3 2 5 6
## variant-2
x<- c(2,4,6,4,6,7,3,2,5,6)
ctr<- 0
for(i in x){
x[i]<- ifelse(x[i]==6, NaN, x[i])
ctr<- ctr+1
}
x
print(ctr)
# output, Note: Last element of the vector is not a NaN
> x
[1] 2 4 NaN 4 NaN 7 3 2 5 6
> print(ctr)
[1] 10
## variant-3
x<- c(2,4,6,4,6,7,3,2,5,6)
ctr<- 0
for(i in x){
x[ctr]<- ifelse(x[ctr]==6, NaN, x[ctr])
ctr<- ctr+1
}
x
print(ctr)
# output. Note: the counter is perfect
> x
[1] 2 4 NaN 4 NaN 7 3 2 5 6
> print(ctr)
[1] 10
## variant-4
x<- c(2,4,6,4,6,7,3,2,5,6)
ctr<- 0
for(i in x){
i<- ifelse(i==6, NaN, i)
ctr<- ctr+1
}
x
print(ctr)
# output
> x
[1] 2 4 6 4 6 7 3 2 5 6
> print(ctr)
[1] 10
Consider the following example:
> y <- c(2, 5, 3, 9, 8, 11, 6)
for loops over the vector you provide. In the first case you iterate over the elements of vector y:
> for (val in y) {
+ print(val)
+ }
[1] 2
[1] 5
[1] 3
[1] 9
[1] 8
[1] 11
[1] 6
In the second case you are iterating over the elements of vector 1:length(y), meaning c(1, 2, 3, 4, 5, 6, 7):
> for (val in 1:length(y)) {
+ print(val)
+ }
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
You got this mixed up in your code above. Hope this clears things up!
The for loop that you are using works in this way (I will take up the first variant i.e. variant-0)
This is normal definition part
y <- c(2,5,3,9,8,11,6)
count <- 0
Here is where the business begins:
for (val in y)
Here, val will contain values of vector y which will be changing in each iteration.
For example:
val for iteration 1: 2;
val for iteration 2: 5;
val for iteration 3: 3;
and so on.
{
if(val %% 2 == 0)
count = count+1
}
print(count)
So, here, count will be incremented when val is even i.e. for iteration: 1,5,7
So, value of count is 3.
Let me try to make this question as general as possible.
Let's say I have two variables a and b.
a <- as.integer(runif(20, min = 0, max = 10))
a <- as.data.frame(a)
b <- as.data.frame(a[c(-7, -11, -15),])
So b has 17 observations and is a subset of a which has 20 observations.
My question is the following: how I would use these two variables to generate a third variable c which like a has 20 observations but for which observations 7, 11 and 15 are missing, and for which the other observations are identical to b but in the order of a?
Or to put it somewhat differently: how could I squeeze in these missing observations into variable b at locations 7, 11 and 15?
It seems pretty straightforward (and it probably is) but I have been not getting this to work for a bit too long now.
1) loop Try this loop:
# test data
set.seed(123) # for reproducibility
a <- as.integer(runif(20, min = 0, max = 10))
a <- as.data.frame(a)
b <- as.data.frame(a[c(-7, -11, -15),])
# lets work with vectors
A <- a[[1]]
B <- b[[1]]
j <- 1
C <- A
for(i in seq_along(A)) if (A[i] == B[j]) j <- j+1 else C[i] <- NA
which gives:
> C
[1] 2 7 4 8 9 0 NA 8 5 4 NA 4 6 5 NA 8 2 0 3 9
2) Reduce Here is a loop-free version:
f <- function(j, a) j + (a == B[j])
r <- Reduce(f, A, acc = TRUE)
ifelse(duplicated(r), NA, A)
giving:
[1] 2 7 4 8 9 0 NA 8 5 4 NA 4 6 5 NA 8 2 0 3 9
3) dtw. Using dtw in the package of the same name we can get a compact loop-free one-liner:
library(dtw)
ifelse(duplicated(dtw(A, B)$index2), NA, A)
giving:
[1] 2 7 4 8 9 0 NA 8 5 4 NA 4 6 5 NA 8 2 0 3 9
REVISED Added additional solutions.
Here's a more complicated way of doing it, using the Levenshtein distance algorithm, that does a better job on more complicated examples (it also seemed faster in a couple of larger tests I tried):
# using same data as G. Grothendieck:
set.seed(123) # for reproducibility
a <- as.integer(runif(20, min = 0, max = 10))
a <- as.data.frame(a)
b <- as.data.frame(a[c(-7, -11, -15),])
A = a[[1]]
B = b[[1]]
# compute the transformation between the two, assigning infinite weight to
# insertion and substitution
# using +1 here because the integers fed to intToUtf8 have to be larger than 0
# could also adjust the range more dynamically based on A and B
transf = attr(adist(intToUtf8(A+1), intToUtf8(B+1),
costs = c(Inf,1,Inf), counts = TRUE), 'trafos')
C = A
C[substring(transf, 1:nchar(transf), 1:nchar(transf)) == "D"] <- NA
#[1] 2 7 4 8 9 0 NA 8 5 4 NA 4 6 5 NA 8 2 0 3 9
More complex matching example (where the greedy algorithm would perform poorly):
A = c(1,1,2,2,1,1,1,2,2,2)
B = c(1,1,1,2,2,2)
transf = attr(adist(intToUtf8(A), intToUtf8(B),
costs = c(Inf,1,Inf), counts = TRUE), 'trafos')
C = A
C[substring(transf, 1:nchar(transf), 1:nchar(transf)) == "D"] <- NA
#[1] NA NA NA NA 1 1 1 2 2 2
# the greedy algorithm would return this instead:
#[1] 1 1 NA NA 1 NA NA 2 2 2
The data frame version, which isn't terribly different from G.'s above.
(Assumes a,b setup as above).
j <- 1
c <- a
for (i in (seq_along(a[,1]))) {
if (a[i,1]==b[j,1]) {
j <- j+1
} else
{
c[i,1] <- NA
}
}
Here is my problem. I have a large vector of positive data. My goal is to remove the sequences of at least N consecutive values that are repeated in the vector (all of repeated values must be strictly > 0).
I've written a program that works and is as follows :
X is my vector of numeric values ;
N is the minimum length of repeated sequences.
rmpParNASerieRepetee <- function(X, N)
{
X_ <- paste("T", paste(X, collapse="T"), "T", sep="")
ind.parcours <- 1
ind.sup <- c()
# Loop on the values
while ( ind.parcours <= (length(X)-N+1) )
{
# indices of my sequence of N values
deb <- ind.parcours
fin <- ind.parcours + N-1
# sequence of N values to search in the vector
serie <- X[deb:fin]
serie_ <- paste("T", paste(serie, collapse="T"), "T", sep="")
borne <- 1*(ind.parcours < (length(X)-N+1)) + 0*(ind.parcours == (length(X)-N+1))
if (sum(X[(length(X)-N+1):length(X)]==serie)==3) borne <- 0
# split my string vector by my sequence vector of N values and count the pieces of result
if ( length(unlist(strsplit(X_, serie_)))-1 > borne && length(which(serie!=0))>=N)
{ ind.sup <- unique(c(ind.sup, deb:fin)) }
ind.parcours <- ind.parcours+1
}
if (length(ind.sup !=0)) { X[ind.sup] <- NA }
list_return <- list(X=X, Ind.sup=unique(sort(ind.sup)))
return (list_return)
}
I think my function is really not optimal (calculation time of 1:15 for a vector of 92,000 values, N=18). And I have to do this step 1600 times... It would take around 3 months...
Does anyone have a better idea ?
Example :
x <- c(1,2,3,4,0,4,1,2,3,8,9,1,2,3,4,0)
N <- 3
# (1,2,3) is a sequence of 3 elements which is repeated
# (1,2,3,4) is sequence of 4 elements which is repeated
# no other sequence of length at least 3 repeats
# my result should also be :
# NA NA NA NA 0 4 NA NA NA 8 9 NA NA NA NA 0
# The result of my program is :
# $X
# [1] NA NA NA NA 0 4 NA NA NA 8 9 NA NA NA NA 0
#$Ind.sup
# [1] 1 2 3 4 7 8 9 12 13 14 15
A way:
f <- function(X, N)
{
.rle <- rle(sort(X))
res <- .rle$values[.rle$lengths >= N]
res <- res[res > 0]
inds <- X %in% res
X[inds] <- NA
list(X = X, Ind = which(inds))
}
#> f(X, 3)
#$X
# [1] NA NA NA NA 0 0 0 0 NA NA NA NA NA NA 8 9 NA NA NA NA NA NA 0 0 0
#
#$Ind
# [1] 1 2 3 4 9 10 11 12 13 14 17 18 19 20 21 22
Try using table and %in% to get faster speed due to vectorisation.
rmpParNASerieRepetee<-function(X,N){
tab<-table(X[X>0])
over.n<-as.numeric(names(tab)[tab>=N])
ind<-X %in% over.n
Ind.sup<-which(ind)
X<-ifelse(ind,NA,X)
list(Ind.sup,X)
}
X <- c(1,2,3,4,0,0,0,0,1,4,1,2,3,4,8,9,1,2,3,1,4,1,0,0,0)
rmpParNASerieRepetee(X,3)
# [[1]]
# [1] 1 2 3 4 9 10 11 12 13 14 17 18 19 20 21 22
#
# [[2]]
# [1] NA NA NA NA 0 0 0 0 NA NA NA NA NA NA 8 9 NA NA NA NA NA NA 0 0 0
And a little test for 92000 values:
X<-sample(1:10000,92000,TRUE)
system.time(rmpParNASerieRepetee(X,3))
# user system elapsed
# 0.14 0.00 0.14
One way to think about this is that in a sequence, each element differs from the last one by 1, so:
X <- c(1,2,3,4,0,0,0,0,1,4,1,2,3,4,8,9,1,2,3,1,4,1,0,0,0)
y <- X[-1]
diff <- y-X[1:length(X)-1]
diff
[1] 1 1 1 -4 0 0 0 1 3 -3 1 1 1 4 1 -8 1 1 -2 3 -3 -1 0 0
And now you're looking for sequences of > N 1's in diff.
I have optimized my function, and now it takes "only" 10 minutes for a vector of length 92000.
Maybe someone could find an other solution more faster than mine.
Imagine my vector is X<-c(1,2,3,4,0,7,8,1,2,3,NA,8,9,1,2,3,4) and N=3.
c(1,2,3) et c(1,2,3,4) are the only repeated sequences of length at least N without NA or 0. So my result should be NA NA NA NA 0 7 8 NA NA NA NA 8 9 NA NA NA NA.
To answer my problem, I use this principle :
I create a big string like this : X_ <- T1T2T3T4T0T7T8T1T2T3TNAT8T9T1T2T3T4 in which, all X values are concatened by T. For each little string of length N=3 (ex : the first is T1T2T3T), I break my big string X_ using strsplit function with the pattern "little string". If the length of the result is more than 2, the sequence is repeated.
Care must be taken not to take null values in the series, and some adaptation must be done to avoid edge phenomena (borne in my function)...
I created these functions which work :
# Function to count NA in a vector
count.na <- function(vec) { return (length(which(is.na(vec)))) }
# Function to detect sequence of stricly postive numbers of length at least N
rmpParNASerieRepetee <- function(X, N, val.min=0)
{
# Collapse the vector to make a big string
X_ <- paste("T", paste(X, collapse="T"), "T", sep="")
# Index term
ind.parcours <- 1
ind.sup <- c()
# Loop on X values
while ( ind.parcours <= (length(X)-N+1) )
{
# Selection of the sequence to be detected
deb <- ind.parcours
fin <- ind.parcours + N-1
serie <- X[deb:fin]
# All values are > 0
if ( length(which(serie>0)) >= (N-count.na(serie)) )
{
# Research of repetition with strsplit
serie_ <- paste("T", paste(serie, collapse="T"), "T", sep="")
borne <- 1*(ind.parcours < (length(X)-N+1)) + 0*(ind.parcours == (length(X)-N+1))
if (sum(X[(length(X)-N+1):length(X)]==serie, na.rm=TRUE)==N) borne <- 0
if (length(unlist(strsplit(X_, serie_)))-1 > borne)
ind.sup <- unique( c(ind.sup, deb:fin) )
# Incrementation
ind.parcours <- ind.parcours + 1
}
# Contains 0
else
{ ind.parcours <- ind.parcours + max(which(serie==0))
}
}
# Invalidaion of repeated sequences
if (length(ind.sup !=0)) { X[ind.sup] <- NA }
# Return
list_return <- list(X=X, Ind.sup=unique(sort(ind.sup)))
return (list_return)
}
I hope someone will find an other way to solve my problem !
I have a vector in R,
a = c(2,3,4,9,10,2,4,19)
let us say I want to efficiently insert the following vectors, b, and c,
b = c(2,1)
d = c(0,1)
right after the 3rd and 7th positions (the "4" entries), resulting in,
e = c(2,3,4,2,1,9,10,2,4,0,1,19)
How would I do this efficiently in R, without recursively using cbind or so.
I found a package R.basic but its not part of CRAN packages so I thought about using a supported version.
Try this:
result <- vector("list",5)
result[c(TRUE,FALSE)] <- split(a, cumsum(seq_along(a) %in% (c(3,7)+1)))
result[c(FALSE,TRUE)] <- list(b,d)
f <- unlist(result)
identical(f, e)
#[1] TRUE
EDIT: generalization to arbitrary number of insertions is straightforward:
insert.at <- function(a, pos, ...){
dots <- list(...)
stopifnot(length(dots)==length(pos))
result <- vector("list",2*length(pos)+1)
result[c(TRUE,FALSE)] <- split(a, cumsum(seq_along(a) %in% (pos+1)))
result[c(FALSE,TRUE)] <- dots
unlist(result)
}
> insert.at(a, c(3,7), b, d)
[1] 2 3 4 2 1 9 10 2 4 0 1 19
> insert.at(1:10, c(4,7,9), 11, 12, 13)
[1] 1 2 3 4 11 5 6 7 12 8 9 13 10
> insert.at(1:10, c(4,7,9), 11, 12)
Error: length(dots) == length(pos) is not TRUE
Note the bonus error checking if the number of positions and insertions do not match.
You can use the following function,
ins(a, list(b, d), pos=c(3, 7))
# [1] 2 3 4 2 1 9 10 2 4 0 1 4 19
where:
ins <- function(a, to.insert=list(), pos=c()) {
c(a[seq(pos[1])],
to.insert[[1]],
a[seq(pos[1]+1, pos[2])],
to.insert[[2]],
a[seq(pos[2], length(a))]
)
}
Here's another function, using Ricardo's syntax, Ferdinand's split and #Arun's interleaving trick from another question:
ins2 <- function(a,bs,pos){
as <- split(a,cumsum(seq(a)%in%(pos+1)))
idx <- order(c(seq_along(as),seq_along(bs)))
unlist(c(as,bs)[idx])
}
The advantage is that this should extend to more insertions. However, it may produce weird output when passed invalid arguments, e.g., with any(pos > length(a)) or length(bs)!=length(pos).
You can change the last line to unname(unlist(... if you don't want a's items named.
The straightforward approach:
b.pos <- 3
d.pos <- 7
c(a[1:b.pos],b,a[(b.pos+1):d.pos],d,a[(d.pos+1):length(a)])
[1] 2 3 4 2 1 9 10 2 4 0 1 19
Note the importance of parenthesis for the boundaries of the : operator.
After using Ferdinand's function, I tried to write my own and surprisingly it is far more efficient.
Here's mine :
insertElems = function(vect, pos, elems) {
l = length(vect)
j = 0
for (i in 1:length(pos)){
if (pos[i]==1)
vect = c(elems[j+1], vect)
else if (pos[i] == length(vect)+1)
vect = c(vect, elems[j+1])
else
vect = c(vect[1:(pos[i]-1+j)], elems[j+1], vect[(pos[i]+j):(l+j)])
j = j+1
}
return(vect)
}
tmp = c(seq(1:5))
insertElems(tmp, c(2,4,5), c(NA,NA,NA))
# [1] 1 NA 2 3 NA 4 NA 5
insert.at(tmp, c(2,4,5), c(NA,NA,NA))
# [1] 1 NA 2 3 NA 4 NA 5
And there's the benchmark result :
> microbenchmark(insertElems(tmp, c(2,4,5), c(NA,NA,NA)), insert.at(tmp, c(2,4,5), c(NA,NA,NA)), times = 10000)
Unit: microseconds
expr min lq mean median uq max neval
insertElems(tmp, c(2, 4, 5), c(NA, NA, NA)) 9.660 11.472 13.44247 12.68 13.585 1630.421 10000
insert.at(tmp, c(2, 4, 5), c(NA, NA, NA)) 58.866 62.791 70.36281 64.30 67.923 2475.366 10000
my code works even better for some cases :
> insert.at(tmp, c(1,4,5), c(NA,NA,NA))
# [1] 1 2 3 NA 4 NA 5 NA 1 2 3
# Warning message:
# In result[c(TRUE, FALSE)] <- split(a, cumsum(seq_along(a) %in% (pos))) :
# number of items to replace is not a multiple of replacement length
> insertElems(tmp, c(1,4,5), c(NA,NA,NA))
# [1] NA 1 2 3 NA 4 NA 5
Here's an alternative that uses append. It's fine for small vectors, but I can't imagine it being efficient for large vectors since a new vector is created upon each iteration of the loop (which is, obviously, bad). The trick is to reverse the vector of things that need to be inserted to get append to insert them in the correct place relative to the original vector.
a = c(2,3,4,9,10,2,4,19)
b = c(2,1)
d = c(0,1)
pos <- c(3, 7)
z <- setNames(list(b, d), pos)
z <- z[order(names(z), decreasing=TRUE)]
for (i in seq_along(z)) {
a <- append(a, z[[i]], after = as.numeric(names(z)[[i]]))
}
a
# [1] 2 3 4 2 1 9 10 2 4 0 1 19
I've got a large nxn matrix and would like to take off-diagonal slices of varying sizes. For example:
1 2 3 4 5 6
1 2 3 4 5 6
1 2 3 4 5 6
1 2 3 4 5 6
1 2 3 4 5 6
1 2 3 4 5 6
I'd like an R function which, when given the matrix and "width of diagonal slice" would return an nxn matrix of just those values. So for the matrix above and, say, 3, I'd get:
1 x x x x x
1 2 x x x x
1 2 3 x x x
x 2 3 4 x x
x x 3 4 5 x
x x x 4 5 6
At the moment I'm using (forgive me) a for loop which is incredibly slow:
getDiags<-function(ndiags, cormat){
resmat=matrix(ncol=ncol(cormat),nrow=nrow(cormat))
dimnames(resmat)<-dimnames(cormat)
for(j in 1:ndiags){
resmat[row(resmat) == col(resmat) + j] <-
cormat[row(cormat) == col(cormat) + j]
}
return(resmat)
}
I realise that this is a very "un-R" way to go about solving this problem. Is there a better way to do it, probably using diag or lower.tri?
size <- 6
mat <- matrix(seq_len(size ^ 2), ncol = size)
low <- 0
high <- 3
delta <- rep(seq_len(ncol(mat)), nrow(mat)) -
rep(seq_len(nrow(mat)), each = ncol(mat))
#or Ben Bolker's better alternative
delta <- row(mat) - col(mat)
mat[delta < low | delta > high] <- NA
mat
this works with 5000 x 5000 matrices on my machine
If you want to use upper.tri and lower.tri you could write functions like these:
cormat <- mapply(rep, 1:6, 6)
u.diags <- function(X, n) {
X[n:nrow(X),][lower.tri(X[n:nrow(X),])] <- NA
return(X)
}
or
l.diags <- function(X, n) {
X[,n:ncol(X)][upper.tri(X[,n:ncol(X)])] <- NA
return(X)
}
or
n.diags <- function(X, n.u, n.l) {
X[n.u:nrow(X),][lower.tri(X[n.u:nrow(X),])] <- NA
X[,n.l:ncol(X)][upper.tri(X[,n.l:ncol(X)])] <- NA
return(X)
}
l.diags(cormat, 3)
u.diags(cormat, 3)
n.diags(cormat, 3, 1)
you can do:
matrix:
m<-
matrix(1:6,ncol = 6, nrow=6 ,byrow = T)
function:
n_diag <- function (x, n) {
d <- dim(x)
ndiag <- .row(d) - n >= .col(d)
x[upper.tri(x) | ndiag] <- NA
return(x)
}
call:
n_diag(m,3)
# [,1] [,2] [,3] [,4] [,5] [,6]
#[1,] 1 NA NA NA NA NA
#[2,] 1 2 NA NA NA NA
#[3,] 1 2 3 NA NA NA
#[4,] NA 2 3 4 NA NA
#[5,] NA NA 3 4 5 NA
#[6,] NA NA NA 4 5 6
just for fun:
#lapply(1:6, n_diag, x = m)