R - identify sequences in a vector - r

Suppose I have a vector ab containing A's and B's. I want to identify sequences and create a vector v with length(ab) that indicates the sequence length at the beginning and end of a given sequence and NA otherwise.
I have however the restriction that another vector x with 0/1 will indicate that a sequence ends.
So for example:
rep("A", 6)
"A" "A" "A" "A" "A" "A"
x <- c(0,0,1,0,0,0)
0 0 1 0 0 0
should give
v <- c(3 NA 3 3 NA 3)
An example could be the following:
ab <- c(rep("A", 5), "B", rep("A", 3))
"A" "A" "A" "A" "A" "B" "A" "A" "A"
x <- c(rep(0,3),1,0,1,rep(0,3))
0 0 0 1 0 1 0 0 0
Here the output should be:
4 NA NA 4 1 1 3 NA 3
(without the restriction it would be)
5 NA NA NA 5 1 3 NA 3
So far, my code without the restriction looks like this:
ab <- c(rep("A", 5), "B", rep("A", 3))
x <- c(rep(0,3),1,0,1,rep(0,3))
cng <- ab[-1L] != ab[-length(ab)] # is there a change in A and B w.r.t the previous value?
idx <- which(cng) # where do the changes take place?
idx <- c(idx,length(ab)) # include the last value
seq_length <- diff(c(0, idx)) # how long are the sequences?
# create v
v <- rep(NA, length(ab))
v[idx] <- seq_length # sequence end
v[idx-(seq_length-1)] <- seq_length # sequence start
v
Does anyone have an idea how I can implement the restriction? (And since my vector has 2 Millions of observations, I wonder whether there would be a more efficient way than my approach)
I would appreciate any comments! Many thanks in advance!

You may do something like this
x <- c(rep(0,3),1,rep(0,2),1,rep(0,3))
ab <- c(rep("A", 5), "B", rep("A", 4))
#creating result of lengths
res <- as.numeric(ave(ab, rev(cumsum(rev(x))), FUN = function(z){with(rle(z), rep(lengths, lengths))}))
> res
[1] 4 4 4 4 1 1 1 3 3 3
#creating intermediate NAs
replace(res, with(rle(res), setdiff(seq_along(res), c(length(res) + 1 - cumsum(rev(lengths)),
cumsum(lengths),
which(res == 1)))), NA)
[1] 4 NA NA 4 1 1 1 3 NA 3
As per edited scenario
x <- c(rep(0,3),1,rep(0,2),1,rep(0,3))
ab <- c(rep("A", 5), "B", rep("A", 4))
ab[3] <- 'B'
as.numeric(ave(ab, rev(cumsum(rev(x))), FUN = function(z){with(rle(z), rep(lengths, lengths))}))
[1] 2 2 1 1 1 1 1 3 3 3
ab
[1] "A" "A" "B" "A" "A" "B" "A" "A" "A" "A"

Related

implementation in R: Finding the distance between two lists of characters

I am new to R and trying to understand how I can implement the algorithm below in R.
I have two lists of characters and want to see what is the minimum distance between these two.
List 1: "a", "b", "c"
List 2: "a", "b", "c", "d"
At the first step:
I have created a table like this :
a b c
a 0 1 2
b 1
c 2
d 3
At the second step: I filled the rest of the matrix with '0's
a b c
a 0 1 2
b 1 0 0
c 2 0 0
d 3 0 0
Now, I want to start calculating the distance among these two lists by this algorithm and update the matrix:
if (characters_in_header_of_matrix[i]==characters_in_column_of_matrix [j] & value[i,j] == value[i+1][j-1] )
then {get the 'diagonal value' #diagonal value= value[i, j-1]}
else{
value[i,j] = min(value[i-1, j], value[i-1, j-1], value[i, j-1]) + 1
}
endif
for finding the difference between two liststhat you can see in the header and the column of the matrix, I have used the strcmp() function. But, I fail at implementing this.
The final result should look like this :
a b c
a 0 1 2
b 1 0 1
c 2 1 0
d 3 2 1
I would appreciate your help.Thanks
In R, there are a few things that you can do brute-force with for loops and conditionals, but they might easily be done in a vectorized method. The benefit might be speed (though might not) but can often be appreciated in the simpler code and (once you can grok the functions) readability and maintainability.
Take for example this problem:
l1 <- c("a", "b", "c")
l2 <- c("a", "b", "c", "d")
You want to find the "distance" (absolute distance, to be specific) between each letter in l1 from each letter in l2. The outer function does an "outer product" of the two vectors. For instance, if we were to constructively (not actually) do outer(a:c, 1:3), it would pair a1, a2, a3, b1, ..., c3. (That is not legal R code, just used for demonstration, though it can be done quite easily with a couple of minor additions.)
In our case, if we do outer(l1, l2), the function it uses defaults to multiplication (*), since its initial use is often in linear algebra, but this function can easily be overridden with FUN=. Internally, what it is doing is creating two (much longer) vectors doing all of the pairing. We can see what is happening under the hood if we introduce a debugging function to inspect the state.
debugfunc <- function(a, b) { browser(); 1; }
(The 1 is there solely as a place-holder.)
outer(l1, l2, FUN=debugfunc)
# Called from: FUN(X, Y, ...)
# Browse[2]>
a # <--- the object 'a' here is the first argument to this function
# [1] "a" "b" "c" "a" "b" "c" "a" "b" "c" "a" "b" "c"
# Browse[2]>
b # <--- the second argument
# [1] "a" "a" "a" "b" "b" "b" "c" "c" "c" "d" "d" "d"
In order, this pairs "a" with "a", then "b" with "a", then "c" with "a", etc. It exhausts the first (l1) vector and then increments the second vector, repeating until both are exhausted. At this point, the debugfunc is called exactly once with these two vectors (not once per pair, as some might suspect), so your FUN= function must be able to do all operations in one call.
One might want to look at the distances here. You can determine an individual letter's position in the alphabet by using match("a", letters) (the companion LETTERS is all-caps). In general, match finds the position of the first argument(s) in the second. So continuing within debugfunc:
# Browse[2]>
match(a, letters)
# [1] 1 2 3 1 2 3 1 2 3 1 2 3
# Browse[2]>
match(b, letters)
# [1] 1 1 1 2 2 2 3 3 3 4 4 4
So what you want is really the difference between those two numeric vectors. One could easily do:
# Browse[2]>
match(a, letters) - match(b, letters)
# [1] 0 1 2 -1 0 1 -2 -1 0 -3 -2 -1
but since we need the absolute distance, we really need
# Browse[2]>
abs( match(a, letters) - match(b, letters) )
# [1] 0 1 2 1 0 1 2 1 0 3 2 1
Ok, so I think we have our function here. Let's break out of the debugger (Q) and try this a little more formally:
distfunc <- function(a, b) abs( match(a, letters) - match(b, letters) )
outer(l1, l2, FUN=distfunc)
# [,1] [,2] [,3] [,4]
# [1,] 0 1 2 3
# [2,] 1 0 1 2
# [3,] 2 1 0 1
Note that the first argument becomes the rows, so l1 with a length of 3 gives us 3 rows. If you need to apply the row/column names, then:
o <- outer(l1, l2, FUN=distfunc)
dimnames(o) <- list(l1, l2)
o
# a b c d
# a 0 1 2 3
# b 1 0 1 2
# c 2 1 0 1
(Changing the order of the arguments will give you precisely the matrix you are seeking.)

R data.fame manipulation: convert to NA after specific column

I have a large data.frame and I need some conversion based by row. My purpose is convert all values in rows to NA after if there is specific character in column.
For example I provide little sample from my real data set:
sample_df <- data.frame( a = c("V","I","V","V"), b = c("I","V","V","V"), c = c("V","V","I","V"), d = c("V","V","I","V"))
result_df <- data.frame( a = c("V","I","V","V"), b = c("I",NA,"V","V"), c = c(NA,NA,"I","V"), d = c(NA,NA,NA,"V"))
As an example in sample_df
First I want to turn all values to NA after first "I"
Sample data.frames
I tried base, dpylr, purrr but can not create an algorithm.
Thanks for your help.
Try this:
Find "I" values
I_true<-sample_df=="I"
I_true
a b c d
[1,] FALSE TRUE FALSE FALSE
[2,] TRUE FALSE FALSE FALSE
[3,] FALSE FALSE TRUE TRUE
[4,] FALSE FALSE FALSE FALSE
Find positions from the first "I" seen
out<-t(apply(t(I_true),2,cumsum))
out
a b c d
[1,] 0 1 1 1
[2,] 1 1 1 1
[3,] 0 0 1 2
[4,] 0 0 0 0
Replace needed values
output<-out
output[out>=1]<-NA
output[output==0]<-"V"
output[I_true]<-"I"
output[out>=2]<-NA
Your output
output
a b c d
[1,] "V" "I" NA NA
[2,] "I" NA NA NA
[3,] "V" "V" "I" "I"
[4,] "V" "V" "V" "V"
Example 2:
sample_df <- data.frame( a = c("V","I","I","V"), b = c("I","V","V","V"), c = c("V","V","I","V"), d = c("V","V","I","V"))
sample_df
a b c d
1 V I V V
2 I V V V
3 I V I I
4 V V V V
output
a b c d
[1,] "V" "I" NA NA
[2,] "I" NA NA NA
[3,] "I" NA NA NA
[4,] "V" "V" "V" "V"
Here is a brute force approach, which should be the easiest to come up with but the least preferred. Anyway, here it is:
df <- data.frame( a = c("V","I","V","V"), b = c("I","V","V","V"), c = c("V","V","I","V"), d = c("V","V","I","V"), stringsAsFactors=FALSE)
rowlength<-length(colnames(df))
for (i in 1:length(df[,1])){
if (any(as.character(df[i,])=='I')){
first<-which(as.character(df[i,])=='I')[1]+1
df[i,first:rowlength]<-NA
}
}
Here's a possible answer using ddply from the plyr package
ddply(sample_df,.(a,b,c,d), function(x){
idx<-which(x=='I')[1]+1 #ID after first 'I'
if(!is.na(idx)){ #Check if found
if(idx<=ncol(x)){ # Prevent out of bounds
x[,idx:ncol(x)]<-NA
}
}
x
})
The plyr approach :
plyr::adply(sample_df, 1L, function(x) {
if (all(x != "I"))
return(x)
x[1L:min(which(x == "I"))]
})
You have to use an if because x[min(which(x == "I"))] would returns numeric(0) for rows without at least one I
My Solution:
After #Julien Navarre recommendation, first I created toNA() function:
toNA <- function(x) {
temp <- grep("INVALID", unlist(x)) # which can be generalized for any string
lt <- length(x)
loc <- min(temp,100)+1 #100 is arbitrary number bigger than actual column count
#print(lt) #Debug purposes
if( (loc < lt+1) ) {
x[ (loc):(lt)] <-NA
}
x
}
First, I tried plyr::adply() and purrrlyr::by_row() functions to apply my toNA() function my data.frame which has over 3 million rows.
Both are very slow. (For 1000 rows they take 9 and 6 seconds respectively). These approaches are also slow with a simple function(x) x. I am not sure what is overhead.
So I tried base::apply() function: (result is my data set)
as.tibble(t(apply(result, 1, toNA ) ))
It only takes 0.2 seconds for 1000 rows.
I am not sure about programming style but for now this solution works for me.
Thanks for all your recommendations.
A pure base solution, we're building a boolean matrix of "=="I" or not", then with a double cumsum by row we can find where our NAs must be placed:
result_df <- sample_df
is.na(result_df) <- t(apply(sample_df == "I",1,function(x) cumsum(cumsum(x)))) >1
result_df
# a b c d
# 1 V I <NA> <NA>
# 2 I <NA> <NA> <NA>
# 3 V V I <NA>
# 4 V V V V

R: Merging two vectors and shuffle them with a maximum number of repetitions

I want to merge two vectors in R. Each has 30 equal values, e.g.:
a <- rep("a",30)
b <- rep("b",30)
How do I achieve that there are never more than three consecutive appearances of a's or b's in the resulting vector?
c <- c("a","a","b","b","b","a","b","a","a",...)
Try
f1 <- function(x,y, n1, n2 ){
repeat {
v1 <- sample(c(x,y),n1, replace=TRUE)
if(all(rle(v1)$lengths <=n2)) break
}
return(v1)
}
res <- f1(a,b, 20, 3)
res
#[1] "a" "a" "b" "b" "a" "b" "b" "a" "b" "a" "b" "b" "b" "a" "a"
# "a" "b" "a" "b"
#[20] "a"
rle(res)$lengths
#[1] 2 2 1 2 1 1 1 3 3 1 1 1 1
rle(f1(a,b, 30, 2))$lengths
#[1] 1 2 1 1 2 2 2 1 1 2 2 1 2 1 1 1 1 2 1 2 1
Based on #jbest's comments, if you need the result to have equal number of "a", "b" elements,
f1N <- function(x,y, n1, n2 ){
repeat {
v1 <- sample(c(x,y),n1, replace=TRUE)
if(all(rle(v1)$lengths <=n2) & !diff(table(v1))) break
}
return(v1)
}
res <- f1N(a,b,36,2)
table(res)
#res
# a b
#18 18
Update
It may be also possible that you are not looking for a variable n1. The below function will return a vector of 60 elements (30 of "a" and 30 of "b")
f2 <- function(x, y, n){
repeat {
v1 <- sample(c(x,y))
if(all(rle(v1)$lengths <=n)) break
}
return(v1)
}
res <- f2(a,b,3)
rle(res)$lengths
#[1] 2 1 3 1 3 1 1 2 1 1 1 1 1 1 1 2 3 1 1 2 2 1 1 2 1 3 1 3 1 3 2 1 2 2 2 2 1
table(res)
#res
#a b
#30 30

How to create a factor from a binary indicator matrix?

Say I have the following matrix mat, which is a binary indicator matrix for the levels A, B, and C for a set of 5 observations:
mat <- matrix(c(1,0,0,
1,0,0,
0,1,0,
0,1,0,
0,0,1), ncol = 3, byrow = TRUE)
colnames(mat) <- LETTERS[1:3]
> mat
A B C
[1,] 1 0 0
[2,] 1 0 0
[3,] 0 1 0
[4,] 0 1 0
[5,] 0 0 1
I want to convert that into a single factor such that the output is equivalent to fac defines as:
> fac <- factor(rep(LETTERS[1:3], times = c(2,2,1)))
> fac
[1] A A B B C
Levels: A B C
Extra points if you get the labels from the colnames of mat, but a set of numeric codes (e.g. c(1,1,2,2,3)) would also be acceptable as desired output.
Elegant solution with matrix multiplication (and shortest up to now):
as.factor(colnames(mat)[mat %*% 1:ncol(mat)])
This solution makes use of the arr.ind=TRUE argument of which, returning the matching positions as array locations. These are then used to index the colnames:
> factor(colnames(mat)[which(mat==1, arr.ind=TRUE)[, 2]])
[1] A A B B C
Levels: A B C
Decomposing into steps:
> which(mat==1, arr.ind=TRUE)
row col
[1,] 1 1
[2,] 2 1
[3,] 3 2
[4,] 4 2
[5,] 5 3
Use the values of the second column, i.e. which(...)[, 2] and index colnames:
> colnames(mat)[c(1, 1, 2, 2, 3)]
[1] "A" "A" "B" "B" "C"
And then convert to a factor
One way is to replicate the names out by row number and index directly with the matrix, then wrap that with factor to restore the levels:
factor(rep(colnames(mat), each = nrow(mat))[as.logical(mat)])
[1] A A B B C
Levels: A B C
If this is from model.matrix, the colnames have fac prepended, and so this should work the same but removing the extra text:
factor(gsub("^fac", "", rep(colnames(mat), each = nrow(mat))[as.logical(mat)]))
You could use something like this:
lvls<-apply(mat, 1, function(currow){match(1, currow)})
fac<-factor(lvls, 1:3, labels=colnames(mat))
Here is another one
factor(rep(colnames(mat), colSums(mat)))

in R, how to retrieve a complete matrix using combn?

My problem, removing the specific purpose, seems like this:
how to transform a combination like this:
first use combn(letters[1:4], 2) to calculate the combination
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] "a" "a" "a" "b" "b" "c"
[2,] "b" "c" "d" "c" "d" "d"
use each column to obtain another data frame:
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 1 2 3 4 5 6
elements are obtained, for example: the first element, from the first column of the above dataframe
then How can i transform the above dataframe into a matrix, for example result, things like:
a b c d
a 0 1 2 3
b 1 0 4 5
c 2 4 0 6
d 3 5 6 0
the elements with same col and row names will have zero value where others corresponding to above value
Here is one way that works:
inputs <- letters[1:4]
combs <- combn(inputs, 2)
N <- seq_len(ncol(combs))
nams <- unique(as.vector(combs))
out <- matrix(ncol = length(nams), nrow = length(nams))
out[lower.tri(out)] <- N
out <- t(out)
out[lower.tri(out)] <- N
out <- t(out)
diag(out) <- 0
rownames(out) <- colnames(out) <- inputs
Which gives:
> out
a b c d
a 0 1 2 3
b 1 0 4 5
c 2 4 0 6
d 3 5 6 0
If I had to do this a lot, I'd wrap those function calls into a function.
Another option is to use as.matrix.dist() to do the conversion for us by setting up a "dist" object by hand. Using some of the objects from earlier:
## Far easier
out2 <- N
class(out2) <- "dist"
attr(out2, "Labels") <- as.character(inputs)
attr(out2, "Size") <- length(inputs)
attr(out2, "Diag") <- attr(out2, "Upper") <- FALSE
out2 <- as.matrix(out2)
Which gives:
> out2
a b c d
a 0 1 2 3
b 1 0 4 5
c 2 4 0 6
d 3 5 6 0
Again, I'd wrap this in a function if I had to do it more than once.
Does it have to be a mirror matrix with zeros over the diagonal?
combo <- combn(letters[1:4], 2)
in.combo <- matrix(1:6, nrow = 1)
combo <- rbind(combo, in.combo)
out.combo <- matrix(rep(NA, 16), ncol = 4)
colnames(out.combo) <- letters[1:4]
rownames(out.combo) <- letters[1:4]
for(cols in 1:ncol(combo)) {
vec1 <- combo[, cols]
out.combo[vec1[1], vec1[2]] <- as.numeric(vec1[3])
}
> out.combo
a b c d
a NA 1 2 3
b NA NA 4 5
c NA NA NA 6
d NA NA NA NA

Resources