get length of character matching between two string in R - r

I have a dataframe where i need to compare two columns and find the number of matching characters between two elements.
For eg: x and y are two elements to be compared which look like below:
x<- "1/2"
y<-"2/3"
I did unlisted and splitted them by '/' as below:
unlist(strsplit(x,"/"))->a
unlist(strsplit(y,"/"))->b
Then i used pmatch:
pmatch(a,b,nomatch =0)
[1] 0 1
Used sum() to know how many characters are matching:
sum(pmatch(a,b,nomatch =0))
[1] 1
However, when the comparison is done the other way:
pmatch(b,a,nomatch = 0)
[1] 2 0
Since there is only one match between the two string, why is it showing 2. It could be index. But i would need to get how many characters are same between the strings irrespective of the comparison a vs b or b vs a.
Could someone help how to get this.

Per ?pmatch, pmatch seeks matches for the elements of its first argument among those of its second.
For example, "2" in the first list matches the second element in the second list.
> pmatch(c("2", "1"),c("3","2"),nomatch =0)
# [1] 2 0
One way to know the number of elements got matched is to sum non-zero elements:
sum(pmatch(c("2", "1"),c("3","2"),nomatch =0) != 0)
# [1] 1

Both
sum(pmatch(b, a, nomatch = 0) != 0) # 1
sum(pmatch(a, b, nomatch = 0) != 0) # 1
return the same value.

Another option could be
sum(b %in% a)
[1] 1
sum(a %in% b)
[1] 1

Related

Count occurences in a cell, with a condition- R studio

I have a string like this one:
0|294|314|20|314|SC49TST57ASG75A|1428.0
Using R, I want to extract only the data between two | (example- SC49TST57ASG75A), and then count only the numbers which are bigger than 20 (in this case I have the numbers 49,57,75 so the code needs to return the number 3)
I want to apply it on a column in a data frame.
Eventually, I want to get a new column that specify for each row how many numbers that are greater than 20 there is inside the |....|.
Thanks!
You can try strsplit with split = '\\|', if you only want to count between two pipes then you should exclude the first and the last elements also since you want elements greater than 20 ( we are using > sign for clarity in the solution)
I am assuming here that your columns have same structure as given in your question.
st <- '0|294|314|20|314|SC5GSC12ASG266T|1428.0'
Solution:
lapply(strsplit(st, '\\|'), function(x)sum(as.numeric(x[2:(length(x)-1)]) > 20, na.rm=TRUE))
I am not sure if this is what you are looking for, otherwise please tell me what is your expected result.
cnt <- Map(function(x) sum(as.numeric(x)>20),
regmatches(r <- unlist(regmatches(s,gregexpr("(?<=\\|).*?(?=\\|)",s,perl = TRUE))),
gregexpr("\\d+\\.?\\d+?",r)))
such that
> cnt
[[1]]
[1] 1
[[2]]
[1] 1
[[3]]
[1] 0
[[4]]
[1] 1
[[5]]
[1] 1
DATA
s <- "0|294|314|20|314|SC5GSC12ASG266T|1428.0"

finding matching consecutive rows in r

Is there a one-line solution possible for this example?
df = data.frame('First' = c('T','T','V','V','A','E'),'Last' = c(rep('Ng',3),'Smith','Wolf','Wolf'))
matches = (df$First[-1] == df$First)
which(matches == 'TRUE')
# [1] 1 3
I want the indeces, but would rather not use a temporary variable.
Perhaps you could use the rleid function from data.table in combination with diff, like this:
which(diff(rleid(df$First)) == 0)
[1] 1 3
You could argue that the 2nd element and the 4th element in df$First match the previous value (instead of the 1st and 3rd), therefore, which(c(F, diff(rleid(df$First)) == 0)) might be more appropriate, which yields: [1] 2 4

return indices of duplicated elements corresponding to the unique elements in R

anyone know if there's a build in function in R that can return indices of duplicated elements corresponding to the unique elements?
For instance I have a vector
a <- ["A","B","B","C","C"]
unique(a) will give ["A","B","C"]
duplicated(a) will give [F,F,T,F,T]
is there a build-in function to get a vector of indices for the same length as original vector a, that shows the location a's elements in the unique vecor (which is [1,2,2,3,3] in this example)?
i.e., something like the output variable "ic" in the matlab function "unique". (which is, if we let c = unique(a), then a = c(ic,:)).
http://www.mathworks.com/help/matlab/ref/unique.html
Thank you!
We can use match
match(a, unique(a))
#[1] 1 2 2 3 3
Or convert to factor and coerce to integer
as.integer(factor(a, levels = unique(a)))
#[1] 1 2 2 3 3
data
a <- c("A","B","B","C","C")
This should work:
cumsum( !duplicated( sort( a)) ) # one you replace Mathlab syntax with R syntax.
Or just:
as.numeric(factor(a) )

Get index of vector between 1nd and 2nd appearance of number 1

Suppose we have a vector:
v <- c(0,0,0,1,0,0,0,1,1,1,0,0)
Expected output:
v_index <- c(5,6,7)
v always starts and ends with 0. There is only one possibility of having cluster of zeros between two 1s.
Seems simple enough, can't get my head around...
I think this will do
which(cumsum(v == 1L) == 1L)[-1L]
## [1] 5 6 7
The idea here is to separate all the instances of "one"s to groups and select the first group while removing the occurrence of the "one" at the beginning (because you only want the zeroes).
v <- c(0,0,0,1,0,0,0,1,1,1,0,0)
v_index<-seq(which(v!=0)[1]+1,which(v!=0)[2]-1,1)
> v_index
[1] 5 6 7
Explanation:I ask which indices are not equal to 0:
which(v!=0)
then I take the first and second index from that vector and create a sequence out of it.
This is probably one of the simplest answers out there. Find which items are equal to one, then produce a sequence using the first two indexes, incrementing the first and decrementing the other.
block <- which(v == 1)
start <- block[1] + 1
end <- block[2] - 1
v_index <- start:end
v_index
[1] 5 6 7

removing sequences of positive values between sequences of "0"

I would like to create a small function in a data frame, for detecting (and setting to 0) sequences of positive values which are located between sequences of values equal to 0, but only if these sequences of positive values are not more than 5 values long.
Here's just a small example for showing you how my data looks (initial_data column), and what I would like to obtain at the end (final_data column):
DF<-data.frame(initial_data=c(0,0,0,0,100,2,85,0,0,0,0,0,0,3,455,24,10,7,6,15,42,0,0,0,0,0,0,0),final_data=c(0,0,0,0,0,0,0,0,0,0,0,0,0,3,455,24,10,7,6,15,42,0,0,0,0,0,0,0))
This sentence can also resume the trick:
"If there's a sequence of positive values, not longer than 5 values, and located between at least two or three 0-values (before and after this sequence of positive values), then set also this sequence to 0"
Any advice for doing this easily?
Thanks a lot!!!
Here's a possible approach using rle function :
DF<-data.frame(initial_data=c(0,0,0,0,100,2,85,0,0,0,0,0,0,3,455,24,10,7,6,15,42,0,0,0,0,0,0,0),
final_data=c(0,0,0,0,0,0,0,0,0,0,0,0,0,3,455,24,10,7,6,15,42,0,0,0,0,0,0,0))
# using rle create an object with the sequences of consecutive elements
# having the same sign (-1 means negative, 0 means zero, 1 means positive)
enc <- rle(sign(DF$initial_data))
# find the positive sequences having maximum 5 elements
posSequences <- which(enc$values == 1 & enc$lengths <= 5)
# remove index=1 or index=length(enc$values) if present because
# they can't be surrounded by 0
posSequences <- posSequences[posSequences != 1 &
posSequences != length(enc$values)]
# check if they're preceeded and followed by at least 2 zeros
# (if not remove the index)
toForceToZero <- sapply(posSequences,FUN=function(idx){
enc$values[idx-1]==0 &&
enc$lengths[idx-1] >= 2 &&
enc$values[idx+1] == 0 &&
enc$lengths[idx+1] >= 2})
posSequences <- posSequences[toForceToZero]
# reverse the run-length encoding, setting NA where we want to force to zero
v <- enc$values
v[posSequences] <- NA
# create the final data vector by forcing NAs to 0
final_data <- DF$initial_data
final_data[is.na(rep.int(v, enc$lengths))] <- 0
# check if is equal to your desired output
all(DF$final_data == final_data)
# > [1] TRUE
My best friend rle to the rescue:
notzero<-rle(as.logical(unlist(DF)))
Run Length Encoding
lengths: int [1:7] 4 3 6 8 20 8 7
values : logi [1:7] FALSE TRUE FALSE TRUE FALSE TRUE ...
Now just find all locations where values is TRUE and lengths < 5, and replace the values at those locations with FALSE . Then invoke inverse.rle to get the desired output.

Resources