R: pairwise matrix of the number of characters that differ among strings - r

I have a vector containing a large number of strings that are all of the same length. For example:
vec = c("keep", "teem", "meat", "weep")
I would like to compare every possible pair of strings from within this vector and count the number of characters that differ between them. Using the vector above, "keep" would be compared to every other string in the vector, "teem" would be compared to every other string, and so on.
I'm only interested in counting the number of characters from the same position within each string that are different. So for example "keep" vs. "teem" would have 2 differences, "keep" vs. "meat" 3 differences, etc. I'd like to output the results as a pairwise matrix, where the strings in the vector make up the row names and column names.
I've learned from another post (How can I compare two strings to find the number of characters that match in R, using substitution distance?) that I can use the adist argument in mapply to calculate the number of differences between two strings:
mapply(adist,string1,string2)
But I'm not sure how to modify this to operate over every possible pairwise combination in my vector, and to place the results in a pairwise matrix. Any ideas for how to do that? Thanks!!

Do you mean using adist like below?
> `dimnames<-`(adist(vec),rep(list(vec),2))
keep teem meat weep
keep 0 2 3 1
teem 2 0 3 2
meat 3 3 0 3
weep 1 2 3 0

An option with stringdistmatrix
library(stringdist)
out <- as.matrix(stringdistmatrix(vec))
dimnames(out) <- list(vec, vec)

Related

number some patterns in the string using R

I have a strings and it has some patterns like this
my_string = "`d#k`0.55`0.55`0.55`0.55`0.55`0.55`0.55`0.55`0.55`n$l`0.4`0.1`0.25`0.28`0.18`0.3`0.17`0.2`0.03`!lk`0.04`0.04`0.04`0.04`0.04`0.04`0.04`0.04`0.04`vnabgjd`0.02`0.02`0.02`0.02`0.02`0.02`0.02`0.02`0.02`pogk(`1.01`0.71`0.86`0.89`0.79`0.91`0.78`0.81`0.64`r!#^##niw`0.0014`0.0020`9.9999`9.9999`0.0020`0.0022`0.0032`9.9999`0.0000`
As you can see there is patterns [`nonnumber] then [`number.num~] repeated.
So I want to identify how many [`number.num~] are between [`nonnumber].
I tried to use regex
index <- gregexpr("`(\\w{2,20})`\\d\\.\\d(.*?)`\\D",cle)
regmatches(cle,index)
but using this code, the [`\D] is overlapped. so just It can't number how many the pattern are.
So if you know any method about it, please leave some reply
Using strsplit. We split at the backtick and count the position difference which of the values coerced to "numeric" yield NA. Note, that we need to exclude the first element after strsplit and add an NA at the end in the numerics. Resulting in a vector named with the non-numerical element using setNames (not very good names actually, but it's demonstrating what's going on).
s <- el(strsplit(my_string, "\\`"))[-1]
s.num <- suppressWarnings(as.numeric(s))
setNames(diff(which(is.na(c(s.num, NA)))) - 1,
s[is.na(s.num)])
# d#k n$l !lk vnabgjd pogk( r!#^##niw
# 9 9 9 9 9 9

Remove single value from vector leaving other occurrences of the same value

Suppose I have a large vector of integers in which a single integer can occur in the vector multiple times. I do not know the order of the values within the vector. Consider the code below: I have vector and I want to remove a single 1 to get newVector. Since the order within the vector is not known outside this example, I cannot simply use vector[-1].
vector<-c(1,1,2,2,3)
newVector<-c(1,2,2,3)
Some background: I iteratively pick two values from the vector (using sample) and then want to remove the values I picked from the vector.
Of course I could loop through the vector until I find the first occurrence of the value I wish to remove and remove it using the index, however, that is very time consuming. All the other results I found end up removing all occurrences of the value, which I don't want.
I think this would work, as which.max returns the index of the first match and then we can remove them using negative subsetting.
vector[-which.max(vector == 1)]
#[1] 1 2 2 3
Also, match does the same
vector[-match(1, vector)]
#[1] 1 2 2 3
You could use match. This finds the first occurrence of the specified value returning its index
vector<-c(1,1,2,2,3)
vector[-match(1, vector)]
# [1] 1 2 2 3

How can I compare two strings to find the number of characters that match in R, using substitution distance?

In R, I have two character vectors, a and b.
a <- c("abcdefg", "hijklmnop", "qrstuvwxyz")
b <- c("abXdeXg", "hiXklXnoX", "Xrstuvwxyz")
I want a function that counts the character mismatches between each element of a and the corresponding element of b. Using the example above, such a function should return c(2,3,1). There is no need to align the strings.
I need to compare each pair of strings character-by-character and count matches and/or mismatches in each pair. Does any such function exist in R?
Or, to ask the question in another way, is there a function to give me the edit distance between two strings, where the only allowed operation is substitution (ignore insertions or deletions)?
Using some mapply fun:
mapply(function(x,y) sum(x!=y),strsplit(a,""),strsplit(b,""))
#[1] 2 3 1
Another option is to use adist which Compute the approximate string distance between character vectors:
mapply(adist,a,b)
abcdefg hijklmnop qrstuvwxyz
2 3 1

Adding numbers within a vector in r

I have a vector
v<-c(1,2,3)
I need add the numbers in the vector in the following fashion
1,1+2,1+2+3
producing a second vector
v1<-c(1,3,6)
This is probably quite simple...but I am a bit stuck.
Use the cumulative sum function:
cumsum(v)
#[1] 1 3 6

Different behaviour of intersect on vectors and factors

I try to compare multiple vectors of Entrez IDs (integer vectors) by using Reduce(intersect,...). The vectors are selected from a database using "DISTINCT" so a single vector does not contain duplicates.
length(factor(c(l1$entrez)))
gives the same length (and the same IDs w/o the length function) as
length(c(l1$entrez))
When I compare multiple vectors with
length(Reduce(intersect,list(c(l1$entrez),c(l2$entrez),c(l3$entrez),c(l4$entrez))))
or
length(Reduce(intersect,list(c(factor(l1$entrez)),c(factor(l2$entrez)),c(factor(l3$entrez)),c(factor(l4$entrez)))))
the result is not the same. I know that factor!=originalVector but I cannot understand why the result differs although the length and the levels of the initial factors/vectors are the same.
Could somebody please explain the different behaviour of the intersect function on vectors and factors? Is it that the intersect of two factor lists are again factorlists and then duplicates are treated differently?
Edit - Example:
> head(l1)
entrez
1 1
2 503538
3 29974
4 87769
5 2
6 144568
> head(l2)
entrez
1 1743
2 1188
3 8915
4 7412
5 51082
6 5538
The lists contain around 500 to 20K Entrez IDs. So the vectors contain pure integer and should give the intersect among all tested vectors.
> length(Reduce(intersect,list(c(factor(l1$entrez)),c(factor(l2$entrez)),c(factor(l3$entrez)),c(factor(l4$entrez)))))
[1] 514
> length(Reduce(intersect,list(c(l1$entrez),c(l2$entrez),c(l3$entrez),c(l4$entrez))))
[1] 338
> length(Reduce(intersect,list(l1$entrez,l2$entrez,l3$entrez,l4$entrez)))
[1] 494
I have to apologize profusely. The different behaviour of the intersect function may be caused by a problem with the data. I have found fields in the dataset containing comma seperated Entrez IDs (22038, 23207, ...). I should have had a more detailed look at the data first. Thank you for the answers and your time. Although I do not understand the different results yet, I am sure that this is the cause of the different behaviour. Can somebody confirm that?
As Roman says, an example would be very helpful.
Nevertheless, one possibility is that your variables l1$entrez, l2$entrez etc have the same levels but in different orders.
intersect converts its arguments via as.vector, which turns factors into character variables. This is usually the right thing to do, as it means that varying level order doesn't make any difference to the result.
Passing factor(l1$entrez) as an argument to intersect also removes the impact of varying level order, as it effectively creates a new factor with level ordering set to the default. However, if you pass c(l1$entrez), you strip the factor attributes off your variable and what you're left with is the raw integer codes which will depend on level ordering.
Example:
a <- factor(letters[1:3], levels=letters)
b <- factor(letters[1:3], levels=rev(letters)
# returns 1 2 3
intersect(c(factor(a)), c(factor(b)))
# returns integer(0)
intersect(c(a), c(b))
I don't see any reason why you should use c() in here. Just let R handle factors by itself (although to be fair, there are other scenarios where you do want to step in).

Resources