How can I compare two strings to find the number of characters that match in R, using substitution distance? - r

In R, I have two character vectors, a and b.
a <- c("abcdefg", "hijklmnop", "qrstuvwxyz")
b <- c("abXdeXg", "hiXklXnoX", "Xrstuvwxyz")
I want a function that counts the character mismatches between each element of a and the corresponding element of b. Using the example above, such a function should return c(2,3,1). There is no need to align the strings.
I need to compare each pair of strings character-by-character and count matches and/or mismatches in each pair. Does any such function exist in R?
Or, to ask the question in another way, is there a function to give me the edit distance between two strings, where the only allowed operation is substitution (ignore insertions or deletions)?

Using some mapply fun:
mapply(function(x,y) sum(x!=y),strsplit(a,""),strsplit(b,""))
#[1] 2 3 1

Another option is to use adist which Compute the approximate string distance between character vectors:
mapply(adist,a,b)
abcdefg hijklmnop qrstuvwxyz
2 3 1

Related

R: pairwise matrix of the number of characters that differ among strings

I have a vector containing a large number of strings that are all of the same length. For example:
vec = c("keep", "teem", "meat", "weep")
I would like to compare every possible pair of strings from within this vector and count the number of characters that differ between them. Using the vector above, "keep" would be compared to every other string in the vector, "teem" would be compared to every other string, and so on.
I'm only interested in counting the number of characters from the same position within each string that are different. So for example "keep" vs. "teem" would have 2 differences, "keep" vs. "meat" 3 differences, etc. I'd like to output the results as a pairwise matrix, where the strings in the vector make up the row names and column names.
I've learned from another post (How can I compare two strings to find the number of characters that match in R, using substitution distance?) that I can use the adist argument in mapply to calculate the number of differences between two strings:
mapply(adist,string1,string2)
But I'm not sure how to modify this to operate over every possible pairwise combination in my vector, and to place the results in a pairwise matrix. Any ideas for how to do that? Thanks!!
Do you mean using adist like below?
> `dimnames<-`(adist(vec),rep(list(vec),2))
keep teem meat weep
keep 0 2 3 1
teem 2 0 3 2
meat 3 3 0 3
weep 1 2 3 0
An option with stringdistmatrix
library(stringdist)
out <- as.matrix(stringdistmatrix(vec))
dimnames(out) <- list(vec, vec)

Remove single value from vector leaving other occurrences of the same value

Suppose I have a large vector of integers in which a single integer can occur in the vector multiple times. I do not know the order of the values within the vector. Consider the code below: I have vector and I want to remove a single 1 to get newVector. Since the order within the vector is not known outside this example, I cannot simply use vector[-1].
vector<-c(1,1,2,2,3)
newVector<-c(1,2,2,3)
Some background: I iteratively pick two values from the vector (using sample) and then want to remove the values I picked from the vector.
Of course I could loop through the vector until I find the first occurrence of the value I wish to remove and remove it using the index, however, that is very time consuming. All the other results I found end up removing all occurrences of the value, which I don't want.
I think this would work, as which.max returns the index of the first match and then we can remove them using negative subsetting.
vector[-which.max(vector == 1)]
#[1] 1 2 2 3
Also, match does the same
vector[-match(1, vector)]
#[1] 1 2 2 3
You could use match. This finds the first occurrence of the specified value returning its index
vector<-c(1,1,2,2,3)
vector[-match(1, vector)]
# [1] 1 2 2 3

I have a number of data sequences and I want to select the longest sequence out them using R

I am working on a large number of sequences (nucleotide sequences) and I want to select the longest sequence (the sequence with the biggest length) out of them.
My sequences are elements of a list.
I am working on the R software.
Any help with the code? which functions to use?
If you list is named l use sapply(l,length) will return a vector with the length of each element in your list. To select the longest sequence use
s<-sapply(l,length) # or use s<-lengths(l) (Richard Scriven's comment)
longest<-l[[match(max(s),s)]]
Example :
x<-rnorm(100)
y<-rnorm(1000)
z<-rnom(10)
l<-list(x,y,z)
s<-sapply(l,length)
longest<-l[[match(max(s),s)]]
length(longest)
[1] 1000

Matlab or R: replace elements in matrix by values from another matrix in order

I have a problem to solve in either Matlab or R (preferably in R).
Imagine I have a vector A with 10 elements.
I have also a vector B with 30 elements, of which 10 have value 'x'.
Now, I want to replace all the 'x' in B by the corresponding values taken from A, in the order that is established in A. Once a value in A is taken, the next one is ready to be used when the next 'x' in B is found.
Note that the sizes of A and B are different, it's the number of 'x' cells that coincides with the size of A.
I have tried different ways to do it. Any suggestion on how to program this?
As long as the number of x entries in B matches the length of A, this will do what you want:
B[B=='x'] <- A
(It should be clear that this is the R solution.)
MATLAB Solution
In MATLAB it's quite simple, use logical indexing:
B(B == 'x') = A;

Adding numbers within a vector in r

I have a vector
v<-c(1,2,3)
I need add the numbers in the vector in the following fashion
1,1+2,1+2+3
producing a second vector
v1<-c(1,3,6)
This is probably quite simple...but I am a bit stuck.
Use the cumulative sum function:
cumsum(v)
#[1] 1 3 6

Resources