how to output string of specific length using agrep in R - r

I have a load of DNA sequences.
I want to match a part of the sequence and I want to return the match up to a specific length
The dataframe df has the columns:
V1 and V2
>chr1:61695-62229 aattccaagagtattattgcaccaaaaggcatggacttaaaattcttgatacatgatttcaaaatattttctttaaggtttgaatcagtctatattccctccagcagcgtataaaagtgccaatttctctgatccttagccagtttgggtaataataattgtaaaacttttttttctttttttttgagacagagtctccctctgtcgccaggctgaagtgcagtggcgcaatctcggctcactgcaacctccgcctcccggggtcaagctattctcctgcctcagcctcccaagtagctgggactacaggcatgcaccaccatgcccagctaatttttgttatttttagtagagatggagtttccccatgttggacaggatggtctcgatctcttgacctcgtgatccaccctcctcggcctcccaaagtgctgggataacaggcgtgaacaaccatgcccggcctgtaaaactttttcctaatttaacagaaaaataatagtattatattttatcatatttctttgatttcta
>chr1:101718-102194 taaaaataaatgtattaagtatgaacaacaaaaaagctagtaaaggttgaacaacaactatccttaggaaagtggaaataatgtattaataaatatgaaagcaggctagccacggtgactcacatctgtaatcccagcactttgggaggctgaggcaggcagatcacctgaggtcaggagttccagaccagcctggccaacatggtgaaatcttgtctctcctacaaatacaaaaactagccaggcttggttgtgcactcctgtaattcgagctacttgggaggctgaggcaggagaatctcttgaacctgagaggcagaggttgcagtgagccaagatcatgccactgcactccagctggggcaacagagtgacactccatctcaaaataaataaataagaaagcagaaactaataaactagaaaacagaaacatagaactaatttataaatcaaagcactatgccttgaaaaga
I have used agrep to get the match.
RepeatAlusSequencesdfMatch <- RepeatAlusSequencesdf[agrep("aacctcaaagactggcctca", RepeatAlusSequencesdf[,2],ignore.case = TRUE, max.distance = 0.3), ]
but I also want to return a length of 146 characters from the end of the match. At the moment it is giving me the entire sequence which I can't use

See my comment above. I don't think you can achieve what you are trying to do here with agrep. If the DNA sequence you are trying to fuzzy-match has predictable number and positions of nucleotide insertion/deletion/substitutions then just use one (or several) regular expressions with capture groups to extract what you need.
If the differences are not predictable and you truly need a fuzzy match, you could use a brute-force algorithm like the following:
Break up each of your DNA strings into random positions such as to produce different 146 nucleotide long sequences.
Using a "threshold" max.distance run agrep on the resulting sequences and select those that match.
From the above set select out the best matches by running agrep using successively smaller max.distances until you get the one(s) you are interested in.
The more different 146 nucleotide sequences you select the more accurate result you will get. If you want to find the best match, use an exhaustive search by walking each DNA sequence from the beginning, breaking it to 146 nucleotide pieces and running the above algorithm then leaving one nucleotide out from the beginning, selecting the next partitions and so on until the end.
Hope this helps or gives better ideas.

Related

Gathering the correct amount of digits for numbers when text mining

I need to search for specific information within a set of documents that follows the same standard layout.
After I used grep to find the keywords in every document, I went on collecting the numbers or characters of interest.
One piece of data I have to collect is the Total Power that appears as following:
TotalPower: 986559. (UoPow)
Since I had already correctly selected this excerpt, I created the following function that takes the characters between positions n and m, where n and m start counting up from right to left.
substrRight <- function(x, n,m){
substr(x, nchar(x)-n+1, nchar(x)-m)
}
It's important to say that from the ":" to the number 986559, there are 2 spaces; and from the "." to the "(", there's one space.
So I wrote:
TotalP = substrRight(myDf[i],17,9) [1]
where myDf is a character vector with all the relevant observations.
Line [1], after I loop over all my observations, gives me the numbers I want, but I noticed that when the number was 986559, the result was 98655. It simply doesn't "see" 9 as the last number.
The code seems to work fine for the rest of the data. This number (986559) is indeed the highest number in the data and is the only one with order 10^5 of magnitude.
How can I make sure that I will gather all digits in every number?
Thank you for the help.
We can extract the digits before a . by using regex lookaround
library(stringr)
str_extract(str1, "\\d+(?=\\.)")
#[1] "986559"
The \\d+ indicates one or more digist followed by the regex lookaound .

Counting specific characters in a string, across a data frame. sapply

I have found similar problems to this here:
Count the number of words in a string in R?
and here
Faster way to split a string and count characters using R?
but I can't get either to work in my example.
I have quite a large dataframe. One of the columns has genomic locations for features and the entries are formatted as follows:
[hg19:2:224840068-224840089:-]
[hg19:17:37092945-37092969:-]
[hg19:20:3904018-3904040:+]
[hg19:16:67000244-67000248,67000628-67000647:+]
I am splitting out these elements into thier individual elements to get the following (i,e, for the first entry):
hg19 2 224840068 224840089 -
But in the case of the fourth entry, I would like to pase this into two seperate locations.
i.e
hg19:16:67000244-67000248,67000628-67000647:+]
becomes
hg19 16 67000244 67000248 +
hg19 16 67000628 67000647 +
(with all the associated data in the adjacent columns filled in from the original)
An easy way for me to identify which rows need this action is to simply count the rows with commas ',' as they don't appear in any other text in any other columns, except where there are multiple genomic locations for the feature.
However I am failing at the first hurdle because the sapply command incorrectly returns '1' for every entry.
testdat$multiple <- sapply(gregexpr(",", testdat$genome_coordinates), length)
(or)
testdat$multiple <- sapply(gregexpr("\\,", testdat$genome_coordinates), length)
table(testdat$multiple)
1
4
Using the example I have posted above, I would expect the output to be
testdat$multiple
0
0
0
1
Actually doing
grep -c
on the same data in the command line shows I have 10 entries containing ','.
Using the example I have posted above, I would expect the output to be
So initially I would like to get this working but also I am a bit stumped for ideas as to how to then extract the two (or more) locations and put them on thier own rows, filling in the adjacent data.
Actually what I intended to to was to stick to something I know (on the command line) grepping the rows with ','out, duplicate the file and split and awk selected columns (1st and second location in respective files) then cat and sort them. If there is a niftier way for me to do this in R then I would love a pointer.
gregexpr does in fact return an object of length 1. If you want to find the rows which have a match vs the ones which don't, then you need to look at the returned value , not the length. A match failure returns -1 .
Try foo<-sapply(testdat$genome, function(x) gregexpr(',',x)); as.logical(foo) to get the rows with a comma.

Calculate correlation coefficient between words?

For a text analysis program, I would like to analyze the co-occurrence of certain words in a text. For example, I would like to see that e.g. the words "Barack" and "Obama" appear more often together (i.e. have a positive correlation) than others.
This does not seem to be that difficult. However, to be honest, I only know how to calculate the correlation between two numbers, but not between two words in a text.
How can I best approach this problem?
How can I calculate the correlation between words?
I thought of using conditional probabilities, since e.g. Barack Obama is much more probable than Obama Barack; however, the problem I try to solve is much more fundamental and does not depend on the ordering of the words
The Ngram Statistics Package (NSP) is devoted precisely to this task. They have a paper online which describes the association measures they use. I haven't used the package myself, so I cannot comment on its reliability/requirements.
Well a simple way to solve your question is by shaping the data in a 2x2 matrix
obama | not obama
barack A B
not barack C D
and score all occuring bi-grams in the matrix. That way you can for instance use simple chi squared.
I don't know how this is commonly done, but I can think of one crude way to define a notion of correlation that captures word adjacency.
Suppose the text has length N, say it is an array
text[0], text[1], ..., text[N-1]
Suppose the following words appear in the text
word[0], word[1], ..., word[k]
For each word word[i], define a vector of length N-1
X[i] = array(); // of length N-1
as follows: the ith entry of the vector is 1 if the word is either the ith word or the (i+1)th word, and zero otherwise.
// compute the vector X[i]
for (j = 0:N-2){
if (text[j] == word[i] OR text[j+1] == word[i])
X[i][j] = 1;
else
X[i][j] = 0;
}
Then you can compute the correlation coefficient between word[a] and word[b] as the dot product between X[a] and X[b] (note that the dot product is the number of times these words are adjacent) divided by the lenghts (the length is the square root of the number of appearances of the word, well maybe twice that). Call this quantity COR(X[a],X[b]). Clearly COR(X[a],X[a]) = 1, and COR(X[a],X[b]) is larger if word[a], word[b] are often adjacent.
This can be generalized from "adjacent" to other notions of near - for example we could have chosen to use 3 word (or 4, 5, etc.) blocks instead. One can also add weights, probably do many more things as well if desired. One would have to experiment to see what is useful, if any of it is of use at all.
This problem sounds like a bigram, a sequence of two "tokens" in a larger body of text. See this Wikipedia entry, which has additional links to the more general n-gram problem.
If you want to do a full analysis, you'd most likely take any given pair of words and do a frequency analysis. E.g., the sentence "Barack Obama is the Democratic candidate for President," has 8 words, so there are 8 choose 2 = 28 possible pairs.
You can then ask statistical questions like, "in how many pairs does 'Obama' follow 'Barack', and in how many pairs does some other word (not 'Obama') follow 'Barack'? In this case, there are 7 pairs that include 'Barack' but in only one of them is it paired with 'Obama'.
Do the same for every possible word pair (e.g., "in how many pairs does 'candidate' follow 'the'?"), and you've got a basis for comparison.

Find powerset of all unique combinations of vector of strings

I am trying to find all of the unique groupings of a vector/list of items, length 39. Below is the code I have:
x <- c("Dominion","progress","scarolina","tampa","tva","TminKTYS",
"TmaxKTYS","TminKBNA","TmaxKBNA","TminKMEM","TmaxKMEM",
"TminKCRW","TmaxKCRW","TminKROA","TmaxKROA","TminKCLT",
"TmaxKCLT","TminKCHS","TmaxKCHS","TminKATL","TmaxKATL",
"TminKCMH","TmaxKCMH","TminKJAX","TmaxKJAX","TminKLTH",
"TmaxKLTH","TminKMCO","TmaxKMCO","TminKMIA","TmaxKMIA",
"TminKPTA","TmaxKTPA","TminKPNS","TmaxKPNS","TminKLEX",
"TmaxKLEX","TminKSDF","TmaxKSDF")
# Generate a list with the combinations
zz <- sapply(seq_along(x), function(y) combn(x,y))
# Filter out all the duplicates
sapply(zz, function(z) t(unique(t(z))))
However, the code causes my computer to run out of memory. Is there a better way to do this? I realize I have a large list. thanks.
To calculate all unique subsets, you are simply creating all binary vectors with the same length as the cardinality of the original set of items. If there are 39 items, then you are looking at all binary vectors of length 39. Each element of each vector identifies, yes or no, whether or not the item is in the corresponding subset.
As there are 39 items, and each can either be in or not-in a given subset, then there are 2^39 possible subsets. Excluding the empty set, i.e. the all-0 vector, you have 2^39 - 1 possible subsets.
That is, as #joran said, about 549B vectors. Given that the binary vectors are most compactly representing the data (i.e. without strings), then you will need 549B * 39 bits to return all of the subsets. I don't think you want to store this: that's about 2.68E12 bytes. If you insist on using the characters, you're likely to be in the many tens of terabytes.
It's certainly feasible to buy a system that can support this, but not very cost-effective.
At a meta-level, it is very likely, as #JD said, that this is not the path you really need to go. I recommend posting a new question and maybe it can be refined here or on the statistics-related SE site.
You might try using expand.grid.
Create a data frame from all combinations of the supplied vectors or
factors. See the description of the return value for precise details
of the way this is done.

Probability of 3-character string appearing in a randomly generated password

If you have a randomly generated password, consisting of only alphanumeric characters, of length 12, and the comparison is case insensitive (i.e. 'A' == 'a'), what is the probability that one specific string of length 3 (e.g. 'ABC') will appear in that password?
I know the number of total possible combinations is (26+10)^12, but beyond that, I'm a little lost. An explanation of the math would also be most helpful.
The string "abc" can appear in the first position, making the string look like this:
abcXXXXXXXXX
...where the X's can be any letter or number. There are (26 + 10)^9 such strings.
It can appear in the second position, making the string look like:
XabcXXXXXXXX
And there are (26 + 10)^9 such strings also.
Since "abc" can appear at anywhere from the first through 10th positions, there are 10*36^9 such strings.
But this overcounts, because it counts (for instance) strings like this twice:
abcXXXabcXXX
So we need to count all of the strings like this and subtract them off of our total.
Since there are 6 X's in this pattern, there are 36^6 strings that match this pattern.
I get 7+6+5+4+3+2+1 = 28 patterns like this. (If the first "abc" is at the beginning, the second can be in any of 7 places. If the first "abc" is in the second place, the second can be in any of 6 places. And so on.)
So subtract off 28*36^6.
...but that subtracts off too much, because it subtracted off strings like this three times instead of just once:
abcXabcXabcX
So we have to add back in the strings like this, twice. I get 4+3+2+1 + 3+2+1 + 2+1 + 1 = 20 of these patterns, meaning we have to add back in 2*20*(36^3).
But that math counted this string four times:
abcabcabcabc
...so we have to subtract off 3.
Final answer:
10*36^9 - 28*36^6 + 2*20*(36^3) - 3
Divide that by 36^12 to get your probability.
See also the Inclusion-Exclusion Principle. And let me know if I made an error in my counting.
If A is not equal to C, the probability P(n) of ABC occuring in a string of length n (assuming every alphanumeric symbol is equally likely) is
P(n)=P(n-1)+P(3)[1-P(n-3)]
where
P(0)=P(1)=P(2)=0 and P(3)=1/(36)^3
To expand on Paul R's answer. Probability (for equally likely outcomes) is the number of possible outcomes of your event divided by the total number of possible outcomes.
There are 10 possible places where a string of length 3 can be found in a string of length 12. And there are 9 more spots that can be filled with any other alphanumeric characters, which leads to 36^9 possibilities. So the number of possible outcomes of your event is 10 * 36^9.
Divide that by your total number of outcomes 36^12. And your answer is 10 * 36^-3 = 0.000214
EDIT: This is not completely correct. In this solution, some cases are double counted. However they only form a very small contribution to the probability so this answer is still correct up to 11 decimal places. If you want the full answer, see Nemo's answer.

Resources