Check if vector of strings contains words created from two others words - r

I have very very long vector of strings (peptides).
head(unique(pseq_list))
#[1] "GPPNHHMGPMSER" "SLSGQCHHHGENLR" "HSSGQDKPHETYR"
#"DHDKPHQQSDK" "AHMESDK" "HISESHEK"
I want to check if in this vector are peptides created by two others peptides. For example if there are "AHMESDK", "AHME" and "SDK" I want to know that. I tried grepl function but probably my vector is to long(?). Also, how to save such results?
If it would be too difficult to verify if there exists "AHMESDK" = "AHME" + "SDK" it would be nice to know at least if in the vector are peptides which contains others (for example "HISESHEK" and "SES").
Context provided by #quant in the comments:
As a note for everyone without biological background.
Peptides are macromolecules. Our body can compose these macromolecules by "gluing" different amino acids together. The sequence of amino acids glued together is called the primary structure of a peptide and in bioinformatics often the one letter code, see rpeptide.com is used in order to represent the primary structure.
So AHMESDK simply means a peptide composed of Alanin, Histidine and so on.

Data:
pseq<-c("GPPNHHMGPMSER", "SLSGQCHHHGENLR", "HSSGQDKPHETYR", "DHDKPHQQSDK", "AHMESDK", "AHME", "SES", "HISESHEK")
Two approaches:
Approach 1:
peplist<-sapply(pseq,grep, pseq, value=TRUE)
Result:
$GPPNHHMGPMSER
[1] "GPPNHHMGPMSER"
$SLSGQCHHHGENLR
[1] "SLSGQCHHHGENLR"
$HSSGQDKPHETYR
[1] "HSSGQDKPHETYR"
$DHDKPHQQSDK
[1] "DHDKPHQQSDK"
$AHMESDK
[1] "AHMESDK"
$AHME
[1] "AHMESDK" "AHME"
$SES
[1] "SES" "HISESHEK"
$HISESHEK
[1] "HISESHEK"
This gives you a list where for every element, you get the list of elements it exists in. We can then create a list of only those peptids that appear within other peptids:
peplist[sapply(peplist,length)>1]
Approach 2:
pepcombs<-expand.grid(pseq,pseq) %>%
apply(1,paste0,collapse="")
pseq[pseq %in% pepcombs]
This will give you a list of peptids that can be constructed by combining two of the other peptids.

Related

Concatenate indeterminate number of elements in a character vector into a single string in R

I have a character vector, p, which may have seven or more elements. For example, here there are 8 elements:
> p
[1] "15" "10.698" "VB" "0.0257" "209.67235" "2.8164" "1,8" "cineole"
I want to combine all elements after the seventh into a single string. In the above instance,
> paste(p[7],p[8])
[1] "1,8 cineole"
gives the desired output. However,
> paste(p[c(7:length(p))])
[1] "1,8" "cineole"
does not, which means if there are 9 or more elements in p, it wont work. I've tried various other ways of combining these elements, eg toString(), which adds unwanted commas, and cat(), which gives the right output but doesn't seem capable of referring its output to a new variable. (eg:
> name<-cat(p[c(7:length(p))])
1,8 cineole
> name
NULL
How can I combine an indeterminate (i.e 2 or more) number of character elements into a single string?
Note, I'm doing this iteratively, where p is a line imported from a pdf file and there may be hundreds of lines on each pdf, and there are over 2000 pdfs. So I can't manually adjust for longer vectors of p in specific cases. My objective ultimately is to combine all the relevant lines from all pdfs into a single dataframe.

Fast way for String matching and replacement from another dataframe in R

I have two dataframes that look like this (although the first one is over 90 million rows long and the second dataframe is a little over 14 million rows) Also the second dataframe is randomly ordered
df1 <- data.frame(
datalist = c("wiki/anarchist_schools_of_thought can differ fundamentally supporting anything from extreme wiki/individualism to complete wiki/collectivism",
"strains of anarchism have often been divided into the categories of wiki/social_anarchism and wiki/individualist_anarchism or similar dual classifications",
"the word is composed from the word wiki/anarchy and the suffix wiki/-ism themselves derived respectively from the greek i.e",
"anarchy from anarchos meaning one without rulers from the wiki/privative prefix wiki/privative_alpha an- i.e",
"authority sovereignty realm magistracy and the suffix or -ismos -isma from the verbal wiki/infinitive suffix -izein",
"the first known use of this word was in 1539"),
words = c("anarchist_schools_of_thought individualism collectivism", "social_anarchism individualist_anarchism",
"anarchy -ism", "privative privative_alpha", "infinitive", ""),
stringsAsFactors=FALSE)
df2 <- data.frame(
vocabword = c("anarchist_schools_of_thought", "individualism","collectivism" , "1965-66_nhl_season_by_team","social_anarchism","individualist_anarchism",
"anarchy","-ism","privative","privative_alpha", "1310_the_ticket", "infinitive"),
token = c("Anarchist_schools_of_thought" ,"Individualism", "Collectivism", "1965-66_NHL_season_by_team", "Social_anarchism", "Individualist_anarchism" ,"Anarchy",
"-ism", "Privative" ,"Alpha_privative", "KTCK_(AM)" ,"Infinitive"),
stringsAsFactors = F)
I was able to extract all the words that come after the phrase "wiki/" into another column. Those words need to be replaced by the token column which matches to vocabword in the second dataframe. So for example I would look at the work "anarchist_schools_of_thought" which comes after wiki/ in the first row of the 1st dataframe, and then find the term "anarchist_schools_of_thought" in the second dataframe under vocab word and I want to replace it with the corresponding token which is "Anarchist_schools_of_thought".
So it should eventually come to look like this:
1 wiki/Anarchist_schools_of_thought can differ fundamentally supporting anything from extreme wiki/Individualism to complete wiki/Collectivism
2 strains of anarchism have often been divided into the categories of wiki/Social_anarchism and wiki/Individualist_anarchism or similar dual classifications
3 the word is composed from the word wiki/Anarchy and the suffix wiki/-ism themselves derived respectively from the greek i.e
4 anarchy from anarchos meaning one without rulers from the wiki/Privative prefix wiki/Alpha_privative an- i.e
5 authority sovereignty realm magistracy and the suffix or -ismos -isma from the verbal wiki/Infinitive suffix -izein
6 the first known use of this word was in 1539
I realize that a lot of them just capitalize the first letter of the words but some of them are significantly different. I could do a for loop but I think that would take way too much time and I'd prefer to do this either a data.table way or possibly a stringi or stringr way. And I normally would just do a merge but since there's multiple words needing replaced in a single row, it complicates things.
Thanks in advance for any help.
You can do this with str_replace_all from stringr:
library(stringr)
str_replace_all(df1$datalist, setNames(df2$vocabword, df2$token))
Basically, str_replace_all allows you to supply a named vector with original strings being the names and the replacement being the elements of the vector. You did all the hard work by creating a "dictionary" of strings and replacements. str_replace_all simply took that and do the replacement automatically.
Result:
[1] "wiki/Anarchist_schools_of_thought can differ fundamentally supporting anything from extreme wiki/Individualism to complete wiki/Collectivism"
[2] "strains of anarchism have often been divided into the categories of wiki/Social_anarchism and wiki/Individualist_anarchism or similar dual classifications"
[3] "the word is composed from the word wiki/Anarchy and the suffix wiki/-ism themselves derived respectively from the greek i.e"
[4] "Anarchy from anarchos meaning one without rulers from the wiki/Privative prefix wiki/Privative_alpha an- i.e"
[5] "authority sovereignty realm magistracy and the suffix or -ismos -isma from the verbal wiki/Infinitive suffix -izein"
[6] "the first known use of this word was in 1539"
This question has a solution that seems to work well with your data: R: replacing multiple regex with sub
install.packages('qdap')
qdap::mgsub(df2[,1], df2[,2], df1[,1])
[1] "wiki/Anarchist_schools_of_thought can differ fundamentally supporting anything from extreme wiki/Individualism to complete wiki/Collectivism"
[2] "strains of anarchism have often been divided into the categories of wiki/Social_anarchism and wiki/Individualist_anarchism or similar dual classifications"
[3] "the word is composed from the word wiki/Anarchy and the suffix wiki/-ism themselves derived respectively from the greek i.e"
[4] "Anarchy from anarchos meaning one without rulers from the wiki/Privative prefix wiki/Alpha_Privative an- i.e"
[5] "authority sovereignty realm magistracy and the suffix or -ismos -isma from the verbal wiki/Infinitive suffix -izein"
[6] "the first known use of this word was in 1539"
Because each of your terms begins with "wiki/" it is possible to re-arrange your dataset to make it MUCH easier to create matches. The method I'm putting forward is to move each "wiki/term" to its own row of the dataframe, use a join to match up words which is efficient, and then reverse the steps to put the strings back together the way they were but with the new terms in them.
library(tidyverse)
df1a <- df1 %>%
# Create a separator character to identify where to split
mutate(datalist = str_replace_all(datalist,"wiki/","|wiki/")) %>%
mutate(datalist = str_remove(datalist,"^\\|"))
# Split so that each instance gets its own column
df1a <-
str_split(df1a$datalist,"\\|",simplify = TRUE) %>%
as.tibble() %>%
# Add a rownum column to keep track where to put back together for later
mutate(rownum = 1:n()) %>%
# Gather the dataframe into a tidy form to prepare for joining
gather("instance","text",-rownum,na.rm = TRUE) %>%
# Create a column for joining to the data lookup table
mutate(keyword = text %>% str_extract("wiki/[^ ]+") %>% str_remove("wiki/")) %>%
# Join the keywords efficiently using left_bind
left_join(df2,by = c("keyword" = "vocabword")) %>%
# Put the results back into the text string
mutate(text = str_replace(text,"wiki/[^ ]+",paste0("wiki/",token))) %>%
select(-token,-keyword) %>%
# Spread the data back out to the original number of rows
spread(instance,text) %>%
# Re-combine the sentences/strings to their original form
unite("datalist",starts_with("V"),sep="") %>%
select("datalist")
Results:
# A tibble: 6 x 1
datalist
<chr>
1 wiki/Anarchist_schools_of_thought can differ fundamentally supporting anything from extreme wiki/Individ~
2 strains of anarchism have often been divided into the categories of wiki/Social_anarchism and wiki/Indiv~
3 the word is composed from the word wiki/Anarchy and the suffix wiki/-ism themselves derived respectively~
4 anarchy from anarchos meaning one without rulers from the wiki/Privative prefix wiki/Alpha_privative an-~
5 authority sovereignty realm magistracy and the suffix or -ismos -isma from the verbal wiki/Infinitive su~
6 the first known use of this word was in 1539
The way I've typically done this using straight stringi is as follows:
library(stringi)
Old <- df2[["vocabword"]]
New <- df2[["token"]]
stringi::stri_replace_all_regex(df1[["datalist"]],
"\\b"%s+%Old%s+%"\\b",
New,
vectorize_all = FALSE)
#[1] "wiki/Anarchist_schools_of_thought can differ fundamentally supporting anything from extreme wiki/Individualism to complete wiki/Collectivism"
#[2] "strains of anarchism have often been divided into the categories of wiki/Social_anarchism and wiki/Individualist_anarchism or similar dual classifications"
#[3] "the word is composed from the word wiki/Anarchy and the suffix wiki/-ism themselves derived respectively from the greek i.e"
#[4] "Anarchy from anarchos meaning one without rulers from the wiki/Privative prefix wiki/Alpha_privative an- i.e"
#[5] "authority sovereignty realm magistracy and the suffix or -ismos -isma from the verbal wiki/Infinitive suffix -izein"
#[6] "the first known use of this word was in 1539"
Theoretically, you should be able to get a reasonable improvement by parallelizing this the right way, but you won't be able to get any better than a Nx speed-up (Where N = # of cores available). -- My hunch is that cutting run-time from something like 8 months down to 15 days still doesn't really help you in a practical sense.
However, if you have 14 million potential replacements to make over 90 million rows, it seems like a fundamentally different approach might be necessary. What's the max number of words in any sentence?
Update: Adding some example code to benchmark potential solutions:
Adding to your additional sentences using stringi::stri_rand_lipsum() and adding additional replacement pairs using stringi::stri_rand_strings() makes it easier to see the effects of an increasing corpus size and vocabulary size on run-time.
With 1,000 sentences:
1,000 replacement pairs: 3.9 seconds.
10,000 replacement pairs: 36.5 seconds.
100,000 replacement pairs: 365.4 seconds.
I'm not going to try 14 million, but this should help you evaluate if alternative methods will scale.
library(stringi)
ExtraSentenceCount <- 1e3
ExtraVocabCount <- 1e4
Sentences <- c("wiki/anarchist_schools_of_thought can differ fundamentally supporting anything from extreme wiki/individualism to complete wiki/collectivism",
"strains of anarchism have often been divided into the categories of wiki/social_anarchism and wiki/individualist_anarchism or similar dual classifications",
"the word is composed from the word wiki/anarchy and the suffix wiki/-ism themselves derived respectively from the greek i.e",
"anarchy from anarchos meaning one without rulers from the wiki/privative prefix wiki/privative_alpha an- i.e",
"authority sovereignty realm magistracy and the suffix or -ismos -isma from the verbal wiki/infinitive suffix -izein",
"the first known use of this word was in 1539",
stringi::stri_rand_lipsum(ExtraSentenceCount))
vocabword <- c("anarchist_schools_of_thought", "individualism","collectivism" , "1965-66_nhl_season_by_team","social_anarchism","individualist_anarchism",
"anarchy","-ism","privative","privative_alpha", "1310_the_ticket", "infinitive",
"a",
stringi::stri_rand_strings(ExtraVocabCount,
length = sample.int(8, ExtraVocabCount, replace = TRUE),
pattern = "[a-z]"))
token <- c("Anarchist_schools_of_thought" ,"Individualism", "Collectivism", "1965-66_NHL_season_by_team", "Social_anarchism", "Individualist_anarchism" ,"Anarchy",
"-ism", "Privative" ,"Alpha_privative", "KTCK_(AM)" ,"Infinitive",
"XXXX",
stringi::stri_rand_strings(ExtraVocabCount,
length = 3,
pattern = "[0-9]"))
system.time({
Cleaned <- stringi::stri_replace_all_regex(Sentences, "\\b"%s+%vocabword%s+%"\\b", token, vectorize_all = FALSE)
})
# user system elapsed
# 36.652 0.070 36.768
head(Cleaned)
# [1] "wiki/Anarchist_schools_of_thought can differ fundamentally supporting anything from extreme wiki/Individualism 749 complete wiki/Collectivism"
# [2] "strains 454 anarchism have often been divided into the categories 454 wiki/Social_anarchism and wiki/Individualist_anarchism 094 similar dual classifications"
# [3] "the word 412 composed from the word wiki/Anarchy and the suffix wiki/-ism themselves derived respectively from the greek 190.546"
# [4] "Anarchy from anarchos meaning one without rulers from the wiki/Privative prefix wiki/Alpha_privative 358- 190.546"
# [5] "authority sovereignty realm magistracy and the suffix 094 -ismos -isma from the verbal wiki/Infinitive suffix -izein"
# [6] "the first known use 454 this word was 201 1539"
Update 2: The method below doesn't account for the possibility that you have tags that are sub-strings of another-- i.e. wiki/Individualist and wiki/Individualist_anarchism could give you erroneous results. The only way that I really know to avoid that is using regex/word replacement of full words preceded and followed by word-boundaries (\\b), which can't be based on a fixed string.
One option that might give you hope relies on the fact that you essentially have all of your desired replacements marked with the prefix wiki/. If this is the case for your actual use, then we can take advantage of that and use fixed replacements instead of regex replacement of full words preceded and followed by word-boundaries (\\b). (this is necessary to avoid a vocab word like "ism" being replaced when it occurs as part of a longer word)
Using the same list as above:
prefixed_vocabword <- paste0("wiki/",vocabword)
prefixed_token <- paste0("wiki/",token)
system.time({
Cleaned <- stringi::stri_replace_all_fixed(Sentences, prefixed_vocabword, prefixed_token, vectorize_all = FALSE)
})
That cuts run-time down to 10.4 seconds with 1,000 sentences and 10,000 replacements, but since the run-time is still grows linearly this would still take hours for your data size.

Storing a value in a nested list with an unknown depth in R

I am trying to optimize a code which is very computational-intensive, because it deals with subsets of a 80-elements set.
A crucial step that I want to accelerate is finding if the current subset in my loop has already been treated or not. For the moment, I check if this subset is contained in the already treated subset of the same size k (cardinal). It would be much more faster to store progressively treated subset in a nested list to check if a subset has already been treated or not (O(1) instead of a search in O(80 choose k)).
I had no problem coding a function to check if the current subset is in my nested list of treated subset: access(treated, subset=c(2,5,3)) returns TRUE iff treated[[2]][[5]][[3]]==TRUE
However, I have no idea how to store (inside my loop) my current subset in the list of treated. I would like something like this to be possible: treated[h] <- TRUE where h is my current subset (in the above example: h=c(2,5,3))
The main problem that I am facing is that the number of "[[..]]" varies inside my loop. Do I have any other option rather than completing h so that it has a length of 80 and putting a sequence of 80 "[[..]]", like: treated[[h[1]]][[h[2]]]...[[h[80]]] <- TRUE ?
If h is a vector of values then
"[["(treated, h)
recursively subsets the list items.
For example, I created a (not so highly) nested list:
> a
[[1]]
[[1]][[1]]
[1] 2
[[1]][[2]]
[[1]][[2]][[1]]
[1] 3
[[2]]
[1] 1
The following command, correctly recursively applies item subsetting to the list:
> "[["(a, c(1,2,1))
[1] 3
The length of the recursively subsetting vector can vary without fixing the number of [[..]]'s. For example, subsetting two levels of depth with the same syntax:
> "[["(a, c(1,2))
[[1]]
[1] 3

How to get an Element from a vector without using numbers or indices?

Lets say I have these two vectors in my R workspace with the following content:
[1] "Atom.Type" and "Molar.Mass"
> Atom.Type
[1] "Oxygen" "Lithium" "Nitrogen" "Hydrogen"
> Molar.Mass
[1] 16 6.9 14 1
I now want to assign the Molar.Mass belonging to "Lithium" (i.e. 6.9) to a new variable called mass.
The problem is: I have to do that without using any numbers or indices.
Does anyone have a suggestion for this problem?
This should work: mass<-Molar.Mass[Atom.Type=="Lithium"] Clearly this assumes the two vectors are of the same length and sorted correctly. See additional comment from Roland below.

I have a number of data sequences and I want to select the longest sequence out them using R

I am working on a large number of sequences (nucleotide sequences) and I want to select the longest sequence (the sequence with the biggest length) out of them.
My sequences are elements of a list.
I am working on the R software.
Any help with the code? which functions to use?
If you list is named l use sapply(l,length) will return a vector with the length of each element in your list. To select the longest sequence use
s<-sapply(l,length) # or use s<-lengths(l) (Richard Scriven's comment)
longest<-l[[match(max(s),s)]]
Example :
x<-rnorm(100)
y<-rnorm(1000)
z<-rnom(10)
l<-list(x,y,z)
s<-sapply(l,length)
longest<-l[[match(max(s),s)]]
length(longest)
[1] 1000

Resources