Splitting sequence of letters, whilst retaining original sequence position - r

I need to split the following sequence of letters into distinct chunks
SCDKSFNRGECSCDKSFNRGECSCDKSFNRGEC
I have used the following code provided from a previous user to achieve what I initially wanted, which was to split the sequence after every C.
library(dplyr)
TestSequence <- "SCDKSFNRGECSCDKSFNRGECSCDKSFNRGEC"
Test <- strsplit(TestSequence, "(?<=[C])", perl = TRUE) %>% unlist
df <- data.frame(Fragment = Test) %>%
mutate("position" = cumsum(nchar(Test)))
This allowed me to split the sequence after every C and retain it's position in the sequence, for example C at position 2, 11 etc.
Now I need to split the same sequence at different locations, which I can do using the following to split after P,A,G or S:
Test2 <- strsplit(TestSequence, "(?<=[P,A,G,S])", perl = TRUE) %>% unlist
This is fine if I want it to split after a given character, but if I try to split it before a character for example D, I cannot seem to retain the D in the fragment. I can only have it retained if it is split after the D.
I have tried every combination of look behind or look ahead I can think of, the following cuts before and after every D which isn't that useful.
Test3 <- strsplit(TestSequence, "(?=[D])", perl = TRUE) %>% unlist
Also is there a way to retain the exact position of every C in the original sequence?
So if I were to split the test sequence after the initial K, I'd have a fragment that was SCDK, could I have a separate column that tells me where the C was in the original sequence. Just as a second example, the next fragment would be SFNRGECSCDK and in that separate column it would say the C was originally in position 11.

Zero-length matches that result from the use of lookahead only patterns used in strsplit are not handled properly.
In this case, you need to "anchor" the matches on the left, too. Either use a non-word boundary, or a lookbehind that disallows the match at the start of string:
TestSequence <- "SCDKSFNRGECSCDKSFNRGECSCDKSFNRGEC"
strsplit(TestSequence, "\\B(?=D)", perl = TRUE)
# => [[1]]
# => [1] "SC" "DKSFNRGECSC" "DKSFNRGECSC" "DKSFNRGEC"
strsplit(TestSequence, "(?<!^)(?=D)", perl = TRUE)
# => [[1]]
# => [1] "SC" "DKSFNRGECSC" "DKSFNRGECSC" "DKSFNRGEC"
See the online R demo.
The \B(?=D) pattern matches a location that is immediately preceded with a word char and is immediately followed with D.
The (?<!^)(?=D) pattern matches a location that is not immediately preceded with a start of string location (i.e. not at the start of string) and is immediately followed with D.
Also, note that [P,A,G,S] matches P, A, G, S and a comma. You should use [PAGS] to match one of the letters.

Related

R Is there a way to remove special character from beginning of a string

A field from my dataset has some of it's observations to start with a "." e.g ".TN34AB1336"instead of "TN34AB1336".
I've tried
truck_log <- truck_log %>%
filter(BookingID != "WDSBKTP49392") %>%
filter(BookingID != "WDSBKTP44502") %>%
mutate(GpsProvider = str_replace(GpsProvider, "NULL", "UnknownGPS")) %>%
mutate(vehicle_no = str_replace(vehicle_no, ".TN34AB1336", "TN34AB1336"))
the last mutate command in my code worked but there are more of such issues in another field e.g "###TN34AB1336" instead of "TN34AB1336".
So I need a way of automating the process such that all observations that doesn't start with a character is trimmed from the left by a single command in R.
I've attached a picture of a filtered field from spreadsheet to make the question clearer.
We can use regular expressions to replace anything up to the first alphanumeric character with ""to remove everything that is not a Number/Character from the beginning of a string:
names <- c("###TN34AB1336",
".TN34AB1336",
",TN654835A34",
":+?%TN735345")
stringr::str_replace(names,"^.+?(?=[[:alnum:]])","") # Matches up to, but not including the first alphanumeric character and replaces with ""
[1] "TN34AB1336" "TN34AB1336" "TN654835A34" "TN735345"
``
You can use sub.
s <- c(".TN34AB1336", "###TN34AB1336")
sub("^[^A-Z]*", "", s)
#[1] "TN34AB1336" "TN34AB1336"
Where ^ is the start of the string, [^A-Z] matches everything what is not A, B, C, ... , Z and * matches it 0 to n times.

easy way to extract uppercase in string in R

I am beginner programmer in R.
I have "cCt/cGt" and I want to extract C and G and write it like C>G.
test ="cCt/cGt"
str_extract(test, "[A-Z]+$")
Try this:
gsub(".*([A-Z]).*([A-Z]).*", "\\1>\\2", test )
[1] "C>G"
Here, we capture the two occurrences of the upper case letters in capturing groups given in parentheses (...). This enables us to refer to them (and only to them but not the rest of the string!) in gsub's replacement clause using backreferences \\1 and \\2. In the replacement clause we also include the desired >.
You seem to look for a mutation in two concatenated strings, this function should solve your problem:
extract_mutation <- function(text){
splitted <- strsplit(text, split = "/")[[1]]
pos <- regexpr("[[:upper:]]", splitted)
uppercases <- regmatches(splitted, pos)
mutation <- paste0(uppercases, collapse = ">")
return(mutation)
}
If the two base exchanges are always at the same index, you could also return the position if you're interested:
position <- pos[1]
return(list(mutation, position))
instead of the return(mutation)
You might also capture the 2 uppercase chars followed and preceded by optional lowercase characters and a / in between.
test ="cCt/cGt"
res = str_match(test, "([A-Z])[a-z]*/[a-z]*([A-Z])")
sprintf("%s>%s", res[2], res[3])
Output
[1] "C>G"
See an R demo.
An exact match for the whole string could be:
^[a-z]([A-Z])[a-z]/[a-z]([A-Z])[a-z]$

RegEx for a conditional pattern in a string

I need to extract substrings from some strings,for example:
My data is a vector: c("Shigella dysenteriae","PREDICTED: Ceratitis")
a = "Shigella dysenteriae"
b = "PREDICTED: Ceratitis"
I hope that if the string starts with "PREDICTED:", it can be extracted to the subsequent word(maybe "Ceratitis"), and if the string doesn't start with "PREDICTED", it can be extracted to the first word(maybe Shigella);
In this example, the result would be:
result_of_a = "Shigella"
result_of_b = "Ceratitis"
Well,it is a typical conditional regular expression.I tried,but always failed;
I used R which can compatible perl's regular expression.
I know R supports perl's regular expression so I tried to use regexpr and regmatches, two functions to extract the substrings that I want.
The code is :
pattern = "(?<=PREDICTED:)?(?(1)(\\s+\\w+\\b)|(\\w+\\b))"
a = c("Shigella dysenteriae")
m_a = regexpr(pattern,a,perl = TRUE)
result_a = regmatches(a,m_a)
b = c("PREDICTED: Ceratitis")
m_b = regexpr(pattern,a,perl = TRUE)
result_b = regmatches(b,m_b)
Finaly,the result is :
# result_a = "Shigella"
# result_b = "PREDICTED"
It is not the result I expect,result_a is right,result_b is wrong.
WHY???Its seem that the condition didn't work...
PS:
I tried to read some details of conditional reg-expresstion. this is the web I tried to read : https://www.regular-expressions.info/conditional.html and I try to imitate "pattern" from this web ,and also tried to use "RegexBuddy" software to find the reason.
EDIT:
To use the function below on a vector, one can do:
Vector: myvec<-c("Shigella dysenteriae","PREDICTED: Ceratitis")
lapply(myvec,extractor)
[[1]]
[1] "Shigella"
[[2]]
[1] "Ceratitis"
Or:
unlist(lapply(myvec,extractor))
[1] "Shigella" "Ceratitis"
This assumes that the strings are always in the format shown above:
extractor<- function(string){
if(grepl("^PREDICTED",string)){
strsplit(string,": ")[[1]][2]
}
else{
strsplit(string," ")[[1]][1]
}
}
extractor(b)
#[1] "Ceratitis"
extractor(a)
#[1] "Shigella"
I think the reason it does not work is because (1) checks if a numbered capture group has been set but there is no first capturing group set yet, also not in the positive lookbehind (?<=PREDICTED:)?.
There are a first and second capturing group in the parts that follow. The if clause will check for group 1, it is not set so it will match group 2.
If you would make it the only capturing group (?<=(PREDICTED: )?) and omit the other 2 then the if clause will be true but you will get an error because the lookbehind assertion is not fixed length.
Instead of using a conditional pattern, to get both words you might use a capturing group and make PREDICTED: optional:
^(?:PREDICTED: )?(\w+)
Regex demo | R demo
If I understand correctly, the OP wants to extract
the first word after "PREDICTED:" if the strings starts with "PREDICTED:"
the first word of the string if the string does not start with "PREDICTED:".
So, if there is no specific requirement to use only one regex, this is what I would do:
Remove any leading "PREDICTED:" (if any)
Extract the first word from the intermediate result.
For working with regex, I prefer to use Hadley Wickham's stringr package:
inp <- c("Shigella dysenteriae", "PREDICTED: Ceratitis")
library(magrittr) # piping used to improve readability
inp %>%
stringr::str_replace("^PREDICTED:\\s*", "") %>%
stringr::str_extract("^\\w+")
[1] "Shigella" "Ceratitis"
To be on the safe side, I would remove any leading spaces beforehand:
inp %>%
stringr::str_trim() %>%
stringr::str_replace("^PREDICTED:\\s*", "") %>%
stringr::str_extract("^\\w+")

How to transform long names into shorter (two-part) names

I have a character vector in which long names are used, which will consist of several words connected by delimiters in the form of a dot.
x <- c("Duschekia.fruticosa..Rupr...Pouzar",
"Betula.nana.L.",
"Salix.glauca.L.",
"Salix.jenisseensis..F..Schmidt..Flod.",
"Vaccinium.minus..Lodd...Worosch")
The length of the names is different. But only the first two words of the entire name are important.
My goal is to get names up to 7 symbols: 3 initial symbols from the first two words and a separator in the form of a "dot" between them.
Very close to my request are these examples, but I do not know how to apply these code variations to my case.
R How to remove characters from long column names in a data frame and
how to append names to " column names" of the output data frame in R?
What should I do to get exit names to look like this?
x <- c("Dus.fru",
"Bet.nan",
"Sal.gla",
"Sal.jen",
"Vac.min")
Any help would be appreciated.
You can do the following:
gsub("(\\w{1,3})[^\\.]*\\.(\\w{1,3}).*", "\\1.\\2", x)
# [1] "Dus.fru" "Bet.nan" "Sal.gla" "Sal.jen" "Vac.min"
First we match up to 3 characters (\\w{1,3}), then ignore anything which is not a dot [^\\.]*, match a dot \\. and then again up to 3 characters (\\w{1,3}). Finally anything, that comes after that .*. We then only use the things in the brackets and separate them with a dot \\1.\\2.
Split on dot, substring 3 characters, then paste back together:
sapply(strsplit(x, ".", fixed = TRUE), function(i){
paste(substr(i[ 1 ], 1, 3), substr(i[ 2], 1, 3), sep = ".")
})
# [1] "Dus.fru" "Bet.nan" "Sal.gla" "Sal.jen" "Vac.min"
Here a less elegant solution than kath's, but a bit more easy to read, if you are not an expert in regex.
# Your data
x <- c("Duschekia.fruticosa..Rupr...Pouzar",
"Betula.nana.L.",
"Salix.glauca.L.",
"Salix.jenisseensis..F..Schmidt..Flod.",
"Vaccinium.minus..Lodd...Worosch")
# A function that takes three characters from first two words and merges them
cleaner_fun <- function(ugly_string) {
words <- strsplit(ugly_string, "\\.")[[1]]
short_words <- substr(words, 1, 3)
new_name <- paste(short_words[1:2], collapse = ".")
return(new_name)
}
# Testing function
sapply(x, cleaner_fun)
[1]"Dus.fru" "Bet.nan" "Sal.gla" "Sal.jen" "Vac.min"

R: find a specific string next to another string with for loop

I have the text of a novel in a single vector, it has been split by words novel.vector.words I am looking for all instances of the string "blood of". However since the vector is split by words, each word is its own string and I don't know to search for adjacent strings in a vector.
I have a basic understanding of what for loops do, and following some instructions from a text book, I can use this for loop to target all positions of "blood" and the context around it to create a tab-delineated KWIC display (key words in context).
node.positions <- grep("blood", novel.vector.words)
output.conc <- "D:/School/U Alberta/Classes/Winter 2019/LING 603/dracula_conc.txt"
cat("LEFT CONTEXT\tNODE\tRIGHT CONTEXT\n", file=output.conc) # tab-delimited header
#This establishes the range of how many words we can see in our KWIC display
context <- 10 # specify a window of ten words before and after the match
for (i in 1:length(node.positions)){ # access each match...
# access the current match
node <- novel.vector.words[node.positions[i]]
# access the left context of the current match
left.context <- novel.vector.words[(node.positions[i]-context):(node.positions[i]-1)]
# access the right context of the current match
right.context <- novel.vector.words[(node.positions[i]+1):(node.positions[i]+context)]
# concatenate and print the results
cat(left.context,"\t", node, "\t", right.context, "\n", file=output.conc, append=TRUE)}
What I am not sure how to do however, is use something like an if statement or something to only capture instances of "blood" followed by "of". Do I need another variable in the for loop? What I want it to do basically is for every instance of "blood" that it finds, I want to see if the word that immediately follows it is "of". I want the loop to find all of those instances and tell me how many there are in my vector.
You can create an index using dplyr::lead to match 'of' following 'blood':
library(dplyr)
novel.vector.words <- c("blood", "of", "blood", "red", "blood", "of", "blue", "blood")
which(grepl("blood", novel.vector.words) & grepl("of", lead(novel.vector.words)))
[1] 1 5
In response to the question in the comments:
This certainly could be done with a loop based approach but there is little point in re-inventing the wheel when there are already packages better designed and optimized to do the heavy lifting in text mining tasks.
Here is an example of how to find how frequently the words 'blood' and 'of' appear within five words of each other in Bram Stoker's Dracula using the tidytext package.
library(tidytext)
library(dplyr)
library(stringr)
## Read Dracula into dataframe and add explicit line numbers
fulltext <- data.frame(text=readLines("https://www.gutenberg.org/ebooks/345.txt.utf-8", encoding = "UTF-8"), stringsAsFactors = FALSE) %>%
mutate(line = row_number())
## Pair of words to search for and word distance
word1 <- "blood"
word2 <- "of"
word_distance <- 5
## Create ngrams using skip_ngrams token
blood_of <- fulltext %>%
unnest_tokens(output = ngram, input = text, token = "skip_ngrams", n = 2, k = word_distance - 1) %>%
filter(str_detect(ngram, paste0("\\b", word1, "\\b")) & str_detect(ngram, paste0("\\b", word2, "\\b")))
## Return count
blood_of %>%
nrow
[1] 54
## Inspect first six line number indices
head(blood_of$line)
[1] 999 1279 1309 2192 3844 4135

Resources