Parsing String and splitting it in R - r

I have somehow a regex problem with handling strings in R.
I have data structure provided by RNAfold software that looks like this:
"....(((..((((((((.(((((((((((.........))))))))))).))))))))..))).."
This is a typical secondary structure for miRNAs, but I also have other sequences that are not miRNAs, that look somwhat like this:
...((((.....))))...........(((((((...((..(((..((((...((((((.....)))).))...)))).))).))...))))))).......
This second sequence has two hairpin loops, one at the beginning and another one in the middle, whereas the first sequence just has one hairpin loop in the middle.
Dots (".") represent nucleotides that are not paired, while "(" represent nucleotides that are paired with their counterparts, represented as ")".
I want to split this string so that I can get the stems in the structure.
The output I would like to obtain is:
Input:
[1] "....(((..((((((((.(((((((((((.........))))))))))).))))))))..))).."
Output:
[1] "....(((..((((((((.(((((((((((........."
[2] "))))))))))).))))))))..))).."
So that I can count the number of splited strings and count the number of stems.
The result for the second sequence would be:
Input:
[1] ...((((.....))))...........(((((((...((..(((..((((...((((((.....)))).))...)))).))).))...))))))).......
Output:
[1] "...((((....."
[2] "))))...........(((((((...((..(((..((((...((((((....."
[3] ")))).))...)))).))).))...)))))))......."
So in esence, what I want is to parse the strings, so that they are splitted when they fin a ")" symbol, conserving all the symbols of the string.
I have been tried using strplit() and some regex variations but I haven't been able to find the trick...
Any help?
Thanks

You could do a lookahead and look for dots ending by a closing parenthesis which come straight after an opening parenthesis.
x <- c("....(((..((((((((.(((((((((((..))))))))))).))))))))..)))..",
"...((((.....))))...........(((((((...((..(((..((((...((((((.....)))).))...)))).))).))...))))))).......")
strsplit(x, "\\((?=(\\.+\\)))", perl = TRUE)
# [[1]]
# [1] "....(((..((((((((.((((((((((" "..))))))))))).))))))))..))).."
#
# [[2]]
# [1] "...(((" ".....))))...........(((((((...((..(((..((((...((((("
# [3] ".....)))).))...)))).))).))...)))))))......."

If you looking to count character it might be more convenient to do this:
x <- "...((((.....))))...........(((((((...((..(((..((((...((((((.....)))).))...)))).))).))...)))))))......."
with(rle(strsplit(x, "")[[1]]), setNames(lengths, values))
## . ( . ) . ( . ( . ( . ( . ( . ) . ) . ) . ) . ) . ) .
## 3 4 5 4 11 7 3 2 2 3 2 4 3 6 5 4 1 2 3 4 1 3 1 2 3 7 7

You can get the output you specified using DavidArenburg's logic but with a twist - David uses a lookahead regex expression to find the ( that precedes the pattern.{N}) where N can be any number. A variable-length lookbehind (where pattern contains unspecified # of a character) would be ideal but does not work (read - is not allowed). The trick is to reverse the string to use variable-length lookahead, much like a variable-length lookbehind might operate.
Data
S <- c("....(((..((((((((.(((((((((((.........))))))))))).))))))))..)))..", "...((((.....))))...........(((((((...((..(((..((((...((((((.....)))).))...)))).))).))...))))))).......")
Functions
reverse_string <- function(S) {
paste(rev(unlist(strsplit(S, ""))), collapse="")
}
myfun <- function(S) {
T <- reverse_string(S)
result <- unlist(strsplit(T, "\\)(?=(\\.+\\())", perl = TRUE))
setNames(rev(sapply(result, function(i) reverse_string(i))), NULL)
}
Result
lapply(S, myfun)
# [[1]]
# [1] "....(((..((((((((.(((((((((((........."
# [2] ")))))))))).))))))))..))).."
# [[2]]
# [1] "...((((....."
# [2] ")))...........(((((((...((..(((..((((...((((((....."
# [3] "))).))...)))).))).))...)))))))......."

Related

Regex: how to keep all digits when splitting a string?

Question
Using a regular expression, how do I keep all digits when splitting a string?
Overview
I would like to split each element within the character vector sample.text into two elements: one of only digits and one of only the text.
Current Attempt is Dropping Last Digit
This regular expression - \\d\\s{1} - inside of base::strsplit() removes the last digit. Below is my attempt, along with my desired output.
# load necessary data -----
sample.text <-
c("111110 Soybean Farming", "0116 Soybeans")
# split string by digit and one space pattern ------
strsplit(sample.text, split = "\\d\\s{1}")
# [[1]]
# [1] "11111" "Soybean Farming"
#
# [[2]]
# [1] "011" "Soybeans"
# desired output --------
# [[1]]
# [1] "111110" "Soybean Farming"
#
# [[2]]
# [1] "0116" "Soybeans"
# end of script #
Any advice on how I can split sample.text to keep all digits would be much appreciated! Thank you.
Because you're splitting on \\d, the digit there is consumed in the regex, and not present in the output. Use lookbehind for a digit instead:
strsplit(sample.text, split = "(?<=\\d) ", perl=TRUE)
http://rextester.com/GDVFU71820
Some alternative solutions, using very simple pattern matching on the first occurrence of space:
1) Indirectly by using sub to substitute your own delimiter, then strsplit on your delimiter:
E.g. you can substitute ';' for the first space (if you know that character does not exist in your data):
strsplit( sub(' ', ';', sample.text), split=';')
2) Using regexpr and regmatches
You can effectively match on the first " " (space character), and split as follows:
regmatches(sample.text, regexpr(" ", sample.text), invert = TRUE)
Result is a list, if that's what you are after per your sample desired output:
[[1]]
[1] "111110" "Soybean Farming"
[[2]]
[1] "0116" "Soybeans"
3) Using stringr library:
library(stringr)
str_split_fixed(sample.text, " ", 2) #outputs a character matrix
[,1] [,2]
[1,] "111110" "Soybean Farming"
[2,] "0116" "Soybeans"

plot function type

For digits I have done so:
digits <- c("0","1","2","3","4","5","6","7","8","9")
You can use the [:punct:] to detect punctuation. This detects
[!"\#$%&'()*+,\-./:;<=>?#\[\\\]^_`{|}~]
Either in grepexpr
x = c("we are friends!, Good Friends!!")
gregexpr("[[:punct:]]", x)
R> gregexpr("[[:punct:]]", x)
[[1]]
[1] 15 16 30 31
attr(,"match.length")
[1] 1 1 1 1
attr(,"useBytes")
[1] TRUE
or via stringi
# Gives 4
stringi::stri_count_regex(x, "[:punct:]")
Notice the , is counted as punctuation.
The question seems to be about getting individual counts of particular punctuation marks. #Joba provides a neat answer in the comments:
## Create a vector of punctuation marks you are interested in
punct = strsplit('[]?!"\'#$%&(){}+*/:;,._`|~[<=>#^-]\\', '')[[1]]
The count how often they appear
counts = stringi::stri_count_fixed(x, punct)
Decorate the vector
setNames(counts, punct)
You can use regular expressions.
stringi::stri_count_regex("amdfa, ad,a, ad,. ", "[:punct:]")
https://en.wikipedia.org/wiki/Regular_expression
might help too.

How to recognise and extract alpha numeric characters

I want to extract alphanumeric characters from a partiular sentence in R.
I have tried the following:
aa=grep("[:alnum:]","abc")
.This should return integer(0),but it returns 1,which should not be the case as "abc" is not an alphanumeric.
What am I missing here?
Essentially I am looking for a function,that only searches for characters that are combinations of both alphabets and numbers,example:"ABC-0112","PCS12SCH"
Thanks in advance for your help.
[[:alnum:]] matches alphabets or digits. To match the string which contains the both then you should use,
x <- c("ABC", "ABc12", "--A-1", "abc--", "89=A")
grep("(.*[[:alpha:]].*[[:digit:]]|.*[[:digit:]].*[[:alpha:]])", x)
# [1] 2 3 5
or
which(grepl("[[:alpha:]]", x) & grepl("[[:digit:]]", x))
# [1] 2 3 5

Print occurrence / positions of words

I have tried some different packages in order to build a R program that will take as input a text file and produce a list of the words inside that file. Each word should have a vector with all the places that this word exist in the file.
As an example, if the text file has the string:
"this is a nice text with nice characters"
The output should be something like:
$this
[1] 1
$is
[1] 2
$a
[1] 3
$nice
[1] 4 7
$text
[1] 5
$with
[1] 6
$characters
[1] 8
I came across a useful post, http://r.789695.n4.nabble.com/Memory-usage-in-R-grows-considerably-while-calculating-word-frequencies-td4644053.html, but it does not include the positions of each words.
I found a similar function called "str_locate", however I want to count "words" and not "characters".
Any guidance of what packages / techniques to use on that, would be really appreciated
You can do this with base R (which curiously produces exact your suggested output):
# data
x <- "this is a nice text with nice characters"
# split on whitespace
words <- strsplit(x, split = ' ')[[1]]
# find positions of every word
sapply(unique(words), function(x) which(x == words))
### result ###
$this
[1] 1
$is
[1] 2
$a
[1] 3
$nice
[1] 4 7
$text
[1] 5
$with
[1] 6
$characters
[1] 8

How to read in list from file in R?

I have list written in file created by sink() - "file.txt". That file contains one list, which look like this, and it contains only numers:
[[1]]
[1] 1 2
[[2]]
[1] 1 2 3
how to read in data as list from such file ?
EDITION :
I'm going to try read it as a string, then use some regex to remove '[[*]]' and substitute '[*]' with special symbol - let it be '#'. Then take every substring between '#', split it into vector and put into empty list.
Something like this should do the trick. (The exact details may vary, but at least this will give you some ideas to work with.)
l <- readLines("file.txt")
l2 <- gsub("\\[{2}\\d+\\]{2}", "#", l) # Replace [[*]] with '#'
l3 <- gsub("\\[\\d+\\]\\s", "", l2)[-1] # Remove all [*]
l4 <- paste(l3, collapse=" ") # Paste together into one string
l5 <- strsplit(l4, "#")[[1]] # Break into list
lapply(l5, function(X) scan(textConnection(X))) # Use scan to convert 2 numeric
# [[1]]
# [1] 1 2
#
# [[2]]
# [1] 1 2 3

Resources