Extract inside of first square brackets - r

I know there are a few similar questions, but they did not help me, perhaps due to my lack of understanding the basics of string manipulation.
I have a piece of string that I want to extract the inside of its first square brackets.
x <- "cons/mod2/det[4]/rost2/rost_act[2]/Q2w5"
I have looked all over the internet to assemble the following code but it gives me inside of 2nd brackets
sub(".*\\[(.*)\\].*", "\\1", x, perl=TRUE)
The code returns 2. I expect to get 4.
Would appreciate if someone points out the missing piece.
---- update ----
Replacing .* to .*? in the first two instances worked, but do not know how. I leave the question open for someone who can provide why this works:
sub(".*?\\[(.*?)\\].*", "\\1", x, perl=TRUE)

You're almost there:
sub("^[^\\]]*\\[(\\d+)\\].*", "\\1", x, perl=TRUE)
## [1] "4"
The original problem is that .* matches as much as possible of anything before it matches [. Your solution was *? which is lazy version of * (non-greedy, reluctant) matches as little as it can.
Completely valid, another alternative I used is [^\\]]*: which translates into match anything that is not ].

stringr
You can solve this with base R, but I usually prefer the functions from the stringr-package when handeling such 'problems'.
x <- "cons/mod2/det[4]/rost2/rost_act[2]/Q2w5"
If you want only the first string between brackets, use str_extract:
stringr::str_extract(x, "(?<=\\[).+?(?=\\])")
# [1] "4"
If you want all the strings between brackets, use str_extract_all:
stringr::str_extract_all(x, "(?<=\\[).+?(?=\\])")
# [[1]]
# [1] "4" "2"

Related

Regex get string between intervals underscores

I've seen a lot of similar questions, but I wasn't able to get the desired output.
I have a string means_variab_textimput_x2_200.txt and I want to catch ONLY what is between the third and fourth underscores: textimput
I'm using R, stringr, I've tried many things, but none solved the issue:
my_string <- "means_variab_textimput_x2_200.txt"
str_extract(my_string, '[_]*[^_]*[_]*[^_]*[_]*[^_]*')
"means_variab_textimput"
str_extract(my_string, '^(?:([^_]+)_){4}')
"means_variab_textimput_x2_"
str_extract(my_string, '[_]*[^_]*[_]*[^_]*[_]*[^_]*\\.') ## the closer I got was this
"_textimput_x2_200."
Any ideas? Ps: I'm VERY new to Regex, so details would be much appreciated :)
additional question: can I also get only a "part" of the word? let's say, instead of textimput only text but without counting the words? It would be good to know both possibilities
this this one this one were helpful, but I couldn't get the final expected results. Thanks in advance.
stringr uses ICU based regular expressions. Therefore, an option would be to use regex lookarounds, but here the length is not fixed, thus (?<= wouldn't work. Another option is to either remove the substrings with str_remove or use str_replace to match and capture the third word which doesn't have the _ ([^_]+) and replace with the backreference (\\1) of the captured word
library(stringr)
str_replace(my_string, "^[^_]+_[^_]+_([^_]+)_.*", "\\1")
[1] "textimput"
If we need only the substring
str_replace(my_string, "^[^_]+_[^_]+_([^_]{4}).*", "\\1")
[1] "text"
In base R, it is easier with strsplit and get the third word with indexing
strsplit(my_string, "_")[[1]][3]
# [1] "textimput"
Or use perl = TRUE in regexpr
regmatches(my_string, regexpr("^([^_]+_){2}\\K[^_]+", my_string, perl = TRUE))
# [1] "textimput"
For the substring
regmatches(my_string, regexpr("^([^_]+_){2}\\K[^_]{4}", my_string, perl = TRUE))
[1] "text"
Following up on question asked in comment about restricting the size of the extracted word, this can easily be achieved using quantification. If, for example, you want to extract only the first 4 letters:
sub("[^_]+_[^_]+_([^_]{4}).*$", "\\1", my_string)
[1] "text"

Extract character string in middle of string with R

I have character strings which look something like this:
a <- c("miRNA__hsa-mir-521-3p.iso.t5:", "miRNA__hsa-mir-947b.ref.t5:")
I want to extract the middle portion only eg. hsa-mir-521-3p and hsa-mir-947b
I have tried the following so far:
a1 <- substr(a, 8,21)
[1] "hsa-mir-521-3p" "hsa-mir-947b.r"
this obviously does not work because my desired substrings have varying lengths
a2 <- sub('miRNA__', '', a)
[1] "hsa-mir-521-3p.iso.t5:" "hsa-mir-947b.ref.t5:"
this works to remove the upstream string (“miRNA__”), but I still need to remove the downstream string
Could someone please advise what else I could try or if there is a simpler way to achieve this? I am still learning how to code with R. Thank you very much!
You haven't clearly defined the "middle portion" but based on the data shared we can extract everything between the last underscore ("_") and a dot (".").
sub('.*_(.*?)\\..*', '\\1', a)
#[1] "hsa-mir-521-3p" "hsa-mir-947b"
You can try the following regex like below
> gsub(".*_|\\..*","",a)
[1] "hsa-mir-521-3p" "hsa-mir-947b"
which removes the left-most (.*_) and right-most (\\..*) parts, therefore keeping the middle part.
We could also use trimws from base R
trimws(a, whitespace = '.*_|\\..*')
#[1] "hsa-mir-521-3p" "hsa-mir-947b"

strsplit returns invisible element

I found a very strange behavior in strsplit(). It's similar to this question, however I would love to know why it is returning an empty element in the first place. Does someone know?
unlist(strsplit("88F5T7F4T13F", "\\d+"))
[1] "" "F" "T" "F" "T" "F"
Since I use that string vor reproducing a long logical vector (88*FALSE 5*TRUE 7*FALSE 4*TRUE 13*FALSE) I have to trust it...
Answer unlist(strsplit("88F5T7F4T13F", "\\d+"))[-1] works, but is it robust?
The empty element appears since there are digits at the start. Since you split at digits, the first split occurs right between start of string and the first F and that empty string at the string start is added to the resulting list.
You may use your own solution since it is already working well. If you are interested in alternative solutions, see below:
unlist(strsplit(sub("^\\d+", "", "88F5T7F4T13F"), "\\d+"))
It makes the empty element in the resulting split disapper since the sub with ^\d+ pattern removes all leading digits (^ is the start of string and \d+ matches 1 or more digits). However, it is not robust, since it uses 2 regexps.
library(stringr)
res = str_extract_all(s, "\\D+")
This only requires one matching regex, \D+ - 1 or more non-digit symbols, and one external library.
If you want to do a similar thing with base R, use regmatches with gregexpr:
regmatches(s, gregexpr("\\D+", s))

Removing a character from within a vector element

I have a vector of strings:
str.vect<-c ("abcR.1", "abcL.1", "abcR.2", "abcL.2")
str.vect
[1] "abcR.1" "abcL.1" "abcR.2" "abcL.2"
How can I remove the third character from the right in each vector element?
Here is the desired result:
"abc.1" "abc.1" "abc.2" "abc.2"
Thank you very much in advance
You can use nchar to find the length of each element of the vector
> nchar(str.vect)
[1] 6 6 6 6
Then you combine this with strtrim to get the beginning of each string
> strtrim(str.vect, nchar(str.vect)-3)
[1] "abc" "abc" "abc" "abc"
To get the end of the word you can then use substr (actually, you could use substr to get the beginning too...)
> substr(str.vect, nchar(str.vect)-1, nchar(str.vect))
[1] ".1" ".1" ".2" ".2"
And finally you use paste0 (which is paste with sep="") to stick them together
> paste0(strtrim(str.vect, nchar(str.vect)-3), # Beginning
substr(str.vect, nchar(str.vect)-1, nchar(str.vect))) # End
[1] "abc.1" "abc.1" "abc.2" "abc.2"
There are easier ways if you know your strings have some special characteristics
For instance, if the length is always 6 you can directly substitute the nchar calls with the appropriate value.
EDIT: alternatively, R also supports regular expressions, which make this task much easier.
> gsub(".(..)$", "\\1", str.vect)
[1] "abc.1" "abc.1" "abc.2" "abc.2"
The syntax is a bit more obscure, but not that difficult once you know what you are looking at.
The first parameter (".(..)$") is what you want to match
. matches any character, $ denotes the end of the string.
So ...$ indicates the last 3 characters in the string.
We put the last two in parenthesis, so that we can store them in memory.
The second parameter tells us what you want to substitute the matched substring with. In our case we put \\1 which means "whatever was in the first pair of parenthesis".
So essentially this command means: "find the last three characters in the string and change them with the last two".
The solution provided by #nico seems fine, but a simpler alternative might be to use sub:
sub('.(.{2})$', '\\1', str.vect)
This searches for the pattern of: "any character (represented by .) followed by 2 of any character (represented by .{2}), followed by the end of the string (represented by $)". By wrapping the .{2} in parentheses, R captures whatever those last two characters were. The second argument is the string to replace the matched substrings with. In this case, we refer to the first string captured in the matched pattern. This is represented by \\1. (If you captured multiple parts of the pattern, with multiple sets of parentheses, you would refer to subsequent captured regions with, e.g. \\2, \\3, etc.)
str.vect<-c ("abcR.1", "abcL.1", "abcR.2", "abcL.2")
a <- strsplit(str.vect,split="")
a <- strsplit(str.vect,split="")
b <- unlist(lapply(a,FUN=function(x) {x[4] <- ""
paste(x,collapse="")}
))
If you want to parameterize it further change 4 to a variable and put the index of the character you want to remove there.
Not sure how general or efficient this is, but it seems to work with your example string:
(This seems very similar to nico's answer although I am not using the strtrim function.)
my.string <- c("abcR.1", "abcL.1", "abcR.2", "abcL.2")
n.char <- nchar(my.string)
the.beginning <- substr(my.string, n.char-(n.char-1), n.char-3)
the.end <- substr(my.string, n.char-1, n.char)
new.string <- paste0(the.beginning, the.end)
new.string
# [1] "abc.1" "abc.1" "abc.2" "abc.2"
The 3rd character from the right of each element is removed.
sapply(str.vec, function(x) gsub(substr(x, nchar(x)-2,nchar(x)-2), "", x))
This is a very quick and dirty answer, but thats what is needed sometimes:
#Define vector
str.vect <- c("abcR.1", "abcL.1", "abcR.2", "abcL.2")
#Use gsub to remove both 'R' and 'L' independently.
str.vect2 <- gsub("R", '', str.vect )
str.vect_final <- gsub("L", '', str.vect2 )
>str.vect_final
[1] "abc.1" "abc.1" "abc.2" "abc.2"

Extract websites links from a text in R

I have multiple texts that each may consist references to one or more web links. for example:
text1= "s#1212a as www.abcd.com asasa11".
How do I extract:
"www.abcd.com"
from this text in R? In other words I am looking to extract patterns that start with www and end with .com
regmatches This approach uses regexpr/grepgexpr and regmatches. I expanded the test data to include more examples.
text1 <- c("s#1212a www.abcd.com www.cats.com",
"www.boo.com",
"asdf",
"blargwww.test.comasdf")
# Regular expressions take some practice.
# check out ?regex or the wikipedia page on regular expressions
# for more info on creating them yourself.
pattern <- "www\\..*?\\.com"
# Get information about where the pattern matches text1
m <- gregexpr(pattern, text1)
# Extract the matches from text1
regmatches(text1, m)
Which gives
> regmatches(text1, m) ##
[[1]]
[1] "www.abcd.com" "www.cats.com"
[[2]]
[1] "www.boo.com"
[[3]]
character(0)
[[4]]
[1] "www.test.com"
Notice it returns a list. If we want a vector you can just use unlist on the result. This is because we used gregexpr which implies there could be multiple matches in our string. If we know there is at most one match we could use regexpr instead
> m <- regexpr(pattern, text1)
> regmatches(text1, m)
[1] "www.abcd.com" "www.boo.com" "www.test.com"
Notice, however, that this returns all results as a vector and only returns a single result from each string (note that www.cats.com isn't in the results). On the whole, though, I think either of these two methods is preferable to the gsub method because that way will return the entire input if there is no result found. For example take a look:
> gsub(text1, pattern=".*(www\\..*?\\.com).*", replace="\\1")
[1] "www.abcd.com" "www.boo.com" "asdf" "www.test.com"
And that's even after modifying the pattern to be a little more robust. We still get 'asdf' in the results even though it clearly doesn't match the pattern.
Shameless silly self promotion: regmatches was introduced with R 2.14 so if you're stuck with an earlier version of R you might be out of luck. Unless you're able to install the future2.14 package from my github repo which provides some support for functions introduced in 2.14 to earlier versions of R.
strapplyc. An alternative which gives the same result as ## above is:
library(gsubfn)
strapplyc(test1, pattern)
The regular expression Here is some explanation on how to decipher the regular expression:
pattern <- "www\\..*?\\.com"
Explanation:
www matches the www portion
\\. We need to escape an actual 'dot' character using \\ because a plain . represents "any character" in regular expressions.
.*? The . represents any character, the * tells to match 0 or more times, and the ? following the * tells it to not be greedy. Otherwise "asdf www.cats.com www.dogs.com asdf" would match all of "www.cats.com www.dogs.com" as a single match instead of recognizing that there are two matches in there.
\\. Once again we need to escape an actual dot character
com This part matches the ending 'com' that we want to match
Putting it all together it says: start with www. then match any characters until you reach the first ".com"
Check out the gsub function:
x = "s#1212a as www.abcd.com asasa11"
gsub(x=x, pattern=".*(www.*com).*", replace="\\1")
The basic idea is to surround the txt you want to retain in parenthesis, then replace the entire line with it. The replace parameter of gsub "\\1" refers to what was found in the parenthesis.
The solutions here are great and in base. For those that want a quick solution you can use qdap's genXtract. This functions basically takes a left and a right element(s) and it will extract everything in between. By setting with = TRUE it will include those elements:
text1 <- c("s#1212a www.abcd.com www.cats.com",
"www.boo.com",
"asdf",
"http://www.talkstats.com/ and http://stackoverflow.com/",
"blargwww.test.comasdf")
library(qdap)
genXtract(text1, "www.", ".com", with=TRUE)
## > genXtract(text1, "www.", ".com", with=TRUE)
## $`www. : .com1`
## [1] "www.abcd.com" "www.cats.com"
##
## $`www. : .com2`
## [1] "www.boo.com"
##
## $`www. : .com3`
## character(0)
##
## $`www. : .com4`
## [1] "www.talkstats.com"
##
## $`www. : .com5`
## [1] "www.test.com"
PS if you look at code for the function it is a wrapper for Dason's solution.

Resources