Get string in between many other strings [R] - r

Here I want to extract the string part "wanted1part". I could do it like this:
string <- "foo_bar_doo_xwanted1part_more_junk"
gsub("\\_.*", "", gsub(".*?_x", "", string))
#> [1] "wanted1part"
But I wanted hoping that maybe someone could suggest a one line solution?

If you want to stick with using gsub, you can use a capture group that is backreferenced in the replacement:
gsub('^.+_x(\\w+?)_.+$', '\\1', string, perl = TRUE)
The key here is to have the pattern match the whole string but to have a capture group, specified using parenthesis, match the part of the string you would like to keep. This group, here "(\\w+?)", can then replace the entire string when we reference it in the replacement.
I've found that using str_extract from stringr can make this kind of thing a easier as it allows me to avoid the use of capture groups.
library(stringr)
str_extract(string, '(?<=_x)\\w+?(?=_)')
Here, I use a lookahead and lookbehind instead to identify the part of the string we want to extract.

Related

Create a function in R to extract character from string by using position? The positions of characters are figured out based on pattern condition

I want to create a function that extract characters from strings by using substring, but got some problems to find out the end_position to cut the character.
I got a string that stored in term of log file like that:
string = ("{\"country\":\"UNITED STATES\",\"country`_`code\":\"US\"}")
My idea is identify the position of each descriptions in the log and cut the character behind
start_position = as.numeric(str_locate(string,'\"country\":\"')[,2])
end_position = ??????
country = substring(x,start_position,end_postion)
The sign to recognize the end of character that I want to cut is the symbol "," at the end. FOR EXAMPLE: \"country\":\"UNITED STATES\",
Could you guys tell me any way to get the position of "," with condition of specific pattern in front? I intend to create a function later to extract character based on the recognized pattern. In this example, they are "country" and "country code"
Instead of using substring have a look into strsplit, that will split according to a pattern.
string = ("{\"country\":\"UNITED STATES\",\"country`_`code\":\"US\"})")
strsplit(string,",")[[1]][1]
[1] "{\"country\":\"UNITED STATES\""
You can change the pattern with every regex you like

Match everything up until first instance of a colon

Trying to code up a Regex in R to match everything before the first occurrence of a colon.
Let's say I have:
time = "12:05:41"
I'm trying to extract just the 12. My strategy was to do something like this:
grep(".+?(?=:)", time, value = TRUE)
But I'm getting the error that it's an invalid Regex. Thoughts?
Your regex seems fine in my opinion, I don't think you should use grep, also you are missing perl=TRUE that is why you are getting the error.
I would recommend using :
stringr::str_extract( time, "\\d+?(?=:)")
grep is little different than it is being used here, its good for matching separate values and filtering out those which has similar pattern, but you can't pluck out values within a string using grep.
If you want to use Base R you can also go for sub:
sub("^(\\d+?)(?=:)(.*)$","\\1",time, perl=TRUE)
Also, you may split the string using strsplit and filter out the first string like below:
strsplit(time, ":")[[1]][1]

Repeating a regex pattern for date parsing

I have the following string
"31032017"
and I want to use regular expressions in R to get
"31.03.2017"
What is the best function to do it?
And a general question, how can I repeat the matched part, like as in sed in bash? There, we use \1 to repeat the first matched part.
You need to put the single parts in round brackets like this:
sub("([0-9]{2})([0-9]{2})([0-9]{4})", "\\1.\\2.\\3", "31032017")
You can then use \\1 to access the part matched by the first group, \\2 for the second and so on.
Note that if your string is a date, there are better ways to parse / reformat it than directly using regex.
date_vector = c("31032017","28052017","04052022")
as.character(format(as.Date(date_vector, format = "%d%m%Y"), format = "%d.%m.%Y"))
#[1] "31.03.2017" "28.05.2017" "04.05.2022"
If you want to work/do math with dates, omit as.character.

str_extract - How to disable default regex

library(stringr)
namesfun<-(sapply(mxnames, function (x)(str_extract(x,sapply(jockeys, function (y)y)))))%>%as.data.frame(stringsAsFactors = F)
So I am trying to use str_extract using sapply through two vectors, and the "jockeys" vector that I use as the pattern argument in str_extract, has elements with special characters like "-" or "/" that interfere with regex.
Since I want an exact "human" match if you prefer, and not regex based match, how can I disable regex from being the default matching manner?
I hope I got my point across!

Replace the last occurence of a string (and only it) using regular expression

I have a string, let say MyString = "aabbccawww". I would like to use a gsub expression to replace the last "a" in MyString by "A", and only it. That is "aabbccAwww". I have found similar questions on the website, but they all requested to replace the last occurrence and everything coming after.
I have tried gsub("a[^a]*$", "A", MyString), but it gives "aabbccA". I know that I can use stringi functions for that purpose but I need the solution to be implemented in a part of a code where using such functions would be complicated, so I would like to use a regular expression.
Any suggestion?
You can use stringi library which makes dealing with strings very easy, i.e.
library(stringi)
x <- "aabbccawww"
stri_replace_last_fixed(x, 'a', 'A')
#[1] "aabbccAwww"
We can use sub to match 'a' followed by zero or more characters that are not an 'a' ([^a]*), capture it as group ((...)) until the end of the string ($) and replace it with "A" followed by the backreference of the captured group (\\1)
sub("a([^a]*)$", "A\\1", MyString)
#[1] "aabbccAwww"
While akrun's answer should solve the problem (not sure, haven't worked with \1 etc. yet), you can also use lookouts:
a(?!(.|\n)*a)
This is basically saying: Find an a that is NOT followed by any number of characters and an a. The (?!x) is a so-called lookout, which means that the searched expression won't be included in the match.
You need (.|\n) since . refers to all characters, except for line breaks.
For reference about lookouts or other regex, I can recommend http://regexr.com/.

Resources