Match a substring with character, digits and spaces with gsub - r

I have a string like:
a <- '{:name=>"krill", :priority=>2, :count=>1}, {:name=>"vit a", :priority=>2]}, {:name=>"vit-b", :priority=>2, :count=>1}, {:name=>"vit q10", :priority=>2]}'
I would like to parse via str_match the elements within ':name=>" ' and ' " '
krill
vit a
vit-b
vit q10
So far I tried:
str_match(a, ':name=>\\"([A-Za-z]{3})')
But it doesn't work.
Any help is appreciated

You may extract those values with
> regmatches(a, gregexpr(':name=>"\\K[^"]+', a, perl=TRUE))
[[1]]
[1] "krill" "vit a" "vit-b" "vit q10"
The :name=>"\\K[^"]+ pattern matches
:name=>" - a literal substring
\K - omits the substring from the match
[^"]+ - one or more chars other than ".
If you need to use stringr package, use str_extract_all:
> library(stringr)
> str_extract_all(a, '(?<=:name=>")[^"]+')
[[1]]
[1] "krill" "vit a" "vit-b" "vit q10"
In (?<=:name=>")[^"]+, the (?<=:name=>") matches any location that is immediately preceded with :name=>".

Using stringr and positive lookbehind:
library(stringr)
str_match_all(a, '(?<=:name=>")[^"]+')
[[1]]
[,1]
[1,] "krill"
[2,] "vit a"
[3,] "vit-b"
[4,] "vit q10"

Related

fetch specific word or number from url address [duplicate]

I have a dataset say
x <- c('test/test/my', 'et/tom/cat', 'set/eat/is', 'sk / handsome')
I'd like to remove everything before (including) the last slash, the result should look like
my cat is handsome
I googled this code which gives me everything before the last slash
gsub('(.*)/\\w+', '\\1', x)
[1] "test/test" "et/tom" "set/eat" "sk / tie"
How can I change this code, so that the other part of the string after the last slash can be shown?
Thanks
You can use basename:
paste(trimws(basename(x)),collapse=" ")
# [1] "my cat is handsome"
Using strsplit
> sapply(strsplit(x, "/\\s*"), tail, 1)
[1] "my" "cat" "is" "handsome"
Another way for gsub
> gsub("(.*/\\s*(.*$))", "\\2", x) # without 'unwanted' spaces
[1] "my" "cat" "is" "handsome"
Using str_extract from stringr package
> library(stringr)
> str_extract(x, "\\w+$") # without 'unwanted' spaces
[1] "my" "cat" "is" "handsome"
You can basically just move where the parentheses are in the regex you already found:
gsub('.*/ ?(\\w+)', '\\1', x)
You could use
x <- c('test/test/my', 'et/tom/cat', 'set/eat/is', 'sk / handsome')
gsub('^(?:[^/]*/)*\\s*(.*)', '\\1', x)
Which yields
[1] "my" "cat" "is" "handsome"
To have it in one sentence, you could paste it:
(paste0(gsub('^(?:[^/]*/)*\\s*(.*)', '\\1', x), collapse = " "))
The pattern here is:
^ # start of the string
(?:[^/]*/)* # not a slash, followed by a slash, 0+ times
\\s* # whitespaces, eventually
(.*) # capture the rest of the string
This is replaced by \\1, hence the content of the first captured group.

gsub replace word not followed by : with word: R

I thought this would be simpler, but I have strings not followed by ':', and strings with : inside the string. I want to append : to strings that don't end in :, and ignore strings that have : inside.
words
[1] "Bajos" "Ascensor" "habs.:3"
gsub('\\b(?!:)', '\\1:', words, perl = TRUE)
[1] ":Bajos:" ":Ascensor:" ":habs:.::3:"
grep('\\W', words)
[1] 3
grep('\\w', words)
[1] 1 2 3 # ?
Desired output:
'Bajos:' 'Ascensor:' 'habs.:3'
sub("^([^:]*)$", "\\1:", words)
# [1] "Bajos:" "Ascensor:" "habs.:3"
or
nocolon <- !grepl(":", words)
words[nocolon] <- paste0(words[nocolon], ":")
words
# [1] "Bajos:" "Ascensor:" "habs.:3"
Use
"(\\p{L}+)\\b(?![\\p{P}\\p{S}])"
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
(\p{L}+) one or more letters (group #1)
--------------------------------------------------------------------------------
\b word boundary
--------------------------------------------------------------------------------
(?![\p{P}\p{S}]) no punctuation allowed on the right
--------------------------------------------------------------------------------
R code snippet:
gsub("(\\p{L}+)\\b(?![\\p{P}\\p{S}])", "\\1:", text, perl=TRUE)

Regex: how to keep all digits when splitting a string?

Question
Using a regular expression, how do I keep all digits when splitting a string?
Overview
I would like to split each element within the character vector sample.text into two elements: one of only digits and one of only the text.
Current Attempt is Dropping Last Digit
This regular expression - \\d\\s{1} - inside of base::strsplit() removes the last digit. Below is my attempt, along with my desired output.
# load necessary data -----
sample.text <-
c("111110 Soybean Farming", "0116 Soybeans")
# split string by digit and one space pattern ------
strsplit(sample.text, split = "\\d\\s{1}")
# [[1]]
# [1] "11111" "Soybean Farming"
#
# [[2]]
# [1] "011" "Soybeans"
# desired output --------
# [[1]]
# [1] "111110" "Soybean Farming"
#
# [[2]]
# [1] "0116" "Soybeans"
# end of script #
Any advice on how I can split sample.text to keep all digits would be much appreciated! Thank you.
Because you're splitting on \\d, the digit there is consumed in the regex, and not present in the output. Use lookbehind for a digit instead:
strsplit(sample.text, split = "(?<=\\d) ", perl=TRUE)
http://rextester.com/GDVFU71820
Some alternative solutions, using very simple pattern matching on the first occurrence of space:
1) Indirectly by using sub to substitute your own delimiter, then strsplit on your delimiter:
E.g. you can substitute ';' for the first space (if you know that character does not exist in your data):
strsplit( sub(' ', ';', sample.text), split=';')
2) Using regexpr and regmatches
You can effectively match on the first " " (space character), and split as follows:
regmatches(sample.text, regexpr(" ", sample.text), invert = TRUE)
Result is a list, if that's what you are after per your sample desired output:
[[1]]
[1] "111110" "Soybean Farming"
[[2]]
[1] "0116" "Soybeans"
3) Using stringr library:
library(stringr)
str_split_fixed(sample.text, " ", 2) #outputs a character matrix
[,1] [,2]
[1,] "111110" "Soybean Farming"
[2,] "0116" "Soybeans"

Extract both occurrences of pattern Regex

I have an input vector as follows:
input <- c("fdsfs iwantthis (1,1,1,1) fdsaaa iwantthisaswell (2,3,4,5)", "fdsfs thistoo (1,1,1,1)")
And I would like to use a regex to extract the following:
> output
[1] "iwantthis iwantthisaswell" "thistoo"
I have managed to extract every word that is before an opening bracket.
I tried this to get only the first word:
> gsub(".*?[[:space:]](.*?)[[:space:]]\\(.*", "\\1", input)
[1] "iwantthis" "thistoo"
But I cannot get it to work for multiple occurrences:
> gsub(".*?[[:space:]](.*?)[[:space:]]\\(.*?[[:space:]](.*?)[[:space:]]\\(.*", "\\1 \\2", input)
[1] "iwantthis iwantthisaswell" "fdsfs thistoo (1,1,1,1)"
The closest I have managed is the following:
library(stringr)
> str_extract_all(input, "(\\S*)\\s\\(")
[[1]]
[1] "iwantthis (" "iwantthisaswell ("
[[2]]
[1] "thistoo ("
I am sure I am missing something in my regex (not that good at it) but what?
You may use
> sapply(str_extract_all(input, "\\S+(?=\\s*\\()"), paste, collapse=" ")
[1] "iwantthis iwantthisaswell" "thistoo"
See the regex demo. The \\S+(?=\\s*\\() will extract all 1+ non-whitespace chunks from a text before a ( char preceded with 0+ whitespaces. sapply with paste will join the found matches with a space (with collapse=" ").
Pattern details
\S+ - 1 or more non-whitespace chars
(?=\s*\() - a positive lookahead ((?=...)) that requires the presence of 0+ whitespace chars (\s*) and then a ( char (\() immediately to the right of the current position.
Here is an option using base R
unlist(regmatches(input, gregexpr("\\w+(?= \\()", input, perl = TRUE)))
#[1] "iwantthis" "iwantthisaswell" "thistoo"
This works in R:
gsub('\\w.+? ([^\\s]+) \\(.+?\\)','\\1', input, perl=TRUE)
Result:
[1] "iwantthis iwantthisaswell" "thistoo"
UPDATED to work for the general case. E.g. now finds "i_wantthisaswell2" by searching on non-spaces between the other matches.
Using other suggested general case inputs:
general_cases <- c("fdsfs iwantthis (1,1,1,1) fdsaaa iwantthisaswell (2,3,4,5)",
"fdsfs thistoo (1,1,1,1) ",
"GaGa iwant_this (1,1,1,1)",
"lal2!##$%^&*()_+a i_wantthisaswell2 (2,3,4,5)")
gsub('\\w.+? ([^\\s]+) \\(.+?\\)','\\1', general_cases, perl=TRUE)
results:
[1] "iwantthis iwantthisaswell" "thistoo "
[3] "iwant_this" "i_wantthisaswell2"

Extract all values in a string that occur after another substring in R

Lets say I have a string:
fgjh=621729_&ioij_fgjh7=twenty-_-One-_-Forty
I want to extract the following substrings from this string:
1. "621729"
2. "twenty"
3. "One"
4. "Forty"
Basically I want to extract anything after the "fgjh=" substring and "fgjh7=" sub strings.
I've found that this formula works in excel:
=TRIM(RIGHT(SUBSTITUTE(A1,"fgjh=",REPT(" ",LEN(A1))),LEN(A1)))
But the excel file is too large and I need to perform the same operation in R
How would I deal with leading characters and trailing characters. Let's say the string was "lmnop_82137_hhgia=77789_pasdk_ikuk_fgjh=621729_&ioij_fgjh7=‌​twenty--One--Forty_d‌​saoij_882390=lkuk" and I need to extract the data after "fgjh=" i.e 621729 and everything after "fgjh7=" to get only "twenty", "one" and "forty"
You could use the package stringr and the function str_match for example to parse out the interesting bits with regular expressions
> library(stringr)
> s <- "fgjh=621729_&ioij_fgjh7=twenty--One--Forty"
> str_match(s, "^fgjh=([0-9]+)_&ioij_fgjh7=(.+)--(.+)--(.+)$")
[,1] [,2] [,3] [,4] [,5]
[1,] "fgjh=621729_&ioij_fgjh7=twenty--One--Forty" "621729" "twenty" "One" "Forty"
library(stringr)
unlist(strsplit(str_extract_all(string,'(?<=\\=)([^_]+)')[[1]],'--'))
[1] "621729" "twenty" "One" "Forty"
Using sub with regular expression is more flexible than splitting by position:
> sub(".*=(.*)_&.*", "\\1", "fgjh=621729_&ioij_fgjh7=twenty--One--Forty", )
[1] "621729"
> sub(".*=(.*)--.*--.*", "\\1", "fgjh=621729_&ioij_fgjh7=twenty--One--Forty", )
[1] "twenty"
> sub(".*--(.*)--.*", "\\1", "fgjh=621729_&ioij_fgjh7=twenty--One--Forty", )
[1] "One"
> sub(".*--(.*)$", "\\1", "fgjh=621729_&ioij_fgjh7=twenty--One--Forty", )
[1] "Forty"
In one line :
strsplit(sub(".*=(.*)_&.*=(.*)--(.*)--(.*)", "\\1\\|\\2\\|\\3\\|\\4",
"fgjh=621729_&ioij_fgjh7=twenty--One--Forty" ), split="\\|")[[1]]
[1] "621729" "twenty" "One" "Forty"

Resources