I thought this would be simpler, but I have strings not followed by ':', and strings with : inside the string. I want to append : to strings that don't end in :, and ignore strings that have : inside.
words
[1] "Bajos" "Ascensor" "habs.:3"
gsub('\\b(?!:)', '\\1:', words, perl = TRUE)
[1] ":Bajos:" ":Ascensor:" ":habs:.::3:"
grep('\\W', words)
[1] 3
grep('\\w', words)
[1] 1 2 3 # ?
Desired output:
'Bajos:' 'Ascensor:' 'habs.:3'
sub("^([^:]*)$", "\\1:", words)
# [1] "Bajos:" "Ascensor:" "habs.:3"
or
nocolon <- !grepl(":", words)
words[nocolon] <- paste0(words[nocolon], ":")
words
# [1] "Bajos:" "Ascensor:" "habs.:3"
Use
"(\\p{L}+)\\b(?![\\p{P}\\p{S}])"
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
(\p{L}+) one or more letters (group #1)
--------------------------------------------------------------------------------
\b word boundary
--------------------------------------------------------------------------------
(?![\p{P}\p{S}]) no punctuation allowed on the right
--------------------------------------------------------------------------------
R code snippet:
gsub("(\\p{L}+)\\b(?![\\p{P}\\p{S}])", "\\1:", text, perl=TRUE)
Related
Trying to figure out how to pull everything after the second number using the sub function in R. I understand the basics with the lazy and greedy matching, but how do I take it one step further and pull everything after the second number?
str <- 'john02imga-04'
#lazy: pulls everything after first number
sub(".*?[0-9]", "", str)
#output: "2imga-04
#greedy: pulls everything after last number
sub(".*[0-9]", "", str)
#output: ""
#desired output: "imga-04"
You can use
sub("\\D*[0-9]+", "", str)
## Or,
## sub("\\D*\\d+", "", str)
## => [1] "imga-04"
See the regex demo. Also, see the R demo online.
sub will find and replace the first occurrence of
\D* (=[^0-9]) - any zero or more non-digit chars
[0-9]+ (=\d+) - one or more digits.
Alternative ways
Match one or more letters, -, one or more digits at the end of the string:
> regmatches(str, regexpr("[[:alpha:]]+-\\d+$", str))
[1] "imga-04"
> library(stringr)
> str_extract(str, "\\p{L}+-\\d+$")
[1] "imga-04"
You can use a capture group for the second part and use that in the replacement
^\D+\d+(\D+\d+)
^ Start of string
\D+\d+ Match 1+ non digits, then 1+ digits
(\D+\d+) Capture group 1, match 1+ non digits and match 1+ digits
Regex demo | R demo
str <- 'john02imga-04'
sub("^\\D+\\d+(\\D+\\d+)", "\\1", str)
Output
[1] "imga-04"
If you want to remove all after the second number:
^\D+\d+(\D+\d+).*
Regex demo
As an alternative getting a match only using perl=T for using PCRE and \K to clear the match buffer:
str <- 'john02imga-04'
regmatches(str, regexpr("^\\D+\\d+\\K\\D+\\d+", str, perl = T))
Output
[1] "imga-04"
See an R demo
I'm trying to extract twitter handles from tweets using R's stringr package. For example, suppose I want to get all words in a vector that begin with "A". I can do this like so
library(stringr)
# Get all words that begin with "A"
str_extract_all(c("hAi", "hi Ahello Ame"), "(?<=\\b)A[^\\s]+")
[[1]]
character(0)
[[2]]
[1] "Ahello" "Ame"
Great. Now let's try the same thing using "#" instead of "A"
str_extract_all(c("h#i", "hi #hello #me"), "(?<=\\b)\\#[^\\s]+")
[[1]]
[1] "#i"
[[2]]
character(0)
Why does this example give the opposite result that I was expecting and how can I fix it?
It looks like you probably mean
str_extract_all(c("h#i", "hi #hello #me", "#twitter"), "(?<=^|\\s)#[^\\s]+")
# [[1]]
# character(0)
# [[2]]
# [1] "#hello" "#me"
# [[3]]
# [1] "#twitter"
The \b in a regular expression is a boundary and it occurs "Between two characters in the string, where one is a word character and the other is not a word character." see here. Since an space and "#" are both non-word characters, there is no boundary before the "#".
With this revision you match either the start of the string or values that come after spaces.
A couple of things about your regex:
(?<=\b) is the same as \b because a word boundary is already a zero width assertion
\# is the same as #, as # is not a special regex metacharacter and you do not have to escape it
[^\s]+ is the same as \S+, almost all shorthand character classes have their negated counterparts in regex.
So, your regex, \b#\S+, matches #i in h#i because there is a word boundary between h (a letter, a word char) and # (a non-word char, not a letter, digit or underscore). Check this regex debugger.
\b is an ambiguous pattern whose meaning depends on the regex context. In your case, you might want to use \B, a non-word boundary, that is, \B#\S+, and it will match # that are either preceded with a non-word char or at the start of the string.
x <- c("h#i", "hi #hello #me")
regmatches(x, gregexpr("\\B#\\S+", x))
## => [[1]]
## character(0)
##
## [[2]]
## [1] "#hello" "#me"
See the regex demo.
If you want to get rid of this \b/\B ambiguity, use unambiguous word boundaries using lookarounds with stringr methods or base R regex functions with perl=TRUE argument:
regmatches(x, gregexpr("(?<!\\w)#\\S+", x, perl=TRUE))
regmatches(x, gregexpr("(?<!\\S)#\\S+", x, perl=TRUE))
where:
(?<!\w) - an unambiguous starting word boundary - is a negative lookbehind that makes sure there is a non-word char immediately to the left of the current location or start of string
(?<!\S) - a whitespace starting word boundary - is a negative lookbehind that makes sure there is a whitespace char immediately to the left of the current location or start of string.
See this regex demo and another regex demo here.
Note that the corresponding right hand boundaries are (?!\w) and (?!\S).
The answer above should suffice. This will remove the # symbol in case you are trying to get the users' names only.
str_extract_all(c("#tweeter tweet", "h#is", "tweet #tweeter2"), "(?<=\\B\\#)[^\\s]+")
[[1]]
[1] "tweeter"
[[2]]
character(0)
[[3]]
[1] "tweeter2"
While I am no expert with regex, it seems like the issue may be that the # symbol does not correspond to a word character, and thus matching the empty string at the beginning of a word (\\b) does not work because there is no empty string when # is preceding the word.
Here are two great regex resources in case you hadn't seen them:
stat545
Stringr's Regex page, also available as a vignette:
vignette("regular-expressions", package = "stringr")
Question
Using a regular expression, how do I keep all digits when splitting a string?
Overview
I would like to split each element within the character vector sample.text into two elements: one of only digits and one of only the text.
Current Attempt is Dropping Last Digit
This regular expression - \\d\\s{1} - inside of base::strsplit() removes the last digit. Below is my attempt, along with my desired output.
# load necessary data -----
sample.text <-
c("111110 Soybean Farming", "0116 Soybeans")
# split string by digit and one space pattern ------
strsplit(sample.text, split = "\\d\\s{1}")
# [[1]]
# [1] "11111" "Soybean Farming"
#
# [[2]]
# [1] "011" "Soybeans"
# desired output --------
# [[1]]
# [1] "111110" "Soybean Farming"
#
# [[2]]
# [1] "0116" "Soybeans"
# end of script #
Any advice on how I can split sample.text to keep all digits would be much appreciated! Thank you.
Because you're splitting on \\d, the digit there is consumed in the regex, and not present in the output. Use lookbehind for a digit instead:
strsplit(sample.text, split = "(?<=\\d) ", perl=TRUE)
http://rextester.com/GDVFU71820
Some alternative solutions, using very simple pattern matching on the first occurrence of space:
1) Indirectly by using sub to substitute your own delimiter, then strsplit on your delimiter:
E.g. you can substitute ';' for the first space (if you know that character does not exist in your data):
strsplit( sub(' ', ';', sample.text), split=';')
2) Using regexpr and regmatches
You can effectively match on the first " " (space character), and split as follows:
regmatches(sample.text, regexpr(" ", sample.text), invert = TRUE)
Result is a list, if that's what you are after per your sample desired output:
[[1]]
[1] "111110" "Soybean Farming"
[[2]]
[1] "0116" "Soybeans"
3) Using stringr library:
library(stringr)
str_split_fixed(sample.text, " ", 2) #outputs a character matrix
[,1] [,2]
[1,] "111110" "Soybean Farming"
[2,] "0116" "Soybeans"
I have an input vector as follows:
input <- c("fdsfs iwantthis (1,1,1,1) fdsaaa iwantthisaswell (2,3,4,5)", "fdsfs thistoo (1,1,1,1)")
And I would like to use a regex to extract the following:
> output
[1] "iwantthis iwantthisaswell" "thistoo"
I have managed to extract every word that is before an opening bracket.
I tried this to get only the first word:
> gsub(".*?[[:space:]](.*?)[[:space:]]\\(.*", "\\1", input)
[1] "iwantthis" "thistoo"
But I cannot get it to work for multiple occurrences:
> gsub(".*?[[:space:]](.*?)[[:space:]]\\(.*?[[:space:]](.*?)[[:space:]]\\(.*", "\\1 \\2", input)
[1] "iwantthis iwantthisaswell" "fdsfs thistoo (1,1,1,1)"
The closest I have managed is the following:
library(stringr)
> str_extract_all(input, "(\\S*)\\s\\(")
[[1]]
[1] "iwantthis (" "iwantthisaswell ("
[[2]]
[1] "thistoo ("
I am sure I am missing something in my regex (not that good at it) but what?
You may use
> sapply(str_extract_all(input, "\\S+(?=\\s*\\()"), paste, collapse=" ")
[1] "iwantthis iwantthisaswell" "thistoo"
See the regex demo. The \\S+(?=\\s*\\() will extract all 1+ non-whitespace chunks from a text before a ( char preceded with 0+ whitespaces. sapply with paste will join the found matches with a space (with collapse=" ").
Pattern details
\S+ - 1 or more non-whitespace chars
(?=\s*\() - a positive lookahead ((?=...)) that requires the presence of 0+ whitespace chars (\s*) and then a ( char (\() immediately to the right of the current position.
Here is an option using base R
unlist(regmatches(input, gregexpr("\\w+(?= \\()", input, perl = TRUE)))
#[1] "iwantthis" "iwantthisaswell" "thistoo"
This works in R:
gsub('\\w.+? ([^\\s]+) \\(.+?\\)','\\1', input, perl=TRUE)
Result:
[1] "iwantthis iwantthisaswell" "thistoo"
UPDATED to work for the general case. E.g. now finds "i_wantthisaswell2" by searching on non-spaces between the other matches.
Using other suggested general case inputs:
general_cases <- c("fdsfs iwantthis (1,1,1,1) fdsaaa iwantthisaswell (2,3,4,5)",
"fdsfs thistoo (1,1,1,1) ",
"GaGa iwant_this (1,1,1,1)",
"lal2!##$%^&*()_+a i_wantthisaswell2 (2,3,4,5)")
gsub('\\w.+? ([^\\s]+) \\(.+?\\)','\\1', general_cases, perl=TRUE)
results:
[1] "iwantthis iwantthisaswell" "thistoo "
[3] "iwant_this" "i_wantthisaswell2"
I have a list of strings as follow: "/home/ricardo/MultiClass/data//F10/1036.txt"
> library(stringr)
> strsplit(cls[1], split= "/")
Give me:
#> [[1]] [1] "" "home" "ricardo" "MultiClass" "data"
#> "" "F10" "1036.txt"
How can I keep only the 7th position?
#> "F10"
If you want to extract one or more chars after // up to the first / or end of string use
> library(stringr)
> s <- "/home/ricardo/MultiClass/data//F10/1036.txt"
> str_extract(s, "(?<=//)[^/]+")
[1] "F10"
The (?<=//)[^/]+ regex pattern will find a position that is preceded with 2 slashes (see (?<=//)) and then matches one or more characters other than / (see [^/]+).
A base R solution with sub will look like
> sub("^.*/([^/]*)/[^/]*$", "\\1", s)
[1] "F10"
Details:
^ - start of string
.* - any 0+ chars as many as possible
/ - a slash (last but one in the string as the previous pattern is greedy)
([^/]*) - capturing group #1 matching any 0+ chars other than /
/ - last slash
[^/]* - any 0+ chars other than /
$ - end of string.
Using function word of stringr,
library(stringr)
word(sub('.*//', '', s), 1, sep = '/')
#[1] "F10"
#where
s <- '/home/ricardo/MultiClass/data//F10/1036.txt'
It can be done in R-base in this way.
I have defined the function gret to extract a pattern from a string
gret <-function(pattern,text,ignore.case=TRUE){
regmatches(text,regexpr(pattern,text,perl=TRUE,ignore.case))
then
gsub("data|/*","",gret("(?=data/).*(?<=/)","/home/ricardo/MultiClass
/data//F10/1036.txt"))
#>[1] "F10"