Replace multiple consecutive hyphens in R

Replace multiple consecutive hyphens in R - r

I have a string which looks like this:
something-------another--thing
I want to replace the multiple dashes with a single one.
So the expected output would be:
something-another-thing

We can try using sub here:
x <- "something-------another--thing"
gsub("-{2,}", "-", x)
[1] "something-another-thing"
More generally, if we want to replace any sequence of two or more of the same character with just the single character, then use this version:
x <- "something-------another--thing"
gsub("(.)\\1+", "\\1", x)
The second pattern could use an explanation:
(.) match AND capture any single letter
\\1+ then match the same letter, at least one or possibly more times
Then, we replace with just the single captured letter.

you can do it with gsub and using regex.
> text='something-------another--thing'
> gsub('-{2,}','-',text)
[1] "something-another-thing"

t2 <- "something-------another--thing"
library(stringr)
str_replace_all(t2, pattern = "-+", replacement = "-")
which gives:
[1] "something-another-thing"
If you're searching for the right regex to search for a string, you can test it out here https://regexr.com/
In the above, you're just searching for a pattern that is a hyphen, so pattern = "-", but we add the plus so that the search is 'greedy' and can include many hyphens, so we get pattern = "-+"

Related

regex substitution "." to "_"

I have a specific issue with character substitution in strings:
If I have the following strings
"..A.B....c...A..D.."
"A..S.E.Q.......AW.."
".B.C..a...R......Ds"
Which regex substitution should I use to replace the dots and obtain the following strings:
"A_B_c_A_D"
"A_S_E_Q_AW"
"B_C_a_R_Ds"
I am using R.
Thanks in advance!

Using stringr from the ever fantastic tidyverse.
str1 <- "..A.B....c...A..D.."
str1 %>%
#replace all dots that follow any word character ('\\.' escapes search, '+' matches one or more, '(?<=\\w)' followed by logic)
str_replace_all('(?<=\\w)\\.+(?=\\w)', '_') %>%
#delete remaining dots (i.e. at the start)
str_remove_all('\\.')
As always plenty of ways to skin the cat with regex

Here a solution using gsub in two parts
string = c("..A.B....c...A..D..","A..S.E.Q.......AW..",".B.C..a...R......Ds")
first remove start and end points
string2 = gsub("^\\.+|\\.+$", "", string)
finally replace one or more points with _
string2 = gsub("\\.+", "_", string2)

Using x shown in the Note at the end, use trimws to trim dot off both ends. dot means any character so we have to escape it with backslashes to remove that meaning. Then replace every dot with underscore using chartr. No packages are used.
x |> trimws("both", "\\.") |> chartr(old = ".", new = "_")
## [1] "A_B____c___A__D" "A__S_E_Q_______AW" "B_C__a___R______Ds"
Note
x <- c("..A.B....c...A..D..",
"A..S.E.Q.......AW..",
".B.C..a...R......Ds")

R Use Regular Expression to capture number when sometimes the capture is at the end of the string or not

I need to capture the numbers out of a string that come after a certain parameter name.
I have it working for most, but there is one parameter that is sometimes at the end of the string, but not always. When using the regular expression, it seems to matter.
I've tried different things, but nothing seems to work in both cases.
# Regular expression to capture the digit after the phrase "AppliedWhenID="
p <- ".*&AppliedWhenID=(.\\d*)"
# Tried this, but when at end, it just grabs a blank
#p <- ".*&AppliedWhenID=(.\\d*)&.*|.*&AppliedWhenID=(.\\d*)$"
testAtEnd <- "ReportType=233&ReportConID=171&MonthQuarterYear=0TimePeriodLabel=Year%202020&AppliedWhenID=2"
testNotAtEnd <- "ReportType=233&ReportConID=171&MonthQuarterYear=0TimePeriodLabel=Year%202020&AppliedWhenID=2&AgDateTypeID=1"
# What should be returned is "2"
gsub(p, "\\1", testAtEnd) # works
gsub(p, "\\1", testNotAtEnd) # doesn't work, it captures 2 + &AgDateTypeID=1

Note that sub and gsub replace the found text(s), thus, in order to extract a part of the input string with a capturing group + a backreference, you need to actually match (and consume) the whole string.
Hence, you need to match the string to the end by adding .* at the end of the pattern:
p <- ".*&AppliedWhenID=(\\d+).*"
sub(p, "\\1", testNotAtEnd)
# => [1] "2"
sub(p, "\\1", testAtEnd)
# => [1] "2"
See the regex demo and the R online demo.
Note that gsub matches multiple occurrences, you need a single one, so it makes sense to replace gsub with sub.
Regex details
.* - any zero or more chars as many as possible
&AppliedWhenID= - a &AppliedWhenID= string
(\d+) - Group 1 (\1): one or more digits
.* - any zero or more chars as many as possible.

You could try using the string look behind conditional "(?<=)" and str_extract() from the stringr library.
testAtEnd <- "ReportType=233&ReportConID=171&MonthQuarterYear=0TimePeriodLabel=Year%202020&AppliedWhenID=2"
testNotAtEnd <- "ReportType=233&ReportConID=171&MonthQuarterYear=0TimePeriodLabel=Year%202020&AppliedWhenID=2&AgDateTypeID=1"
p <- "(?<=AppliedWhenID=)\\d+"
# What should be returned is "2"
library(stringr)
str_extract(testAtEnd, p)
str_extract(testNotAtEnd, p)
Or in base R
p <- ".*((?<=AppliedWhenID=)\\d+).*"
gsub(p, "\\1", testAtEnd, perl=TRUE)
gsub(p, "\\1", testNotAtEnd, perl=TRUE)

Camel Case format conversion using regular expressions in R

I have two related questions regarding regular expressions in R:
[1]
I would like to convert sub-strings, containing punctuation followed by a letter, to an upper case letter.
Example:
Dr_dre to: DrDre
Captain.Spock to: CaptainSpock
spider-man to: spiderMan
[2]
I would like convert camel case strings to lower case strings with underscore delimiter.
Example:
EndOfFile to: End_of_file
CamelCase to: Camel_Case
ABC to: A_B_C
Thanks much,
Kamashay

We can use sub. We match one or more punctuation characters ([[:punct:]]+) followed by a single character which is captured as a group ((.)). In the replacement, the backreference for the capture group (\\1) is changed to upper case (\\U).
sub("[[:punct:]]+(.)", "\\U\\1", str1, perl = TRUE)
#[1] "DrDre" "CaptainSpock" "spiderMan"
For the second case, we use regex lookarounds i.e. match a letter ((?<=[A-Za-z])) followed by a capital letter and replace with _.
gsub("(?<=[A-Za-z])(?=[A-Z])", "_", str2, perl = TRUE)
#[1] "End_Of_File" "Camel_Case" "A_B_C"
data
str1 <- c("Dr_dre", "Captain.Spock", "spider-man")
str2 <- c("EndOfFile", "CamelCase", "ABC")

How to delete string or digits after certain pattern?

If there is a vector x that is,
x <- c('/name12/?ad_2','/name13/?ad_3','/name14/?ad_4')
Is there a way to delete the following numbers after 'ad_'?
so the converted x appears as
'/name12/?ad_' '/name13/?ad_' '/name14/?ad_'
I was trying to use gsub function but it didn't work because of the digits followed by 'name'.

You may use a regex with sub (since you perform a single search and replace, you do not need gsub) and use a pattern depending on what you need to include or exclude in the result.
You might use "(\\?ad_)[0-9]+$" to remove ?ad_ + digits and replace with "\\1" to restore the ?ad_ value, or just match the _ and then digits (and replace with _).
See demo code:
> x <- c('/name12/?ad_2','/name13/?ad_3','/name14/?ad_4')
> sub("(\\?ad_)[0-9]+$", "\\1", x)
[1] "/name12/?ad_" "/name13/?ad_" "/name14/?ad_"
> sub("_[0-9]+$", "_", x)
[1] "/name12/?ad_" "/name13/?ad_" "/name14/?ad_"
See the regex demo
Pattern details:
_ - matches an underscore
[0-9]+ - 1 or more (due to the + quantifier matching one or more occurrences, as many as possible)
$ - the end of string.

Since the prefix is the same length for all of them:
x <- c('/name12/?ad_2','/name13/?ad_3','/name14/?ad_4')
substr(x,1,12)
[1] "/name12/?ad_" "/name13/?ad_" "/name14/?ad_"
Otherwise I would grep it.

Retrieving a specific part of a string in R

I have the next vector of strings
[1] "/players/playerpage.htm?ilkidn=BRYANPHI01"
[2] "/players/playerpage.htm?ilkidhh=WILLIROB027"
[3] "/players/playerpage.htm?ilkid=THOMPWIL01"
I am looking for a way to retrieve the part of the string that is placed after the equal sign meaning I would like to get a vector like this
[1] "BRYANPHI01"
[2] "WILLIROB027"
[3] "THOMPWIL01"
I tried using substr but for it to work I have to know exactly where the equal sign is placed in the string and where the part i want to retrieve ends

We can use sub to match the zero or more characters that are not a = ([^=]*) followed by a = and replace it with ''.
sub("[^=]*=", "", str1)
#[1] "BRYANPHI01" "WILLIROB027" "THOMPWIL01"
data
str1 <- c("/players/playerpage.htm?ilkidn=BRYANPHI01",
"/players/playerpage.htm?ilkidhh=WILLIROB027",
"/players/playerpage.htm?ilkid=THOMPWIL01")

Using stringr,
library(stringr)
word(str1, 2, sep = '=')
#[1] "BRYANPHI01" "WILLIROB027" "THOMPWIL01"

Using strsplit,
strsplit(str1, "=")[[1]][2]
# [1] "BRYANPHI01"
With Sotos comment to get results as vector:
sapply(str1, function(x){
strsplit(x, "=")[[1]][2]
})

Another solution based on regex, but extracting instead of substituting, which may be more efficient.
I use the stringi package which provides a more powerful regex engine than base R (in particular, supporting look-behind).
str1 <- c("/players/playerpage.htm?ilkidn=BRYANPHI01",
"/players/playerpage.htm?ilkidhh=WILLIROB027",
"/players/playerpage.htm?ilkid=THOMPWIL01")
stri_extract_all_regex(str1, pattern="(?<==).+$", simplify=T)
(?<==) is a look-behind: regex will match only if preceded by an equal sign, but the equal sign will not be part of the match.
.+$ matches everything until the end. You could replace the dot with a more precise symbol if you are confident about the format of what you match. For example, '\w' matches any alphanumeric character, so you could use "(?<==)\\w+$" (the \ must be escaped so you end up with \\w).

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Replace multiple consecutive hyphens in R - r

I have a string which looks like this: something-------another--thing I want to replace the multiple dashes with a single one. So the expected output would be: something-another-thing

you can do it with gsub and using regex. > text='something-------another--thing' > gsub('-{2,}','-',text) [1] "something-another-thing"

Related

regex substitution "." to "_"

R Use Regular Expression to capture number when sometimes the capture is at the end of the string or not

Camel Case format conversion using regular expressions in R

How to delete string or digits after certain pattern?

Retrieving a specific part of a string in R

Categories

Resources