Inserting character dynamically into string in R - r

I'm trying to insert a "+" symbol into the middle of a postcode. The postcodes following a pattern of AA111AA or AA11AA. I want the "+" to be inserted before the final number, so an output of either AA11+1AA or AA1+1AA. I've found a way to do this using stringr, but it feels like there's an easier way to do this that how I'm currently doing it. Below is my code.
pc <- "bt43xx"
pc <- str_c(
str_sub(pc, start = 1L, end = -4L),
"+",
str_sub(pc, start = -3L, end = -1L)
)
pc
[1] "bt4+3xx"

Here are some alternatives. All solutions work if pc is a scalar or vector. No packages are needed. Of them (3) seems particularly short and simple.
1) Match everything (.*) up to the last digit (\\d) and then replace that with the first capture (i.e. the match to the part within the first set of parens), a plus and the second capture (i.e. a match to the last digit).
sub("(.*)(\\d)", "\\1+\\2", pc)
2) An alternative which is even shorter is to match a digit followed by a non-digit and replace that with a plus followed by the match:
sub("(\\d\\D)", "+\\1", pc)
## [1] "bt4+3xx"
3) This one is even shorter than (2). It matches the last 3 characters replacing the match with a plus followed by the match:
sub("(...)$", "+\\1", pc)
## [1] "bt4+3xx"
4) This one splits the string into individual characters, inserts a plus in the appropriate position using append and puts the characters back together.
sapply(Map(append, strsplit(pc, ""), after = nchar(pc) - 3, "+"), paste, collapse = "")
## [1] "bt4+3xx"
If pc were known to be a scalar (as is the case in the question) it could be simplified to:
paste(append(strsplit(pc, "")[[1]], "+", nchar(pc) - 3), collapse = "")
[1] "bt4+3xx"

This regular expression with sub and two back references should work.
sub("(\\d?)(\\d[^\\d]*)$", "\\1+\\2", pc)
[1] "bt4+3xx"
\\d? matches 1 or 0 numeric characters, 0-9, and is captured by (). It will match if at least two numeric characters are present.
\\d[^\\d]* matches a numeric character followed by all non numeric characters, and is captured by ()
$ anchors the regular expression to the end of the string
"\\1+\\2" replaces the matched elements in the first two points with themselves and a "+" in the middle.

sub('(\\d)(?=\\D+$)','+\\1',pc,perl=T)

Related

Replace multiple consecutive hyphens in R

I have a string which looks like this:
something-------another--thing
I want to replace the multiple dashes with a single one.
So the expected output would be:
something-another-thing
We can try using sub here:
x <- "something-------another--thing"
gsub("-{2,}", "-", x)
[1] "something-another-thing"
More generally, if we want to replace any sequence of two or more of the same character with just the single character, then use this version:
x <- "something-------another--thing"
gsub("(.)\\1+", "\\1", x)
The second pattern could use an explanation:
(.) match AND capture any single letter
\\1+ then match the same letter, at least one or possibly more times
Then, we replace with just the single captured letter.
you can do it with gsub and using regex.
> text='something-------another--thing'
> gsub('-{2,}','-',text)
[1] "something-another-thing"
t2 <- "something-------another--thing"
library(stringr)
str_replace_all(t2, pattern = "-+", replacement = "-")
which gives:
[1] "something-another-thing"
If you're searching for the right regex to search for a string, you can test it out here https://regexr.com/
In the above, you're just searching for a pattern that is a hyphen, so pattern = "-", but we add the plus so that the search is 'greedy' and can include many hyphens, so we get pattern = "-+"

Extracting string between punctuation, when present

I'm trying to extract a string after a : or ; and before a ; if the 2nd punctuation is present, then to remove everything after a ; if present. Goal result is a number.
The current code is able to do between : and ; OR after : but cannot handle ; alone or : alone.
Also, gsub(|(OF 100); SEE NOTE) isn't working, and I'm not sure why the initial : isn't being excluded and needs the gsub at all.
test<-c("Score (ABC): 2 (of 100); see note","Amount of ABC; 30%","Presence of ABC: negative","ABC not tested")
#works for :/;
toupper((regmatches(toupper(test), gregexpr(":\\s* \\K.*?(?=;)", toupper(test), perl=TRUE))))
#works for :
test<-toupper((regmatches(toupper(test), gregexpr(":\\s* (.*)", toupper(test), perl=TRUE))))
#removes extra characters:
test<-gsub(": |(OF 100); SEE NOTE|%|; ","",test)
#Negative to numeric:
test[grepl("NEGATIVE|<1",test)]<-0
test
Expected result: 2 30 0
Here are some solutions.
The first two are base. The first only uses very simple regular expressions. The second is shorter and the regular expression is only a bit more complicated. In both cases we return NA if there is no match but you can replace NAs with 0 (using ifelse(is.na(x), 0, x) where x is the answer with NAs) afterwards if that is important to you.
The third is almost the same as the second but uses strapply in gsubfn. It returns 0 instead of NA.
1) read.table Replace all colons with semicolons and read it in as semicolon-separated fields. Pick off the second such field and remove the first non-digit and everything after it. Then convert what is left to numeric.
DF <- read.table(text = gsub(":", ";", test),
as.is = TRUE, fill = TRUE, sep = ";", strip.white = TRUE)
as.numeric(sub("\\D.*", "", DF$V2))
##[1] 2 30 NA
2) strcapture Match from the start characters which are not colon or semicolon and then match a colon or semicolon and then match a space and finally capture digits. Return the captured digits converted to numeric.
strcapture("^[^:;]+[;:] (\\d+)", test, list(num = numeric(0)))$num
##[1] 2 30 NA
3) strapply Using the same pattern as in (2) convert the match to numeric and return 0 if the match is empty.
library(gsubfn)
strapply(test, "^[^:;]+[;:] (\\d+)", as.numeric, simplify = TRUE, empty = 0)
## [1] 2 30 0
Another approach:
out <- gsub('(^.+?[;:][^0-9]+)(\\d+)(.*$)|^.+', '\\2', test)
out[out == ''] <- 0
as.numeric(out)
## [1] 2 30 0
Per the OP's description (italics is mine):
extract a string after a : or ; and before a ; if the 2nd punctuation is present, then to remove everything after a ; if present. Goal result is a number.
I think some of the other suggestions may miss that italicized criteria. So here is the OP's test set with one extra condition at the end to test that:
test<-c( "Score (ABC): 2 (of 100); see note",
"Amount of ABC; 30%",
"Presence of ABC: negative",
"...and before a ; if the second punctuation is present, then remove everything after a ; if present [so 666 should not be returned]")
One-liner to return results as requested:
sub( pattern='.+?[:;]\\D*?[^;](\\d*).*?;*.*',
replacement='\\1',
x=test, perl=TRUE)
Results matching OP's request:
[1] "2" "30" "" ""
If the OP really wants an integer with zeros where no match is found, set the sub() replacement = '0\\1' and wrap with as.integer() as follows:
as.integer( gsub( pattern='.+?[:;]\\D*?[^;](\\d*).*?;*.*',
replacement='0\\1',
x=test, perl=TRUE) )
Result:
[1] 2 30 0 0
Fully working online R (R 3.3.2) example:
https://ideone.com/TTuKzG
Regexp explanation
OP wants to find just one match in a string so the sub() function works just fine.
Technique for using sub() is to make a pattern that matches all strings, but use a capture group in the middle to capture zero or more digits if conditions around it are met.
The pattern .+?[:;]\\D*?[^;](\\d*).*?;*.* is read as follows
.+? Match any character (except for line terminators) + between one and unlimited times ? as few times as possible, expanding as needed (lazy)
[:;] Match a single character in the list between the square brackets, in this case : or ;
\\D Match any character that's NOT a digit (equal to [^0-9])
*? Quantifier * Matches between zero and unlimited times ? as few times as possible, expanding as needed (lazy)
[^;] The ^ hat as first character between square brackets means: Match a single character NOT present in the list between the square brackets, in this case match any character NOT ;
(\d*) Everything between curved brackets is a capturing group - this is the 1st capturing croup: \\d* matches a digit (equal to [0-9]) between zero and unlimited times, as many times as possible(greedy)
;* Match the ; character * between zero and unlimited times [so ; does not have to be present but is matched if it is there: This is the key to excluding anything after the second delimiter as the OP requested]
.* Match any character * between zero and unlimited times, as many times as possible (greedy) [so picks up everything to the end of the line]
The replacement = \\1 refers to the 1st capture group in our pattern. We replace everything that was matched by the pattern with what we found in the capture group. \\d* can match no digits, so will return an empty string if there is no number found where we are expecting it.

Using gsub to replace last occurence of string in R

I have the following character vector than I need to modify with gsub.
strings <- c("x", "pm2.5.median", "rmin.10000m", "rmin.2500m", "rmax.5000m")
Desired output of filtered strings:
"x", "pm2.5.median", "rmin", "rmin", "rmax"
My current attempt works for everything except the pm2.5.median string which has dots that need to be preserved. I'm really just trying to remove the buffer size that is appended to the end of each variable, e.g. 1000m, 2500m, 5000m, 7500m, and 10000m.
gsub("\\..*m$", "", strings)
"x", "pm2", "rmin", "rmin", "rmax"
Match a dot, any number of digits, m and the end of string and replace that with the empty string. Note that we prefer sub to gsub here because we are only interested in one replacement per string.
sub("\\.\\d+m$", "", strings)
## [1] "x" "pm2.5.median" "rmin" "rmin" "rmax"
The .* pattern matches any 0 or more chars, as many as possible. The \..*m$ pattern matches the first (leftmost) . in the string and then grab all the text after it if it ends with m.
You need
> sub("\\.[^.]*m$", "", strings)
[1] "x" "pm2.5.median" "rmin" "rmin" "rmax"
Here, \.[^.]*m$ matches ., then 0 or more chars other than a dot and then m at the end of the string.
See the regex demo.
Details
\. - a dot (must be escaped since it is a special regex char otherwise)
[^.]* - a negated character class matching any char but . 0 or more times
m - an m char
$ - end of string.

Replace all characters between the 3rd occurrence of “-” and the ":" in each element of a vector

Here is what I am trying to do:
Given a string, I want to remove everything after the third occurrence of the '-' and the character — assuming there is a third occurrence, which there may not be.
This is my expected result :
Initial string
yy-aa-bbb-cccc1:HYT => yy-aa-bbb:HYT
yy-aa-vvv-vv:ZTR => yy-aa-vvv:ZTR
yy-aa-ddd:YTLM => yy-aa-ddd:YTLM
Any help?
gsub('(.*-.*-.*)\\-.*(\\:.*)','\\1\\2',string)
#[1] "yy-aa-bbb:HYT" "yy-aa-vvv:ZTR" "yy-aa-ddd:YTLM"
We match two instances of characters that are not a - followed by - ([^-]+-) followed by another set of characters that are not a -, capture it as a group i.e. inside the (), followed by a - and set of characters that are not a : ([^:]+) followed by the second capture group that starts with : ((:.*)) and replace it with the backreference of the capture groups
sub("(([^-]+-){2}[^-]+)-*[^:]+(:.*)", "\\1\\3", str1)
#[1] "yy-aa-bbb:HYT" "yy-aa-vvv:ZTR" "yy-aa-ddd:YTLM"
data
str1 <- c("yy-aa-bbb-cccc1:HYT", "yy-aa-vvv-vv:ZTR", "yy-aa-ddd:YTLM"
Match the the first two fields and everything afterwards to colon and replace that with the first two fields and colon. Note that \w matches any word character and the \ needs to be doubled inside "..." :
sub("(\\w+-\\w+)-.+:", "\\1:", xx)
## [1] "yy-aa-bbb:HYT" "yy-aa-vvv:ZTR" "yy-aa:YTLM"
Note: The input xx in reproducible form is:
xx <- c("yy-aa-bbb-cccc1:HYT", "yy-aa-vvv-vv:ZTR", "yy-aa-ddd:YTLM")
Just throwing a stringi solution in there.
library(stringi)
sub('_.*:' ,':', stri_replace_last_fixed(x, '-', '_'))
#[1] "yy-aa-bbb:HYT" "yy-aa-vvv:ZTR" "yy-aa:YTLM"

Retrieving a specific part of a string in R

I have the next vector of strings
[1] "/players/playerpage.htm?ilkidn=BRYANPHI01"
[2] "/players/playerpage.htm?ilkidhh=WILLIROB027"
[3] "/players/playerpage.htm?ilkid=THOMPWIL01"
I am looking for a way to retrieve the part of the string that is placed after the equal sign meaning I would like to get a vector like this
[1] "BRYANPHI01"
[2] "WILLIROB027"
[3] "THOMPWIL01"
I tried using substr but for it to work I have to know exactly where the equal sign is placed in the string and where the part i want to retrieve ends
We can use sub to match the zero or more characters that are not a = ([^=]*) followed by a = and replace it with ''.
sub("[^=]*=", "", str1)
#[1] "BRYANPHI01" "WILLIROB027" "THOMPWIL01"
data
str1 <- c("/players/playerpage.htm?ilkidn=BRYANPHI01",
"/players/playerpage.htm?ilkidhh=WILLIROB027",
"/players/playerpage.htm?ilkid=THOMPWIL01")
Using stringr,
library(stringr)
word(str1, 2, sep = '=')
#[1] "BRYANPHI01" "WILLIROB027" "THOMPWIL01"
Using strsplit,
strsplit(str1, "=")[[1]][2]
# [1] "BRYANPHI01"
With Sotos comment to get results as vector:
sapply(str1, function(x){
strsplit(x, "=")[[1]][2]
})
Another solution based on regex, but extracting instead of substituting, which may be more efficient.
I use the stringi package which provides a more powerful regex engine than base R (in particular, supporting look-behind).
str1 <- c("/players/playerpage.htm?ilkidn=BRYANPHI01",
"/players/playerpage.htm?ilkidhh=WILLIROB027",
"/players/playerpage.htm?ilkid=THOMPWIL01")
stri_extract_all_regex(str1, pattern="(?<==).+$", simplify=T)
(?<==) is a look-behind: regex will match only if preceded by an equal sign, but the equal sign will not be part of the match.
.+$ matches everything until the end. You could replace the dot with a more precise symbol if you are confident about the format of what you match. For example, '\w' matches any alphanumeric character, so you could use "(?<==)\\w+$" (the \ must be escaped so you end up with \\w).

Resources