replacing second occurence of string with different string - r

I have very simple issue with replacing the strings second occurrence with the new string.
Lets say we have this string
string <- c("A12A32")
and we want to replace the second A with B string. A12B32 is the expected output.
by following this relevant post
How to replace second or more occurrences of a dot from a column name
I tried,
replace_second_A <- sub("(\\A)\\A","\\1B", string)
print(replace_second_A)
[1] "A12A32"
it seems no change in the second A why?

Note that .*? matches the shortest string until the next A:
string <- "A12A32"
sub("(A.*?)A", "\\1B", string)
## [1] "A12B32"

First, there is no need to escape the letter A using backslashes. They are only required to escape special characters that have other meanings e.g. "." means "any character", "\\." means "period".
Second, your regular expression "(\\A)\\A" reads "match A followed by another A, keeping the first A for reuse." You don't have two consecutive "A", they are separated by digits.
So this works ("\\d+" means "match 1 or more digits"):
sub("(A\\d+)A","\\1B", "A12A32")
[1] "A12B32"

Related

conditionally removing first two letter of every entry in a string in R

I have a vector with some codes. However, for an unknown reason, some of the code start with X# (# being a number 0-9). If my vector item does start with x#, I need to remove the first two letters.
Examples:
codes <- c('x0fa319-432f39-4fre78', '23weq0-4fsf198-417203', 'x2431-5435-1242-qewf')
expectedResult <- c('fa319-432f39-4fre78', '23weq0-4fsf198-417203', '431-5435-1242-qewf')
I tried using str_replace and gsub, but I couldn't get it right:
gsub("X\\d", "", codes)
but this would remove the x# even if it was in the middle of the string.
Any ides?
You can use
codes <- c('x0fa319-432f39-4fre78', '23weq0-4fsf198-417203', 'x2431-5435-1242-qewf')
sub("^x\\d", "", codes, ignore.case=TRUE)
See the R demo.
The ^x\d pattern matches x and any digit at the start of a string.
sub replaces the first occurrence only.
ignore.case=TRUE enables case insensitive matching.

Regex returns digit string with leading "_"

Using R script in PowerBI Query Editor to find six digit numeric string in a description column and add this as a new column to the table. It works EXCEPT where the number string is preceded by a "_" (underscore character)
# 'dataset' holds the input data for this script ##
library(stringr)
# assign regex to variable #
pattern <- "(?:^|\\D)(\\d{6})(?!\\d)"
# define function to use pattern ##
isNewSiteNum = function(x) substr(str_extract(x,pattern),1,6)
# output statement - within adds new column to dataset ##
output <- within(dataset,{NewSiteNum=isNewSiteNum(dataset$LineItemComment)})
number string can be at start, end or in the middle of the description text. When the number string is preceded by underscore (_123456 for example) the regex returns the _12345 instead of 123456. Not sure how to tell this to skip the underscore but still grab the six digits (and not break the cases where there is no leading underscore that currently work.)
regex101.com shows the full match as '_123456' and group.1 as '123456' but my result column has '_12345' For the case with a leading space the full match is ' 123456' yet my result column is correct. I seem to be missing something since the full match gets 7 char and the desirec group 1 has 6.
The problem was with the str_extract which I could not get to work. However, by using the str_match and selecting the group I get what I am looking for.
# 'dataset' holds input data
library(stringr)
pattern<-"(?:^|\\D)(\\d{6})(?!\\d)"
SiteNum = function(x) str_match(x, pattern)[,2]
output<-within(dataset,{R_SiteNum2=SiteNum(dataset$ReqComments)})
this does not pick up non-numeric initial characters.

Replace every single character at the start of string that matches a regex pattern

I want to replace every single character in a string that matches a certain pattern. Take the following string
mystring <- c("000450")
I want to match all single zeros up to the first element that is non-zero. I tried something like
gsub("^0[^1-9]*", "x", mystring)
[1] "x450"
This expression replaces all the leading zeros with a single x. But instead, I want to replace all three leading zeros with xxx. The preferred result would be
[1] "xxx450"
Can anyone help me out?
You may use
mystring <- c("000450")
gsub("\\G0", "x", mystring, perl=TRUE)
## => [1] "xxx450"
See the regex demo and an R demo
The \\G0 regex matches 0 at the start of the string, and any 0 that only appears after a successful match.
Details
\G - an anchor that matches ("asserts") the position at the start of the string or right after a successful match
0 - a 0 char.

How to extract first occurrence of alphabets in a string in R?

I have a character column having values like "CHELSEAFC17FEB640CE", "BARCAFC17FEB1400CE". I want to extract characters "CHELSEAFC", "BARCAFC" and so on. Currently I am using
regmatches(x$symbol,regexpr("[A-z]+",x$symbol))
but getting an error:
Error in $<-.data.frame(*tmp*, "cg", value = c("CHELSEAFC",
"CHELSEAFC", "TOTTENHAMFC", : replacement has 11366767 rows, data
has 11366772 Calls: $<- -> $<-.data.frame Execution halted
I can't seem to find the problem row. Please somebody help with debugging or suggest a better way to do this :)
Assuming that we need to extract the non-numeric part, one option is to remove the other characters by matching one or more numbers ([0-9]+) followed by other characters (.*) and replace it with ""
sub("[0-9]+.*", "", str1)
#[1] "CHELSEAFC" "BARCAFC"
Or capture the upper case letters as a group (([A-Z]+)) from the start (^) of the string and replace it with the backreference (\\1) for that group
sub("^([A-Z]+).*", "\\1", str1)
#[1] "CHELSEAFC" "BARCAFC"
data
str1 <- c( "CHELSEAFC17FEB640CE", "BARCAFC17FEB1400CE")
Instead of [A-z]+ you should use ^[A-Za-z]+ Check this for more understanding why you shouldn't do that: https://stackoverflow.com/a/29771926/4082217
The error appears because you have some values in the input vector that do not contain letters (and some symbols that [A-z] matches). That makes regmatches return no value in case there is no match, and thus, assigning the column values becomes impossible as the number of matches does not coincide with the number of rows in the data frame.
What you may do is:
1) Use sub
x <- c("------", "CHELSEAFC17FEB640CE", "BARCAFC17FEB1400CE")
> sub("^([a-zA-Z]+).*|.*", "\\1", df$x)
[1] "" "CHELSEAFC" "BARCAFC"
>
x$symbol <- sub("^([a-zA-Z]+).*|.*", "\\1", x$symbol)
The ^([a-zA-Z]+).*|.* pattern will match and capture one or more ASCII letters (replace [a-zA-Z]+ with [[:alpha:]]+ to match letters other than ASCII, too) at the start of the string (^), and .* will match the rest of the string, OR (|) the whole string will get matches with the second branch and the match will be replaced with the capturing group contents (so, it will be either filled with a letter value or will be empty).
2) If you want to keep NA for the values with no match, use stringr str_extract:
library(stringr)
> x$symbol <- str_extract(x$symbol, "^[A-Za-z]+")
## => 1 <NA>
## 2 CHELSEAFC
## 3 BARCAFC
Note that ^[A-Za-z]+ matches 1+ ASCII letters ([A-Za-z]+) at the start of the string only (^).

get character before second underscore [duplicate]

What regular expression can retrieve (e.g. with sup()) the characters before the second period. Given a character vector like:
v <- c("m_s.E1.m_x.R1PE1", "m_xs.P1.m_s.R2E12")
I would like to have returned this:
[1] "m_s.E1" "m_xs.P1"
> sub( "(^[^.]+[.][^.]+)(.+$)", "\\1", v)
[1] "m_s.E1" "m_xs.P1"
Now to explain it: The symbols inside the first and third paired "[ ]" match any character except a period ("character classes"), and the "+"'s that follow them let that be an arbitrary number of such characters. The [.] therefore is only matching the first period, and the second period will terminate the match. Parentheses-pairs allow you to specific partial sections of matched characters and there are two sections. The second section is any character (the period symbol) repeated an arbitrary number of times until the end of the string, $. The "\\1" specifies only the first partial match as the returned value.
The ^ operator means different things inside and outside the square-brackets. Outside it refers to the length-zero beginning of the string. Inside at the beginning of a character class specification, it is the negation operation.
This is a good use case for "character classes" which are described in the help page found by typing:
?regex
Not regex but the qdap package has the beg2char (beginning of string 2 n character) to handle this:
library(qdap)
beg2char(v, ".", 2)
## [1] "m_s.E1" "m_xs.P1"

Resources