Replace text that appears at the end of a string - r

Consider "artikelnr". I want to replace "nr" by "nummer", but when I consider "inrichting", I do NOT want to replace "nr". So I just want to replace "nr" by "nummer" if it's at the end of a word.

regex is your friend, here:
sub('nr$', 'nummer', 'artikelnr')
# [1] "artikelnummer"
The $ indicates "end of string", so nr will only be replaced with nummer when it appears at the end of the string.
sub can operate on an entire vector, e.g. for a character vector x, do:
sub('nr$', 'nummer', x)

If you don't mind using the stringr package, str_replace is also handy :
library(stringr)
str_replace("artikelnr", "nr$", "nummer")

Related

Parsing String - Extract Numeric Characters At End

Parsing string fields in R data frames is a bit of a mystery to me I'm afraid...would be grateful for help.
I have a string field which always ends in an indeterminate number of numeric characters. I'd like to write a bit of code to just extract the numeric part at the end of each.
An example of the data format is:
df_test <- data.frame(my_string = c("XXX-0387", "XXXX-1-999999", "XXX 12345432", "XXX-2345", "XXX1234"))
What I'd like is to put the numeric part at the end into a new field but to keep any leading zeros - so presumably the new field would have to be chr rather than int. So my output would look like:
c("0387", "999999", "12345432", "2345", "1234)
Is there an easy way to do this please?
Thank you.
A way using sub to capture the last part of string which is number.
sub('.*?(\\d+)$', '\\1', df_test$my_string)
#[1] "0387" "999999" "12345432" "2345" "1234"
Using stringr :
stringr::str_extract(df_test$my_string, '\\d+$')
You can use regexpr with \\d+$ to find the numbers at the end and extracti it with regmatches.
regmatches(df_test$my_string, regexpr("\\d+$", df_test$my_string))
#[1] "0387" "999999" "12345432" "2345" "1234"
We can use stri_extract_last from stringi
library(stringi)
stri_extract_last(df_test$my_string, regex = "\\d+")
#[1] "0387" "999999" "12345432" "2345" "1234"

sub function in r does not replace the first match

I am trying to manipulate a character vector and want to delete all characters before the first occurrence of a specific string using sub function in r, since the function performs replacement of the first match, but in my code sub replaces the last but not the first match?
Here below is an example
Vec <- c("ID1.P.001", "ID2.P.002") # character vector
# I want to get rid of all characters before the first dot (including the dot)
# So i want to get this vector
c("P.001", "P.002")
#[1] "P.001" "P.002"
# my code
sub('.*\\.', "", Vec )
#[1] "001" "002"
# sub replace the last not the first match !!
How can i use sub to get rid of characters before the first match (including the pattern)?
You can make the * quantifier lazy (opposed to the default greedy matching) by adding a ? after it. I.e.:
sub('.*?\\.', "", Vec)
[1] "P.001" "P.002"
We can specify the start (^) of the string, match the characters that are not a . ([^.]+ - one or more characters that are not a dot) followed by a dot (\\. - metacharacter - so escaping, within the [], it would be evaluated as . though) and in replacement, specify as blank ("")
sub("^[^.]+\\.", "", Vec)
#[1] "P.001" "P.002"

How to take only that part of a string which occurs before a pattern of 2 dots?

I used a code of regular expressions which only took stuff before the 2nd occurrence of a dot. The following is the code:-
colnames(final1)[i] <- gsub("^([^.]*.[^.]*)..*$", "\\1", colnames(final)[i])
But now i realized i wanted to take the stuff before the first occurrence of a pattern of 2 dots.
I tried
gsub(",.*$", "", colnames(final)[i]) (changed the , to ..)
gsub("...*$", "", colnames(final)[i])
But it didn't work
The example to try on
KC1.Comdty...PX_LAST...USD......Comdty........
converted to
KC1.Comdty.
or
"LIT.US.Equity...PX_LAST...USD......Comdty........"
to
"LIT.US.Equity."
Can anyone suggest anything?
Thanks
We could use sub to match 2 or more dots followed by other characters and replace it with blank
sub("\\.{2,}.*", "", str1)
#[1] "KC1.Comdty" "LIT.US.Equity"
The . is a metacharacter implying any character. So, we need to escape (\\.) to get the literal meaning of the character
data
str1 <- c("KC1.Comdty...PX_LAST...USD......Comdty.......", "LIT.US.Equity...PX_LAST...USD......Comdty........")
Another solution with strsplit:
str1 <- c("KC1.Comdty...PX_LAST...USD......Comdty.......", "LIT.US.Equity...PX_LAST...USD......Comdty........")
sapply(strsplit(str1, "\\.{2}\\w"), "[", 1)
# [1] "KC1.Comdty." "LIT.US.Equity."
To also include the dot at the end with #akrun's answer, one can do:
sub("\\.{2}\\w.*", "", str1)
# [1] "KC1.Comdty." "LIT.US.Equity."

Retrieving a specific part of a string in R

I have the next vector of strings
[1] "/players/playerpage.htm?ilkidn=BRYANPHI01"
[2] "/players/playerpage.htm?ilkidhh=WILLIROB027"
[3] "/players/playerpage.htm?ilkid=THOMPWIL01"
I am looking for a way to retrieve the part of the string that is placed after the equal sign meaning I would like to get a vector like this
[1] "BRYANPHI01"
[2] "WILLIROB027"
[3] "THOMPWIL01"
I tried using substr but for it to work I have to know exactly where the equal sign is placed in the string and where the part i want to retrieve ends
We can use sub to match the zero or more characters that are not a = ([^=]*) followed by a = and replace it with ''.
sub("[^=]*=", "", str1)
#[1] "BRYANPHI01" "WILLIROB027" "THOMPWIL01"
data
str1 <- c("/players/playerpage.htm?ilkidn=BRYANPHI01",
"/players/playerpage.htm?ilkidhh=WILLIROB027",
"/players/playerpage.htm?ilkid=THOMPWIL01")
Using stringr,
library(stringr)
word(str1, 2, sep = '=')
#[1] "BRYANPHI01" "WILLIROB027" "THOMPWIL01"
Using strsplit,
strsplit(str1, "=")[[1]][2]
# [1] "BRYANPHI01"
With Sotos comment to get results as vector:
sapply(str1, function(x){
strsplit(x, "=")[[1]][2]
})
Another solution based on regex, but extracting instead of substituting, which may be more efficient.
I use the stringi package which provides a more powerful regex engine than base R (in particular, supporting look-behind).
str1 <- c("/players/playerpage.htm?ilkidn=BRYANPHI01",
"/players/playerpage.htm?ilkidhh=WILLIROB027",
"/players/playerpage.htm?ilkid=THOMPWIL01")
stri_extract_all_regex(str1, pattern="(?<==).+$", simplify=T)
(?<==) is a look-behind: regex will match only if preceded by an equal sign, but the equal sign will not be part of the match.
.+$ matches everything until the end. You could replace the dot with a more precise symbol if you are confident about the format of what you match. For example, '\w' matches any alphanumeric character, so you could use "(?<==)\\w+$" (the \ must be escaped so you end up with \\w).

R character conversion

From an import, I have a date being read in as a factor:
user$registrationDate[1]
[1] "2004-07-23 14:19:32"
15551 Levels: " "1" "2004-07-23 14:19:32" "2004-07-25 03:29:18" "2004-07-25 08:35:20" ... i10yo."
I convert it apparently successfully into a character vector
as.character(user$registrationDate[1])
[1] "\"2004-07-23 14:19:32\""
Whatever I try to strip off the leading and trailing quote, I still end up with a trailing quote (or something like it)
sub('"', "", as.character(user$registrationDate[10]), fixed=TRUE)
[1] "2004-09-12 22:39:21\""
I tried many variations of sub and keep getting the same result. Tips?
From ?sub: "sub replaces only the first occurrence of a pattern whereas gsub replaces all occurrences". So use gsub instead.

Resources