Regex replace everything after first digit including digit with another string - r

I have some strings with this pattern: abc_def_10_cat_dog and I want to use gsub to replace everything after abc_def with _diff. So in the end it should be abc_def_diff. What regex expression would I need to do this? Another way of thinking about it could be "How do I keep all the values before the first digit and then add _diff?" I am using dplyr, gsub, on R.
I know '^[^0-9]*' keeps what I want, but I'm not sure how to just keep those characters and drop the stuff afterwards. I tried using str_extract as well but it kept saying 'object not found'. My object is just a list of names.
object <- df %>% select(vars(ends_with("10_cat_dog")) %>% names()

You can match on a digit and any characters until you reach the end of the string, and then replace with diff:
library(stringr)
str_replace("abc_def_10_cat_dog", "_\\d.*$", "_diff")
#> [1] "abc_def_diff"
Created on 2020-10-30 by the reprex package (v0.3.0)

You can use base R sub :
x <- 'abc_def_10_cat_dog'
sub('\\d+.*', 'diff', x)
#[1] "abc_def_diff"

Another option with sub
sub("[0-9]+.*", "diff", x)
#[1] "abc_def_diff"
data
x <- 'abc_def_10_cat_dog'

Related

R turn 6 digit number into HMS i.e. "130041" into "13:00:41"

As the question states, I want to turn "130041" into "13:00:41" i.e. HMS data
lubridate::ymd("20220413") works no problems but lubridate::hms("130041") does not.
I assume there should be a reasonably simply solution?!
Thank you.
If you need the output as a lubridate Period object, rather than a character vector, as you need to perform operations on it, you can use the approach suggested by Tim Biegeleisen of adding colon separators to the character vector and then using lubridate:
x <- "130041"
gsub("(\\d{2})(?!$)", "\\1:", x, perl = TRUE) |>
lubridate::hms()
# [1] "13H 0M 41S"
The output is similar but it is a Period object. I used a slightly different regex as well (add a colon when there are two digits not followed by the end of string) but it is fundamentally the same approach.
You could use sub here:
x <- "130041"
output <- sub("(\\d{2})(\\d{2})(\\d{2})", "\\1:\\2:\\3", x)
output
[1] "13:00:41"
Another regex which will will also work in case when hour uses only on digit.
gsub("(?=(..){1,2}$)", ":", c("130041", "30041"), perl=TRUE)
#[1] "13:00:41" "3:00:41"
Which can than be used in e.g. in lubridate::hms or hms::as_hms.
In base::as.difftime a format could be given: as.difftime("130041", "%H%M%S")

What is the best way in R to identify the first character in a string?

I am trying to find a way to loop through some data in R that contains both numbers and characters and where the first character is found return all values after. For example:
column
000HU89
87YU899
902JUK8
result
HU89
YU89
JUK8
have tried stringr_detct / grepl but the value of the first character is by nature unknown so I am having difficultly pulling it out.
We could use str_extract
stringr::str_extract(x, "[A-Z].*")
#[1] "HU89" "YU899" "JUK8"
data
x <- c("000HU89", "87YU899", "902JUK8")
Ronak's answer is simple.
Though I would also like to provide another method:
column <-c("000HU89", "87YU899" ,"902JUK8")
# Get First character
first<-c(strsplit(gsub("[[:digit:]]","",column),""))[[1]][1]
# Find the location of first character
loc<-gregexpr(pattern =first,column)[[1]][1]
# Extract everything from that chacracter to the right
substring(column, loc, last = 1000000L)
We can use sub from base R to match one or more digits (\\d+) at the start (^) of the string and replace with blank ("")
sub("^\\d+", "", x)
#[1] "HU89" "YU899" "JUK8"
data
x <- c("000HU89", "87YU899", "902JUK8")
In base R we can do
x <- c("000HU89", "87YU899", "902JUK8")
regmatches(x, regexpr("\\D.+", x))
# [1] "HU89" "YU899" "JUK8"

Regular expression in R - extract only match

My strings look like as follows:
crb_gdp_g_100000_16_16_ftv_all.txt
crb_gdp_g_100000_16_20_fweo2_all.txt
crb_gdp_g_100000_4_40_fweo2_galt_1.txt
I only want to extract the part between f and the following underscore (in these three cases "tv", "weo2" and "weo2").
My regular expression is:
regex.f = "_f([[:alnum:]]+)_"
There is no string with more than one part matching the pattern. Why does the following command not work?
sub(regex.f, "\\1", "crb_gdp_g_100000_16_16_ftv_all.txt")
The command only removes "_f" from the string and returns the remaining string.
Can easily be achived with qdapRegex
df <- c("crb_gdp_g_100000_16_16_ftv_all.txt",
"crb_gdp_g_100000_16_20_fweo2_all.txt",
"crb_gdp_g_100000_4_40_fweo2_galt_1.txt")
library(qdapRegex)
rm_between(df, "_f", "_", extract=TRUE)
We can use sub extract the strings by matching the characterf followed by one or more characters that are not an underscore or numbers ([^_0-9]+), capture as a group ((...)), followed by 0 or more numbers (\\d*) followed by an _ and other characters. Replace with the backreference (\\1) of the captured group
sub(".*_f([^_0-9]+)\\d*_.*", "\\1", str1)
#[1] "tv" "weo" "weo"
data
str1 <- c("crb_gdp_g_100000_16_16_ftv_all.txt",
"crb_gdp_g_100000_16_20_fweo2_all.xml",
"crb_gdp_g_100000_4_40_fweo2_galt_1.txt")
My usual regex for extracting the text between two characters comes from https://stackoverflow.com/a/13499594/1017276, which specifically looks at extracting text between parentheses. This approach only changes the parentheses to f and _.
x <- c("crb_gdp_g_100000_16_16_ftv_all.txt",
"crb_gdp_g_100000_16_20_fweo2_all.xml",
"crb_gdp_g_100000_4_40_fweo2_galt_1.txt",
"crb_gdp_g_100000_20_tbf_16_nqa_8_flin_galt_2.xml")
regmatches(x,gregexpr("(?<=_f).*?(?=_)", x, perl=TRUE))
Or with the stringr package.
library(stringr)
str_extract(x, "(?<=_f).*?(?=_)")
edited to start the match on _f instead of f.
NOTE
akrun's answer runs a few milliseconds faster than the stringr approach, and about ten times faster than the base approach. The base approach clocks in at about 100 milliseconds for a character vector of 10,000 elements.
update: capture match using str_match
library(stringr)
m <- str_match("crb_gdp_g_100000_16_20_fweo2_all.txt", "_f([[:alnum:]]+)_")
print(m[[2]])
# weo2
your regex not work because missing starting and ending match .* and use \w for shorthand [:alnum:]
sub(".*_f(\\w+?)_.*", "\\1", "crb_gdp_g_100000_16_20_fweo2_all.txt")
We could use the package unglue :
library(unglue)
txt <- c("crb_gdp_g_100000_16_16_ftv_all.txt",
"crb_gdp_g_100000_16_20_fweo2_all.txt",
"crb_gdp_g_100000_4_40_fweo2_galt_1.txt")
pattern <-
"crb_gdp_g_100000_{=\\d+}_{=\\d+}_f{x}_{=.+?}.txt"
unglue_vec(txt,pattern)
#> [1] "tv" "weo2" "weo2"
Created on 2019-10-09 by the reprex package (v0.3.0)

Extract a part of string on a particular reference

I need to extract number that comes after "&r=" in the below link.
http://asdf.com/product/eyewear/eyeglasses?Brand[]=Allen%20Solly&r=472020&ck-source=google-adwords&ck-campaign=eyeglasses-cat-brand-broad&ck-adgroup=eyeglasses-dersdc-cat-brand-broad&keyword={keyword}&matchtype={matchtype}&network={network}&creative={creative}&adposition={adposition}
Here's what i tried
C has my link stored in.
sub(".*&r=", "",c)
"472020&ck-source=google-adwords&ck-campaign=eyeglasses-cat-brand-broad&ck-adgroup=eyeglasses-dersdc-cat-brand-broad&keyword={keyword}&matchtype={matchtype}&network={network}&creative={creative}&adposition={adposition}"
This only gives me whole after part of the string .
I only need the number i.e 472020 .
Any idea?
Here is how to get it using sub
sub(".*=(\\d+)&.*", "\\1", z)
#[1] "472020"
or
as.integer(sub(".*=(\\d+)&.*", "\\1", z))
#[1] 472020
For completeness sake, here it is with the base R regmatches/regexpr combo:
regmatches(z, regexpr("(?<=\\&r\\=)\\d+",z,perl=TRUE))
It uses the same Perl-flavoured regex as #akrun's stringr version. regexpr (or gregexpr if several matches of the same pattern are expected in the same string) matches the pattern, while regmatches extracts it (it is vectorized so several strings can be matched/extracted at once).
> as.integer(regmatches(z,regexpr("(?<=\\&r\\=)\\d+",z,perl=TRUE)))
#[1] 472020
We can use str_extract
library(stringr)
as.numeric(str_extract(z, "(?<=\\&r\\=)\\d+"))
#[1] 472020
If there are several matches use str_extract_all in place of str_extract

Extracting part of string by position in R

I have a vector of strings string which look like this
ABC_EFG_HIG_ADF_AKF_MNB
Now from each of this element I want to extract the 3rd set of strings(from left) i.e in this case HIG. How can I achieve this in R
substr extracts a substring by position:
substr('ABC_EFG_HIG_ADF_AKF_MNB', 9, 11)
returns
[1] "HIG"
Here's one more possibility:
strsplit(str1,"_")[[1]][3]
#[1] "HIG"
The command strsplit() does what its name suggests: it splits a string. The second parameter is the character on which the string is split, wherever it is found within the string.
Perhaps somewhat surprisingly, strsplit() returns a list. So we can either use unlist() to access the resulting split parts of the original string, or in this case address them with the index of the list [[1]] since the list in this example has only one member, which consists of six character strings (cf. the output of str(strsplit(str1,"_"))).
To access the third entry of this list, we can specify [3] at the end of the command.
The string str1 is defined here as in the answer by #akrun.
We can use sub. We match one or more characters that are not _ ([^_]+) followed by a _. Keep it in a capture group. As we wants to extract the third set of non _ characters, we repeat the previously enclosed group 2 times ({2}) followed by another capture group of one or more non _ characters, and the rest of the characters indicated by .*. In the replacement, we use the backreference for the second capture group (\\2).
sub("^([^_]+_){2}([^_]+).*", "\\2", str1)
#[1] "HIG"
Or another option is with scan
scan(text=str1, sep="_", what="", quiet=TRUE)[3]
#[1] "HIG"
A similar option as mentioned by #RHertel would be to use read.table/read.csv on the string
read.table(text=str1,sep = "_", stringsAsFactors=FALSE)[,3]
data
str1 <- "ABC_EFG_HIG_ADF_AKF_MNB"
If you know the place of the pattern you look for, and you know that it is fixed (here, between the 9 and 11 character), you can simply use str_sub(), from the stringr package.
MyString = 'ABC_EFG_HIG_ADF_AKF_MNB'
str_sub(MyString, 9, 11)
A new option is using the function str_split_i from the development version stringr which can also extract a string by position split by a certain string. Here is a reproducible example:
# devtools::install_github("tidyverse/stringr")
library(stringr)
x <- c("ABC_EFG_HIG_ADF_AKF_MNB")
str_split_i(x, "_", 3)
#> [1] "HIG"
Created on 2022-09-10 with reprex v2.0.2
As you can see it extracted the third string. If you want the 6th you can change the 3 with 6 like this:
library(stringr)
x <- c("ABC_EFG_HIG_ADF_AKF_MNB")
str_split_i(x, "_", 6)
#> [1] "MNB"
Created on 2022-09-10 with reprex v2.0.2

Resources