substitute string when there is a dot + number + ':' - r

I have strings that look like these:
> ABCD.1:f_HJK
> ABFD.1:f_HTK
> CJD:f_HRK
> QQYP.2:f_HDP
So basically, I have always a string in the first part, I could have a part with . and a number, and after this part I always have ':' and a string.
I would like to remove the '. + number' when it is included in the string, using R.
I know that maybe regular expressions could be useful but I have not idea about I can apply them in this context. I know that I can substitute the '.' with gsub, but not idea about how I can add the information about number and ':'.
Thank you for your help.

Does this work:
v <- c('ABCD.1:f_HJK','ABFD.1:f_HTK','CJD:f_HRK','QQYP.2:f_HDP')
v
[1] "ABCD.1:f_HJK" "ABFD.1:f_HTK" "CJD:f_HRK" "QQYP.2:f_HDP"
gsub('([A-Z]{,4})(\\.\\d)?(:.*)','\\1\\3',v)
[1] "ABCD:f_HJK" "ABFD:f_HTK" "CJD:f_HRK" "QQYP:f_HDP"

You could also use any of the following depending on the structure of your string
If no other period and numbers in the string
sub("\\.\\d+", "", v)
[1] "ABCD:f_HJK" "ABFD:f_HTK" "CJD:f_HRK" "QQYP:f_HDP"
If you are only interested in the first pattern matched.
sub("^([A-Z]+)\\.\\d+:", "\\1:", v)
[1] "ABCD:f_HJK" "ABFD:f_HTK" "CJD:f_HRK" "QQYP:f_HDP"
Same as above, invoking perl. ie no captured groups
sub("^[A-Z]+\\K\\.\\d+", "", v, perl = TRUE)
[1] "ABCD:f_HJK" "ABFD:f_HTK" "CJD:f_HRK" "QQYP:f_HDP"

If I understood your explanation correctly, this should do the trick:
gsub("(\\.\\d+)", "", string)

Related

how to extract the sub-string which is enclosed in between '$' and '/' (special characters) and then substitute the value of that sub-string?

I have the main string as str and substring which is required named as su_str:
su_str = "$Region_Name" ,
str <- " https://lglbw.pqr.xyz.com:58443/APG/lookup/All/Report%20Library/Amazon%20S3/Inventory/Regions/$Region_Name/Billing/report.csv"
and the value of $Region_Name = ap-southeast-1 that is in some other file.
I tried :
r <- unlist(stri_extract_all(p,"$ /"))
and it will give an error like:
Error in stri_extract_all(p, "$ /") :
you have to specify either regex, fixed, coll, or charclass
c_prop will be:
Key: Value
$DirectorName : DF-1C
$DirectorPortName : Ports, DF-1C
$MaskingViewName : 000197801199, IS_LGLW9062_VIEW
$MaskingInitiatorPortName : Initiator Ports, IS_LGLW9062_VIEW
$MaskingAssDeviceName :Associated Devices, IS_LGLW9062_VIEW
$PoolName : 000197801199, SRP_1
$PoolBoundDevice : Bound Devices, SRP_1
$PortName : DF-1C:12
$Region_Name : ap-southeast-1
How to solve this issue, suggest some idea? Thanks in Advance!!!
This works for your example, does it solve your problem? When using regex (or regular expressions), you have to escape special characters in R with two backslashes \\. It looks like using stringi the replacement must have special characters escaped as well, but I do not use stringi often so hopefully someone can chime in on a better way to do it using stringi
> library(stringi)
>
> str <- " https://lglbw.pqr.xyz.com:58443/APG/lookup/All/Report%20Library/Amazon%20S3/Inventory/Regions/$Region_Name/Billing/report.csv"
>
> # If you just want to extract the sequence of letters and underscores's after "$" and before the "/"
> unlist(stri_extract_all(str, regex = "\\$[[:alpha:]_]*\\b"))
[1] "$Region_Name"
>
> # If you want to replace it with something else using base R
>
> some_string <- "$Region_Name = ap-southeast-1"
>
> gsub("\\$[[:alpha:]_]*\\b", some_string, str)
[1] " https://lglbw.pqr.xyz.com:58443/APG/lookup/All/Report%20Library/Amazon%20S3/Inventory/Regions/$Region_Name = ap-southeast-1/Billing/report.csv"
>
> # Using stringi package
>
> # Special characters have to be escaped
> some_string <- "\\$Region_Name \\= ap\\-southeast\\-1"
>
> stri_replace_all(str, some_string, regex = "\\$[[:alpha:]_]*\\b")
[1] " https://lglbw.pqr.xyz.com:58443/APG/lookup/All/Report%20Library/Amazon%20S3/Inventory/Regions/$Region_Name = ap-southeast-1/"
EDIT: if you want a multiple replacements for the same substring:
# If the substring will always be "$Region_Name"
su_str <- "$Region_Name"
replacements <- c("$Region_Name = ap-southeast-1/", "$Region_Name = ap-southeast-2/")
stri_replace_all(str, replacements, fixed = su_str)
[1] " https://lglbw.pqr.xyz.com:58443/APG/lookup/All/Report%20Library/Amazon%20S3/Inventory/Regions/$Region_Name = ap-southeast-1//Billing/report.csv"
[2] " https://lglbw.pqr.xyz.com:58443/APG/lookup/All/Report%20Library/Amazon%20S3/Inventory/Regions/$Region_Name = ap-southeast-2//Billing/report.csv"
The title of your question and the one you are asking are two different issues but I will try to address them both.
With respect to the Error you are getting with stri_extract_all(), you need to specify what kind of pattern you want to match, I believe you are trying to match a fixed pattern, in which case you can use the
stri_extract_all_fixed()
function instead.
However I do not use stri_extract_all() to remove and substitute your sub-string. Here's my solution.
str <- " https://lglbw.pqr.xyz.com:58443/APG/lookup/All/Report%20Library/Amazon%20S3/Inventory/Regions/$Region_Name/Billing/report.csv"
reg<-"$Region_Name"
replce<-"ap-southeast-1"
# Custom function to return position of a sub string
strpos_fixed<-function(x,y){
a<-regexpr(y, x,fixed=T)
b<-a[1]
return(b)
}
part1<-substr(str,1,(strpos_fixed(str,reg)-1))
part2<-substr(str,(strpos_fixed(str,reg)+nchar(reg)),nchar(str))
part1 # Everything before "$Region_Name"
part2 # Everything after "$Region_Name"
new<-paste(part1,replce,part2, sep ="")
new

Return number from string

I'm trying to extract the "Number" of "Humans" in the string below, for example:
string <- c("ProjectObjectives|Objectives_NA, PublishDate|PublishDate_NA, DeploymentID|DeploymentID_NA, Species|Human|Gender|Female, Species|Cat|Number|1, Species|Human|Number|1, Species|Human|Position|Left")
The position of the text in the string will constantly change, so I need R to search the string and find "Species|Human|Number|" and return 1.
Apologies if this is a duplicate of another thread, but I've looked here (extract a substring in R according to a pattern) and here (R extract part of string). But I'm not having any luck.
Any ideas?
Use a capturing approach - capture 1 or more digits (\d+) after the known substring (just escape the | symbols):
> string <- c("ProjectObjectives|Objectives_NA, PublishDate|PublishDate_NA, DeploymentID|DeploymentID_NA, Species|Human|Gender|Female, Species|Cat|Number|1, Species|Human|Number|1, Species|Human|Position|Left")
> pattern = "Species\\|Human\\|Number\\|(\\d+)"
> unlist(regmatches(string,regexec(pattern,string)))[2]
[1] "1"
A variation is to use a PCRE regex with regmatches/regexpr
> pattern="(?<=Species\\|Human\\|Number\\|)\\d+"
> regmatches(string,regexpr(pattern,string, perl=TRUE))
[1] "1"
Here, the left side context is put inside a non-consuming pattern, a positive lookbehind, (?<=...).
The same functionality can be achieved with \K operator:
> pattern="Species\\|Human\\|Number\\|\\K\\d+"
> regmatches(string,regexpr(pattern,string, perl=TRUE))
[1] "1"
Simplest way I can think of:
as.integer(gsub("^.+Species\\|Human\\|Number\\|(\\d+).+$", "\\1", string))
It will introduce NAs where there is no mention of Speces|Human|Number. Also, there will be artefacts if any of the strings is a number (but I assume that this won't be an issue)

R: How to extract specific digits from a string?

I want to retrieve the first Numbers (here -> 344002) from a string:
string <- '<a href="/Archiv-Suche/!344002&s=&SuchRahmen=Print/" ratiourl-ressource="344002"'
I am preferably looking for a regular expression, which looks for the Numbers after the ! and before the &amp.
All I came up with is this but this catches the ! as well (!344002):
regmatches(string, gregexpr("\\!([[:digit:]]+)", string, perl =TRUE))
Any ideas?
Use this regex:
(?<=\!)\d+(?=&amp)
Use this code:
regmatches(string, gregexpr("(?<=\!)\d+(?=&amp)", string, perl=TRUE))
(?<=\!) is a lookbehind, the match will start following !
\d+ matches one digit or more
(?=&amp) stops the match if next characters are &amp
library(gsubfn)
strapplyc(string, "!(\\d+)")[[1]]
Old answer]
Test this code.
library(stringr)
str_extract(string, "[0-9]+")
similar question&answer is present here
Extract a regular expression match in R version 2.10
You may capture the digits (\d+) in between ! and &amp and get it with regexec/regmatches:
> string <- '<a href="/Archiv-Suche/!344002&s=&SuchRahmen=Print/" ratiourl-ressource="344002"'
> pattern = "!(\\d+)&"
> res <- unlist(regmatches(string,regexec(pattern,string)))
> res[2]
[1] "344002"
See the online R demo

Retrieving a specific part of a string in R

I have the next vector of strings
[1] "/players/playerpage.htm?ilkidn=BRYANPHI01"
[2] "/players/playerpage.htm?ilkidhh=WILLIROB027"
[3] "/players/playerpage.htm?ilkid=THOMPWIL01"
I am looking for a way to retrieve the part of the string that is placed after the equal sign meaning I would like to get a vector like this
[1] "BRYANPHI01"
[2] "WILLIROB027"
[3] "THOMPWIL01"
I tried using substr but for it to work I have to know exactly where the equal sign is placed in the string and where the part i want to retrieve ends
We can use sub to match the zero or more characters that are not a = ([^=]*) followed by a = and replace it with ''.
sub("[^=]*=", "", str1)
#[1] "BRYANPHI01" "WILLIROB027" "THOMPWIL01"
data
str1 <- c("/players/playerpage.htm?ilkidn=BRYANPHI01",
"/players/playerpage.htm?ilkidhh=WILLIROB027",
"/players/playerpage.htm?ilkid=THOMPWIL01")
Using stringr,
library(stringr)
word(str1, 2, sep = '=')
#[1] "BRYANPHI01" "WILLIROB027" "THOMPWIL01"
Using strsplit,
strsplit(str1, "=")[[1]][2]
# [1] "BRYANPHI01"
With Sotos comment to get results as vector:
sapply(str1, function(x){
strsplit(x, "=")[[1]][2]
})
Another solution based on regex, but extracting instead of substituting, which may be more efficient.
I use the stringi package which provides a more powerful regex engine than base R (in particular, supporting look-behind).
str1 <- c("/players/playerpage.htm?ilkidn=BRYANPHI01",
"/players/playerpage.htm?ilkidhh=WILLIROB027",
"/players/playerpage.htm?ilkid=THOMPWIL01")
stri_extract_all_regex(str1, pattern="(?<==).+$", simplify=T)
(?<==) is a look-behind: regex will match only if preceded by an equal sign, but the equal sign will not be part of the match.
.+$ matches everything until the end. You could replace the dot with a more precise symbol if you are confident about the format of what you match. For example, '\w' matches any alphanumeric character, so you could use "(?<==)\\w+$" (the \ must be escaped so you end up with \\w).

Newb regex help: string with ampersand, using R

I known this should be simple but I cannot return a subset of characters from a string using regex in R.
Foo <- 'propertyid=R206411&state_id='
Reg <- 'propertyid=(.*)&state_id='
Test <- grep(pattern=Reg, x=Foo, value=TRUE)
This captures the entire string for me and I want to capture just the R206411. The string I want to capture might vary in length and content, so the key is to have the capture begin after the '=' in propertyid=, and then end the capture once it sees the '&' in '&state_id'.
Thanks for your time.
You have to use positive lookbehind and lookahead assertions like this:
Foo <- 'propertyid=R206411&state_id='
Reg <- gregexpr('(?<=propertyid=).*(?=&state_id=)', Foo, perl=TRUE)
regmatches(Foo, Reg)
Well, grep doesn't play well with captured groups which is what you are trying to do. What you probably want is gsub
Foo <- 'propertyid=R206411&state_id='
Reg <- 'propertyid=(.*)&state_id='
gsub(Reg, "\\1", Foo)
# [1] "R206411"
Here we take your pattern, and we replace the match with "\1" (and since R requires us to escape backslashes we double the slash) which stands for the first capture group (which is what the parenthesis indicate). So since you match the whole string, it will replace the whole string with just the matching portion.
The strapplyc function in the gsubfn package can do exactly that. Using Foo and Reg from the question:
> library(gsubfn)
>
> strapplyc(Foo, Reg, simplify = TRUE)
[1] "R206411"

Resources