grep at the beginning of the string with fixed =T in R?

grep at the beginning of the string with fixed =T in R? - r

How to grep with fixed=T, but only at the beginning of the string?
grep("a.", c("a.b", "cac", "sss", "ca.f"), fixed = T)
# 1 4
I would like to get only the first occurrence.
[Edit: the string to match is not known in advance, and can be anything. "a." is just for the sake of example]
Thanks.
[Edit: I sort of solved it now, but any other ideas are highly welcome. I will accept as an answer any alternative solution.
s <- "a."
res <- grep(s, c("a.b", "cac", "sss", "ca.f"), fixed = T, value = T)
res[substring(res, 1, nchar(s)) == s]
]

If you want to match an exact string (string 1) at the beginning of the string (string 2), then just subset your string 2 to be the same length as string 1 and use ==, should be fairly fast.

Actually, Greg -and you- have mentioned the cleanest solution already. I would even drop the grep altogether:
> name <- "a#"
> string <- c("a#b", "cac", "sss", "ca#f")
> string[substring(string, 1, nchar(name)) == name]
[1] "a#b"
But if you really insist on grep, you can use Dwins approach, or following mindboggling solution:
specialgrep <- function(x,y,...){
grep(
paste("^",
gsub("([].^+?|[#\\-])","\\\\\\1",x)
,sep=""),
y,...)
}
> specialgrep(name,string,value=T)
[1] "a#b"
It might be I forgot to include some characters in the gsub. Be sure you keep the ] symbol first and the - last in the characterset, otherwise you'll get errors. Or just forget about it, use your own solution. This one is just for fun's sake :-)

Do you want to use fixed=T because of the . in the pattern? In that case you can just escape the . this would work:
grep("^a\\.", c("a.b", "cac", "sss", "ca.f"))

If you only want the focus on the first two characters, then only present that much information to grep:
> grep("a.", substr(c("a.b", "cac", "sss", "ca.f"), 1,2) ,fixed=TRUE)
[1] 1
You could easily wrap it into a function:
> checktwo <- function (patt,vec) { grep(patt, substr(vec, 1,nchar(patt)) ,fixed=TRUE) }
> checktwo("a.", c("a.b", "cac", "sss", "ca.f") )
[1] 1

I think Dr. G had the key to the solution in his answer, but didn't explicitly call it out: "^" in the pattern specifies "at the beginning of the string". ("$" means at the end of the string)
So his "^a." pattern means "at the beginning of the string, look for an 'a' followed by one character of anything [the '.']".
Or you could just use "^a" as the pattern unless you don't want to match the one character string containing only "a".
Does that help?
Jeffrey

Related

Checking if a vector starts with a number

I have a pretty straight forward question. Sorry if this has already been asked somewhere, but I could not find the answer...
I want to check if genenames start with a number, and if they do start with a number, I want to add 'aaa_' to the genename. Therefor I used the following code:
geneName <- "2310067B10Rik"
if (is.numeric(substring(geneName, 1, 1))) {
geneName <<- paste("aaaa_", geneName, sep="")
}
What I want to get back is aaaa_2310067B10Rik. However, is.numeric returns a FALSE, because the substring gives "2" in quotations as a character. I've also tries to use noquote(), but that didnt work, and as.numeric() around the substring, but then it also applies the if code to genes that don't start with a number. Any suggestions? Thanks!

Here is a solution with regex (Learning Regular Expressions ):
geneName <- c("2310067B10Rik", "Z310067B10Rik")
sub("^(\\d)", "aaa_\\1", geneName)
or as PERL-flavoured variant (thx to #snoram):
sub("^(?=\\d)", "aaa_", geneName, perl = TRUE)

Using the replace() function:
start_nr <- grep("^\\d", geneName)
replace(geneName, start_nr, paste0("aaaa_", geneName[start_nr]))
[1] "aaaa_2310067B10Rik" "foo" "aaaa_9bar"
Where:
geneName <- c("2310067B10Rik", "foo", "9bar")

geneName <- c("2310067B10Rik", "foo")
ifelse(substring(geneName, 1,1) %in% c(0:9), paste0("aaaa_", geneName), geneName)
[1] "aaaa_2310067B10Rik" "foo"
Or based on above comment, you could replace substring(geneName, 1,1) %in% c(0:9) by grepl("^\\d", geneName)

Using regex:
You can first check the first character of your geneName and if it is digit then you can append as follow:
geneName <- "2310067B10Rik"
ifelse(grepl("^[0-9]*$", substring(geneName, 1,1)),paste("aaaa",geneName,sep="_"),)
Output:
[1] "aaaa_2310067B10Rik"

geneName=function(x){
if( grepl("^[0-9]",x) ){
as.character(glue::glue('aaaa_{x}'))
}else{x}
}
> geneName("2310067B10Rik")
[1] "aaaa_2310067B10Rik"
> geneName("sdsad")
[1] "sdsad"

gsub to remove unwanted precision

Could anyone please help to achieve the following with gsub in R?
input string: a=5.00,b=120,c=0.0003,d=0.02,e=5.20, f=1200.0,g=850.02
desired output: a=5,b=120,c=0.0003,d=0.02,e=5.2, f=1200, g=850.02
Practically, removing the redundant 0s after the decimal point if they are all just 0s, don't remove if real fractions exist.

I couldn't get this to work using gsub alone, but we can try splitting your input vector on comma, and then using an apply function with gsub:
x <- "a=5.00,b=120,c=0.0003,d=0.02,e=5.20, f=1200.0,g=850.02"
input <- sapply(unlist(strsplit(x, ",")), function(x) gsub("(?<=\\d)\\.$", "", gsub("(\\.[1-9]*)0+$", "\\1", x), perl=TRUE))
input <- paste(input, collapse=",")
input
[1] "a=5,b=120,c=0.0003,d=0.02,e=5.2, f=1200,g=850.02"
Demo
I actually make two calls to gsub. The first call strips off all trailing zeroes appearing after a decimal point, should the number have one. And the second call removes stray decimal points, in the case of a number like 5.00, which the first call would leave as 5. and not 5, the latter which we want.

To remove trailing 0s after the decimal, try this:
EDIT Forgot 5.00
x = c('5.00', '0.500', '120', '0.0003', '0.02', '5.20', '1200', '850.02')
gsub("\\.$" "", gsub("(\\.(|[1-9]+))0+$", "\\1", x))
# [1] "5" "0.5" "120" "0.0003" "0.02" "5.2" "1200" "850.02"
HT #TimBiegeleisen: I misread input as a vector of strings. For a single-string input, convert to vector of strings, which you can call gsub on, then collapse output back to a single string:
paste(
gsub("\\.$", "", gsub("(\\.(|[1-9]+))0+$", "\\1",
unlist(strsplit(x, ", ")))),
collapse=", ")
[1] "a=5, b=0.5, c=120, d=0.0003, e=0.02, f=5.2, g=1200, h=850.02"

gsub is a text processing tool that works on character level. It’s ignorant of any semantic interpretation.
However, you are specifically interested in manipulating this semantic interpretation, namely, the precision of numbers encoded in your text.
So use that: parse the numbers in the text, and write them out with the desired precision:
parse_key_value_pairs = function (text) {
parse_pair = function (pair) {
pair = strsplit(pair, "\\s*=\\s*")[[1]]
list(key = pair[1], value = as.numeric(pair[2]))
}
pairs = unlist(strsplit(text, "\\s*,\\s*"))
structure(lapply(pairs, parse_pair), class = 'kvp')
}
as.character.kvp = function (x, ...) {
format_pair = function (pair) {
sprintf('%s = %g', pair[1], pair[2])
}
pairs = vapply(x, format_pair, character(1))
paste(pairs, collapse = ", ")
}
And use it as follows:
text = "a=5.00,b=120,c=0.0003,d=0.02,e=5.20, f=1200.0,g=850.02"
parsed = parse_key_value_pairs(text)
as.character(parsed)
This uses several interesting features of R:
For text processing, it still uses regular expressions (inside strsplit).
To process multiple values, use lapply to apply a parsing function to parts of the string in turn
To reconstruct a key–value pair, format the string using sprintf. sprintf is a primitive text formatting tool adapted from C. But it’s fairly universal and it works OK in our case.
The parsed value is tagged with an S3 class name. This is how R implements object orientation.
Provide an overload of the standard generic as.character for our type. This means that any existing function that takes an object and displays it via as.character can deal with our parsed data type. In particular, this works with the {glue} library:
> glue::glue("result: {parsed}")
result: a = 5, b = 120, c = 0.0003, d = 0.02, e = 5.2, f = 1200, g = 850.02

This is probably not the most ideal solution, but for educational purposes, here is one way to call gsub only once using conditional regex:
x = 'a=5.00,b=120,c=0.0003,d=0.02,e=5.20, f=1200.0,g=850.02'
gsub('(?!\\d+(?:,|$))(\\.[0-9]*[1-9])?(?(1)0+\\b|\\.0+(?=(,|$)))', '\\1', x, perl = TRUE)
# [1] "a=5,b=120,c=0.0003,d=0.02,e=5.2, f=1200,g=850.02"
Notes:
(?!\\d+(?:,|$)) is a negative lookbehind that matches a digit one or more times following a comma or end of string. This effectively excludes the pattern from the overall regex match.
(\\.[0-9]*[1-9])? matches a literal dot, a digit zero or more times and a digit (except zero). The ? makes this pattern optional, and is crucial to how the conditional handles the back reference.
(?(1)0+\\b|\\.0+(?=(,|$))) is a conditional with the logic (?(IF)THEN|ELSE)
(1) is the (IF) part which checks if capture group 1 matched. This refers to (\\.[0-9]*[1-9])
0+\\b is the (THEN) part which matches only if (IF) is TRUE. In this case, only if (\\.[0-9]*[1-9]) matched, will the regex try to match a zero one or more times following a word boundary
\\.0+(?=(,|$)) is the (ELSE) part which matches only if (IF) is FALSE. In this case only if (\\.[0-9]*[1-9]) didn't match, will the regex try to match a literal dot, a zero one or more times following a comma or end of string
If we put 2. and 3. together, we get either (\\.[0-9]*[1-9])0+\\b or \\.0+(?=(,|$))
\\1 as a replacement therefore turns either (\\.[0-9]*[1-9])0+\\b to the pattern matched by (\\.[0-9]*[1-9]) or \\.0+(?=(,|$)) to blank. which translates to:
5.20 to 5.2 for the former
5.00 to 5 and 1200.0 to 1200 for the latter

Use capture group within gsub() as index/name for another object (R) [duplicate]

I'm new to R and am stuck with backreferencing that doesn't seem to work. In:
gsub("\\((\\d+)\\)", f("\\1"), string)
It correctly grabs the number in between parentheses but doesn't apply the (correctly defined, working otherwise) function f to replace the number --> it's actually the string "\1" that passes through to f.
Am I missing something or is it just that R does not handle this? If so, any idea how I could do something similar, i.e. applying a function "on the fly" to the (actually many) numbers that occur in between parentheses in the text I'm parsing?
Thanks a lot for your help.

R does not have the option of applying a function directly to a match via gsub. You'll actually have to extract the match, transform the value, then replace the value. This is relativaly easy with the regmatches function. For example
x<-"(990283)M (31)O (29)M (6360)M"
f<-function(x) {
v<-as.numeric(substr(x,2,nchar(x)-1))
paste0(v+5,".1")
}
m <- gregexpr("\\(\\d+\\)", x)
regmatches(x, m) <- lapply(regmatches(x, m), f)
x
# [1] "990288.1M 36.1O 34.1M 6365.1M"
Of course you can make f do whatever you like just make sure it's vector-friendly. Of course, you could wrap this in your own function
gsubf <- function(pattern, x, f) {
m <- gregexpr(pattern, x)
regmatches(x, m) <- lapply(regmatches(x, m), f)
x
}
gsubf("\\(\\d+\\)", x, f)
Note that in these examples we're not using a capture group, we're just grabbing the entire match. There are ways to extract the capture groups but they are a bit messier. If you wanted to provide an example where such an extraction is required, I might be able to come up with something fancier.

To use a callback within a regex-capable replacement function, you may use either gsubfn or stringr functions.
When choosing between them, note that stringr is based on ICU regex engine and with gsubfn, you may use either the default TCL (if the R installation has tcltk capability, else it is the default TRE) or PCRE (if you pass the perl=TRUE argument).
Also, note that gsubfn allows access to all capturing groups in the match object, while str_replace_all will only allow to manipulate the whole match only. Thus, for str_replace_all, the regex should look like (?<=\()\d+(?=\)), where 1+ digits are matched only when they are enclosed with ( and ) excluding them from the match.
With stringr, you may use str_replace_all:
library(stringr)
string <- "(990283)M (31)O (29)M (6360)M"
## Callback function to increment found number:
f <- function(x) { as.integer(x) + 1 }
str_replace_all(string, "(?<=\\()\\d+(?=\\))", function(m) f(m))
## => [1] "(990284)M (32)O (30)M (6361)M"
With gsubfn, pass perl=TRUE and backref=0 to be able to use lookarounds and just modify the whole match:
gsubfn("(?<=\\()\\d+(?=\\))", ~ f(m), string, perl=TRUE, backref=0)
## => [1] "(990284)M (32)O (30)M (6361)M"
If you have multiple groups in the pattern, remoe backref=0 and enumerate the group value arguments in the callback function declaration:
gsubfn("(\\()(\\d+)(\\))", function(m,n,o) paste0(m,f(n),o), string, perl=TRUE)
^ 1 ^^ 2 ^^ 3 ^ ^^^^^^^ ^^^^

This is for multiple different replacements.
text="foo(200) (300)bar (400)foo (500)bar (600)foo (700)bar"
f=function(x)
{
return(as.numeric(x[[1]])+5)
}
a=strsplit(text,"\\(\\K\\d+",perl=T)[[1]]
b=f(str_extract_all(text,perl("\\(\\K\\d+")))
paste0(paste0(a[-length(a)],b,collapse=""),a[length(a)]) #final output
#[1] "foo(205) (305)bar (405)foo (505)bar (605)foo (705)bar"

Here's a way by tweaking a bit stringr::str_replace(), in the replace argument, just use a lambda formula as the replace argument, and reference the captured group not by ""\\1" but by ..1, so your gsub("\\((\\d+)\\)", f("\\1"), string) will become str_replace2(string, "\\((\\d+)\\)", ~f(..1)), or just str_replace2(string, "\\((\\d+)\\)", f) in this simple case :
str_replace2 <- function(string, pattern, replacement, type.convert = TRUE){
if(inherits(replacement, "formula"))
replacement <- rlang::as_function(replacement)
if(is.function(replacement)){
grps_mat <- stringr::str_match(string, pattern)[,-1, drop = FALSE]
grps_list <- lapply(seq_len(ncol(grps_mat)), function(i) grps_mat[,i])
if(type.convert) {
grps_list <- type.convert(grps_list, as.is = TRUE)
replacement <- rlang::exec(replacement, !!! grps_list)
replacement <- as.character(replacement)
} else {
replacement <- rlang::exec(replacement, !!! grps_list)
}
}
stringr::str_replace(string, pattern, replacement)
}
str_replace2(
"foo (4)",
"\\((\\d+)\\)",
sqrt)
#> [1] "foo 2"
str_replace2(
"foo (4) (5)",
"\\((\\d+)\\) \\((\\d+)\\)",
~ sprintf("(%s)", ..1 * ..2))
#> [1] "foo (20)"
Created on 2020-01-24 by the reprex package (v0.3.0)

How to use grep to match exactly with space?

I have a list as following:
S = "Alicia Chang"
N=c("Alicia Chang", "Heather May", "Alicia Chang J")
I want to use grep to turn the first one only. How could I do it. When I use grep(S, N), it return 3 of them. When I use grep(^S$, N), it gave me error.

We need to use paste to create the pattern for grep.
grep(paste0('^', S, '$'), N)
#[1] 1

Applying a function to a backreference within gsub in R

I'm new to R and am stuck with backreferencing that doesn't seem to work. In:
gsub("\\((\\d+)\\)", f("\\1"), string)
It correctly grabs the number in between parentheses but doesn't apply the (correctly defined, working otherwise) function f to replace the number --> it's actually the string "\1" that passes through to f.
Am I missing something or is it just that R does not handle this? If so, any idea how I could do something similar, i.e. applying a function "on the fly" to the (actually many) numbers that occur in between parentheses in the text I'm parsing?
Thanks a lot for your help.

R does not have the option of applying a function directly to a match via gsub. You'll actually have to extract the match, transform the value, then replace the value. This is relativaly easy with the regmatches function. For example
x<-"(990283)M (31)O (29)M (6360)M"
f<-function(x) {
v<-as.numeric(substr(x,2,nchar(x)-1))
paste0(v+5,".1")
}
m <- gregexpr("\\(\\d+\\)", x)
regmatches(x, m) <- lapply(regmatches(x, m), f)
x
# [1] "990288.1M 36.1O 34.1M 6365.1M"
Of course you can make f do whatever you like just make sure it's vector-friendly. Of course, you could wrap this in your own function
gsubf <- function(pattern, x, f) {
m <- gregexpr(pattern, x)
regmatches(x, m) <- lapply(regmatches(x, m), f)
x
}
gsubf("\\(\\d+\\)", x, f)
Note that in these examples we're not using a capture group, we're just grabbing the entire match. There are ways to extract the capture groups but they are a bit messier. If you wanted to provide an example where such an extraction is required, I might be able to come up with something fancier.

To use a callback within a regex-capable replacement function, you may use either gsubfn or stringr functions.
When choosing between them, note that stringr is based on ICU regex engine and with gsubfn, you may use either the default TCL (if the R installation has tcltk capability, else it is the default TRE) or PCRE (if you pass the perl=TRUE argument).
Also, note that gsubfn allows access to all capturing groups in the match object, while str_replace_all will only allow to manipulate the whole match only. Thus, for str_replace_all, the regex should look like (?<=\()\d+(?=\)), where 1+ digits are matched only when they are enclosed with ( and ) excluding them from the match.
With stringr, you may use str_replace_all:
library(stringr)
string <- "(990283)M (31)O (29)M (6360)M"
## Callback function to increment found number:
f <- function(x) { as.integer(x) + 1 }
str_replace_all(string, "(?<=\\()\\d+(?=\\))", function(m) f(m))
## => [1] "(990284)M (32)O (30)M (6361)M"
With gsubfn, pass perl=TRUE and backref=0 to be able to use lookarounds and just modify the whole match:
gsubfn("(?<=\\()\\d+(?=\\))", ~ f(m), string, perl=TRUE, backref=0)
## => [1] "(990284)M (32)O (30)M (6361)M"
If you have multiple groups in the pattern, remoe backref=0 and enumerate the group value arguments in the callback function declaration:
gsubfn("(\\()(\\d+)(\\))", function(m,n,o) paste0(m,f(n),o), string, perl=TRUE)
^ 1 ^^ 2 ^^ 3 ^ ^^^^^^^ ^^^^

This is for multiple different replacements.
text="foo(200) (300)bar (400)foo (500)bar (600)foo (700)bar"
f=function(x)
{
return(as.numeric(x[[1]])+5)
}
a=strsplit(text,"\\(\\K\\d+",perl=T)[[1]]
b=f(str_extract_all(text,perl("\\(\\K\\d+")))
paste0(paste0(a[-length(a)],b,collapse=""),a[length(a)]) #final output
#[1] "foo(205) (305)bar (405)foo (505)bar (605)foo (705)bar"

Here's a way by tweaking a bit stringr::str_replace(), in the replace argument, just use a lambda formula as the replace argument, and reference the captured group not by ""\\1" but by ..1, so your gsub("\\((\\d+)\\)", f("\\1"), string) will become str_replace2(string, "\\((\\d+)\\)", ~f(..1)), or just str_replace2(string, "\\((\\d+)\\)", f) in this simple case :
str_replace2 <- function(string, pattern, replacement, type.convert = TRUE){
if(inherits(replacement, "formula"))
replacement <- rlang::as_function(replacement)
if(is.function(replacement)){
grps_mat <- stringr::str_match(string, pattern)[,-1, drop = FALSE]
grps_list <- lapply(seq_len(ncol(grps_mat)), function(i) grps_mat[,i])
if(type.convert) {
grps_list <- type.convert(grps_list, as.is = TRUE)
replacement <- rlang::exec(replacement, !!! grps_list)
replacement <- as.character(replacement)
} else {
replacement <- rlang::exec(replacement, !!! grps_list)
}
}
stringr::str_replace(string, pattern, replacement)
}
str_replace2(
"foo (4)",
"\\((\\d+)\\)",
sqrt)
#> [1] "foo 2"
str_replace2(
"foo (4) (5)",
"\\((\\d+)\\) \\((\\d+)\\)",
~ sprintf("(%s)", ..1 * ..2))
#> [1] "foo (20)"
Created on 2020-01-24 by the reprex package (v0.3.0)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

grep at the beginning of the string with fixed =T in R? - r

If you want to match an exact string (string 1) at the beginning of the string (string 2), then just subset your string 2 to be the same length as string 1 and use ==, should be fairly fast.

Do you want to use fixed=T because of the . in the pattern? In that case you can just escape the . this would work: grep("^a\\.", c("a.b", "cac", "sss", "ca.f"))

Related

Checking if a vector starts with a number

gsub to remove unwanted precision

Use capture group within gsub() as index/name for another object (R) [duplicate]

How to use grep to match exactly with space?

Applying a function to a backreference within gsub in R

Categories

Resources