I have the following vector:
X <- c("mama.log", "papa.log", "mimo.png", "mentor.log")
How do I retrieve another vector that only contains elements starting with "m" and ending with ".log"?
you can use grepl with regular expression:
X[grepl("^m.*\\.log", X)]
Try this:
grep("^m.*[.]log$", X, value = TRUE)
## [1] "mama.log" "mentor.log"
A variation of this is to use a glob rather than a regular expression:
grep(glob2rx("m*.log"), X, value = TRUE)
## [1] "mama.log" "mentor.log"
The documentation on the stringr package says:
str_subset() is a wrapper around x[str_detect(x, pattern)], and is equivalent to grep(pattern, x, value = TRUE). str_which() is a wrapper around which(str_detect(x, pattern)), and is equivalent to grep(pattern, x).
So, in your case, the more elegant way to accomplish your task using tidyverse instead of base R is as following.
library(tidyverse)
c("mama.log", "papa.log", "mimo.png", "mentor.log") %>%
str_subset(pattern = "^m.*\\.log")
which produces the output:
[1] "mama.log" "mentor.log"
Using pipes...
library(tidyverse)
c("mama.log", "papa.log", "mimo.png", "mentor.log") %>%
.[grepl("^m.*\\.log$", .)]
[1] "mama.log" "mentor.log"
Related
I would like to see the line where a character exists.
The expected answer would be the 4-row numbers which include the character BTC.
library(stringr)
library(quantmod)
symbols <- stockSymbols()
symbols <- symbols[,1]
u <- symbols
a <- "BTC"
str_detect(a, u)
table(str_detect(a, u))
We could use grepl with which
which(grepl(a, u))
You could either use the tidyverse way, using the filter() function:
filter(dataset, column == "BTC")
Or using the grep() function from base R:
grep("BTC", dataset$column)
That will give you the index (i.e. place) of what you are looking for
Another base R option might be which + regexpr (but I think grep or grepl is obviously more efficient and straightforward)
which(regexpr(a, u)>0)
You can use grep to get the index where pattern a occurs.
#Index
grep(a, u)
#[1] 3437
#Value
grep(a, u, value = TRUE)
#[1] "EBTC"
Using stringr :
library(stringr)
#Index
str_which(u, a)
#Value
str_subset(u, a)
I'm new to R and am stuck with backreferencing that doesn't seem to work. In:
gsub("\\((\\d+)\\)", f("\\1"), string)
It correctly grabs the number in between parentheses but doesn't apply the (correctly defined, working otherwise) function f to replace the number --> it's actually the string "\1" that passes through to f.
Am I missing something or is it just that R does not handle this? If so, any idea how I could do something similar, i.e. applying a function "on the fly" to the (actually many) numbers that occur in between parentheses in the text I'm parsing?
Thanks a lot for your help.
R does not have the option of applying a function directly to a match via gsub. You'll actually have to extract the match, transform the value, then replace the value. This is relativaly easy with the regmatches function. For example
x<-"(990283)M (31)O (29)M (6360)M"
f<-function(x) {
v<-as.numeric(substr(x,2,nchar(x)-1))
paste0(v+5,".1")
}
m <- gregexpr("\\(\\d+\\)", x)
regmatches(x, m) <- lapply(regmatches(x, m), f)
x
# [1] "990288.1M 36.1O 34.1M 6365.1M"
Of course you can make f do whatever you like just make sure it's vector-friendly. Of course, you could wrap this in your own function
gsubf <- function(pattern, x, f) {
m <- gregexpr(pattern, x)
regmatches(x, m) <- lapply(regmatches(x, m), f)
x
}
gsubf("\\(\\d+\\)", x, f)
Note that in these examples we're not using a capture group, we're just grabbing the entire match. There are ways to extract the capture groups but they are a bit messier. If you wanted to provide an example where such an extraction is required, I might be able to come up with something fancier.
To use a callback within a regex-capable replacement function, you may use either gsubfn or stringr functions.
When choosing between them, note that stringr is based on ICU regex engine and with gsubfn, you may use either the default TCL (if the R installation has tcltk capability, else it is the default TRE) or PCRE (if you pass the perl=TRUE argument).
Also, note that gsubfn allows access to all capturing groups in the match object, while str_replace_all will only allow to manipulate the whole match only. Thus, for str_replace_all, the regex should look like (?<=\()\d+(?=\)), where 1+ digits are matched only when they are enclosed with ( and ) excluding them from the match.
With stringr, you may use str_replace_all:
library(stringr)
string <- "(990283)M (31)O (29)M (6360)M"
## Callback function to increment found number:
f <- function(x) { as.integer(x) + 1 }
str_replace_all(string, "(?<=\\()\\d+(?=\\))", function(m) f(m))
## => [1] "(990284)M (32)O (30)M (6361)M"
With gsubfn, pass perl=TRUE and backref=0 to be able to use lookarounds and just modify the whole match:
gsubfn("(?<=\\()\\d+(?=\\))", ~ f(m), string, perl=TRUE, backref=0)
## => [1] "(990284)M (32)O (30)M (6361)M"
If you have multiple groups in the pattern, remoe backref=0 and enumerate the group value arguments in the callback function declaration:
gsubfn("(\\()(\\d+)(\\))", function(m,n,o) paste0(m,f(n),o), string, perl=TRUE)
^ 1 ^^ 2 ^^ 3 ^ ^^^^^^^ ^^^^
This is for multiple different replacements.
text="foo(200) (300)bar (400)foo (500)bar (600)foo (700)bar"
f=function(x)
{
return(as.numeric(x[[1]])+5)
}
a=strsplit(text,"\\(\\K\\d+",perl=T)[[1]]
b=f(str_extract_all(text,perl("\\(\\K\\d+")))
paste0(paste0(a[-length(a)],b,collapse=""),a[length(a)]) #final output
#[1] "foo(205) (305)bar (405)foo (505)bar (605)foo (705)bar"
Here's a way by tweaking a bit stringr::str_replace(), in the replace argument, just use a lambda formula as the replace argument, and reference the captured group not by ""\\1" but by ..1, so your gsub("\\((\\d+)\\)", f("\\1"), string) will become str_replace2(string, "\\((\\d+)\\)", ~f(..1)), or just str_replace2(string, "\\((\\d+)\\)", f) in this simple case :
str_replace2 <- function(string, pattern, replacement, type.convert = TRUE){
if(inherits(replacement, "formula"))
replacement <- rlang::as_function(replacement)
if(is.function(replacement)){
grps_mat <- stringr::str_match(string, pattern)[,-1, drop = FALSE]
grps_list <- lapply(seq_len(ncol(grps_mat)), function(i) grps_mat[,i])
if(type.convert) {
grps_list <- type.convert(grps_list, as.is = TRUE)
replacement <- rlang::exec(replacement, !!! grps_list)
replacement <- as.character(replacement)
} else {
replacement <- rlang::exec(replacement, !!! grps_list)
}
}
stringr::str_replace(string, pattern, replacement)
}
str_replace2(
"foo (4)",
"\\((\\d+)\\)",
sqrt)
#> [1] "foo 2"
str_replace2(
"foo (4) (5)",
"\\((\\d+)\\) \\((\\d+)\\)",
~ sprintf("(%s)", ..1 * ..2))
#> [1] "foo (20)"
Created on 2020-01-24 by the reprex package (v0.3.0)
Lets say I have a function named Fun1 within which I am using many different in-built functions of R for different different processes. Then how can I get a list of in-built functions used inside this function Fun1
Fun1 <- function(x,y){
sum(x,y)
mean(x,y)
c(x,y)
print(x)
print(y)
}
So My output should be like list of characters i.e. sum, mean, c, print. Because these are the in-built functions I have used inside function Fun1.
I have tried using grep function
grep("\\(",body(Fun1),value=TRUE)
# [1] "sum(x, y)" "mean(x, y)" "c(x, y)" "print(x)" "print(y)"
It looks ok, but arguments should not come i.e. x and y. Just the list of function names used inside body of function Fun1 here.
So my overall goal is to print the unique list of in-built functions or any create functions inside a particular function, here Fun1.
Any help on this is highly appreciated. Thanks.
You could use all.vars() to get all the variable names (including functions) that appear inside the body of Fun1, then compare that with some prepared list of functions. You mention in-built functions, so I will compare it with the base package object names.
## full list of variable names inside the function body
(vars <- all.vars(body(Fun1)[-1], functions = TRUE))
# [1] "sum" "x" "y" "mean" "c" "print"
## compare it with the base package object names
intersect(vars, ls(baseenv()))
# [1] "sum" "mean" "c" "print"
I removed the first element of the function body because presumably you don't care about {, which would have been matched against the base package list.
Another possibility, albeit a bit less reliable, would be to compare the formal arguments of Fun1 to all the variable names in the function. Like I said, likely less reliable though because if you make assignments inside the function you will end up with incorrect results.
setdiff(vars, names(formals(Fun1)))
# [1] "sum" "mean" "c" "print"
These are fun though, and you can fiddle around with them.
Access to the parser tokens is available with functions from utils.
tokens <- utils::getParseData(parse(text=deparse(body(Fun1))))
unique(tokens[tokens[["token"]] == "SYMBOL_FUNCTION_CALL", "text"])
[1] "sum" "mean" "c" "print"
This should be somewhat helpful - this will return all functions however.
func_list = Fun1 %>%
body() %>% # extracts function
toString() %>% # converts to single string
gsub("[{}]", "", .) %>% # removes curly braces
gsub("\\s*\\([^\\)]+\\)", "", .) %>% # removes all contents between brackets
strsplit(",") %>% # splits strings at commas
unlist() %>% # converts to vector
trimws(., "both") # removes all white spaces before and after`
[1] "" "sum" "mean" "c" "print" "print"
> table(func_list)
func_list
c mean print sum
1 1 1 2 1
This is extremely limited to your example... you could modify this to be more robust. It will fall over where a function has brackets nesting other functions etc.
this is not so beautiful but working:
Fun1 <- function(x,y){
sum(x,y)
mean(x,y)
c(x,y)
print(x)
print(y)
}
getFNamesInFunction <- function(f.name){
f <- deparse(body(get(f.name)))
f <- f[grepl(pattern = "\\(", x = f)]
f <- sapply(X = strsplit(split = "\\(", x = f), FUN = function(x) x[1])
unique(trimws(f[f != ""]))
}
getFNamesInFunction("Fun1")
[1] "sum" "mean" "c" "print"
as.list(Fun1)[3]
gives you the part of the function between the curly braces.
{
sum(x, y)
mean(x, y)
c(x, y)
print(x)
print(y)
}
Hence
gsub( ").*$", "", as.list(Fun1)[3])
gives you everything before the first " ) " appears which is presumable the name of the first function.
Taking this as a starting point you should be able to include a loop which gives you the other functions and not the first only the first one.
I'm new to R and am stuck with backreferencing that doesn't seem to work. In:
gsub("\\((\\d+)\\)", f("\\1"), string)
It correctly grabs the number in between parentheses but doesn't apply the (correctly defined, working otherwise) function f to replace the number --> it's actually the string "\1" that passes through to f.
Am I missing something or is it just that R does not handle this? If so, any idea how I could do something similar, i.e. applying a function "on the fly" to the (actually many) numbers that occur in between parentheses in the text I'm parsing?
Thanks a lot for your help.
R does not have the option of applying a function directly to a match via gsub. You'll actually have to extract the match, transform the value, then replace the value. This is relativaly easy with the regmatches function. For example
x<-"(990283)M (31)O (29)M (6360)M"
f<-function(x) {
v<-as.numeric(substr(x,2,nchar(x)-1))
paste0(v+5,".1")
}
m <- gregexpr("\\(\\d+\\)", x)
regmatches(x, m) <- lapply(regmatches(x, m), f)
x
# [1] "990288.1M 36.1O 34.1M 6365.1M"
Of course you can make f do whatever you like just make sure it's vector-friendly. Of course, you could wrap this in your own function
gsubf <- function(pattern, x, f) {
m <- gregexpr(pattern, x)
regmatches(x, m) <- lapply(regmatches(x, m), f)
x
}
gsubf("\\(\\d+\\)", x, f)
Note that in these examples we're not using a capture group, we're just grabbing the entire match. There are ways to extract the capture groups but they are a bit messier. If you wanted to provide an example where such an extraction is required, I might be able to come up with something fancier.
To use a callback within a regex-capable replacement function, you may use either gsubfn or stringr functions.
When choosing between them, note that stringr is based on ICU regex engine and with gsubfn, you may use either the default TCL (if the R installation has tcltk capability, else it is the default TRE) or PCRE (if you pass the perl=TRUE argument).
Also, note that gsubfn allows access to all capturing groups in the match object, while str_replace_all will only allow to manipulate the whole match only. Thus, for str_replace_all, the regex should look like (?<=\()\d+(?=\)), where 1+ digits are matched only when they are enclosed with ( and ) excluding them from the match.
With stringr, you may use str_replace_all:
library(stringr)
string <- "(990283)M (31)O (29)M (6360)M"
## Callback function to increment found number:
f <- function(x) { as.integer(x) + 1 }
str_replace_all(string, "(?<=\\()\\d+(?=\\))", function(m) f(m))
## => [1] "(990284)M (32)O (30)M (6361)M"
With gsubfn, pass perl=TRUE and backref=0 to be able to use lookarounds and just modify the whole match:
gsubfn("(?<=\\()\\d+(?=\\))", ~ f(m), string, perl=TRUE, backref=0)
## => [1] "(990284)M (32)O (30)M (6361)M"
If you have multiple groups in the pattern, remoe backref=0 and enumerate the group value arguments in the callback function declaration:
gsubfn("(\\()(\\d+)(\\))", function(m,n,o) paste0(m,f(n),o), string, perl=TRUE)
^ 1 ^^ 2 ^^ 3 ^ ^^^^^^^ ^^^^
This is for multiple different replacements.
text="foo(200) (300)bar (400)foo (500)bar (600)foo (700)bar"
f=function(x)
{
return(as.numeric(x[[1]])+5)
}
a=strsplit(text,"\\(\\K\\d+",perl=T)[[1]]
b=f(str_extract_all(text,perl("\\(\\K\\d+")))
paste0(paste0(a[-length(a)],b,collapse=""),a[length(a)]) #final output
#[1] "foo(205) (305)bar (405)foo (505)bar (605)foo (705)bar"
Here's a way by tweaking a bit stringr::str_replace(), in the replace argument, just use a lambda formula as the replace argument, and reference the captured group not by ""\\1" but by ..1, so your gsub("\\((\\d+)\\)", f("\\1"), string) will become str_replace2(string, "\\((\\d+)\\)", ~f(..1)), or just str_replace2(string, "\\((\\d+)\\)", f) in this simple case :
str_replace2 <- function(string, pattern, replacement, type.convert = TRUE){
if(inherits(replacement, "formula"))
replacement <- rlang::as_function(replacement)
if(is.function(replacement)){
grps_mat <- stringr::str_match(string, pattern)[,-1, drop = FALSE]
grps_list <- lapply(seq_len(ncol(grps_mat)), function(i) grps_mat[,i])
if(type.convert) {
grps_list <- type.convert(grps_list, as.is = TRUE)
replacement <- rlang::exec(replacement, !!! grps_list)
replacement <- as.character(replacement)
} else {
replacement <- rlang::exec(replacement, !!! grps_list)
}
}
stringr::str_replace(string, pattern, replacement)
}
str_replace2(
"foo (4)",
"\\((\\d+)\\)",
sqrt)
#> [1] "foo 2"
str_replace2(
"foo (4) (5)",
"\\((\\d+)\\) \\((\\d+)\\)",
~ sprintf("(%s)", ..1 * ..2))
#> [1] "foo (20)"
Created on 2020-01-24 by the reprex package (v0.3.0)
I have a data frame sp which contains several species names but as they come from different databases, they are written in different ways.
For example, one specie can be called Urtica dioica and Urtica dioica L..
To correct this, I use the following code which extracs only the two first words from a row:
paste(strsplit(sp[i,"sp"]," ")[[1]][1],strsplit(sp[i,"sp"]," ")[[1]][2],sep=" ")
For now, this code is integrated in a for loop, which works but takes ages to finish:
for (i in seq_along(sp$sp)) {
sp[i,"sp2"] = paste(strsplit(sp[i,"sp"]," ")[[1]][1],
strsplit(sp[i,"sp"]," ")[[1]][2],
sep=" ")
}
If there a way to improve this basic code using vectors or an apply function?
You could just use vectorized regular expression functions:
library(stringr)
x <- c("Urtica dioica", "Urtica dioica L.")
> str_extract(string = x,"\\w+ \\w+")
[1] "Urtica dioica" "Urtica dioica"
I happen to have found stringr convenient here, but with the right regex for your specific data you could do this just as well with base functions like gsub.
You might want to check to see if there are more than 2 words in the string before doing each extraction:
if((sapply(gregexpr("\\W+", i), length) + 1) > 2){
...
}
There's a function for that.
Also from stringr, the word function
> choices <- c("Urtica dioica", "Urtica dioica L..")
> library(stringr)
> word(choices, 1:2)
# [1] "Urtica" "dioica"
> word(choices, rep(1:2, 2))
# [1] "Urtica" "dioica" "Urtica" "dioica"
These return individual strings. For two strings containing the first and last names,
> word(choices, 1, 2)
# [1] "Urtica dioica" "Urtica dioica"
The final line gets the first two words from each string in the vector choices