String replacement using sub function - r

I am attempting to extract the names of NBA players from a column in a database. However, the format of the the names in the names column is the following:
"LeBron James\\jamesle01"
I used the following regex expression inside a sub function to attempt to keep only the name portion:
sub("([A-Z]\\w+\\s*-*'*[a-z]*\\s*\\.*|[A-Z]\\.\\s*)\\*\\*[a-z]*\\d*\\d*", replacement = "\\1", x = nba_salaries$Names)
The expression is meant to take into account for unusual names that contain more than just alphanumeric characters (e.g. Michael Kidd-Gilchrist, De'Andre Jordan, Luc Mbah a Moute, etc.)
However, when I run the following,
head(nba_salaries$Names)
The names end up being in the same format.
I have used regexr.com to ensure that the regex expression captures the strings properly.

How about this, you can split the text by the "\\" string, and then take only the first element:
text <- c( "LeBron James\\jamesle01", "Michael Jordan\\jamesle01" )
sapply( strsplit( text, "\\\\" ), "[", 1 )
Which gives
[1] "LeBron James" "Michael Jordan"
To explain. The "[" is a function*, which is being called within sapply. So we pass the result of strsplit as the X in sapply, and apply the [ function to it* with the parameter 1 to take the 1st element. Here's another way to put it:
text <- strsplit( text, "\\\\" )
This will output a list, with each list element containing a vector, where the first element is the text before the "\\" string, and the second element contains any text after it. Then we use the "[" function*, passing the parameter 1, to take the first element of each of those vectors:
text <- sapply( X = text, FUN = "[", 1 )
Edit to add, I personally like using the magrittr pipe for things like this, just to make it a little more readable:
library( magrittr )
text <- strsplit( x = text, split = "\\\\" ) %>%
sapply( FUN = "[", 1 )
the "[" function is the function called when you subset with []. eg: vector[1:3] or in this case vector[1] (thanks #MathewLundberg for the suggestion here)

Related

In r, use string just as if I had typed it in

I am dealing with one aspect of r that really confuses me. What I have built is a line of code invoking str_remove saved as a string. If I was to copy-paste that string into where I want to use this line of code, it works perfectly as intended.
However I cannot get r to interpret this code correctly. I have tried using e.g. parse, but the escape characters intended for str_remove regular expression throw up errors.
Is there not a simple way to just treat a string as if it was a line of typed code?
Here is my reproducible example:
Make toy data:
maf_list_context <- list(as.data.frame(cbind(c("ATTATCGAATT", "ATTATTTTAAA"), c("this one", "not that one"))),
as.data.frame(cbind(c("ATTACGTAATT", "ATTATTTTAAA"), c("this one too", "not that one either"))) )
maf_list_context <- lapply(maf_list_context, function(x)
{colnames(x) <- c("CONTEXT", "want_it")
return(x)
})
The idea is that context will be an argument to a function and that it can be flexible, so the user can supply any number of contexts of interest separated by commas. These will be stringr regular expressions designed to look for particular contexts in DNA within a string of 11 bases. Here for example we can use two contexts of interest. The code that follows combines these to make an expression for use later in selecting the appropriate rows from the dataframes in the list.
context <- "\\w{5}CG\\w{4}, \\w{4}CG\\w{5}"
contextvec <- unlist(str_split(context, pattern = ", "))
contextexpression <- c()
for(i in 1:length(contextvec)){
contextexpression <- paste0(contextexpression, "str_detect(x$CONTEXT, pattern = '", contextvec[i], "') |")
}
contextexpression <- str_remove(contextexpression, pattern = " \\|$")
'contextexpression' is now:
[1] "str_detect(x$CONTEXT, pattern = '\\w{5}CG\\w{4}') |str_detect(x$CONTEXT, pattern = '\\w{4}CG\\w{5}')"
If I were to paste this expression directly into apply, it works precisely as I would want it.
> lapply(maf_list_context, function(x){
+
+ x[str_detect(x$CONTEXT, pattern = '\\w{5}CG\\w{4}') |str_detect(x$CONTEXT, pattern = '\\w{4}CG\\w{5}'), ]
+
+ })
[[1]]
CONTEXT want_it
1 ATTATCGAATT this one
[[2]]
CONTEXT want_it
1 ATTACGTAATT this one too
But of course if I use the string there, it does not.
> lapply(maf_list_context, function(x){
+
+ x[contextexpression, ]
+
+ })
[[1]]
CONTEXT want_it
NA <NA> <NA>
[[2]]
CONTEXT want_it
NA <NA> <NA>
I have tried many different functions but none of them make this work. Is there are way of having r interpret this string as if I had typed it in directly?
The whole reprex:
if (!require("stringr") {
install.packages("stringr", dependencies = TRUE)
library("stringr")
maf_list_context <- list(as.data.frame(cbind(c("ATTATCGAATT", "ATTATTTTAAA"), c("this one", "not that one"))),
as.data.frame(cbind(c("ATTACGTAATT", "ATTATTTTAAA"), c("this one too", "not that one either"))) )
maf_list_context <- lapply(maf_list_context, function(x){
colnames(x) <- c("CONTEXT", "want_it")
return(x)
})
context <- "\\w{5}CG\\w{4}, \\w{4}CG\\w{5}"
contextvec <- unlist(str_split(context, pattern = ", "))
contextexpression <- c()
for(i in 1:length(contextvec)){
contextexpression <- paste0(contextexpression, "str_detect(x$CONTEXT, pattern = '", contextvec[i], "') |")
}
contextexpression <- str_remove(contextexpression, pattern = " \\|$")
maf_list_select <- lapply(maf_list_context, function(x){
x[contextexpression, ]
})
I'm not sure I completely follow what you want your input to be and how to apply it, but your problem seems to be with what you're passing to the subset operator, i.e. x[<codehere>]
The subset operator expects a logical vector. When you "paste the expression" you are actually pasting an expression that gets evaluated to a logical vector, hence it properly subsets. When you pass the variable contextexpression, you are actually passing a string. As R sees it:
x[ "str_detect(x$CONTEXT, pattern = '\\w{5}CG\\w{4}') |str_detect(x$CONTEXT, pattern = '\\w{4}CG\\w{5}')", ]
Instead of (notice the syntax highlighting difference):
x[ str_detect(x$CONTEXT, pattern = '\\w{5}CG\\w{4}') |str_detect(x$CONTEXT, pattern = '\\w{4}CG\\w{5}'), ]
You want apply each context to each member of the list to get a logical vector and then subset.
purrr::map2(maf_list_context, contextvec, ~.x[str_detect(.x$CONTEXT, .y), ])
If you want to compare every item in contextvec to every item in maf_list_context, then it's a little more complicated but doable.
purrr::map2(
maf_list_context,
purrr::map(
maf_list_context,
function(data){
purrr::reduce(contextvec,
function(prev, cond) str_detect(data$CONTEXT, cond) | prev,
.init = logical(length(contextvec))
)
}
),
~.x[.y]
)
There's probably a more efficient way to short circuit the matching against the items in maf_list_context, but the general approach applies. The str_detect handles the comparison of a single condition against a single maf_list item. The reduce call combines the results of all the comparisons of contextvec to a single item in maf_list_context to a single boolean. The inner map iterates through maf_list_context. The outer map2 iterates through the list of boolean values created by the inner map and maf_list_context to subset for matches.
If maf_list_context has n items and contextvec has m items:
reduce makes m comparisons, resulting in 1 value
map makes n calls to reduce result in n values
map2 makes n iterations to subset maf_list_context

How to extract bracket from string into new columns

I need to export information from a string into different columns.
More specifically the content of the brackets within the string;
Lets say I have a string
a <- "2xExp [K89; K96]; 1xExp [N-Term]; 2xNum [S87(100); S93(100)]"
What I am trying to output is a vector with the contents of the brackets, if there is a comma save them as separate bracketed strings, and remove parentheses.
e.g.
tmp <- function(a)
Result
tmp
"[K89]" , "[K96]", "[N-Term]", "[S87]", "[S93]"
My approach so far:
pattern <- "(\\[.*?\\])"
hits <- gregexpr(pattern, a)
matches <- regmatches(a, hits)
unlisted_matches <- unlist(matches)
Results
"[K89; K96]" "[N-Term]" "[S87(100); S93(100)]"
This does give me the brackets but still doesn't split the terms. And for any reason I am not able to efficiently separate the ";" terms.
Here's a way using the tidyverse :
a <- "2xExp [K89; K96]; 1xExp [N-Term]; 2xNum [S87(100); S93(100)]"
library(tidyverse)
a %>%
# extract between square, brackets, not keeping brackets, and unlist
str_extract_all("(?<=\\[).*?(?=\\])") %>%
unlist() %>%
# remove round brackets and content
str_replace_all("\\(.*?\\)", "") %>%
# split by ";" and unlist
str_split("; ") %>%
unlist() %>%
# put the brackets back
str_c("[",.,"]")
#> [1] "[K89]" "[K96]" "[N-Term]" "[S87]" "[S93]"
You may use
a <- "2xExp [K89; K96]; 1xExp [N-Term]; 2xNum [S87(100); S93(100)]"
pattern <- "(?:\\G(?!^)(?:\\([^()]*\\))?\\s*;\\s*|\\[)\\K[^][;()]+"
matches <- regmatches(a, gregexpr(pattern, a, perl=TRUE))
unlisted_matches <- paste0("[", unlist(matches),"]")
unlisted_matches
## => [1] "[K89]" "[K96]" "[N-Term]" "[S87]" "[S93]"
See the R demo and the regex demo.
Pattern details
(?:\G(?!^)(?:\([^()]*\))?\s*;\s*|\[) - either the end of the previous successful match (\G(?!^)) followed with any substring inside round parentheses (optional, see (?:\([^()]*\))?) and then a ; enclosed with optional 0+ whitespaces (see \s*;\s*) or a [ char
\K - match reset operator discarding all text matched so far
[^][;()]+ - one or more chars other than [, ], ;, ( and ).
The paste0("[", unlist(matches),"]") part wraps the matches with square brackets.

How to do a replace with backreferences, when the number of occurences is unknown?

In order to make a few corrections to a .tex file generated by Bookdown, I need to replace occurrences of }{ with , when it is used in a citation, i.e.
s <- "Text.\\autocites{REF1}{REF2}{REF3}. More text \\autocites{REF4}{REF5} and \\begin{tabular}{ll}"
Should become
"Text.\\autocites{REF1,REF2,REF3}. More text \\autocites{REF4,REF5} and \\begin{tabular}{ll}
Because I need to keep the references I tried to look into backreferences, but I cannot seem to get it right, because the number of groups to match is unknown beforehand. Also, I cannot do stringr::str_replace_all(s, "\\}\\{", ","), because }{ occurs in other places in the document as well.
My best approach so far, is to use a look-behind to only do the replace when the occurence is after \\autocites, but then I cannot get the backreferences and grouping right:
stringr::str_replace_all(s, "(?<=\\\\autocites\\{)([:alnum:]+)(\\}\\{)", "\\1,")
[1] "Text.\\autocites{REF1,REF2}{REF3}. More text \\autocites{REF4,REF5} and \\begin{tabular}{ll}"
stringr::str_replace_all(s, "(?<=\\\\autocites\\{)([:alnum:]+)((\\}\\{)([:alnum:]+))*", "\\1,\\4")
[1] "Text.\\autocites{REF1,REF3}. More text \\autocites{REF4,REF5} and \\begin{tabular}{ll}"
I might be missing some completely obvious approach, so I hope someone can help.
pat matches
autocites followed by
the shortest string that ends in } and is
followed by end of string or a non-{
It then uses gsubfn to replace each occurrence of }{ in that with a comma. It uses formula notation to express the replacement function -- the body of the function is on the RHS of the ~ and because the body contains ..1 the arguments are taken to be ... . It does not use zero width lookahead or lookbehind.
library(gsubfn)
pat <- "(autocites.*?\\}($|[^{]))"
gsubfn(pat, ~ gsub("}{", ",", ..1, fixed = TRUE), s)
giving:
[1] "Text.\\autocites{REF1,REF2,REF3}. More text \\autocites{REF4,REF5} and \\begin{tabular}{ll}"
Variation
One minor simplificaiton of the regular expression shown above is to remove the outer parentheses from pat and instead specify backref = 0 in gsubfn. That tells it to pass the entire match to the function. We could use ..1 to specify the argument as above but since we know that there is necessarily only one argument passed we can specify it as x in the body of the function. Any variable name would do as it assumes that any free variable is an argument. The output would be the same as above.
pat2 <- "autocites.*?\\}($|[^{])"
gsubfn(pat2, ~ gsub("}{", ",", x, fixed = TRUE), s, backref = 0)
Cool problem - I got to learn a new trick with str_replace. You can make the return value a function, and it applies the function to the strings you've picked out.
replace_brakets <- function(str) {
str_replace_all(str, "\\}\\{", ",")
}
s %>% str_replace_all("(?<=\\\\autocites\\{)([:alnum:]+\\}\\{)+", replace_brakets)
# [1] "Text.\\autocites{REF1,REF2,REF3}. More text \\autocites{REF4,REF5} and \\begin{tabular}{ll}"

How to insert backslash followed by single quote using paste0 in R?

I'm trying to separate the elements in a vector with \' and a comma using paste0. For example:
test_vector = c("test1", "test2", "test3")
I would like to use paste0 to generate the following output:
\'test1\', \'test2\', \'test3\'
because the backslash character is an escape character itself,
paste0(test_vector, collapse = "\', \'")
generates the following:
"test1', 'test2', 'test3"
How about
(x <- paste0("\\'", test_vector, "\\'", collapse = ", "))
# [1] "\\'test1\\', \\'test2\\', \\'test3\\'"
We can check the actual result with cat() (since the second backslash is only present when printed to the console).
cat(x)
# \'test1\', \'test2\', \'test3\'

R text mining filtering string from text

I was wondering if there's an existing R function that given a text and a list of strings as input, will filter out the matching strings in the list that are found within the text?
For example,
x <- "This is a new way of doing things."
mywords <- c("This is", "new", "not", "maybe", "things.")
filtered_words <- Rfunc(x, mywords)
Then filtered_words will contain "This is", "new" and "things.".
Is there any such function?
We can use str_extract_all from library(stringr). The output will be a list, which can be unlisted to convert it to a vector.
library(stringr)
unlist(str_extract_all(x, mywords))
#[1] "This is" "new" "things."
filterWords = function(x, mywords){
splitwords = unlist(strsplit(x, split = " "))
return(splitwords[splitwords%in%mywords])
}
This is one way of approach. However this will not find the the words with two sub words like "this is". But I thought it might give you little more information on what you asked.

Resources