How can I dynamically get words surrounding a keyword? - r

I have a sentence that may contain keywords. I search for them, if one is true, I want the word before and after the keyword.
cont <- c("could not","would not","does not","will not","do not","were not","was not","did not")
text <- "this failed to increase incomes and production did not improve"
str_extract(text,"([^\\s]+\\s+){1}names(which(sapply(cont,grepl,text)))(\\s+[^\\s]+){1}")
This fails when I dynamically search using the names function but if I input:
str_extract(text,"([^\\s]+\\s+){1}did not(\\s+[^\\s]+){1}")
it correctly returns: production did not improve.
How can I get this to function without directly inputing the keywords?
Final note: I do not completely understand the syntax used to get surrounding objects. Basic r books have not covered this. Can someone explain please?

You could use your cont vector to create a vector of regex strings:
targets <- paste0("([^\\s]+\\s+){1}", cont, "(\\s+[^\\s]+){1}")
Which you can feed into str_extract_all and then unlist:
unlist(stringr::str_extract_all(text, targets))
#> [1] "production did not improve"
If this is something you need to do quite frequently, you could wrap it in a function:
get_surrounding <- function(string, keywords) {
targets <- paste0("([^\\s]+\\s+){1}", keywords, "(\\s+[^\\s]+){1}")
unlist(stringr::str_extract_all(string, targets))
}
With which you can easily run the query on new strings:
new_text <- "The production did not increase because the manager would not allow it."
get_surrounding(new_text, cont)
#> [1] "manager would not allow" "production did not increase"

Perhaps we can try this
> regmatches(text, gregexpr(sprintf("\\w+\\s(%s)\\s\\w+", paste0(cont, collapse = "|")), text))[[1]]
[1] "production did not improve"

Each match of the following regular expression will save the preceding and following words in capture groups 1 and 2, respectively.
\\b([a-z]+) +(?:could|would|does|will|do|were|was|did) +not +([a-z]+)\\b
You will of course have to form this expression programmatically, but that should be straightforward.
Hover the cursor over each element of the expression at this demo to obtain an explanation of its function.
For the string
"she could not believe that production did not improve"
there are two matches. For the first ("she could not believe") "she" and "believe" are saved to capture groups 1 and 2, respectively. For the second ("production did not improve") "production" and "improve" are saved to capture groups 1 and 2, respectively.

Related

In R: Searching a column for different string patterns and replace all of them

I have a column with different game titles. In order to collect them, I have to change all of them to a singluar spelling.
For example, I have:
str_replace_all(FavouriteGames_DF$FavGame1, pattern = c("SKYRIM|
THE ELDER SCROLLS V: SKYRIM|
ELDER SCROLLS SKYRIM|
ELDER SCROLLS V SKYRIM|
SKYRIM (BETHESDA 2011)|
SKYRIM (MODDED)|
THE ELDERSCROLLS V: SKYRIM"),
replacement = "THE ELDER SCROLLS 5: SKYRIM")
The problem is, that str_replace_all is kinda bad for this, as it can't just search for any matching pattern and replace it with the replacement, but apparently has to go through it in order and I can't predict where in the DataSet which term will arrive.
I do not want the function to replace incomplete matches (ie., turning "The ELDERSCROLLS V: SKYRIM" to THE ELDERSCOLLS V: THE ELDER SCROLL 5: Skyrim")
Putting the patterns into pattern = c("1", "2") it will not work at all, because it can only check for the patterns in order.
I also tried the FindReplace function from the DataCombine package, but that one doesn't seem to work either for reasons I do not quite understand (claiming I am missing dimensions and the vector not being a character vector). Anyway, I want to use as few packages as possible and would prefer to stay in the tidyverse.
Does anybody have a good solution? I do not want to search for each term on it's own as I have to do this a lot and I already have to do it for 6 columns as mutate_at doesn_t seem to work with str_replace.
Thanks!
My comment as an answer:
FavouriteGames_DF[FavouriteGames_Df$FavGame1 %in% pattern, ]$FavGame1 <- replacement
A handy solution would be to just use "SKYRIM" as a pattern, as it is the common word on all the patterns you specified. You could define a very simple function to check for that pattern and then use lapply on the specific column you want to check for:
check <- function(x){
y <- unlist(strsplit(x, " "))
if("SKYRIM" %in% y)
return("THE ELDER SCROLLS 5: SKYRIM")
else
return(x)
}
FavouriteGames_DF["FavGame1"] <- lapply(FavouriteGames_DF["FavGame1"], check)

how to only use sub on when there are multiple values

So this is a short example of a dataframe:
x<- c("WB (16)","CT (14)WB (15)","ET (13)CITG-TILm (16)EE-SS (17)TN-SE (17)")
My question is how to get sub(".*?)", "", x)(or a different function) to work such that this will be the result:
x<-c("WB (16)","WB (15)","TN-SE(17)")
instead of
x<-c("","WB (15)")
I got different types of letters (so not only WB, CT and TN-SE),such as:
"NBIO(15)" "CITG-TP(08)" "BK-AR(10)"
So it should be a general function...
Thanks!
Could you please try following.
sub(".*[0-9]+[^)]\\)?([^)$])", "\\1", x)
Output will be as follows.
[1] "WB (16)" "WB (15)" "TN-SE (17)"
Where Input will be as follows.
> x
[1] "WB (16)" "CT (14)WB (15)"
[3] "ET (13)CITG-TILm (16)EE-SS (17)TN-SE (17)"
Explanation: Following is only for explanation purposes.
sub(" ##Using sub function of Base R here.
##sub works on method of sub(regex_to_match_current_line's_stuff, new_string/variable/value out of matched,regex, variable)
.*[0-9]+[^)]\\) ##Using look ahead method of regex by mentioning .*(everything till) a ) is NOT found then mentioning ) there to cover it too so it will match till a ) which is NOt on end of line.
? ##? this makes sure above regex is matched first and it will move for next regex condition as per look ahead functoianlity.
([^)$])", ##() means in R to put a value into R's memory to remember it kind of place holder in memory, I am mentioning here to keep everything till a ) found at last.
"\\1", ##Substitute whole line with \\1 means first place holder's value.
x) ##Mentioning variable/vector's name here.
I think that I understand what you want. This certainly works on your example.
sub(".*?([^()]+\\(\\d+\\))$", "\\1", x)
[1] "WB (16)" "WB (15)" "TN-SE (17)"
Details: This looks for something of the form SomeStuff (Numbers) at the end of the string and throws away anything before it. SomeStuff is not allowed to contain parentheses.

r replace text within a string by lookup table

I already have tried to find a solutions on the internet for my problem, and I have the feeling I know all the small pieces but I am unable to put them together. I'm quite knew at programing so pleace be patient :D...
I have a (in reality much larger) text string which look like this:
string <- "Test test [438] test. Test 299, test [82]."
Now I want to replace the numbers in square brackets using a lookup table and get a new string back. There are other numbers in the text but I only want to change those in brackets and need to have them back in brackets.
lookup <- read.table(text = "
Number orderedNbr
1 270 1
2 299 2
3 82 3
4 314 4
5 438 5", header = TRUE)
I have made a pattern to find the square brackets using regular expressions
pattern <- "\\[(\\d+)\\]"
Now I looked all around and tried sub/gsub, lapply, merge, str_replace, but I find myself unable to make it work... I don't know how to tell R! to look what's inside the brackets, to look for that same argument in the lookup table and give out what's standing in the next column.
I hope you can help me, and that it's not a really stupid question. Thx
We can use a regex look around to match only numbers that are inside a square bracket
library(gsubfn)
gsubfn("(?<=\\[)(\\d+)(?=\\])", setNames(as.list(lookup$orderedNbr),
lookup$Number), string, perl = TRUE)
#[1] "Test test [5] test. Test [3]."
Or without regex lookaround by pasteing the square bracket on each column of 'lookup'
gsubfn("(\\[\\d+\\])", setNames(as.list(paste0("[", lookup$orderedNbr,
"]")), paste0("[", lookup$Number, "]")), string)
Read your table of keys and values (a 2 column table) into a data frame. If your source information be a flat text file, then you can easily use read.csv to obtain a data frame. In the example below, I hard code a data frame with just two entries. Then, I iterate over it and make replacements in the input string.
df <- data.frame(keys=c(438, 82), values=c(5, 3))
string <- "Test test [438] test. Test [82]."
for (i in 1:nrow(df)) {
string <- gsub(paste0("(?<=\\[)", df$keys[i], "(?=\\])"), df$values[i], string, perl=TRUE)
}
string
[1] "Test test 5 test. Test 3."
Demo
Note: As #Frank wisely pointed out, my solution would fail if your number markers (e.g. [438]) happen to have replacements which are numbers also appearing as other markers. That is, if replacing a key with a value results in yet another key, there could be problems. If this be a possibility, I would suggest using markers for which this cannot happen. For example, you could remove the brackets after each replacement.
You can use regmatches<- with a pattern containing lookahead/lookbehind:
patt = "(?<=\\[)\\d+(?=\\])"
m = gregexpr(patt, string, perl=TRUE)
v = as.integer(unlist(regmatches(string, m)))
`regmatches<-`(string, m, value = list(lookup$orderedNbr[match(v, lookup$Number)]))
# [1] "Test test [5] test. Test 299, test [3]."
Or to modify the string directly, change the last line to the more readable...
regmatches(string, m) <- list(lookup$orderedNbr[match(v, lookup$Number)])

remove multiple patterns from text vector r

I want to remove multiple patterns from multiple character vectors. Currently I am going:
a.vector <- gsub("#\\w+", "", a.vector)
a.vector <- gsub("http\\w+", "", a.vector)
a.vector <- gsub("[[:punct:]], "", a.vector)
etc etc.
This is painful. I was looking at this question & answer: R: gsub, pattern = vector and replacement = vector but it's not solving the problem.
Neither the mapply nor the mgsub are working. I made these vectors
remove <- c("#\\w+", "http\\w+", "[[:punct:]]")
substitute <- c("")
Neither mapply(gsub, remove, substitute, a.vector) nor mgsub(remove, substitute, a.vector) worked.
a.vector looks like this:
[4951] "#karakamen: Suicide amongst successful men is becoming rampant. Kudos for staing the conversation. #mental"
[4952] "#stiphan: you are phenomenal.. #mental #Writing. httptxjwufmfg"
I want:
[4951] "Suicide amongst successful men is becoming rampant Kudos for staing the conversation #mental"
[4952] "you are phenomenal #mental #Writing" `
I know this answer is late on the scene but it stems from my dislike of having to manually list the removal patterns inside the grep functions (see other solutions here). My idea is to set the patterns beforehand, retain them as a character vector, then paste them (i.e. when "needed") using the regex seperator "|":
library(stringr)
remove <- c("#\\w+", "http\\w+", "[[:punct:]]")
a.vector <- str_remove_all(a.vector, paste(remove, collapse = "|"))
Yes, this does effectively do the same as some of the other answers here, but I think my solution allows you to retain the original "character removal vector" remove.
Try combining your subpatterns using |. For example
>s<-"#karakamen: Suicide amongst successful men is becoming rampant. Kudos for staing the conversation. #mental"
> gsub("#\\w+|http\\w+|[[:punct:]]", "", s)
[1] " Suicide amongst successful men is becoming rampant Kudos for staing the conversation #mental"
But this could become problematic if you have a large number of patterns, or if the result of applying one pattern creates matches to others.
Consider creating your remove vector as you suggested, then applying it in a loop
> s1 <- s
> remove<-c("#\\w+","http\\w+","[[:punct:]]")
> for (p in remove) s1 <- gsub(p, "", s1)
> s1
[1] " Suicide amongst successful men is becoming rampant Kudos for staing the conversation #mental"
This approach will need to be expanded to apply it to the entire table or vector, of course. But if you put it into a function which returns the final string, you should be able to pass that to one of the apply variants
In case the multiple patterns that you are looking for are fixed and don't change from case-to-case, you can consider creating a concatenated regex that combines all of the patterns into one uber regex pattern.
For the example you provided, you can try:
removePat <- "(#\\w+)|(http\\w+)|([[:punct:]])"
a.vector <- gsub(removePat, "", a.vector)
I had a vector with statement "my final score" and I wanted to keep on the word final and remove the rest. This what worked for me based on Marian suggestion:
str_remove_all("my final score", "my |score")
note: "my final score" is just an example. I was dealing with a vector.

Regular expression to find function calls in a function body

Please consider the body of read.table as a text file, created with the following code:
sink("readTable.txt")
body(read.table)
sink()
Using regular expressions, I'd like to find all function calls of the form foo(a, b, c) (but with any number of arguments) in "readTable.txt". That is, I'd like the result to contain the names of all called functions in the body of read.table. This includes nested functions of the form
foo(a, bar(b, c)). Reserved words (return, for, etc) and functions that use back-ticks ('=='(), '+'(), etc) can be included since I can remove them later.
So in general, I'm looking for the pattern text( or text ( then possible nested functions like text1(text2(, but skipping over the text if it's an argument, and not a function. Here's where I'm at so far. It's close, but not quite there.
x <- readLines("readTable.txt")
regx <- "^(([[:print:]]*)\\(+.*\\))"
mat <- regexpr(regx, x)
lines <- regmatches(x, mat)
fns <- gsub(".*( |(=|(<-)))", "", lines)
head(fns, 10)
# [1] "default.stringsAsFactors()" "!missing(text))"
# [3] "\"UTF-8\")" "on.exit(close(file))" "(is.character(file))"
# [6] "(nzchar(fileEncoding))" "fileEncoding)" "\"rt\")"
# [9] "on.exit(close(file))" "\"connection\"))"
For example, in [9] above, the calls are there, but I do not want file in the result. Ideally it would be on.exit(close(.
How can I go about improving this regular expression?
If you've ever tried to parse HTML with a regular expression you know what a nightmare it can be. It's always better to use some HTML parser and extract info that way. I feel the same way about R code. The beauty of R is that it's functional and you inspect any function via code.
Something like
call.ignore <-c("[[", "[", "&","&&","|","||","==","!=",
"-","+", "*","/", "!", ">","<", ":")
find.funcs <- function(f, descend=FALSE) {
if( is.function(f)) {
return(find.funcs(body(f), descend=descend))
} else if (is(f, "name") | is.atomic(f)) {
return(character(0))
}
v <- list()
if (is(f, "call") && !(deparse(f[[1]]) %in% call.ignore)) {
v[[1]] <- deparse(f)
if(!descend) return(v[[1]])
}
v <- append(v, lapply(as.list(f), find.funcs, descend=descend))
unname(do.call(c, v))
}
could work. Here we iterate over each object in the function looking for calls, ignoring those you don't care about. You would run it on a function like
find.funcs(read.table)
# [1] "default.stringsAsFactors()"
# [2] "missing(file)"
# [3] "missing(text)"
# [4] "textConnection(text, encoding = \"UTF-8\")"
# [5] "on.exit(close(file))"
# [6] "is.character(file)"
# ...
You can set the descend= parameter to TRUE if you want to look in calls to functions for other functions.
I'm sure there are plenty of packages that make this easier, but I just wanted to show how simple it really is.
Recursive Regex in Perl Mode
In the general case, I am sure you're aware of the hazards of trying to match such constructions: what if your file contains things like if() that you don't want to match?
That being said, I believe this recursive regex fits the requirements as I understand them
[a-z]+(\((?:`[()]|[^()]|(?1))*\))
See demo.
I'm not completely up to scratch on R syntax, but something like this should work, and you can tweak the function name and arguments to suit your needs:
grepl("[a-z]+(\\((?:`[()]|[^()]|(?1))*\\))", subject, perl=TRUE);
Explanation
[a-z]+ matches the letters before the opening parenthesis
( starts Group 1
\( matches an opening parenthesis
(?: starts a non-capture group that will be repeated. The capture group matches several possibilities:
BACKTICK[()] matches a backtick + ( or ) (sorry, don't know how to make the backtick appear in this editor
|[^()] OR match one character that is not a parenthesis
|(?1) OR match the pattern defined by the Group 1 parentheses (recurse)
)* close non-capture group, repeat zero or more times
\) matches a closing parenthesis
) ends Group 1

Resources