remove multiple patterns from text vector r - r

I want to remove multiple patterns from multiple character vectors. Currently I am going:
a.vector <- gsub("#\\w+", "", a.vector)
a.vector <- gsub("http\\w+", "", a.vector)
a.vector <- gsub("[[:punct:]], "", a.vector)
etc etc.
This is painful. I was looking at this question & answer: R: gsub, pattern = vector and replacement = vector but it's not solving the problem.
Neither the mapply nor the mgsub are working. I made these vectors
remove <- c("#\\w+", "http\\w+", "[[:punct:]]")
substitute <- c("")
Neither mapply(gsub, remove, substitute, a.vector) nor mgsub(remove, substitute, a.vector) worked.
a.vector looks like this:
[4951] "#karakamen: Suicide amongst successful men is becoming rampant. Kudos for staing the conversation. #mental"
[4952] "#stiphan: you are phenomenal.. #mental #Writing. httptxjwufmfg"
I want:
[4951] "Suicide amongst successful men is becoming rampant Kudos for staing the conversation #mental"
[4952] "you are phenomenal #mental #Writing" `

I know this answer is late on the scene but it stems from my dislike of having to manually list the removal patterns inside the grep functions (see other solutions here). My idea is to set the patterns beforehand, retain them as a character vector, then paste them (i.e. when "needed") using the regex seperator "|":
library(stringr)
remove <- c("#\\w+", "http\\w+", "[[:punct:]]")
a.vector <- str_remove_all(a.vector, paste(remove, collapse = "|"))
Yes, this does effectively do the same as some of the other answers here, but I think my solution allows you to retain the original "character removal vector" remove.

Try combining your subpatterns using |. For example
>s<-"#karakamen: Suicide amongst successful men is becoming rampant. Kudos for staing the conversation. #mental"
> gsub("#\\w+|http\\w+|[[:punct:]]", "", s)
[1] " Suicide amongst successful men is becoming rampant Kudos for staing the conversation #mental"
But this could become problematic if you have a large number of patterns, or if the result of applying one pattern creates matches to others.
Consider creating your remove vector as you suggested, then applying it in a loop
> s1 <- s
> remove<-c("#\\w+","http\\w+","[[:punct:]]")
> for (p in remove) s1 <- gsub(p, "", s1)
> s1
[1] " Suicide amongst successful men is becoming rampant Kudos for staing the conversation #mental"
This approach will need to be expanded to apply it to the entire table or vector, of course. But if you put it into a function which returns the final string, you should be able to pass that to one of the apply variants

In case the multiple patterns that you are looking for are fixed and don't change from case-to-case, you can consider creating a concatenated regex that combines all of the patterns into one uber regex pattern.
For the example you provided, you can try:
removePat <- "(#\\w+)|(http\\w+)|([[:punct:]])"
a.vector <- gsub(removePat, "", a.vector)

I had a vector with statement "my final score" and I wanted to keep on the word final and remove the rest. This what worked for me based on Marian suggestion:
str_remove_all("my final score", "my |score")
note: "my final score" is just an example. I was dealing with a vector.

Related

Nb_words code counting partial matches in R

before I get started, I would like you to know that I am completely new to coding in R. For a group assignment our professor set up a database by scraping data from Amazon. Within the database, which is called 'dat', there is a column named 'product_name'. We were given a set group of utilitarian words. I think you can guess where this is going. Within the column 'product_name' we have to find for each product name whether any of the utilitarian words appeared. If yes, how many times. We were given the following code by our professor to use for this assignment:
nb_words <- function(lexicon,corpus){
rowSums(sapply(lexicon, function(x) grepl(x, corpus)))
}
after which i created the following codes:
uti_words <-c("additives","antioxidant","artificial", "busy", "calcium","calories", "carb", "carbohydrates", "chemicals", "cholesterol", "convenient", "dense", "diet", "fast")
sentences <- (dat$product_name)
nb_words (lexicon=uti_words,corpus=sentences)
when i run nb_words, however, I noticed something went wrong. A sentence contained the word 'breakfast'. My code counted this as a match because the word 'fast' from 'uti_words' matched with it. I don't want this to happen, does anyone know how to make it so that I only get exact matches and no partial matches?
We may have to add word boundary (\\b) to avoid partial matches
uti_words <- paste0("\\b", trimws(uti_words), "\\b")
Or another option is to change the grepl part of the code with fixed = TRUE
nb_words <- function(lexicon,corpus){
rowSums(sapply(lexicon, function(x) grepl(x, corpus, fixed = TRUE)))
}

How can I dynamically get words surrounding a keyword?

I have a sentence that may contain keywords. I search for them, if one is true, I want the word before and after the keyword.
cont <- c("could not","would not","does not","will not","do not","were not","was not","did not")
text <- "this failed to increase incomes and production did not improve"
str_extract(text,"([^\\s]+\\s+){1}names(which(sapply(cont,grepl,text)))(\\s+[^\\s]+){1}")
This fails when I dynamically search using the names function but if I input:
str_extract(text,"([^\\s]+\\s+){1}did not(\\s+[^\\s]+){1}")
it correctly returns: production did not improve.
How can I get this to function without directly inputing the keywords?
Final note: I do not completely understand the syntax used to get surrounding objects. Basic r books have not covered this. Can someone explain please?
You could use your cont vector to create a vector of regex strings:
targets <- paste0("([^\\s]+\\s+){1}", cont, "(\\s+[^\\s]+){1}")
Which you can feed into str_extract_all and then unlist:
unlist(stringr::str_extract_all(text, targets))
#> [1] "production did not improve"
If this is something you need to do quite frequently, you could wrap it in a function:
get_surrounding <- function(string, keywords) {
targets <- paste0("([^\\s]+\\s+){1}", keywords, "(\\s+[^\\s]+){1}")
unlist(stringr::str_extract_all(string, targets))
}
With which you can easily run the query on new strings:
new_text <- "The production did not increase because the manager would not allow it."
get_surrounding(new_text, cont)
#> [1] "manager would not allow" "production did not increase"
Perhaps we can try this
> regmatches(text, gregexpr(sprintf("\\w+\\s(%s)\\s\\w+", paste0(cont, collapse = "|")), text))[[1]]
[1] "production did not improve"
Each match of the following regular expression will save the preceding and following words in capture groups 1 and 2, respectively.
\\b([a-z]+) +(?:could|would|does|will|do|were|was|did) +not +([a-z]+)\\b
You will of course have to form this expression programmatically, but that should be straightforward.
Hover the cursor over each element of the expression at this demo to obtain an explanation of its function.
For the string
"she could not believe that production did not improve"
there are two matches. For the first ("she could not believe") "she" and "believe" are saved to capture groups 1 and 2, respectively. For the second ("production did not improve") "production" and "improve" are saved to capture groups 1 and 2, respectively.

In R: Searching a column for different string patterns and replace all of them

I have a column with different game titles. In order to collect them, I have to change all of them to a singluar spelling.
For example, I have:
str_replace_all(FavouriteGames_DF$FavGame1, pattern = c("SKYRIM|
THE ELDER SCROLLS V: SKYRIM|
ELDER SCROLLS SKYRIM|
ELDER SCROLLS V SKYRIM|
SKYRIM (BETHESDA 2011)|
SKYRIM (MODDED)|
THE ELDERSCROLLS V: SKYRIM"),
replacement = "THE ELDER SCROLLS 5: SKYRIM")
The problem is, that str_replace_all is kinda bad for this, as it can't just search for any matching pattern and replace it with the replacement, but apparently has to go through it in order and I can't predict where in the DataSet which term will arrive.
I do not want the function to replace incomplete matches (ie., turning "The ELDERSCROLLS V: SKYRIM" to THE ELDERSCOLLS V: THE ELDER SCROLL 5: Skyrim")
Putting the patterns into pattern = c("1", "2") it will not work at all, because it can only check for the patterns in order.
I also tried the FindReplace function from the DataCombine package, but that one doesn't seem to work either for reasons I do not quite understand (claiming I am missing dimensions and the vector not being a character vector). Anyway, I want to use as few packages as possible and would prefer to stay in the tidyverse.
Does anybody have a good solution? I do not want to search for each term on it's own as I have to do this a lot and I already have to do it for 6 columns as mutate_at doesn_t seem to work with str_replace.
Thanks!
My comment as an answer:
FavouriteGames_DF[FavouriteGames_Df$FavGame1 %in% pattern, ]$FavGame1 <- replacement
A handy solution would be to just use "SKYRIM" as a pattern, as it is the common word on all the patterns you specified. You could define a very simple function to check for that pattern and then use lapply on the specific column you want to check for:
check <- function(x){
y <- unlist(strsplit(x, " "))
if("SKYRIM" %in% y)
return("THE ELDER SCROLLS 5: SKYRIM")
else
return(x)
}
FavouriteGames_DF["FavGame1"] <- lapply(FavouriteGames_DF["FavGame1"], check)

obtaining count of phrases contained between parentheses and containing specific character

There must be a simple answer to this, but I'm new to regex and couldn't find one.
I have a dataframe (df) with text strings arranged in a column vector of length n (df$text). Each of the texts in this column is interspersed with parenthetical phrases. I can identify these phrases using:
regmatches(df$text, gregexpr("(?<=\\().*?(?=\\))", df$text, perl=T))[[1]]
The code above returns all text between parentheses. However, I'm only interested in parenthetical phrases that contain 'v.' in the format 'x v. y', where x and y are any number of characters (including spaces) between the parentheses; for example, '(State of Arkansas v. John Doe)'. Matching phrases (court cases) are always of this format: open parentheses, word beginning with capital letter, possible spaces and other words, v., another word beginning with a capital letter, and possibly more spaces and words, close parentheses.
I'd then like to create a new column containing counts of x v. y phrases in each row.
Bonus if there's a way to do this separately for the same phrases denoted by italics rather than enclosed in parentheses: State of Arkansas v. John Doe (but perhaps this should be posed as a separate question).
Thanks for helping a newbie!
I believe I have figured out what you want, but it is hard to tell without example data. I have made and example data frame to work with. If it is not what you are going for, please give an example.
df <- data.frame(text = c("(Roe v. Wade) is not about boats",
"(Dred Scott v. Sandford) and (Plessy v. Ferguson) have not stood the test of time",
"I am trying to confuse you (this is not a court case)",
"this one is also confusing (But with Capital Letters)",
"this is confusing (With Capitols and v. d)"),
stringsAsFactors = FALSE)
The regular expression I think you want is:
cases <- regmatches(df$text, gregexpr("(?<=\\()([[:upper:]].*? v\\. [[:upper:]].*?)(?=\\))",
df$text, perl=T))
You can then get the number of cases and add it to your data frame with:
df$numCases <- vapply(cases, length, numeric(1))
As for italics, I would really need an example of your data. usually that kind of formatting isn't stored when you read in a string in R, so the italics effectively don't exist anymore.
Change your regex like below,
regmatches(df$text, gregexpr("(?<=\\()[^()]*\\sv\\.\\s[^()]*(?=\\))", df$text, perl=T))[[1]]
DEMO

extracting only relevant comments from a list of comments

Continuing with my exploration into text analysis, i have encountered yet another roadblock.I understand the logic but don't know how to do it in R.
Here's what i want to do:
I have 2 CSVs- 1. contains 10,000 comments 2. containing a list of words
I want to select all those comments that have any of the words in the 2nd CSV. How can i go about it?
example:
**CSV 1:**
this is a sample set
the comments are not real
this is a random set of words
hope this helps the problem case
thankyou for helping out
i have learned a lot here
feel free to comment
**CSV 2**
sample
set
comment
**Expected output:**
this is a sample set
the comments are not real
this is a random set of words
feel free to comment
Please note:
the different forms of words is also considered, eg, comment and comments are both considered.
We can use grep after pasteing the elements in the second dataset.
v1 <- scan("file2.csv", what ="")
lines1 <- readLines("file1.csv")
grep(paste(v1, collapse="|"), lines1, value=TRUE)
#[1] "this is a sample set" "the comments are not real"
#[3] "this is a random set of words" "feel free to comment"
First create two objects called lines and words.to.match from your files. You could do it like this:
lines <- read.csv('csv1.csv', stringsAsFactors=F)[[1]]
words.to.match <- read.csv('csv2.csv', stringsAsFactors=F)[[1]]
Let's say they look like this:
lines <- c(
'this is a sample set',
'the comments are not real',
'this is a random set of words',
'hope this helps the problem case',
'thankyou for helping out',
'i have learned a lot here',
'feel free to comment'
)
words.to.match <- c('sample', 'set', 'comment')
You can then compute the matches with two nested *apply-functions:
matches <- mapply(
function(words, line)
any(sapply(words, grepl, line, fixed=T)),
list(words.to.match),
lines
)
matched.lines <- lines[which(matches)]
What's going on here? I use mapply to compute a function over each line in lines, taking words.to.match as the other argument. Note that the cardinality of list(words.to.match) is 1. I just recycle this argument across each application. Then, inside the mapply function I call an sapply function to check whether any of the words match the line (I check for the match via grepl).
This is not necessarily the most efficient solution, but it's a bit more intelligible to me. Another way you could compute matches is:
matches <- lapply(words.to.match, grepl, lines, fixed=T)
matches <- do.call("rbind", matches)
matches <- apply(matches, c(2), any)
I dislike this solution because you need to do a do.call("rbind",...), which is a bit hacky.

Resources