I have a dataframe that consists of a variable with multiple words, such as:
variable
"hello my name is this"
"greetings friend"
And another dataframe that consists of two columns, one of which is words, the other of which is replacements for those words, such as:
word
"hello"
"greetings"
replacement:
replacement
"hi"
"hi"
I'm trying to find an easy way to replace the words in "variable" with the replacement words, looping over both all the observations, and all the words in each observation. The desired result is:
variable
"hi my name is this"
"hi friend"
I've looked into some methods that use cSplit, but it's not feasible for my application (there are too many words in any given observation of "variable", so this creates too many columns). I'm not sure how I would use strsplit for this, but am guessing that is the correct option?
EDIT: From my understanding of this question, my question my be a repeat of a previously unanswered question: Replace strings in text based on dictionary
stringr's str_replace_all would be handy in this case:
df = data.frame(variable = c('hello my name is this','greetings friend'))
replacement <- data.frame(word = c('hello','greetings'), replacment = c('hi','hi'), stringsAsFactors = F)
stringr::str_replace_all(df$variable,replacement$word,replacement$replacment)
Output:
> stringr::str_replace_all(df$variable,replacement$word,replacement$replacment)
[1] "hi my name is this" "hi friend"
This is similar to #amrrs's solution, but I am using a named vector instead of supplying two separate vectors. This also addresses the issue mentioned by OP in the comments:
library(dplyr)
library(stringr)
df2$word %>%
paste0("\\b", ., "\\b") %>%
setNames(df2$replacement, .) %>%
str_replace_all(df1$variable, .)
# [1] "hi my name is this" "hi friend" "hi, hellomy is not a word"
# [4] "hi! my friend"
This is the named vector with regex as the names and string to replace with as elements:
df2$word %>%
paste0("\\b", ., "\\b") %>%
setNames(df2$replacement, .)
# \\bhello\\b \\bgreetings\\b
# "hi" "hi"
Data:
df1 = data.frame(variable = c('hello my name is this',
'greetings friend',
'hello, hellomy is not a word',
'greetings! my friend'))
df2 = data.frame(word = c('hello','greetings'),
replacement = c('hi','hi'),
stringsAsFactors = F)
Note:
In order to address the issue of root words also being converted, I wrapped the regex with word boundaries (\\b). This makes sure that I am not converting words that live inside another, like "helloguys".
Related
I have been wondering how to extract string in R using stringr or another package between the exact word "to the" (which is always lowercase) and the very second comma in a sentence.
For instance:
String: "This not what I want to the THIS IS WHAT I WANT, DO YOU SEE IT?, this is not what I want"
Desired output: "THIS IS WHAT I WANT, DO YOU SEE IT?"
I have this vector:
x<-c("This not what I want to the THIS IS WHAT I WANT, DO YOU SEE IT?, this is not what I want",
"HYU_IO TO TO to the I WANT, THIS, this i dont, want", "uiui uiu to the xxxx,,this is not, what I want")
and I am trying to use this code
str_extract(string = x, pattern = "(?<=to the ).*(?=\\,)")
but I cant seem to get it to work to properly give me this:
"THIS IS WHAT I WANT, DO YOU SEE IT?"
"I WANT, THIS"
"xxxx,"
Thank you guys so much for your time and help
You were close!
str_extract(string = x, pattern = "(?<=to the )[^,]*,[^,]*")
# [1] "THIS IS WHAT I WANT, DO YOU SEE IT?"
# [2] "I WANT, THIS"
# [3] "xxxx,"
The look-behind stays the same, [^,]* matches anything but a comma, then , matches exactly one comma, then [^,]* again for anything but a comma.
Alternative approach, by far not comparable with Gregor Thomas approach, but somehow an alternative:
vector to tibble
separate twice by first to the then by ,
paste together
pull for vector output.
library(tidyverse)
as_tibble(x) %>%
separate(value, c("a", "b"), sep = 'to the ') %>%
separate(b, c("a", "c"), sep =",") %>%
mutate(x = paste0(a, ",", c), .keep="unused") %>%
pull(x)
[1] "THIS IS WHAT I WANT, DO YOU SEE IT?"
[2] "I WANT, THIS"
[3] "xxxx,"
I've got a list of specific words to remove a list of sentences. Do I have to loop over the list and apply a function to each regex or can I somehow call them all at once? I've tried to do so with lapply, but I'm hoping to find a better way.
string <- 'This is a sample sentence from which to gather some cool
knowledge'
words <- c('a','from','some')
lapply(words,function(x){
string <- gsub(paste0('\\b',words,'\\b'),'',string)
})
My desired output is:
This is sample sentence which to gather cool knowledge.
You can collapse across your character vector of words-to-remove with the regex OR operator("|") sometimes referred to as the "pipe" symbol.
gsub(paste0('\\b',words,'\\b', collapse="|"), '', string)
[1] "This is sample sentence which to gather cool \n knowledge"
Or:
gsub(paste0('\\b',words,'\\b\\s{0,1}', collapse="|"), '', string)
[1] "This is sample sentence which to gather cool \n knowledge"
string<-'This is a sample sentence from which to gather some cool knowledge'
words<-c('a', 'from', 'some')
library(tm)
string<-removeWords(string, words = words)
string
[1] "This is sample sentence which to gather cool knowledge"
With the tm library you can use the removeWords().
or you can loop with gsub like:
string<-'This is a sample sentence from which to gather some cool knowledge'
words<-c('a', 'from', 'some')
for(i in 1:length(words)) {
string<-gsub(pattern = words[i], replacement = '', x = string)
}
string
[1] "This is sample sentence which to gather cool knowledge"
hope that helps.
You need to use the "|" to use or in the regex :
string2 <- gsub(paste(words,'|',collapse =""),'',string)
> string2
[1] "This is sample sentence which to gather cool knowledge"
I have a list of phrases, in which I want to replace certain words with a similar word, in case it is misspelled.
library(stringr)
a4 <- "I would like a cheseburger and friees please"
badwords.corpus <- c("cheseburger", "friees")
goodwords.corpus <- c("cheeseburger", "fries")
vect.corpus <- goodwords.corpus
names(vect.corpus) <- badwords.corpus
str_replace_all(a4, vect.corpus)
# [1] "I would like a cheeseburger and fries please"
everything works perfectly, until it finds a similar string, and replaces it with another word
if I have a pattern like the following:
"plea", the correct one is "please", but when I execute it removes it and replaces it with "pleased".
What I am looking for is that if a string is already correct, it is no longer modified, in case it finds a similar pattern.
Perhaps you need to perform progressive replace. e.g. you should have multiple set of badwords and goodwords. First replace with badwords having more letters so that matching pattern is not found and then got go for smaller ones.
From the list provided by you, I would create 2 sets as:
goodwords1<-c( "three", "teasing")
badwords1<- c("thre", "teeasing")
goodwords2<-c("tree", "testing")
badwords2<- c("tre", "tesing")
First replace with 1st set and then with 2nd set. You can create many such sets.
str_replace_all takes regex as the pattern, so you can paste0 word boundaries \\b around each badwords so that a replacement will only be made if the whole word is matched:
library(stringr)
string <- c("tre", "tree", "teeasing", "tesing")
goodwords <- c("tree", "three", "teasing", "testing")
badwords <- c("tre", "thre", "teeasing", "tesing")
# Paste word boundaries around badwords
badwords <- paste0("\\b", badwords, "\\b")
vect.corpus <- goodwords
names(vect.corpus) <- badwords
str_replace_all(string, vect.corpus)
[1] "tree" "tree" "teasing" "testing"
The advantage of this is that you don't have to keep track of which strings are the longer strings.
This is what badwords looks like after pasting:
> badwords
[1] "\\btre\\b" "\\bthre\\b" "\\bteeasing\\b" "\\btesing\\b"
I have a list containing verbs. I have another list containing sentences. How do I return the index of the sentence list that contains at least a verb in the verb list for that sentence?
verbList <- c("punching", "kicking", "jumping", "hopping")
sentenceList <- c("I am punching", "I like pineapples", "I am hopping", "I am kicking and jumping")
I want it to return indexes 1, 3, and 4
Using no additional packages, we can sort of "or" different search terms together using | as follows:
Original question:
verbList <- list("punching, kicking, jumping, hopping")
sentenceList <- c("I am punching", "I like pineapples", "I am hopping", "I am kicking and jumping")
v <- gsub(", ", "|", verbList)
grep(v, sentenceList)
New question:
verbList <- c("punching", "kicking", "jumping", "hopping")
sentenceList <- c("I am punching", "I like pineapples", "I am hopping", "I am kicking and jumping")
v <- paste(verbList, collapse = "|")
grep(v, sentenceList)
A solution from stringr and rebus. We can first split the string, and then use str_which to check if the pattern is in the vector to return the index.
library(stringr)
library(rebus)
# Check the index
result <- str_which(sentenceList, or1(verbList))
result
# [1] 1 3 4
I have an atomic vector like:
col_names_to_be_changed <- c("PRODUCTIONDATE", "SPEEDRPM", "PERCENTLOADATCURRENTSPEED", sprintf("SENSOR%02d", 1:18))
I'd like to have _ between words, have them all lower case, except first letters of words (following R Style for dataframes from advanced R). I'd like to have something like this:
new_col_names <- c("Production_Date", "Percent_Load_At_Current_Speed", sprintf("Sensor_%02d", 1:18))
Assume that my words are limited to this list:
list_of_words <- c('production', 'speed', 'percent', 'load', 'at', 'current', 'sensor')
I am thinking of an algorithm that uses gsub, puts _ wherever it finds a word from the above list and then Capitalizes the first letter of each word. Although I can do this manually, I'd like to learn how this can be done more beautifully using gsub. Thanks.
You can take the list of words and paste them with a look-behind ((?<=)). I added the (?=.{2,}) because this will also match the "AT" in "DATE" since "AT" is in the list of words, so whatever is in the list of words will need to be followed by 2 or more characters to be split with an underscore.
The second gsub just does the capitalization
list_of_words <- c('production', 'speed', 'percent', 'load', 'at', 'current', 'sensor')
col_names_to_be_changed <- c("PRODUCTIONDATE", "SPEEDRPM", "PERCENTLOADATCURRENTSPEED", sprintf("SENSOR%02d", 1:18))
(pattern <- sprintf('(?i)(?<=%s)(?=.{2,})', paste(list_of_words, collapse = '|')))
# [1] "(?i)(?<=production|speed|percent|load|at|current|sensor)(?=.{2,})"
(split_words <- gsub(pattern, '_', tolower(col_names_to_be_changed), perl = TRUE))
# [1] "production_date" "speed_rpm" "percent_load_at_current_speed"
# [4] "sensor_01" "sensor_02" "sensor_03"
gsub('(?<=^|_)([a-z])', '\\U\\1', split_words, perl = TRUE)
# [1] "Production_Date" "Speed_Rpm" "Percent_Load_At_Current_Speed"
# [4] "Sensor_01" "Sensor_02" "Sensor_03"