Text Replacement -- Pattern is a set list of strings [r] - r

I have a string variable in a large data set that I want to cleanse based on a set list of strings. ex. pattern <- c("dog","cat") but my list will be about 400 elements long.
vector_to_clean == a
black Dog
white dOG
doggie
black CAT
thatdamcat
Then I want to apply a function to yield
new
dog
dog
dog
cat
cat
I have tried str_extract, grep, grepl etc.. Since I can pick a pattern based on one string at a time. I think what I want is to use dapply with one of these text cleansing functions. Unfortunately, I'm stuck. Below is my latest attempt. Thank you for your help!
new <- vector()
lapply(pattern, function(x){
where<- grep(x,a,value = FALSE, ignore.case = TRUE)
new[where]<-x
})

We paste the 'pattern' vector together to create a single string, use that to extract the words from 'vec1' after we change it to lower case (tolower(vec1)).
library(stringr)
str_extract(tolower(vec1), paste(pattern, collapse='|'))
#[1] "dog" "dog" "dog" "cat" "cat"
data
pattern <- c("dog","cat")
vec1 <- c('black Dog', 'white dOG', 'doggie','black CAT', 'thatdamcat')

Another way using base R is:
#data
vec <- c('black Dog', 'white dOG', 'doggie','black CAT','thatdamcat')
#regexpr finds the locations of cat and dog ignoring the cases
a <- regexpr( 'dog|cat', vec, ignore.case=TRUE )
#regmatches returns the above locations from vec (here we use tolower in order
#to convert to lowercase)
regmatches(tolower(vec), a)
[1] "dog" "dog" "dog" "cat" "cat"

Related

Replace strings that start with "X" ,with "Y", in R

I am trying to take all strings that start with a certain few letters and replace them with a different string.
If I have:
y <- "pet"
x <-c("Cat","Cats","Catss")
z=cbind(y,x)
Which gives:
y x
pet Cat
pet Cats
pet Catss
How can I get
y x
pet Cat
pet Cat
pet Cat
gsub(pattern = "^Cat.*", replacement = "Cat", x)
for more complex patterns you should take a look on Regular Expressions
sub("(Cat).*", "\\1", x)
[1] "Cat" "Cat" "Cat"
This solution uses backreference: sub's replacement argument (i.e. the second) 'recalls' only the capturing group in the pattern argument (i.e, the first), here (Cat) but not the endings thus effectively removing them.

R: using \\b and \\B in regex

I read about regex and came accross word boundaries. I found a question that is about the difference between \b and \B. Using the code from this question does not give the expected output. Here:
grep("\\bcat\\b", "The cat scattered his food all over the room.", value= TRUE)
# I expect "cat" but it returns the whole string.
grep("\\B-\\B", "Please enter the nine-digit id as it appears on your color - coded pass-key.", value= TRUE)
# I expect "-" but it returns the whole string.
I use the code as described in the question but with two backslashes as suggested here. Using one backslash does not work either. What am I doing wrong?
You can to use regexpr and regmatches to get the match. grep gives where it hits. You can also use sub.
x <- "The cat scattered his food all over the room."
regmatches(x, regexpr("\\bcat\\b", x))
#[1] "cat"
sub(".*(\\bcat\\b).*", "\\1", x)
#[1] "cat"
x <- "Please enter the nine-digit id as it appears on your color - coded pass-key."
regmatches(x, regexpr("\\B-\\B", x))
#[1] "-"
sub(".*(\\B-\\B).*", "\\1", x)
#[1] "-"
For more than 1 match use gregexpr:
x <- "1abc2"
regmatches(x, gregexpr("[0-9]", x))
#[[1]]
#[1] "1" "2"
grepreturns the whole string because it just looks to see if the match is present in the string. If you want to extract cat, you need to use other functions such as str_extractfrom package stringr:
str_extract("The cat scattered his food all over the room.", "\\bcat\\b")
[1] "cat"
The difference betweeen band Bis that bmarks word boundaries whereas Bis its negation. That is, \\bcat\\b matches only if cat is separated by white space whereas \\Bcat\\B matches only if cat is inside a word. For example:
str_extract_all("The forgot his education and scattered his food all over the room.", "\\Bcat\\B")
[[1]]
[1] "cat" "cat"
These two matches are from education and scattered.

Create a function that tidies up any text in r

i am new to r and trying to create a regular expression that abbreviates any word over 5 letters in the input vector (e.g hungry would become hungr.). At the moment i am using gsub, but it is removing the first 5 letters of words over five letters rather than following letters.
current code:
gsub(pattern = "[a-z]{5}", replacement = ".", x = text_converter)
So here is the edited version after you added your explanation in the comment section:
We need the following function:
Your string is "the very reliable dog jumped over the fences"
strsplit splits this string whenever there is a space in the string and gives you a list of those words
[[1]]
[1] "the" "very" "reliable" "dog" "jumped" "over" "the" "fences"
lapply applys the function, gsub("([a-zA-Z]{5}).*","\1", x), to each word in the list from step 2, which reduces the letters in each word to 5 or less.
[[1]]
[1] "the" "very" "relia" "dog" "jumpe" "over" "the" "fence"
finally paste with collapse, concatenates the vector of strings from step 3 to a string and the split_reduce function returns this string as the final output.
[1] "the very relia dog jumpe over the fence"
So here is the function:
split_reduce <- function(text){
text<-strsplit(text, " ")
text_reduced <- lapply(text,function(x)gsub("([a-zA-Z]{5}).*","\\1", x))
paste(unlist(text_reduced), collapse = ' ')
}
split_reduce("the very reliable dog jumped over the fences")
#[1] "the very relia dog jumpe over the fence"

Changing column names in dataframe using gsub

I have an atomic vector like:
col_names_to_be_changed <- c("PRODUCTIONDATE", "SPEEDRPM", "PERCENTLOADATCURRENTSPEED", sprintf("SENSOR%02d", 1:18))
I'd like to have _ between words, have them all lower case, except first letters of words (following R Style for dataframes from advanced R). I'd like to have something like this:
new_col_names <- c("Production_Date", "Percent_Load_At_Current_Speed", sprintf("Sensor_%02d", 1:18))
Assume that my words are limited to this list:
list_of_words <- c('production', 'speed', 'percent', 'load', 'at', 'current', 'sensor')
I am thinking of an algorithm that uses gsub, puts _ wherever it finds a word from the above list and then Capitalizes the first letter of each word. Although I can do this manually, I'd like to learn how this can be done more beautifully using gsub. Thanks.
You can take the list of words and paste them with a look-behind ((?<=)). I added the (?=.{2,}) because this will also match the "AT" in "DATE" since "AT" is in the list of words, so whatever is in the list of words will need to be followed by 2 or more characters to be split with an underscore.
The second gsub just does the capitalization
list_of_words <- c('production', 'speed', 'percent', 'load', 'at', 'current', 'sensor')
col_names_to_be_changed <- c("PRODUCTIONDATE", "SPEEDRPM", "PERCENTLOADATCURRENTSPEED", sprintf("SENSOR%02d", 1:18))
(pattern <- sprintf('(?i)(?<=%s)(?=.{2,})', paste(list_of_words, collapse = '|')))
# [1] "(?i)(?<=production|speed|percent|load|at|current|sensor)(?=.{2,})"
(split_words <- gsub(pattern, '_', tolower(col_names_to_be_changed), perl = TRUE))
# [1] "production_date" "speed_rpm" "percent_load_at_current_speed"
# [4] "sensor_01" "sensor_02" "sensor_03"
gsub('(?<=^|_)([a-z])', '\\U\\1', split_words, perl = TRUE)
# [1] "Production_Date" "Speed_Rpm" "Percent_Load_At_Current_Speed"
# [4] "Sensor_01" "Sensor_02" "Sensor_03"

Using R to compare two words and find letters unique to second word (across c. 6000 cases)

I have a dataframe comprising two columns of words. For each row I'd like to identify any letters that occur in only the word in the second column e.g.
carpet carpelt #return 'l'
bag flag #return 'f' & 'l'
dog dig #return 'i'
I'd like to use R to do this automatically as I have 6126 rows.
As an R newbie, the best I've got so far is this, which gives me the unique letters across both words (and is obviously very clumsy):
x<-(strsplit("carpet", ""))
y<-(strsplit("carpelt", ""))
z<-list(l1=x, l2=y)
unique(unlist(z))
Any help would be much appreciated.
The function you’re searching for is setdiff:
chars_for = function (str)
strsplit(str, '')[[1]]
result = setdiff(chars_for(word2), chars_for(word1))
(Note the inverted order of the arguments in setdiff.)
To apply it to the whole data.frame, called x:
apply(x, 1, function (words) setdiff(chars_for(words[2]), chars_for(words[1])))
Use regex :) Paste your word with brackets [] and then use replace function for regex. This regex finds any letter from those in brackets and replaces it with empty string (you can say that it "removes" these letters).
require(stringi)
x <- c("carpet","bag","dog")
y <- c("carplet", "flag", "smog")
pattern <- stri_paste("[",x,"]")
pattern
## [1] "[carpet]" "[bag]" "[dog]"
stri_replace_all_regex(y, pattern, "")
## [1] "l" "fl" "sm"
x <- c("carpet","bag","dog")
y <- c("carpelt", "flag", "dig")
Following (somewhat) with what you were going for with strsplit, you could do
> sx <- strsplit(x, "")
> sy <- strsplit(y, "")
> lapply(seq_along(sx), function(i) sy[[i]][ !sy[[i]] %in% sx[[i]] ])
#[[1]]
#[1] "l"
#
#[[2]]
#[1] "f" "l"
#
#[[3]]
#[1] "i"
This uses %in% to logically match the characters in y with the characters in x. I negate the matching with ! to determine those those characters that are in y but not in x.

Resources