Removing a group of words from a character vector - r

Let's say that I have a character vector of random names. I also have another character vector with a number of car makes and I want to remove any occurrence of a car incident in the original vector.
So given the vectors:
dat = c("Tonyhonda","DaveFord","Alextoyota")
car = c("Honda","Ford","Toyota","honda","ford","toyota")
I want to end up with something like below:
dat = c("Tony","Dave","Alex")
How can I remove part of a string in R?

gsub(x = dat, pattern = paste(car, collapse = "|"), replacement = "")
[1] "Tony" "Dave" "Alex"

Just formalizing 42-'s comment above. Rather than using
car = c("Honda","Ford","Toyota","honda","ford","toyota")
You can just use:
carlist = c("Honda","Ford","Toyota")
gsub(x = dat, pattern = paste(car, collapse = "|"), replacement = "", ignore.case = TRUE)
[1] "Tony" "Dave" "Alex"
That allows you to only put each word you want to exclude in the list one time.

Related

Keep strings with partial match

I would like to keep only strings in my vector which partially match to strings in another vector.
Take a look on this example:
> dput(cc)
c("BLANK_0", "Greg_10", "Luke_40", "Luke_10", "Mark_10", "NA_40", "BLANK_10", "Joe_15", "Jane_10", "BLANK_40", "Greg_40", "Hvserk_40", "NA_10")
and I would like to keep strings starting like elements from a vector below:
> dput(vec_all_compounds)
c("Greg", "Luke", "Mark", "Joe", "Jane", "Hvserk")
It means all of: Greg_10, Luke_10, Hvserk_40, etc should be kept and remain unchanged. Doable ?
I would suggest next approach indexing your vectors with grepl():
#Code
cc[grepl(pattern = paste0(vec_all_compounds,collapse = '|'),cc)]
Output:
[1] "Greg_10" "Luke_40" "Luke_10" "Mark_10" "Joe_15" "Jane_10" "Greg_40" "Hvserk_40"
You can also use grep with value = TRUE :
grep(paste0(vec_all_compounds, collapse = "|"), cc, value = TRUE)
#[1] "Greg_10" "Luke_40" "Luke_10" "Mark_10" "Joe_15" "Jane_10" "Greg_40" "Hvserk_40"
Same with stringr::str_subset :
stringr::str_subset(cc, paste0(vec_all_compounds, collapse = "|"))
You can use gsub + %in%
> cc[gsub("_.*","",cc) %in% vec_all_compounds]
[1] "Greg_10" "Luke_40" "Luke_10" "Mark_10" "Joe_15" "Jane_10"
[7] "Greg_40" "Hvserk_40"

How to find if a string contain certain characters without considering sequence?

I'm trying to match a name using elements from another vector with R. But I don't know how to escape sequence when using grep() in R.
name <- "Cry River"
string <- c("Yesterday Once More","Are You happy","Cry Me A River")
grep(name, string, value = TRUE)
I expect the output to be "Cry Me A River", but I don't know how to do it.
Use .* in the pattern
grep("Cry.*River", string, value = TRUE)
#[1] "Cry Me A River"
Or if you are getting names as it is and can't change it, you can split on whitespace and insert the .* between the words like
grep(paste(strsplit(name, "\\s+")[[1]], collapse = ".*"), string, value = TRUE)
where the regex is constructed in the below fashion
strsplit(name, "\\s+")[[1]]
#[1] "Cry" "River"
paste(strsplit(name, "\\s+")[[1]], collapse = ".*")
#[1] "Cry.*River"
Here is a base R option, using grepl:
name <- "Cry River"
parts <- paste0("\\b", strsplit(name, "\\s+")[[1]], "\\b")
string <- c("Yesterday Once More","Are You happy","Cry Me A River")
result <- sapply(parts, function(x) { grepl(x, string) })
string[rowSums(result) == length(parts)]
[1] "Cry Me A River"
The strategy here is to first split the string containing the various search terms, and generating individual regex patterns for each term. In this case, we generate:
\bCry\b and \bRiver\b
Then, we iterate over each term, and using grepl we check that the term appears in each of the strings. Finally, we retain only those matches which contained all terms.
We can do the grepl on splitted string and Reduce the list of logical vectors to a single logicalvector` and extract the matching element in 'string'
string[Reduce(`&`, lapply(strsplit(name, " ")[[1]], grepl, string))]
#[1] "Cry Me A River"
Also, instead of strsplit, we can insert the .* with sub
grep(sub(" ", ".*", name), string, value = TRUE)
#[1] "Cry Me A River"
Here's an approach using stringr. Is order important? Is case important? Is it important to match whole words. If you would just like to match 'Cry' and 'River' in any order and don't care about case.
name <- "Cry River"
string <- c("Yesterday Once More",
"Are You happy",
"Cry Me A River",
"Take me to the River or I'll Cry",
"The Cryogenic River Rag",
"Crying on the Riverside")
string[str_detect(string, pattern = regex('\\bcry\\b', ignore_case = TRUE)) &
str_detect(string, regex('\\bRiver\\b', ignore_case = TRUE))]

Remove words from stopword list

I previously asked a question how to remove words from a stop list in a character vector by keeping the original format. The task was to remove words of "words_to_remove" in the vector "words".
I accepted this solution:
words_to_remove = c("the", "This")
pattern <- paste0("\\b", words_to_remove, "\\b", collapse="|")
words = c("the", "The", "Intelligent", "this", "This")
res <- grepl(pattern, words, ignore.case=TRUE)
words[!res]
Now I have the problem that I have multiple words in an entry of "words". Then the whole entry is deleted if it contains a stop word.
words = c("the", "The Book", "Intelligent", "this", "This")
I receive the output
[1] "Intelligent"
but I want it to be
[1] "Book" "Intelligent"
Is this possible?
You can try using gsub, i.e.
v1 <- gsub(paste(words_to_remove, collapse = '|'), '', words, ignore.case = TRUE)
#Tidy up your output
trimws(v1)[v1 != '']
#[1] "Book" "Intelligent"
Change the pattern to
pattern <- paste0("^", words_to_remove, "$", collapse="|")
to include start and end of string markers, rather than just word boundaries. The rest of your code should work fine with this one change.

Insert vertical bar between each character of a string in R

How would I be able to insert a vertical bar in between every character of a string in R? For example, say I have a string "ABC123". How could I obtain the output to be "A|B|C|1|2|3"? If anyone could vectorize this idea for a vector of character strings, that would be great.
First, separate into individual characters and then collapse
paste(unlist(strsplit("ABC123", "")), collapse = "|")
#[1] "A|B|C|1|2|3"
For vector of strings, use sapply to loop through them
mystrings = c("ABC123", "PASDP")
sapply(strsplit(mystrings, ""), paste, collapse = "|")
#[1] "A|B|C|1|2|3" "P|A|S|D|P"
Here is an option using regex
gsub("(?<=.)(?=.)", "|", "ABC123", perl = TRUE)
#[1] "A|B|C|1|2|3"
Or with more than one string
mystrings <- c("ABC123", "PASDP")
gsub("(?<=.)(?=.)", "|", mystrings, perl = TRUE)
#[1] "A|B|C|1|2|3" "P|A|S|D|P"

Changing column names in dataframe using gsub

I have an atomic vector like:
col_names_to_be_changed <- c("PRODUCTIONDATE", "SPEEDRPM", "PERCENTLOADATCURRENTSPEED", sprintf("SENSOR%02d", 1:18))
I'd like to have _ between words, have them all lower case, except first letters of words (following R Style for dataframes from advanced R). I'd like to have something like this:
new_col_names <- c("Production_Date", "Percent_Load_At_Current_Speed", sprintf("Sensor_%02d", 1:18))
Assume that my words are limited to this list:
list_of_words <- c('production', 'speed', 'percent', 'load', 'at', 'current', 'sensor')
I am thinking of an algorithm that uses gsub, puts _ wherever it finds a word from the above list and then Capitalizes the first letter of each word. Although I can do this manually, I'd like to learn how this can be done more beautifully using gsub. Thanks.
You can take the list of words and paste them with a look-behind ((?<=)). I added the (?=.{2,}) because this will also match the "AT" in "DATE" since "AT" is in the list of words, so whatever is in the list of words will need to be followed by 2 or more characters to be split with an underscore.
The second gsub just does the capitalization
list_of_words <- c('production', 'speed', 'percent', 'load', 'at', 'current', 'sensor')
col_names_to_be_changed <- c("PRODUCTIONDATE", "SPEEDRPM", "PERCENTLOADATCURRENTSPEED", sprintf("SENSOR%02d", 1:18))
(pattern <- sprintf('(?i)(?<=%s)(?=.{2,})', paste(list_of_words, collapse = '|')))
# [1] "(?i)(?<=production|speed|percent|load|at|current|sensor)(?=.{2,})"
(split_words <- gsub(pattern, '_', tolower(col_names_to_be_changed), perl = TRUE))
# [1] "production_date" "speed_rpm" "percent_load_at_current_speed"
# [4] "sensor_01" "sensor_02" "sensor_03"
gsub('(?<=^|_)([a-z])', '\\U\\1', split_words, perl = TRUE)
# [1] "Production_Date" "Speed_Rpm" "Percent_Load_At_Current_Speed"
# [4] "Sensor_01" "Sensor_02" "Sensor_03"

Resources