How to find if a string contain certain characters without considering sequence?

How to find if a string contain certain characters without considering sequence? - r

I'm trying to match a name using elements from another vector with R. But I don't know how to escape sequence when using grep() in R.
name <- "Cry River"
string <- c("Yesterday Once More","Are You happy","Cry Me A River")
grep(name, string, value = TRUE)
I expect the output to be "Cry Me A River", but I don't know how to do it.

Use .* in the pattern
grep("Cry.*River", string, value = TRUE)
#[1] "Cry Me A River"
Or if you are getting names as it is and can't change it, you can split on whitespace and insert the .* between the words like
grep(paste(strsplit(name, "\\s+")[[1]], collapse = ".*"), string, value = TRUE)
where the regex is constructed in the below fashion
strsplit(name, "\\s+")[[1]]
#[1] "Cry" "River"
paste(strsplit(name, "\\s+")[[1]], collapse = ".*")
#[1] "Cry.*River"

Here is a base R option, using grepl:
name <- "Cry River"
parts <- paste0("\\b", strsplit(name, "\\s+")[[1]], "\\b")
string <- c("Yesterday Once More","Are You happy","Cry Me A River")
result <- sapply(parts, function(x) { grepl(x, string) })
string[rowSums(result) == length(parts)]
[1] "Cry Me A River"
The strategy here is to first split the string containing the various search terms, and generating individual regex patterns for each term. In this case, we generate:
\bCry\b and \bRiver\b
Then, we iterate over each term, and using grepl we check that the term appears in each of the strings. Finally, we retain only those matches which contained all terms.

We can do the grepl on splitted string and Reduce the list of logical vectors to a single logicalvector` and extract the matching element in 'string'
string[Reduce(`&`, lapply(strsplit(name, " ")[[1]], grepl, string))]
#[1] "Cry Me A River"
Also, instead of strsplit, we can insert the .* with sub
grep(sub(" ", ".*", name), string, value = TRUE)
#[1] "Cry Me A River"

Here's an approach using stringr. Is order important? Is case important? Is it important to match whole words. If you would just like to match 'Cry' and 'River' in any order and don't care about case.
name <- "Cry River"
string <- c("Yesterday Once More",
"Are You happy",
"Cry Me A River",
"Take me to the River or I'll Cry",
"The Cryogenic River Rag",
"Crying on the Riverside")
string[str_detect(string, pattern = regex('\\bcry\\b', ignore_case = TRUE)) &
str_detect(string, regex('\\bRiver\\b', ignore_case = TRUE))]

Related

Extract expressions with square brackets

The example I have is as follow:
toMatch <- c("[1]", "[2]", "[3]")
names <- c("apple[1]", "apple", "apple[3]")
I want extract the terms in names which contains one of the patterns in toMatch.
This is what I tried
grep(toMatch, names, value=T)
But, it didn't work for me. Any suggestions?

The problem is that [ character used in toMatch is an reserved character with special meaning in regex/pattern. Hence, we need to first replace [ character with \\[.
Now, collapse toMatch with | and then use it as pattern in grepl function to search matching character in names.
The solution results are:
#Just for indexes
grepl(paste0(gsub("(\\[)","\\\\[",toMatch), collapse = "|"), names)
#[1] TRUE FALSE TRUE
#For values
grep(paste0(gsub("(\\[)","\\\\[",toMatch), collapse = "|"), names, value = TRUE)
#[1] "apple[1]" "apple[3]"
Data:
toMatch <- c("[1]", "[2]", "[3]")
names <- c("apple[1]", "apple", "apple[3]")

We could also remove the letter part and create a logical vector with %in%
names[sub("^[^[]*", "", names) %in% toMatch]
#[1] "apple[1]" "apple[3]"

Remove words from stopword list

I previously asked a question how to remove words from a stop list in a character vector by keeping the original format. The task was to remove words of "words_to_remove" in the vector "words".
I accepted this solution:
words_to_remove = c("the", "This")
pattern <- paste0("\\b", words_to_remove, "\\b", collapse="|")
words = c("the", "The", "Intelligent", "this", "This")
res <- grepl(pattern, words, ignore.case=TRUE)
words[!res]
Now I have the problem that I have multiple words in an entry of "words". Then the whole entry is deleted if it contains a stop word.
words = c("the", "The Book", "Intelligent", "this", "This")
I receive the output
[1] "Intelligent"
but I want it to be
[1] "Book" "Intelligent"
Is this possible?

You can try using gsub, i.e.
v1 <- gsub(paste(words_to_remove, collapse = '|'), '', words, ignore.case = TRUE)
#Tidy up your output
trimws(v1)[v1 != '']
#[1] "Book" "Intelligent"

Change the pattern to
pattern <- paste0("^", words_to_remove, "$", collapse="|")
to include start and end of string markers, rather than just word boundaries. The rest of your code should work fine with this one change.

R count the number of words starts with given letter in a phrase

i would like to get the count times that in a given string a word start with the letter given.
For example, in that phrase: "that pattern is great but pigs likes milk"
if i want to find the number of words starting with "g" there is only 1 "great", but right now i get 2 "great" and "pigs".
this is the code i use:
x <- "that pattern is great but pogintless"
sapply(regmatches(x, gregexpr("g", x)), length)

We need either a space or word boundary to avoid th letter from matching to characters other than the start of the word. In addition, it may be better to use ignore.case = TRUE as some words may begin with uppercase
lengths(regmatches(x, gregexpr("\\bg", x, ignore.case = TRUE)))
The above can be wrapped as a function
fLength <- function(str1, pat){
lengths(regmatches(str1, gregexpr(paste0("\\b", pat), str1, ignore.case = TRUE)))
}
fLength(x, "g")
#[1] 1

You can also do it with stringr library
library(stringr)
str_count(str_split(x," "),"\\bg")

Extract words the include numbers in R

I have words that include numbers within, or begin with or end with numbers. How do i extract those only.
s <- c("An ex4mple". "anothe 3xample" "A thir7", "And sentences w1th w0rds as w3ll")
Expected output:
c("ex4mple", "3xample", "thir7", "w1th w0rds w3ll")
Words could include more than one number.

We can split the strings by space into a list, loop through the elements with sapply, then match all words that have only letters from start (^) to end ($), specify invert=TRUE with value=TRUE to get those elements that don't fit the criteria, paste them together
sapply(strsplit(s, "\\s+"), function(x)
paste(grep("^[A-Za-z]+$", x, invert = TRUE, value = TRUE), collapse=' '))
#[1] "ex4mple" "3xample" "thir7" "w1th w0rds w3ll"
Or we can use str_extract
library(stringr)
sapply(str_extract_all(s, '[A-Za-z]*[0-9]+[A-Za-z]*'), paste, collapse=' ')
#[1] "ex4mple" "3xample" "thir7" "w1th w0rds w3ll"
data
s <- c("An ex4mple", "anothe 3xample", "A thir7", "And sentences w1th w0rds as w3ll")

Changing column names in dataframe using gsub

I have an atomic vector like:
col_names_to_be_changed <- c("PRODUCTIONDATE", "SPEEDRPM", "PERCENTLOADATCURRENTSPEED", sprintf("SENSOR%02d", 1:18))
I'd like to have _ between words, have them all lower case, except first letters of words (following R Style for dataframes from advanced R). I'd like to have something like this:
new_col_names <- c("Production_Date", "Percent_Load_At_Current_Speed", sprintf("Sensor_%02d", 1:18))
Assume that my words are limited to this list:
list_of_words <- c('production', 'speed', 'percent', 'load', 'at', 'current', 'sensor')
I am thinking of an algorithm that uses gsub, puts _ wherever it finds a word from the above list and then Capitalizes the first letter of each word. Although I can do this manually, I'd like to learn how this can be done more beautifully using gsub. Thanks.

You can take the list of words and paste them with a look-behind ((?<=)). I added the (?=.{2,}) because this will also match the "AT" in "DATE" since "AT" is in the list of words, so whatever is in the list of words will need to be followed by 2 or more characters to be split with an underscore.
The second gsub just does the capitalization
list_of_words <- c('production', 'speed', 'percent', 'load', 'at', 'current', 'sensor')
col_names_to_be_changed <- c("PRODUCTIONDATE", "SPEEDRPM", "PERCENTLOADATCURRENTSPEED", sprintf("SENSOR%02d", 1:18))
(pattern <- sprintf('(?i)(?<=%s)(?=.{2,})', paste(list_of_words, collapse = '|')))
# [1] "(?i)(?<=production|speed|percent|load|at|current|sensor)(?=.{2,})"
(split_words <- gsub(pattern, '_', tolower(col_names_to_be_changed), perl = TRUE))
# [1] "production_date" "speed_rpm" "percent_load_at_current_speed"
# [4] "sensor_01" "sensor_02" "sensor_03"
gsub('(?<=^|_)([a-z])', '\\U\\1', split_words, perl = TRUE)
# [1] "Production_Date" "Speed_Rpm" "Percent_Load_At_Current_Speed"
# [4] "Sensor_01" "Sensor_02" "Sensor_03"

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to find if a string contain certain characters without considering sequence? - r

Related

Extract expressions with square brackets

Remove words from stopword list

R count the number of words starts with given letter in a phrase

Extract words the include numbers in R

Changing column names in dataframe using gsub

Categories

Resources