Extract expressions with square brackets - r

The example I have is as follow:
toMatch <- c("[1]", "[2]", "[3]")
names <- c("apple[1]", "apple", "apple[3]")
I want extract the terms in names which contains one of the patterns in toMatch.
This is what I tried
grep(toMatch, names, value=T)
But, it didn't work for me. Any suggestions?

The problem is that [ character used in toMatch is an reserved character with special meaning in regex/pattern. Hence, we need to first replace [ character with \\[.
Now, collapse toMatch with | and then use it as pattern in grepl function to search matching character in names.
The solution results are:
#Just for indexes
grepl(paste0(gsub("(\\[)","\\\\[",toMatch), collapse = "|"), names)
#[1] TRUE FALSE TRUE
#For values
grep(paste0(gsub("(\\[)","\\\\[",toMatch), collapse = "|"), names, value = TRUE)
#[1] "apple[1]" "apple[3]"
Data:
toMatch <- c("[1]", "[2]", "[3]")
names <- c("apple[1]", "apple", "apple[3]")

We could also remove the letter part and create a logical vector with %in%
names[sub("^[^[]*", "", names) %in% toMatch]
#[1] "apple[1]" "apple[3]"

Related

Substitute strings by their first match in a dictionary

I have a vector long_strings defined as
long_strings <- c("*/1/1/1/1", "*/1/2/1/1", "*/2/1",
"*/2/2/1", "*/3/1/1/1")
and I have a dictionary of short short_strings containing the initial patterns (with differing lengths) of those strings, for example
short_strings <- c("*/1/1", "*/3", "*/2", "*/1/2")
How can I "simplify" the contents of long_strings to match their corresponding value on short_strings?
The results should look like
"*/1/1", "*/1/2", "*/2", "*/2", "*/3"
I can find where are the occurrences of a single element of short_strings using grep("\\*/2", long_strings), but I want to avoid looping over the short_strings.
An option with sapply
as.character(with(stack(sapply(setNames(paste0("\\", short_strings), short_strings),
grep, x = long_strings)), ind[order(values)]))
#[1] "*/1/1" "*/1/2" "*/2" "*/2" "*/3"
Or using str_extract
library(stringr)
str_extract(long_strings, str_c(str_c("\\", short_strings), collapse="|"))
#[1] "*/1/1" "*/1/2" "*/2" "*/2" "*/3"
We can programmatically create a capture group and use it in sub to extract it
sub(paste0(".*(",paste0("\\", short_strings, collapse = "|"), ").*"), "\\1",long_strings)
#[1] "*/1/1" "*/1/2" "*/2" "*/2" "*/3"

find alphanumeric elements in vector

I have a vector
myVec <- c('1.2','asd','gkd','232','4343','1.3zyz','fva','3213','1232','dasd')
In this vector, I want to do two things:
Remove any numbers from an element that contains both numbers and letters and then
If a group of letters is followed by another group of letters, merge them into one.
So the above vector will look like this:
'1.2','asdgkd','232','4343','zyzfva','3213','1232','dasd'
I thought I will first find the alphanumeric elements and remove the numbers from them using gsub.
I tried this
gsub('[0-9]+', '', myVec[grepl("[A-Za-z]+$", myVec, perl = T)])
"asd" "gkd" ".zyz" "fva" "dasd"
i.e. it retains the . which I don't want.
This seems to return what you are after
myVec <- c('1.2','asd','gkd','232','4343','1.3zyz','fva','3213','1232','dasd')
clean <- function (x) {
is_char <- grepl("[[:alpha:]]", x)
has_number <- grepl("\\d", x)
mixed <- is_char & has_number
x[mixed] <- gsub("[\\d\\.]+","", x[mixed], perl=T)
grp <- cumsum(!is_char | (is_char & !c(FALSE, head(is_char, -1))))
unname(tapply(x, grp, paste, collapse=""))
}
clean(myVec)
# [1] "1.2" "asdgkd" "232" "4343" "zyzfva" "3213" "1232" "dasd"
Here we look for numbers and letters mixed together and remove the numbers. Then we defined groups for collapsing, looking for characters that come after other characters to put them in the same group. Then we finally collapse all the values in the same group.
Here's my regex-only solution:
myVec <- c('1.2','asd','gkd','232','4343','1.3zyz','fva','3213','1232','dasd')
# find all elemnts containing letters
lettrs = grepl("[A-Za-z]", myVec)
# remove all non-letter characters
myVec[lettrs] = gsub("[^A-Za-z]" ,"", myVec[lettrs])
# paste all elements together, remove delimiter where delimiter is surrounded by letters and split string to new vector
unlist(strsplit(gsub("(?<=[A-Za-z])\\|(?=[A-Za-z])", "", paste(myVec, collapse="|"), perl=TRUE), split="\\|"))

How to find if a string contain certain characters without considering sequence?

I'm trying to match a name using elements from another vector with R. But I don't know how to escape sequence when using grep() in R.
name <- "Cry River"
string <- c("Yesterday Once More","Are You happy","Cry Me A River")
grep(name, string, value = TRUE)
I expect the output to be "Cry Me A River", but I don't know how to do it.
Use .* in the pattern
grep("Cry.*River", string, value = TRUE)
#[1] "Cry Me A River"
Or if you are getting names as it is and can't change it, you can split on whitespace and insert the .* between the words like
grep(paste(strsplit(name, "\\s+")[[1]], collapse = ".*"), string, value = TRUE)
where the regex is constructed in the below fashion
strsplit(name, "\\s+")[[1]]
#[1] "Cry" "River"
paste(strsplit(name, "\\s+")[[1]], collapse = ".*")
#[1] "Cry.*River"
Here is a base R option, using grepl:
name <- "Cry River"
parts <- paste0("\\b", strsplit(name, "\\s+")[[1]], "\\b")
string <- c("Yesterday Once More","Are You happy","Cry Me A River")
result <- sapply(parts, function(x) { grepl(x, string) })
string[rowSums(result) == length(parts)]
[1] "Cry Me A River"
The strategy here is to first split the string containing the various search terms, and generating individual regex patterns for each term. In this case, we generate:
\bCry\b and \bRiver\b
Then, we iterate over each term, and using grepl we check that the term appears in each of the strings. Finally, we retain only those matches which contained all terms.
We can do the grepl on splitted string and Reduce the list of logical vectors to a single logicalvector` and extract the matching element in 'string'
string[Reduce(`&`, lapply(strsplit(name, " ")[[1]], grepl, string))]
#[1] "Cry Me A River"
Also, instead of strsplit, we can insert the .* with sub
grep(sub(" ", ".*", name), string, value = TRUE)
#[1] "Cry Me A River"
Here's an approach using stringr. Is order important? Is case important? Is it important to match whole words. If you would just like to match 'Cry' and 'River' in any order and don't care about case.
name <- "Cry River"
string <- c("Yesterday Once More",
"Are You happy",
"Cry Me A River",
"Take me to the River or I'll Cry",
"The Cryogenic River Rag",
"Crying on the Riverside")
string[str_detect(string, pattern = regex('\\bcry\\b', ignore_case = TRUE)) &
str_detect(string, regex('\\bRiver\\b', ignore_case = TRUE))]

Sapply grepl data frames exact/complete matches

I have the same problem as in :
How to apply grepl for data frame
But I'm getting undesired matches, as in :
Complete word matching using grepl in R
How do I apply the \< or \b solution in a sapply environment when grepl is looping through vectors?
You'd used an anonymous function to be applied to each element of the columns in the data frame.
vec1 <- c("I don't want to match this", "This is what I want to match")
vec2 <- c('Why would I match this?', "What is a good match for this?")
df <- data.frame(vec1,vec2)
sapply(df, function(x) grepl("\\<is\\>", x))
vec1 vec2
[1,] FALSE FALSE
[2,] TRUE TRUE
I found a solution myself.
It's sufficient to paste a blank space before and after each element in the vector to be matched with the sentences.
vector <- paste(" ", vector, " ")
matches <- sapply(vector, grepl, sentences, ignore.case=TRUE )

Changing column names in dataframe using gsub

I have an atomic vector like:
col_names_to_be_changed <- c("PRODUCTIONDATE", "SPEEDRPM", "PERCENTLOADATCURRENTSPEED", sprintf("SENSOR%02d", 1:18))
I'd like to have _ between words, have them all lower case, except first letters of words (following R Style for dataframes from advanced R). I'd like to have something like this:
new_col_names <- c("Production_Date", "Percent_Load_At_Current_Speed", sprintf("Sensor_%02d", 1:18))
Assume that my words are limited to this list:
list_of_words <- c('production', 'speed', 'percent', 'load', 'at', 'current', 'sensor')
I am thinking of an algorithm that uses gsub, puts _ wherever it finds a word from the above list and then Capitalizes the first letter of each word. Although I can do this manually, I'd like to learn how this can be done more beautifully using gsub. Thanks.
You can take the list of words and paste them with a look-behind ((?<=)). I added the (?=.{2,}) because this will also match the "AT" in "DATE" since "AT" is in the list of words, so whatever is in the list of words will need to be followed by 2 or more characters to be split with an underscore.
The second gsub just does the capitalization
list_of_words <- c('production', 'speed', 'percent', 'load', 'at', 'current', 'sensor')
col_names_to_be_changed <- c("PRODUCTIONDATE", "SPEEDRPM", "PERCENTLOADATCURRENTSPEED", sprintf("SENSOR%02d", 1:18))
(pattern <- sprintf('(?i)(?<=%s)(?=.{2,})', paste(list_of_words, collapse = '|')))
# [1] "(?i)(?<=production|speed|percent|load|at|current|sensor)(?=.{2,})"
(split_words <- gsub(pattern, '_', tolower(col_names_to_be_changed), perl = TRUE))
# [1] "production_date" "speed_rpm" "percent_load_at_current_speed"
# [4] "sensor_01" "sensor_02" "sensor_03"
gsub('(?<=^|_)([a-z])', '\\U\\1', split_words, perl = TRUE)
# [1] "Production_Date" "Speed_Rpm" "Percent_Load_At_Current_Speed"
# [4] "Sensor_01" "Sensor_02" "Sensor_03"

Resources