Extract words the include numbers in R - r

I have words that include numbers within, or begin with or end with numbers. How do i extract those only.
s <- c("An ex4mple". "anothe 3xample" "A thir7", "And sentences w1th w0rds as w3ll")
Expected output:
c("ex4mple", "3xample", "thir7", "w1th w0rds w3ll")
Words could include more than one number.

We can split the strings by space into a list, loop through the elements with sapply, then match all words that have only letters from start (^) to end ($), specify invert=TRUE with value=TRUE to get those elements that don't fit the criteria, paste them together
sapply(strsplit(s, "\\s+"), function(x)
paste(grep("^[A-Za-z]+$", x, invert = TRUE, value = TRUE), collapse=' '))
#[1] "ex4mple" "3xample" "thir7" "w1th w0rds w3ll"
Or we can use str_extract
library(stringr)
sapply(str_extract_all(s, '[A-Za-z]*[0-9]+[A-Za-z]*'), paste, collapse=' ')
#[1] "ex4mple" "3xample" "thir7" "w1th w0rds w3ll"
data
s <- c("An ex4mple", "anothe 3xample", "A thir7", "And sentences w1th w0rds as w3ll")

Related

find alphanumeric elements in vector

I have a vector
myVec <- c('1.2','asd','gkd','232','4343','1.3zyz','fva','3213','1232','dasd')
In this vector, I want to do two things:
Remove any numbers from an element that contains both numbers and letters and then
If a group of letters is followed by another group of letters, merge them into one.
So the above vector will look like this:
'1.2','asdgkd','232','4343','zyzfva','3213','1232','dasd'
I thought I will first find the alphanumeric elements and remove the numbers from them using gsub.
I tried this
gsub('[0-9]+', '', myVec[grepl("[A-Za-z]+$", myVec, perl = T)])
"asd" "gkd" ".zyz" "fva" "dasd"
i.e. it retains the . which I don't want.
This seems to return what you are after
myVec <- c('1.2','asd','gkd','232','4343','1.3zyz','fva','3213','1232','dasd')
clean <- function (x) {
is_char <- grepl("[[:alpha:]]", x)
has_number <- grepl("\\d", x)
mixed <- is_char & has_number
x[mixed] <- gsub("[\\d\\.]+","", x[mixed], perl=T)
grp <- cumsum(!is_char | (is_char & !c(FALSE, head(is_char, -1))))
unname(tapply(x, grp, paste, collapse=""))
}
clean(myVec)
# [1] "1.2" "asdgkd" "232" "4343" "zyzfva" "3213" "1232" "dasd"
Here we look for numbers and letters mixed together and remove the numbers. Then we defined groups for collapsing, looking for characters that come after other characters to put them in the same group. Then we finally collapse all the values in the same group.
Here's my regex-only solution:
myVec <- c('1.2','asd','gkd','232','4343','1.3zyz','fva','3213','1232','dasd')
# find all elemnts containing letters
lettrs = grepl("[A-Za-z]", myVec)
# remove all non-letter characters
myVec[lettrs] = gsub("[^A-Za-z]" ,"", myVec[lettrs])
# paste all elements together, remove delimiter where delimiter is surrounded by letters and split string to new vector
unlist(strsplit(gsub("(?<=[A-Za-z])\\|(?=[A-Za-z])", "", paste(myVec, collapse="|"), perl=TRUE), split="\\|"))

How to find if a string contain certain characters without considering sequence?

I'm trying to match a name using elements from another vector with R. But I don't know how to escape sequence when using grep() in R.
name <- "Cry River"
string <- c("Yesterday Once More","Are You happy","Cry Me A River")
grep(name, string, value = TRUE)
I expect the output to be "Cry Me A River", but I don't know how to do it.
Use .* in the pattern
grep("Cry.*River", string, value = TRUE)
#[1] "Cry Me A River"
Or if you are getting names as it is and can't change it, you can split on whitespace and insert the .* between the words like
grep(paste(strsplit(name, "\\s+")[[1]], collapse = ".*"), string, value = TRUE)
where the regex is constructed in the below fashion
strsplit(name, "\\s+")[[1]]
#[1] "Cry" "River"
paste(strsplit(name, "\\s+")[[1]], collapse = ".*")
#[1] "Cry.*River"
Here is a base R option, using grepl:
name <- "Cry River"
parts <- paste0("\\b", strsplit(name, "\\s+")[[1]], "\\b")
string <- c("Yesterday Once More","Are You happy","Cry Me A River")
result <- sapply(parts, function(x) { grepl(x, string) })
string[rowSums(result) == length(parts)]
[1] "Cry Me A River"
The strategy here is to first split the string containing the various search terms, and generating individual regex patterns for each term. In this case, we generate:
\bCry\b and \bRiver\b
Then, we iterate over each term, and using grepl we check that the term appears in each of the strings. Finally, we retain only those matches which contained all terms.
We can do the grepl on splitted string and Reduce the list of logical vectors to a single logicalvector` and extract the matching element in 'string'
string[Reduce(`&`, lapply(strsplit(name, " ")[[1]], grepl, string))]
#[1] "Cry Me A River"
Also, instead of strsplit, we can insert the .* with sub
grep(sub(" ", ".*", name), string, value = TRUE)
#[1] "Cry Me A River"
Here's an approach using stringr. Is order important? Is case important? Is it important to match whole words. If you would just like to match 'Cry' and 'River' in any order and don't care about case.
name <- "Cry River"
string <- c("Yesterday Once More",
"Are You happy",
"Cry Me A River",
"Take me to the River or I'll Cry",
"The Cryogenic River Rag",
"Crying on the Riverside")
string[str_detect(string, pattern = regex('\\bcry\\b', ignore_case = TRUE)) &
str_detect(string, regex('\\bRiver\\b', ignore_case = TRUE))]

R count the number of words starts with given letter in a phrase

i would like to get the count times that in a given string a word start with the letter given.
For example, in that phrase: "that pattern is great but pigs likes milk"
if i want to find the number of words starting with "g" there is only 1 "great", but right now i get 2 "great" and "pigs".
this is the code i use:
x <- "that pattern is great but pogintless"
sapply(regmatches(x, gregexpr("g", x)), length)
We need either a space or word boundary to avoid th letter from matching to characters other than the start of the word. In addition, it may be better to use ignore.case = TRUE as some words may begin with uppercase
lengths(regmatches(x, gregexpr("\\bg", x, ignore.case = TRUE)))
The above can be wrapped as a function
fLength <- function(str1, pat){
lengths(regmatches(str1, gregexpr(paste0("\\b", pat), str1, ignore.case = TRUE)))
}
fLength(x, "g")
#[1] 1
You can also do it with stringr library
library(stringr)
str_count(str_split(x," "),"\\bg")

Remove others in a string except a needed word including certain patterns in R

I have a vector including certain strings, and I would like remove other parts in each string except the word including certain patter (here is mir).
s <- c("a mir-96 line (kk27)", "mir-133a cell",
"d mir-14-3p in", "m mir133 (sas)", "mir_23_5p r 27")
I want to obtain:
mir-96, mir-133a, mir-14-3p, mir133, mir_23_5p
I know the idea: use the gsub() and pattern is: a word beginning with (or including) mir.
But I have no idea how to construct such patter.
Or other idea?
Any help will be appreciated!
One way in base R would be splitting every string into words and then extracting only those with mir in it
unlist(lapply(strsplit(s, " "), function(x) grep("mir", x, value = TRUE)))
#[1] "mir-96" "mir-133a" "mir-14-3p" "mir133" "mir_23_5p"
We can save the unlist step in lapply by using sapply as suggested by #Rich Scriven in comments
sapply(strsplit(s, " "), function(x) grep("mir", x, value = TRUE))
We can use sub to match zero or more characters (.*) followed by a word boundary (\\b) followed by the string (mir and one or more characters that are not a white space (\\S+), capture it as a group by placing inside the (...) followed by other characters, and in the replacement use the backreference of the captured group (\\1)
sub(".*\\b(mir\\S+).*", "\\1", s)
#[1] "mir-96" "mir-133a" "mir-14-3p" "mir133" "mir_23_5p"
Update
If there are multiple 'mir.*' substring, then we want to extract strings having some numeric part
sub(".*\\b(mir[^0-9]*[0-9]+\\S*).*", "\\1", s1)
#[1] "mir-96" "mir-133a" "mir-14-3p" "mir133" "mir_23_5p" "mir_23-5p"
data
s1 <- c("a mir-96 line (kk27)", "mir-133a cell", "d mir-14-3p in", "m mir133 (sas)",
"mir_23_5p r 27", "a mir_23-5p 1 mir-net")

Changing column names in dataframe using gsub

I have an atomic vector like:
col_names_to_be_changed <- c("PRODUCTIONDATE", "SPEEDRPM", "PERCENTLOADATCURRENTSPEED", sprintf("SENSOR%02d", 1:18))
I'd like to have _ between words, have them all lower case, except first letters of words (following R Style for dataframes from advanced R). I'd like to have something like this:
new_col_names <- c("Production_Date", "Percent_Load_At_Current_Speed", sprintf("Sensor_%02d", 1:18))
Assume that my words are limited to this list:
list_of_words <- c('production', 'speed', 'percent', 'load', 'at', 'current', 'sensor')
I am thinking of an algorithm that uses gsub, puts _ wherever it finds a word from the above list and then Capitalizes the first letter of each word. Although I can do this manually, I'd like to learn how this can be done more beautifully using gsub. Thanks.
You can take the list of words and paste them with a look-behind ((?<=)). I added the (?=.{2,}) because this will also match the "AT" in "DATE" since "AT" is in the list of words, so whatever is in the list of words will need to be followed by 2 or more characters to be split with an underscore.
The second gsub just does the capitalization
list_of_words <- c('production', 'speed', 'percent', 'load', 'at', 'current', 'sensor')
col_names_to_be_changed <- c("PRODUCTIONDATE", "SPEEDRPM", "PERCENTLOADATCURRENTSPEED", sprintf("SENSOR%02d", 1:18))
(pattern <- sprintf('(?i)(?<=%s)(?=.{2,})', paste(list_of_words, collapse = '|')))
# [1] "(?i)(?<=production|speed|percent|load|at|current|sensor)(?=.{2,})"
(split_words <- gsub(pattern, '_', tolower(col_names_to_be_changed), perl = TRUE))
# [1] "production_date" "speed_rpm" "percent_load_at_current_speed"
# [4] "sensor_01" "sensor_02" "sensor_03"
gsub('(?<=^|_)([a-z])', '\\U\\1', split_words, perl = TRUE)
# [1] "Production_Date" "Speed_Rpm" "Percent_Load_At_Current_Speed"
# [4] "Sensor_01" "Sensor_02" "Sensor_03"

Resources