R: extract substring with capital letters from string - r

I have a dataframe with strings in a column. How could I extract only the substrings that are in capital letters and add them to another column?
This is an example:
fecha incident
1 2020-12-01 Check GENERATOR
2 2020-12-01 Check BLADE
3 2020-12-02 Problem in GENERATOR
4 2020-12-01 Check YAW
5 2020-12-02 Alarm in SAFETY SYSTEM
And I would like to create another column as follows:
fecha incident system
1 2020-12-01 Check GENERATOR GENERATOR
2 2020-12-01 Check BLADE BLADE
3 2020-12-02 Problem in GENERATOR GENERATOR
4 2020-12-01 Check YAW YAW
5 2020-12-02 Alarm in SAFETY SYSTEM SAFETY SYSTEM
I have tried with str_sub or str_extract_all using a regex but I believe I'm doing thigs wrong.

You can use str_extract if you want to work in a dataframe and tie it into a tidyverse workflow.
The regex asks either for capital letters or space and there need to be two or more consecutive ones (so it does not find capitalized words). str_trim removes the white-space that can get picked up if the capitalized word is not at the end of the string.
Note that this code snipped will only extract the first capitalized words connected via a space. If there are capitalized words in different parts of the string, only the first one will be returned.
library(tidyverse)
x <- c("CAPITAL and not Capital", "one more CAP word", "MULTIPLE CAPITAL words", "CAP words NOT connected")
cap <- str_trim(str_extract(x, "([:upper:]|[:space:]){2,}"))
cap
#> [1] "CAPITAL" "CAP" "MULTIPLE CAPITAL" "CAP"
Created on 2021-01-08 by the reprex package (v0.3.0)

library(tidyverse)
string <- data.frame(test="does this WORK")
string$new <-str_extract_all(string$test, "[A-Z]+")
string
test new
1 does this WORK WORK

If there are cases when the upper-case letters are not next to each other you can use str_extract_all to extract all the capital letters in a sentence and then paste them together.
sapply(stringr::str_extract_all(df$incident, '[A-Z]{2,}'),paste0, collapse = ' ')
#[1] "GENERATOR" "BLADE" "GENERATOR" "YAW" "SAFETY SYSTEM"

Related

regex: extract segments of a string containing a word, between symbols

Hello I have a data frame that looks something like this
dataframe <- data_frame(text = c('WAFF, some words to keep, ciao, WOFF hey ;other ;WIFF12',
'WUFF;other stuff to keep;WIFF2;yes yes IGWIFF'))
print(dataframe)
# A tibble: 2 × 1
text
<chr>
1 WAFF, some words to keep, ciao, WOFF hey ;other ;WIFF12
2 WUFF;other stuff to keep;WIFF2;yes yes IGWIFF
I want to extract the segment of the strings containing the word "keep". Note that these segments can be separated from other parts by different symbols for example , and ;.
the final dataset should look something like this.
final_dataframe <- data_frame(text = c('some words to keep',
'other stuff to keep'))
print(final_dataframe)
# A tibble: 2 × 1
text
<chr>
1 some words to keep
2 other stuff to keep
Does anyone know how I could do this?
With stringr ...
library(stringr)
library(dplyr)
dataframe %>%
mutate(text = trimws(str_extract(text, "(?<=[,;]).*keep")))
# A tibble: 2 × 1
text
<chr>
1 some words to keep
2 other stuff to keep
Created on 2022-02-01 by the reprex package (v2.0.1)
I've made great use of the positive lookbehind and positive lookahead group constructs -- check this out: https://regex101.com/r/Sc7h8O/1
If you want to assert that the text you're looking for comes after a character/group -- in your first case the apostrophe, use (?<=').
If you want to do the same but match something before ' then use (?=')
And you want to match between 0 and unlimited characters surrounding "keep" so use .* on either side, and you wind up with (?<=').*keep.*(?=')
I did find in my test that a string like text =' c('WAFF, some words to keep, ciao, WOFF hey ;other ;WIFF12', will also match the c(, which I didn't intend. But I assume your strings are all captured by pairs of apostrophes.

Matching Wildcard Pattern and Character String in R

I am trying to count how many times a keyword appears in a string. In the following variable text, I would like to count how many times keyword appears in text. The result should display 3 because AWAY appears twice and WIN appears once in the string.
text<- "AWAYTEAM IS XXX. I THINK THEAWAYTEAM WILL WIN"
keyword<- c("AWAY","WIN")
Any ideas?
We may use str_count with sum
library(stringr)
sum(str_count(text, keyword))
[1] 3
One possibility using stringr
library(stringr)
text<- "AWAYTEAM IS XXX. I THINK THEAWAYTEAM WILL WIN"
keyword<- c("AWAY","WIN")
length(unlist(str_extract_all(text, keyword)))
#> [1] 3
Created on 2021-08-22 by the reprex package (v2.0.0)

Extract specified number of words after a string in R

I am trying to extract the 4 words after the string "source:" in this example below.
library(stringr)
x <- data.frame(end = c("source: from animal origin as Vitamin A / all-trans-Retinol: Fish in general, liver and dairy products;", "source: Eggs, liver, certain fish species such as sardines, certain mushroom species such as shiitake", "source: Leafy green vegetables such as spinach; egg yolks; liver"))
x$source = str_extract(x$end, '[^source: ](.)*')
when I try the code above, I can extract all the text after "source:" into a new column. I was wondering if there is a way to extract only the first 4 words following "source", either using stringr or any other package.
You can use :
trimws(stringr::str_extract(x$end, '(?<=source:\\s)(\\w+,?\\s){4}'))
#[1] "from animal origin as" "Eggs, liver, certain fish"
# "Leafy green vegetables such"
?<= is positive lookbehind to search for 'source:' followed by whitespace.
We capture 4 "words" after it including an optional comma and whitespace.

splitting strings using regex in R

I have the following a really long list of strings that look like the following that I want to split it into several pieces.
strings<-c("https://www.website.com/stats/stat.227.y2020.eon.t879.html",
"https://www.website.com/stats/stat.229.y2019.eoff.t476.html")
and the desired output is as below:
links Year Seas Tour
https://www.website.com/stats/stat.227. y2020 eon t879
https://www.website.com/stats/stat.229. y2019 eoff t476
How can I achieve this using regex?
Using str_match :
stringr::str_match(strings, '.*\\.(y\\d+)\\.(\\w+)\\.(t\\d+)')
You can use the same regex in tidyr::extract if you put strings in a dataframe.
tidyr::extract(data.frame(strings), strings, c("Year","Seas", "Tour"),
'\\.(y\\d+)\\.(\\w+)\\.(t\\d+)', remove = FALSE)
# strings Year Seas Tour
#1 https://www.pgatour.com/stats/stat.227.y2020.eon.t879.html y2020 eon t879
#2 https://www.pgatour.com/stats/stat.229.y2019.eoff.t476.html y2019 eoff t476
Here, we capture data in 3 parts (capture groups)
1st part - 'y' followed by a number
2nd part - next word following part 1
3rd part 't' followed by a number.
You could use {unglue} :
library(unglue)
unglue::unglue_data(
strings, "{links}.{Year=[^.]+}.{Seas=[^.]+}.{Tour=[^.]+}.html")
#> links Year Seas Tour
#> 1 https://www.website.com/stats/stat.227 y2020 eon t879
#> 2 https://www.website.com/stats/stat.229 y2019 eoff t476
here "[^.]+" means "one or more non dot characters", which is what we want for Year, Seas, and Tour.

Extract words starting with # in R dataframe and save as new column

My dataframe column looks like this:
head(tweets_date$Tweet)
[1] b"It is #DineshKarthik's birthday and here's a rare image of the captain of #KKRiders. Have you seen him do this before? Happy birthday, DK\\xf0\\x9f\\x98\\xac
[2] b'The awesome #IPL officials do a wide range of duties to ensure smooth execution of work! Here\\xe2\\x80\\x99s #prabhakaran285 engaging with the #ChennaiIPL kid-squad that wanted to meet their daddies while the presentation was on :) #cutenessoverload #lineofduty \\xf0\\x9f\\x98\\x81
[3] b'\\xf0\\x9f\\x8e\\x89\\xf0\\x9f\\x8e\\x89\\n\\nCHAMPIONS!!
[4] b'CHAMPIONS - 2018 #IPLFinal
[5] b'Chennai are Super Kings. A fairytale comeback as #ChennaiIPL beat #SRH by 8 wickets to seal their third #VIVOIPL Trophy \\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86. This is their moment to cherish, a moment to savour.
[6] b"Final. It's all over! Chennai Super Kings won by 8 wickets
These are tweets which have mentions starting with '#', I need to extract all of them and save each mention in that particular tweet as "#mention1 #mention2". Currently my code just extracts them as lists.
My code:
tweets_date$Mentions<-str_extract_all(tweets_date$Tweet, "#\\w+")
How do I collapse those lists in each row to a form a string separated by spaces as mentioned earlier.
Thanks in advance.
I trust it would be best if you used an asis column in this case:
extract words:
library(stringr)
Mentions <- str_extract_all(lis, "#\\w+")
some data frame:
df <- data.frame(col = 1:6, lett = LETTERS[1:6])
create a list column:
df$Mentions <- I(Mentions)
df
#output
col lett Mentions
1 1 A #DineshK....
2 2 B #IPL, #p....
3 3 C
4 4 D
5 5 E #ChennaiIPL
6 6 F
I think this is better since it allows for quite easy sub setting:
df$Mentions[[1]]
#output
[1] "#DineshKarthik" "#KKRiders"
df$Mentions[[1]][1]
#output
[1] "#DineshKarthik"
and it succinctly shows whats inside the column when printing the df.
data:
lis <- c("b'It is #DineshKarthik's birthday and here's a rare image of the captain of #KKRiders. Have you seen him do this before? Happy birthday, DK\\xf0\\x9f\\x98\\xac",
"b'The awesome #IPL officials do a wide range of duties to ensure smooth execution of work! Here\\xe2\\x80\\x99s #prabhakaran285 engaging with the #ChennaiIPL kid-squad that wanted to meet their daddies while the presentation was on :) #cutenessoverload #lineofduty \\xf0\\x9f\\x98\\x81",
"b'\\xf0\\x9f\\x8e\\x89\\xf0\\x9f\\x8e\\x89\\n\\nCHAMPIONS!!",
"b'CHAMPIONS - 2018 #IPLFinal",
"b'Chennai are Super Kings. A fairytale comeback as #ChennaiIPL beat #SRH by 8 wickets to seal their third #VIVOIPL Trophy \\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86. This is their moment to cherish, a moment to savour.",
"b'Final. It's all over! Chennai Super Kings won by 8 wickets")
The str_extract_all function from the stringr package returns a list of character vectors. So, if you instead want a list of single CSV terms, then you may try using sapply for a base R option:
tweets <- str_extract_all(tweets_date$Tweet, "#\\w+")
tweets_date$Mentions <- sapply(tweets, function(x) paste(x, collapse=", "))
Demo
Via Twitter's help site: "Your username cannot be longer than 15 characters. Your real name can be longer (20 characters), but usernames are kept shorter for the sake of ease. A username can only contain alphanumeric characters (letters A-Z, numbers 0-9) with the exception of underscores, as noted above. Check to make sure your desired username doesn't contain any symbols, dashes, or spaces."
Note that email addresses can be in tweets as can URLs with #'s in them (and not just the silly URLs with username/password in the host component). Thus, something like:
(^|[^[[:alnum:]_]#/\\!?=&])#([[:alnum:]_]{1,15})\\b
is likely a better, safer choice

Resources