Matching Wildcard Pattern and Character String in R - r

I am trying to count how many times a keyword appears in a string. In the following variable text, I would like to count how many times keyword appears in text. The result should display 3 because AWAY appears twice and WIN appears once in the string.
text<- "AWAYTEAM IS XXX. I THINK THEAWAYTEAM WILL WIN"
keyword<- c("AWAY","WIN")
Any ideas?

We may use str_count with sum
library(stringr)
sum(str_count(text, keyword))
[1] 3

One possibility using stringr
library(stringr)
text<- "AWAYTEAM IS XXX. I THINK THEAWAYTEAM WILL WIN"
keyword<- c("AWAY","WIN")
length(unlist(str_extract_all(text, keyword)))
#> [1] 3
Created on 2021-08-22 by the reprex package (v2.0.0)

Related

R: extract substring with capital letters from string

I have a dataframe with strings in a column. How could I extract only the substrings that are in capital letters and add them to another column?
This is an example:
fecha incident
1 2020-12-01 Check GENERATOR
2 2020-12-01 Check BLADE
3 2020-12-02 Problem in GENERATOR
4 2020-12-01 Check YAW
5 2020-12-02 Alarm in SAFETY SYSTEM
And I would like to create another column as follows:
fecha incident system
1 2020-12-01 Check GENERATOR GENERATOR
2 2020-12-01 Check BLADE BLADE
3 2020-12-02 Problem in GENERATOR GENERATOR
4 2020-12-01 Check YAW YAW
5 2020-12-02 Alarm in SAFETY SYSTEM SAFETY SYSTEM
I have tried with str_sub or str_extract_all using a regex but I believe I'm doing thigs wrong.
You can use str_extract if you want to work in a dataframe and tie it into a tidyverse workflow.
The regex asks either for capital letters or space and there need to be two or more consecutive ones (so it does not find capitalized words). str_trim removes the white-space that can get picked up if the capitalized word is not at the end of the string.
Note that this code snipped will only extract the first capitalized words connected via a space. If there are capitalized words in different parts of the string, only the first one will be returned.
library(tidyverse)
x <- c("CAPITAL and not Capital", "one more CAP word", "MULTIPLE CAPITAL words", "CAP words NOT connected")
cap <- str_trim(str_extract(x, "([:upper:]|[:space:]){2,}"))
cap
#> [1] "CAPITAL" "CAP" "MULTIPLE CAPITAL" "CAP"
Created on 2021-01-08 by the reprex package (v0.3.0)
library(tidyverse)
string <- data.frame(test="does this WORK")
string$new <-str_extract_all(string$test, "[A-Z]+")
string
test new
1 does this WORK WORK
If there are cases when the upper-case letters are not next to each other you can use str_extract_all to extract all the capital letters in a sentence and then paste them together.
sapply(stringr::str_extract_all(df$incident, '[A-Z]{2,}'),paste0, collapse = ' ')
#[1] "GENERATOR" "BLADE" "GENERATOR" "YAW" "SAFETY SYSTEM"

splitting strings using regex in R

I have the following a really long list of strings that look like the following that I want to split it into several pieces.
strings<-c("https://www.website.com/stats/stat.227.y2020.eon.t879.html",
"https://www.website.com/stats/stat.229.y2019.eoff.t476.html")
and the desired output is as below:
links Year Seas Tour
https://www.website.com/stats/stat.227. y2020 eon t879
https://www.website.com/stats/stat.229. y2019 eoff t476
How can I achieve this using regex?
Using str_match :
stringr::str_match(strings, '.*\\.(y\\d+)\\.(\\w+)\\.(t\\d+)')
You can use the same regex in tidyr::extract if you put strings in a dataframe.
tidyr::extract(data.frame(strings), strings, c("Year","Seas", "Tour"),
'\\.(y\\d+)\\.(\\w+)\\.(t\\d+)', remove = FALSE)
# strings Year Seas Tour
#1 https://www.pgatour.com/stats/stat.227.y2020.eon.t879.html y2020 eon t879
#2 https://www.pgatour.com/stats/stat.229.y2019.eoff.t476.html y2019 eoff t476
Here, we capture data in 3 parts (capture groups)
1st part - 'y' followed by a number
2nd part - next word following part 1
3rd part 't' followed by a number.
You could use {unglue} :
library(unglue)
unglue::unglue_data(
strings, "{links}.{Year=[^.]+}.{Seas=[^.]+}.{Tour=[^.]+}.html")
#> links Year Seas Tour
#> 1 https://www.website.com/stats/stat.227 y2020 eon t879
#> 2 https://www.website.com/stats/stat.229 y2019 eoff t476
here "[^.]+" means "one or more non dot characters", which is what we want for Year, Seas, and Tour.

Extract words starting with # in R dataframe and save as new column

My dataframe column looks like this:
head(tweets_date$Tweet)
[1] b"It is #DineshKarthik's birthday and here's a rare image of the captain of #KKRiders. Have you seen him do this before? Happy birthday, DK\\xf0\\x9f\\x98\\xac
[2] b'The awesome #IPL officials do a wide range of duties to ensure smooth execution of work! Here\\xe2\\x80\\x99s #prabhakaran285 engaging with the #ChennaiIPL kid-squad that wanted to meet their daddies while the presentation was on :) #cutenessoverload #lineofduty \\xf0\\x9f\\x98\\x81
[3] b'\\xf0\\x9f\\x8e\\x89\\xf0\\x9f\\x8e\\x89\\n\\nCHAMPIONS!!
[4] b'CHAMPIONS - 2018 #IPLFinal
[5] b'Chennai are Super Kings. A fairytale comeback as #ChennaiIPL beat #SRH by 8 wickets to seal their third #VIVOIPL Trophy \\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86. This is their moment to cherish, a moment to savour.
[6] b"Final. It's all over! Chennai Super Kings won by 8 wickets
These are tweets which have mentions starting with '#', I need to extract all of them and save each mention in that particular tweet as "#mention1 #mention2". Currently my code just extracts them as lists.
My code:
tweets_date$Mentions<-str_extract_all(tweets_date$Tweet, "#\\w+")
How do I collapse those lists in each row to a form a string separated by spaces as mentioned earlier.
Thanks in advance.
I trust it would be best if you used an asis column in this case:
extract words:
library(stringr)
Mentions <- str_extract_all(lis, "#\\w+")
some data frame:
df <- data.frame(col = 1:6, lett = LETTERS[1:6])
create a list column:
df$Mentions <- I(Mentions)
df
#output
col lett Mentions
1 1 A #DineshK....
2 2 B #IPL, #p....
3 3 C
4 4 D
5 5 E #ChennaiIPL
6 6 F
I think this is better since it allows for quite easy sub setting:
df$Mentions[[1]]
#output
[1] "#DineshKarthik" "#KKRiders"
df$Mentions[[1]][1]
#output
[1] "#DineshKarthik"
and it succinctly shows whats inside the column when printing the df.
data:
lis <- c("b'It is #DineshKarthik's birthday and here's a rare image of the captain of #KKRiders. Have you seen him do this before? Happy birthday, DK\\xf0\\x9f\\x98\\xac",
"b'The awesome #IPL officials do a wide range of duties to ensure smooth execution of work! Here\\xe2\\x80\\x99s #prabhakaran285 engaging with the #ChennaiIPL kid-squad that wanted to meet their daddies while the presentation was on :) #cutenessoverload #lineofduty \\xf0\\x9f\\x98\\x81",
"b'\\xf0\\x9f\\x8e\\x89\\xf0\\x9f\\x8e\\x89\\n\\nCHAMPIONS!!",
"b'CHAMPIONS - 2018 #IPLFinal",
"b'Chennai are Super Kings. A fairytale comeback as #ChennaiIPL beat #SRH by 8 wickets to seal their third #VIVOIPL Trophy \\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86\\xf0\\x9f\\x8f\\x86. This is their moment to cherish, a moment to savour.",
"b'Final. It's all over! Chennai Super Kings won by 8 wickets")
The str_extract_all function from the stringr package returns a list of character vectors. So, if you instead want a list of single CSV terms, then you may try using sapply for a base R option:
tweets <- str_extract_all(tweets_date$Tweet, "#\\w+")
tweets_date$Mentions <- sapply(tweets, function(x) paste(x, collapse=", "))
Demo
Via Twitter's help site: "Your username cannot be longer than 15 characters. Your real name can be longer (20 characters), but usernames are kept shorter for the sake of ease. A username can only contain alphanumeric characters (letters A-Z, numbers 0-9) with the exception of underscores, as noted above. Check to make sure your desired username doesn't contain any symbols, dashes, or spaces."
Note that email addresses can be in tweets as can URLs with #'s in them (and not just the silly URLs with username/password in the host component). Thus, something like:
(^|[^[[:alnum:]_]#/\\!?=&])#([[:alnum:]_]{1,15})\\b
is likely a better, safer choice

Converting a column in a single cell in R

I have a dataframe "animal" like the following:
Word Frequency
Dog 5
Cat 6
I want it to look like the following:
Word
"Dog","Cat"
I have used as.vector, as.list but havent been successful. Please help. Thanks
You can use toString
toString(animal$Word)
#[1] "Dog, Cat"

triming records everything after 7 characters

I am trying to remove/trim everything after 7 characters.
issue #1 removing everything after 6 or 7 chracters
example: 1Q7 4B7 MY NAME IS MARY
Results I want: 1Q7 4B7
issue #2 removing one space
example: EQ9 2IQ
Results I want: EQ9 2IQ
Please assist.
Thanks
SUBSTRING
SELECT SUBSTR(column, 1, 6) AS MYCOLUMN FROM TABLE
Something like this should do the trick; you may have to fiddle with the numbers.
More SUBSTR Examples

Resources