Stringr pattern to detect capitalized words - r

I am trying to write a function to detect capitalized words that are all capitalised
currently, code:
df <- data.frame(title = character(), id = numeric())%>%
add_row(title= "THIS is an EXAMPLE where I DONT get the output i WAS hoping for", id = 6)
df <- df %>%
mutate(sec_code_1 = unlist(str_extract_all(title," [A-Z]{3,5} ")[[1]][1])
, sec_code_2 = unlist(str_extract_all(title," [A-Z]{3,5} ")[[1]][2])
, sec_code_3 = unlist(str_extract_all(title," [A-Z]{3,5} ")[[1]][3]))
df
Where output is:
title
id
sec_code_1
sec_code_2
sec_code_3
THIS is an EXAMPLE where I DONT get the output i WAS hoping for
6
DONT
WAS
The first 3-5 letter capitalized word is "THIS", second should skip example (>5) and be "DONT", third example should be "WAS".
ie:
title
id
sec_code_1
sec_code_2
sec_code_3
THIS is an EXAMPLE where I DONT get the output i WAS hoping for
6
THIS
DONT
WANT
does anyone know where Im going wrong? specifically how I can denote "space or beginning of string" or "space or end of string" logically using stringr.

If you run the code with your regex you'll realise 'THIS' is not included in the output at all.
str_extract_all(df$title," [A-Z]{3,5} ")[[1]]
#[1] " DONT " " WAS "
This is because you are extracting words with leading and lagging whitespace. 'THIS' does not have lagging whitespace because it is start of the sentence, hence it does not satisfy the regex pattern. You can use word boundaries (\\b) instead.
str_extract_all(df$title,"\\b[A-Z]{3,5}\\b")[[1]]
#[1] "THIS" "DONT" "WAS"
Your code would work if you use the above pattern in it.
Or you could also use :
library(tidyverse)
df %>%
mutate(code = str_extract_all(title,"\\b[A-Z]{3,5}\\b")) %>%
unnest_wider(code) %>%
rename_with(~paste0('sec_code_', seq_along(.)), starts_with('..'))
# title id sec_code_1 sec_code_2 sec_code_3
# <chr> <dbl> <chr> <chr> <chr>
#1 THIS is an EXAMPLE where I DONT get t… 6 THIS DONT WAS

Related

split cell at special character if comma found after first word

hi i've got some budget data with names and titles that read "Last, First - Title" and other rows in same column position that read "anything really - ,asd;flkajsd". I'd like to split the column IF first word ends in a "," at the "-" position that follows it.
ive tried this:
C22$ITEM2 <- ifelse(grepl(",", C22$ITEM), C22$ITEM, NA)
test <- str_split_fixed(C22$ITEM2, "-", 2)
C22 <- cbind(C22, test)
but i'm getting other cells with commas elsewhere, need to limit to just "if first word ends in comma"
library(tidyverse)
data <- tibble(data = c("Doe, John - Mr", "Anna, Anna - Ms", " ,asd;flkajsd"))
data
data %>%
# first word must ed with a
filter(data %>% str_detect("^[A-z]+a")) %>%
separate(data, into = c("Last", "First", "Title"), sep = "[,-]") %>%
mutate_all(str_trim)
# A tibble: 1 × 3
# Last First Title
# <chr> <chr> <chr>
#1 Anna Anna Ms
We may use extract to do this - capture the regex pattern as two groups ((...)) where the first group would return word (\\w+) from the start (^) of the string followed by a ,, zero or more space (\\s*), another word (\\w+), then the - (preceding or succeeding zero or more space and the second capture group with the word (\\w+) before the end ($) of the string
library(tidyr)
library(dplyr)
extract(C22, ITEM, into = c("Name", "Title"),
"^(\\w+,\\s*\\w+)\\s*-\\s*(\\w+)$") %>%
mutate(Name = coalesce(Name, ITEM), .keep = 'unused')
NOTE: The mutate is added in case the regex didn't match and return NA elements, we coalesce with the original column to return the value that corresponds to NA

Extract sentences from texts in data frame

I have a data frame with a column "text" and in each row of my data frame "text" contains several sentences (maybe only two, maybe 100 or more). Now I would like to analyze the text in every row of my data frame for specific keywords. If a keyword is found in the text of this row I would like to extract the sentences, which contain keywords, to a separate column, f.e.
needles = c("first", "hope", "analyze", "happy")
mydata <- data.frame(
text = c("This is the first sentence. It is the beginning of this project",
"My second sentence is this. I hope this project will work fine. Then I will analyze the third sentence.",
"And this is the last sentence. Finally my work ends. I am really happy about that.",
"These sentences do not contain any relevant information. There is no keyword. And it is not relevant."),
findings = c("This is the first sentence.",
"I hope this project will work fine. Then I will analyze the third sentence.",
"I am really happy about that.",
NA)
)
So column "text" contains the sentences I want to check for keywords, "findings" is the result I would like to have in the end.
Can anyone help me how to apply the solution for all rows of the data frame?
Thank you!
We can work with list columns by splitting each row in different sentences and look for the needles inside each resulting sentence of each row.
The reduce functions are to take levels of depth of the lists.
code:
library(tidyverse)
needles <- c("first", "hope", "analyze", "happy")
mydata <- data.frame(
text = c(
"This is the first sentence. It is the beginning of this project",
"My second sentence is this. I hope this project will work fine. Then I will analyze the third sentence.",
"And this is the last sentence. Finally my work ends. I am really happy about that.",
"These sentences do not contain any relevant information. There is no keyword. And it is not relevant."
),
findings = c(
"This is the first sentence.",
"I hope this project will work fine. Then I will analyze the third sentence.",
"I am really happy about that.",
NA
)
)
df <- as_tibble(mydata) %>%
mutate(mydata, findings = str_split(text, "\\.\\s") %>%
map(~str_subset(., rebus::or1(needles))) %>%
map_if(~length(.) > 1, ~reduce(., ~paste(.x, .y, sep = '. '))),
findings = map_if(findings, ~length(.) == 0, ~NA) %>% reduce(c))
df
#> # A tibble: 4 × 2
#> text findings
#> <chr> <chr>
#> 1 This is the first sentence. It is the … This is the first sentence
#> 2 My second sentence is this. I hope thi… I hope this project will work fine. T…
#> 3 And this is the last sentence. Finally… I am really happy about that.
#> 4 These sentences do not contain any rel… <NA>
Created on 2021-11-27 by the reprex package (v2.0.1)
What about something like this:
find_sentence <- function(text, word){
require(stringr)
x <- c(str_split(text, "\\..", simplify=TRUE))
inds <- which(str_detect(x, word))
if(length(inds) > 0){
list(x[inds])
}else{
list(NA)
}
}
mydata %>%
rowwise %>%
mutate(res = find_sentence(text, "the")) %>%
unnest(res)
# # A tibble: 4 × 3
# text findings res
# <chr> <chr> <chr>
# 1 This is the first sentence. It is the beginning of this project This is the first sentence. This is the fi…
# 2 This is the first sentence. It is the beginning of this project This is the first sentence. It is the begi…
# 3 My second sentence is this. I hope this project will work fine. Then I will analyze the third sentence. I hope this project will wo… Then I will an…
# 4 And this is the last sentence. Finally my work ends. I am really happy about that. I am really happy about tha… And this is th…
This returns a new variable called res that has a different row for each occurrence of the keyword in a sentence. So, if two sentences contained the word (as in the first sentence in text), the text and findings columns will be replicated for each of the relevant sentences in res.
With Base R,
lookup <- strsplit(as.character(mydata[,1]),"\\.")
out <- lapply(lookup,function(x) {
logic <- grepl(paste0(needles,collapse="|"),x)
paste0(x[logic],collapse=".")
})
data.frame(findings = do.call(rbind,out) )
gives,
# findings
#1 This is the first sentence
#2 I hope this project will work fine. Then I will analyze the third sentence
#3 I am really happy about that
#4
This uses grep and a strsplit to get the matches.
mydata$findings <- sapply( strsplit( t(mydata), "\\. " ), function(x)
x[unlist( lapply( needles, function(y) grep(y, x) ) )] )
text
1 This is the first sentence. It is the beginning of this project
2 My second sentence is this. I hope this project will work fine. Then I will analyze the third sentence.
3 And this is the last sentence. Finally my work ends. I am really happy about that.
4 These sentences do not contain any relevant information. There is no keyword. And it is not relevant.
findings
1 This is the first sentence
2 I hope this project will work fine, Then I will analyze the third sentence.
3 I am really happy about that.
4

Split String with second (single) Backslash / R Emojis (Unicode) without Modifier

I have a tribble with a chr column that contains the unicode to emojis. I want to split these strings into two columns in case of need, if there are more than two backslash in the whole string. So I need a split with the 2nd backslash. It would also be enough to just delete everything from the 2nd backslash on.
Here is what I tried:
df <- tibble::tribble(
~RUser, ~REmoji,
"User1", "\U0001f64f\U0001f3fb",
"User2", "\U0001f64f",
"User2", "\U0001f64f\U0001f3fc"
)
df %>% mutate(newcol = gsub("\\\\*", "", REmoji))
I found the solution Replace single backslash in R. But in my case I have only one backslash, and I don't understand how to separate the column here.
The result should look like this output:
df2 <- tibble::tribble(
~RUser, ~REmoji1, ~newcol,
"User1", "\U0001f64f", "\U0001f3fb",
"User2", "\U0001f64f", "", #This Field is empty, since there was no Emoji-Modification
"User2", "\U0001f64f", "\U0001f3fc"
)
Thanks a lot!
We could also use substring from base R
df$newcol <- substring(df$REmoji, 2)
Note these \U... are single Unicode code points, not just a backslash + digits/letters.
Using the ^. PCRE regex with sub provides the expected results:
> df %>% mutate(newcol = sub("^.", "", REmoji, perl=TRUE))
# A tibble: 3 x 3
RUser REmoji newcol
<chr> <chr> <chr>
1 User1 "\U0001f64f\U0001f3fb" "\U0001f3fb"
2 User2 "\U0001f64f" ""
3 User2 "\U0001f64f\U0001f3fc" "\U0001f3fc"
Make sure you pass the perl=TRUE argument.
And in order to do the reverse, i.e. keep the first code point only, you can use:
df %>% mutate(newcol = sub("^(.).+", "\\1", REmoji, perl=TRUE))

extracting names and numbers using regex

I think I might have some issues with understanding the regular expressions in R.
I need to extract phone numbers and names from a sample vector and create a data-frame with corresponding columns for names and numbers using stringr package functionality.
The following is my sample vector.
phones <- c("Ann 077-789663", "Johnathan 99656565",
"Maria2 099-65-6569 office")
The code that I came up with to extract those is as follows
numbers <- str_remove_all(phones, pattern = "[^0-9]")
numbers <- str_remove_all(numbers, pattern = "[a-zA-Z]")
numbers <- trimws(numbers)
names <- str_remove_all(phones, pattern = "[A-Za-z]+", simplify = T)
phones_data <- data.frame("Name" = names, "Phone" = numbers)
It doesn't work, as it takes the digit in the name and joins with the phone number. (not optimal code as well)
I would appreciate some help in explaining the simplest way for accomplishing this task.
Not a regex expert, however with stringr package we can extract a number pattern with optional "-" in it and replace the "-" with empty string to extract numbers without any "-". For names, we extract the first word at the beginning of the string.
library(stringr)
data.frame(Name = str_extract(phones, "^[A-Za-z]+"),
Number = gsub("-","",str_extract(phones, "[0-9]+[-]?[0-9]+[-]?[0-9]+")))
# Name Number
#1 Ann 077789663
#2 Johnathan 99656565
#3 Maria 099656569
If you want to stick completely with stringr we can use str_replace_all instead of gsub
data.frame(Name = str_extract(phones, "[A-Za-z]+"),
Number=str_replace_all(str_extract(phones, "[0-9]+[-]?[0-9]+[-]?[0-9]+"), "-",""))
# Name Number
#1 Ann 077789663
#2 Johnathan 99656565
#3 Maria 099656569
I think Ronak's answer is good for the name part, I don't really have a good alternative to offer there.
For numbers, I would go with "numbers and hyphens, with a word boundary at either end", i.e.
numbers = str_extract(phones, "\\b[-0-9]+\\b") %>%
str_remove_all("-")
# Can also specify that you need at least 5 numbers/hyphens
# in a row to match
numbers2 = str_extract(phones, "\\b[-0-9]{5,}\\b") %>%
str_remove_all("-")
That way, you're not locked into a fixed format for the number of hyphens that appear in the number (my suggested regex allows for any number).
If you (like me) prefer to use base-R and want to keep the regex as simple as possible you could do something like this:
phone_split <- lapply(
strsplit(phones, " "),
function(x) {
name_part <- grepl("[^-0-9]", x)
c(
name = paste(x[name_part], collapse = " "),
phone = x[!name_part]
)
}
)
phone_split
[[1]]
name phone
"Ann" "077-789663"
[[2]]
name phone
"Johnathan" "99656565"
[[3]]
name phone
"Maria2 office" "099-65-6569"
do.call(rbind, phone_split)
name phone
[1,] "Ann" "077-789663"
[2,] "Johnathan" "99656565"
[3,] "Maria2 office" "099-65-6569"

Looping through and replacing text in a data frame

I have a dataframe that consists of a variable with multiple words, such as:
variable
"hello my name is this"
"greetings friend"
And another dataframe that consists of two columns, one of which is words, the other of which is replacements for those words, such as:
word
"hello"
"greetings"
replacement:
replacement
"hi"
"hi"
I'm trying to find an easy way to replace the words in "variable" with the replacement words, looping over both all the observations, and all the words in each observation. The desired result is:
variable
"hi my name is this"
"hi friend"
I've looked into some methods that use cSplit, but it's not feasible for my application (there are too many words in any given observation of "variable", so this creates too many columns). I'm not sure how I would use strsplit for this, but am guessing that is the correct option?
EDIT: From my understanding of this question, my question my be a repeat of a previously unanswered question: Replace strings in text based on dictionary
stringr's str_replace_all would be handy in this case:
df = data.frame(variable = c('hello my name is this','greetings friend'))
replacement <- data.frame(word = c('hello','greetings'), replacment = c('hi','hi'), stringsAsFactors = F)
stringr::str_replace_all(df$variable,replacement$word,replacement$replacment)
Output:
> stringr::str_replace_all(df$variable,replacement$word,replacement$replacment)
[1] "hi my name is this" "hi friend"
This is similar to #amrrs's solution, but I am using a named vector instead of supplying two separate vectors. This also addresses the issue mentioned by OP in the comments:
library(dplyr)
library(stringr)
df2$word %>%
paste0("\\b", ., "\\b") %>%
setNames(df2$replacement, .) %>%
str_replace_all(df1$variable, .)
# [1] "hi my name is this" "hi friend" "hi, hellomy is not a word"
# [4] "hi! my friend"
This is the named vector with regex as the names and string to replace with as elements:
df2$word %>%
paste0("\\b", ., "\\b") %>%
setNames(df2$replacement, .)
# \\bhello\\b \\bgreetings\\b
# "hi" "hi"
Data:
df1 = data.frame(variable = c('hello my name is this',
'greetings friend',
'hello, hellomy is not a word',
'greetings! my friend'))
df2 = data.frame(word = c('hello','greetings'),
replacement = c('hi','hi'),
stringsAsFactors = F)
Note:
In order to address the issue of root words also being converted, I wrapped the regex with word boundaries (\\b). This makes sure that I am not converting words that live inside another, like "helloguys".

Resources