Why isn't str_count working with multiple strings? - r

I have a string with text like this:
Text <- c("How are you","What is your name","Hi my name is","You ate your cake")
And I want an output that counts the number of times the word "you" or "your" appears
Text NumYou
"How are you" 1
"What is your name" 1
"Hi my name is" 0
"You ate your cake" 2
I tried using the str_count function but it was missing occurrences of "you" and "your"
NumYou = str_count(text,c("you","your"))
Why isn't str_count working correctly?

Pass the pattern as one string.
stringr::str_count(tolower(Text),'you|your')
#[1] 1 1 0 2

Related

Get the word before exclamation mark in R tidyverse

I´m wondering how to get the words that occur before an exclamation mark! I have a dataframe with different strings on each row. I have tried following:
text %>%
str_match("!",lines)
I don´t really get what I want and I´m a bit lost. Anyone has advice?
You can str_extract_all the words before the ! using lookahead:
Data:
text <- c("Hello!", "This a test sentence", "That's another test sentence, yes! It is!", "And that's one more")
Solution:
library(stringr)
unlist(str_extract_all(text, "\\b\\w+\\b(?=!)"))
[1] "Hello" "yes" "is"
If you seek a dplyr solution:
data.frame(text) %>%
mutate(Word_before_excl = str_extract_all(text, "\\b\\w+\\b(?=!)"))
text Word_before_excl
1 Hello! Hello
2 This a test sentence
3 That's another test sentence, yes! It is! yes, is
4 And that's one more
Maybe we can use regmatches
> sapply(regmatches(text, gregexpr("\\b\\w+\\b(?=!)", text, perl = TRUE)), toString)
[1] "Hello" "" "yes, is" ""
You could also use :
> unlist(strsplit("Dog!Cat!", "!"))
[1] "Dog" "Cat"

Find and replace numbers in an R txt file

I am attempting to find all sentences in a text file in r that have numbers of any format in them and replace it with hashtags around them.
for example take the input below:
ex <- c("I have $5.78 in my account","Hello my name is blank","do you want 1,785 puppies?",
"I love stack overflow!","My favorite numbers are 3, 14,568, and 78")
as the output of the function, I'm looking for:
> "I have #$5.78# in my account"
> "do you want #1,785# puppies?"
> "My favorite numbers are #3#, #14,568#, and #78#"
Surrounding numbers is straight-forward, assuming that anything with a number, period, comma, and dollar-sign are all included.
gsub("\\b([-$0-9.,]+)\\b", "#\\1#", ex)
# [1] "I have $#5.78# in my account"
# [2] "Hello my name is blank"
# [3] "do you want #1,785# puppies?"
# [4] "I love stack overflow!"
# [5] "My favorite numbers are #3#, #14,568#, and #78#"
To filter out just the numbered entries:
grep("\\d", gsub("\\b([-$0-9.,]+)\\b", "#\\1#", ex), value = TRUE)
# [1] "I have $#5.78# in my account"
# [2] "do you want #1,785# puppies?"
# [3] "My favorite numbers are #3#, #14,568#, and #78#"
We can use gsub
gsub("(?<=\\s)(?=[$0-9])|(?<=[0-9])(?=,?[ ]|$)", "#", ex, perl = TRUE)
#[1] "I have #$5.78# in my account" "Hello my name is blank"
#[3] "do you want #1,785# puppies?" "I love stack overflow!"
#[5] "My favorite numbers are #3#, #14,568#, and #78#"
Another step-by-step approach is to use grep to identify the elements of text file containing the pattern "[0-9]", subset text elements with numeric entries using ex[....], and use the pipe operator %>% from library(dplyr) to pass the subset to gsub then use #r2evans' logic to place hashtags around numeric entries as shown below:
library(dplyr)
ex[do.call(grep,list("[0-9]",ex))] %>% gsub("\\b([-$0-9.,]+)\\b", "#\\1#",.)
The do.call(grep,list("[0-9]",ex)) portion of the code returns the indices for the text elements in ex with numeric entries.
Output
library(dplyr)
ex[do.call(grep,list("[0-9]",ex))] %>% gsub("\\b([-$0-9.,]+)\\b", "#\\1#",.)
[1] "I have $#5.78# in my account" "do you want #1,785# puppies?"
[3] "My favorite numbers are #3#, #14,568#, and #78#"

Selecting only strings with 2 words in vector via regex

I have a simple character string like:
test<-c("two words", "three more words", "something else", "this has a lot of words", "more of this", "pick me")
I would need a function that returns the indices of test where there are only 2 words in the element (in this example this would be index 1, 3 and 6, but 2, 4 and 5 are completely uninteresting). More context: I am searching for "real" names of persons among a large vector that is mixed also with company names (which have often 3 or more words). I have no clue how to perhaps get regex (or any other technique) to do this...
We can use grep to match the word (\\w+) followed by a space followed by other word (\\w+) from the start (^) to end ($) of the string
grep("^\\w+ \\w+$", test)
[#1] 1 3 6
Or with str_count
library(stringr)
which(str_count(test, "\\w+") == 2)
#[1] 1 3 6
One option involving stringr could be.
which(is.na(word(test, 1, 3, fixed(" "))))
[1] 1 3 6

R - put space at word begins with capital letter, for full column

i am having a column from XLSX imported to R, where each row is having a sentence without space, but words begins with Capital letters. tried to use
gsub("([[:upper:]])([[:upper:]][[:lower:]])", "\\1 \\2", x)
but this is working, if i start converting each row,
Example
1 HowDoYouWorkOnThis
2 ThisIsGreatExample
3 ProgrammingIsGood
Expected is
1 How Do You Work On This
2 This Is Great Example
3 Programming Is Good
Is this what you're after?
s <- c("HowDoYouWorkOnThis", "ThisIsGreatExample", "ProgrammingIsGood");
sapply(s, function(x) trimws(gsub("([A-Z])", " \\1", x)))
# HowDoYouWorkOnThis ThisIsGreatExample ProgrammingIsGood
#"How Do You Work On This" "This Is Great Example" "Programming Is Good"
Or using stringr::str_replace_all:
library(stringr);
trimws(str_replace_all(s, "([A-Z])", " \\1"));
#[1] "How Do You Work On This" "This Is Great Example"
#[3] "Programming Is Good"

R sorting by most commonly occuring

This is probably very very simple but, I have a vector of phrases, some of which repeat, some of which dont, and I would like a list of unique phrases, sorted by the most commonly occurring.
e.g.
vec <- c("hello","hi","hi","greetings","good day", "hi", "hello", "good day","good morning","hello","good day")
sort(unique(vec))
[1] "good day" "good morning" "greetings" "hello" "hi"
I would expect "hi" to be first then followed by "hello" then followed by "good day" etc....
Just use sort(table(vec)) :
sort(table(vec), decreasing=TRUE)
# vec
# good day hello hi good morning greetings
# 3 3 3 1 1

Resources