Simple question here, perhaps a duplicate of this?
I'm trying to figure out how to count the number of times a word appears in a vector. I know I can count the number of rows a word appears in, as shown here:
temp <- tibble(idvar = 1:3,
response = (c("This sounds great",
"This is a great idea that sounds great",
"What a great idea")))
temp %>% count(grepl("great", response)) # lots of ways to do this line
# answer = 3
The answer in the code above is 3 since "great" appears in three rows. However, the word "great" appears 4 different times in the vector "response". How do I find that instead?
We could use str_count from stringr to get the number of instances having 'great' in each row and then get the sum of that count
library(tidyverse)
temp %>%
mutate(n = str_count(response, 'great')) %>%
summarise(n = sum(n))
# A tibble: 1 x 1
# n
# <int>
#1 4
Or using regmatches/gregexpr from base R
sum(lengths(regmatches(temp$response, gregexpr('great', temp$response))))
#[1] 4
Off the top of my head, this should solve your problem:
library(tidyverse)
temp$response %>%
str_extract_all('great') %>%
unlist %>%
length
Related
Let's say I have a string as follows:
string <- "the home home on the range the friend"
All I want to do is determine which words in the string appear at least 2 times.
The psuedocode here is:
Count how many times each word appears
Return list of words that have more than two appearances in the string
Final result should be a list featuring both the and home, in that order.
I am hoping to do this using the tidyverse, ideally with stringr or dplyr. Was attempting to use tidytext as well but have been struggling.
We can split the string by space, get the table and subset based on frequency
out <- table(strsplit(string, "\\s+")[[1]])
out[out >=2]
home the
2 3
Yet another possible solution:
library(tidyverse)
data.frame(x = str_split(string, "\\s+", simplify = T) %>% t) %>%
add_count(x) %>%
filter(n >= 2) %>%
distinct %>%
pull(x)
#> [1] "the" "home"
library(tidyverse)
data.frame(string) %>%
separate_rows(string) %>%
count(string, sort = TRUE) %>%
filter(n >= 2)
Result
# A tibble: 2 × 2
string n
<chr> <int>
1 the 3
2 home 2
Here's an approach using quanteda that prints "the" before "home" as requested in the original post.
library(quanteda)
aString <- "the home home on the range the friend"
aDfm<- dfm(tokens(aString))
# extract the features where the count > 1
aDfm#Dimnames$features[aDfm#x > 1]
...and the output:
> aDfm#Dimnames$features[aDfm#x > 1]
[1] "the" "home"
Here is another option using tidytext and tidyverse, where we first separate each word (unnest_tokens), then we can count each word and sort by frequency. Then, we keep only words that have more than 1 observation, then use tibble::deframe to return a named vector.
library(tidytext)
library(tidyverse)
tibble(string) %>%
unnest_tokens(word, string) %>%
count(word, sort = TRUE) %>%
filter(n >= 2) %>%
deframe()
Output
the home
3 2
Or if you want to leave as a dataframe, then you can just ignore the last step with deframe.
So basically I have a vector of tags that I want to find in my Transcript column (row by row) and if I find any word from the tags in my Transcript string, I want to create a separate column concatenating all the tags as shown in the example below (see image):
tags=c("loan","deposit","quarter","morning")
So, the output should look like this:
Output Result
Currently, I am able to tag this by using two for loops i.e. one to go over Tags vector and the other to go over my data frame's Transcript column one-by-one. But, I have a tag list of around 500 words and data frame has more than 100,000 rows. So, I am concerned about the run time. Is there any better way to optimize my R code using apply function or any other method?
Using, the following code to tag all the rows of Transcript column one-by-one
for (i in 1:length(tags)) {
for (j in 1:nrow(FinalData)){
check_tag <- str_extract(string = FinalData$Cleaned_Transcript[j], pattern = tags[i])
if (is.na(check_tag)==FALSE) {
FinalData$Tags[j] <- stri_remove_empty(paste(FinalData$Tags[j],check_tag,sep = ","))
}
}
}
Not sure if you are open to not using a for loop, but if so, here's a tidyverse approach.
library(tidyverse)
dat <- data.frame(Transcript = c("This is example text a", "this is loan", "deposit is not quarter"))
# as per comment from TO, we want to provide an input vector of tags
my_tags <- c("loan", "deposit", "quarter", "morning")
my_tags_collapsed <- str_c(my_tags, collapse = "|")
# We can now use the collapsed tags in the str_extract_all function
dat %>%
mutate(test = str_extract_all(Transcript, my_patterns_collapsed)) %>%
unnest_wider(test) %>%
mutate(across(-Transcript, replace_na, "")) %>%
mutate(Tags_Marked = apply(across(-Transcript), 1, str_c, collapse = ",")) %>%
select(Transcript, Tags_Marked)
Which gives:
# A tibble: 3 x 2
Transcript Tags_Marked
<chr> <chr>
1 This is example text a ,
2 this is loan loan,
3 deposit is not quarter deposit,quarter
Admittedly, this is not 100% ok, since you still get the comma separator for 0-length characters.
Alternative could be to not concatenate the strings into one column, but keep them as separate columns which would mean that you could stop much earlier:
dat %>%
mutate(test = str_extract_all(Transcript, my_tags_collapsed)) %>%
unnest_wider(test)
which would give you:
# A tibble: 3 x 3
Transcript ...1 ...2
<chr> <chr> <chr>
1 This is example text a NA NA
2 this is loan loan NA
3 deposit is not quarter deposit quarter
I have a data frame with a column containing code numbers and another with dates. I am trying to use dplyr and intersect to find the common elements among days.
Sample data:
df <- data.frame(A=c(2289,490,3940,1745,855,3954,2289,555,3940,667,855,3954,2289,490,12,1745,3000,3954,2289,490,3940,28,855,3954),B=as.Date(c("2019-08-01","2019-08-01","2019-08-01","2019-08-01","2019-08-01","2019-08-01","2019-08-02","2019-08-02","2019-08-02","2019-08-02","2019-08-02","2019-08-02","2019-08-03","2019-08-03","2019-08-03","2019-08-03","2019-08-03","2019-08-03","2019-08-04","2019-08-04","2019-08-04","2019-08-04","2019-08-04","2019-08-04")))
I am trying something like this:
df %>% group_by(B) %>% intersect(A)
The expected output are the codes that are common in each single day. For instance 2289 is the expecte value but 28 is not.
I wonder whether I can use intersect in this case.
Appreciate any help
Regards
Here's one way -
df %>%
# filter(!duplicated(.)) %>% # add this if there can be duplicates
count(A) %>%
filter(n == n_distinct(df$B))
# A tibble: 2 x 2
A n
<dbl> <int>
1 2289 4
2 3954 4
A base R solution if you prefer intersect although I guess above method would be faster if number of groups is high -
Reduce(intersect, split(df$A, df$B))
[1] 2289 3954
As a side note - you can do in base R:
sort(unique(df$A))[rowMeans(table(df)) == 1]
#2289 3954
You can also try:
df %>% group_by(A) %>% summarize(if_all = length(intersect(B, unique(df$B))) == length(unique(df$B)))
which uses intersect.
I have a dataset containing a column of strings from which I wish to calculate an overall sentiment score, and a data frame containing all the unique words that appear in all the strings , each of which is assigned a score:
library(stringr)
df <- data.frame(text = c('recommend good value no problem','terrible quality no good','good service excellent quality commend'), score = 0)
words <- c('recommend','good','value','problem','terrible','quality','service','excellent','commend')
scores <- c(1,2,1,-2,-3,1,0,3,1)
wordsdf <- data.frame(words,scores)
The only way I have been able to get close to this is by using a nested for loop and the str_count function from the stringr package:
for (i in 1:3){
count = 0
for (j in 1:9){
count <- count + (str_count(df$text[i],as.character(wordsdf$words[j])) * wordsdf$scores[j])
}
df$score[i] <- count
}
This almost achieves what I want:
text score
1 recommend good value no problem 3
2 terrible quality no good 0
3 good service excellent quality commend 7
However, since the word 'commend' is also contained in the word 'recommend', my code calculates the scores as if both words are contained in the string.
So I have two queries:
1 - Is there a way to get it to match only to exact words?
2 - Is there a way to achieve this without using the nested loop?
One tidyverse possibility could be:
df %>%
rowid_to_column() %>%
mutate(text = strsplit(text, " ", fixed = TRUE)) %>%
unnest() %>%
full_join(wordsdf, by = c("text" = "words")) %>%
group_by(rowid) %>%
summarise(text = paste(text, collapse = " "),
scores = sum(scores, na.rm = TRUE)) %>%
ungroup() %>%
select(-rowid)
text scores
<chr> <dbl>
1 recommend good value no problem 2
2 terrible quality no good 0
3 good service excellent quality commend 7
It, first, splits the "text" column into separate words. Second, it performs a full join on these words. Finally, it combines the words from "text" column again and performs the summation.
I have 3-column dataframe, where 3-rd (last) contains text body, something like one sentence.
Additionally I have one vector of words.
How to compute in elegant way a following thing:
find 15 the most frequent words (with number of occurences) in whole
3-rd column which occur in mentioned above vector ?
The sentence can look like:
I like dogs and my father like cats
vector=["dogs", "like"]
Here, the most frequent words are dogs and like.
You can try with this:
library(tidytext)
library(tidyverse)
df %>% # your data
unnest_tokens(word,text) %>% # clean a bit the data and split the phrases
group_by(word) %>% # grouping by words
summarise(Freq = n()) %>% # count them
arrange(-Freq) %>% # order decreasing
top_n(2) # here the top 2, you can use 15
Result:
# A tibble: 8 x 2
word Freq
<chr> <int>
1 dogs 3
2 i 2
If you already have the words splitted, you can skip the second line.
With data:
df <- data.frame(
id = c(1,2,3),
group = c(1,1,1),
text = c("I like dogs","I don't hate dogs", "dogs are the best"), stringsAsFactors = F)