The most frequent in column of dataframe - r

I have 3-column dataframe, where 3-rd (last) contains text body, something like one sentence.
Additionally I have one vector of words.
How to compute in elegant way a following thing:
find 15 the most frequent words (with number of occurences) in whole
3-rd column which occur in mentioned above vector ?
The sentence can look like:
I like dogs and my father like cats
vector=["dogs", "like"]
Here, the most frequent words are dogs and like.

You can try with this:
library(tidytext)
library(tidyverse)
df %>% # your data
unnest_tokens(word,text) %>% # clean a bit the data and split the phrases
group_by(word) %>% # grouping by words
summarise(Freq = n()) %>% # count them
arrange(-Freq) %>% # order decreasing
top_n(2) # here the top 2, you can use 15
Result:
# A tibble: 8 x 2
word Freq
<chr> <int>
1 dogs 3
2 i 2
If you already have the words splitted, you can skip the second line.
With data:
df <- data.frame(
id = c(1,2,3),
group = c(1,1,1),
text = c("I like dogs","I don't hate dogs", "dogs are the best"), stringsAsFactors = F)

Related

Group strings that have the same words but in a different order

I have an example concatenated text field (please see sample data below) that is created from two or three different fields, however there is no guarantee that the order of the words will be the same. I would like to create a new dataset where fields with the same words, regardless of order, are collapsed. However, since I do not know in advance what words will be concatenated together, the code will have to recognize that all words in both strings match.
Code for example data:
var1<-c("BLUE|RED","RED|BLUE","WHITE|BLACK|ORANGE","BLACK|WHITE|ORANGE")
freq<-c(1,1,1,1)
have<-as.data.frame(cbind(var1,freq))
Have:
var1 freq
BLUE|RED 1
RED|BLUE 1
WHITE|BLACK|ORANGE 1
BLACK|WHITE|ORANGE 1
How can I collapse the data into what I want below?
color freq
BLUE|RED 2
WHITE|BLACK|ORANGE 2
data.frame(table(sapply(strsplit(have$var1, '\\|'),
function(x)paste(sort(x), collapse = '|'))))
Var1 Freq
1 BLACK|ORANGE|WHITE 2
2 BLUE|RED 2
In the world of piping: R > 4.0
have$var1 |>
strsplit('\\|')|>
sapply(\(x)paste0(sort(x), collapse = "|"))|>
table()|>
data.frame()
Here is a tidyverse approach:
library(dplyr)
library(tidyr)
have %>%
group_by(id=row_number()) %>%
separate_rows(var1) %>%
arrange(var1, .by_group = TRUE) %>%
mutate(var1 = paste(var1, collapse = "|")) %>%
slice(1) %>%
ungroup() %>%
count(var1, name = "freq")
var1 freq
<chr> <int>
1 BLACK|ORANGE|WHITE 2
2 BLUE|RED 2

Rowwise find most frequent term in dataframe column and count occurrences

I try to find the most frequent category within every row of a dataframe. A category can consist of multiple words split by a /.
library(tidyverse)
library(DescTools)
# example data
id <- c(1, 2, 3, 4)
categories <- c("apple,shoes/socks,trousers/jeans,chocolate",
"apple,NA,apple,chocolate",
"shoes/socks,NA,NA,NA",
"apple,apple,chocolate,chocolate")
df <- data.frame(id, categories)
# the solution I would like to achieve
solution <- df %>%
mutate(winner = c("apple", "apple", "shoes/socks", "apple"),
winner_count = c(1, 2, 1, 2))
Based on these answers I have tried the following:
Write a function that finds the most common word in a string of text using R
trial <- df %>%
rowwise() %>%
mutate(winner = names(which.max(table(categories %>% str_split(",")))),
winner_count = which.max(table(categories %>% str_split(",")))[[1]])
Also tried to follow this approach, however it also does not give me the required results
How to find the most repeated word in a vector with R
trial2 <- df %>%
mutate(winner = DescTools::Mode(str_split(categories, ","), na.rm = T))
I am mainly struggling because my most frequent category is not just one word but something like "shoes/socks" and the fact that I also have NAs. I don't want the NAs to be the "winner".
I don't care too much about the ties right now. I already have a follow up process in place where I handle the cases that have winner_count = 2.
split the categories on comma in separate rows, count their occurrence for each id, drop the NA values and select the top occurring row for each id
library(dplyr)
library(tidyr)
df %>%
separate_rows(categories, sep = ',') %>%
count(id, categories, name = 'winner_count') %>%
filter(categories != 'NA') %>%
group_by(id) %>%
slice_max(winner_count, n = 1, with_ties = FALSE) %>%
ungroup %>%
rename(winner = categories) %>%
left_join(df, by = 'id') -> result
result
# id winner winner_count categories
# <dbl> <chr> <int> <chr>
#1 1 apple 1 apple,shoes/socks,trousers/jeans,chocolate
#2 2 apple 2 apple,NA,apple,chocolate
#3 3 shoes/socks 1 shoes/socks,NA,NA,NA
#4 4 apple 2 apple,apple,chocolate,chocolate

Count word frequency from multiple strings in dataframe column

I have a DF like the following with about 33000 rows:
tibble(ID = c(1,2,3), desc = c("This is a description.", "Also a description!","This is yet another desciption"))
I would like to count every word for all rows, to get a resulting df like:
tibble(word = c("this", "is", "a", "description", "also", "yet", "another"), count = c(2,2,2,3,1,1,1))
There are several textmining packages available. tidytext, quanteda, tm, ...
Below an example using tidytext.
library(tibble)
df1 <- tibble(ID = c(1,2,3), desc = c("This is a description.", "Also a description!","This is yet another desciption"))
library(dplyr)
library(tidytext)
df1 %>%
unnest_tokens(words, desc) %>%
group_by(words) %>%
count(words)
# A tibble: 8 x 2
# Groups: words [8]
words n
<chr> <int>
1 a 2
2 also 1
3 another 1
4 desciption 1
5 description 2
6 is 2
7 this 2
8 yet 1
Might be something like
table(unlist(strsplit(paste(collection_df$desc), "\\W")))
It is hard to answer your question as you did not provide clear problem, example and your expected output.

Extract words from text using dplyr and stringr

I'm trying to find an effective way to extract words from an text column in a dataset. The approach I'm using is
library(dplyr)
library(stringr)
Text = c("A little bird told me about the dog", "A pig in a poke", "As busy as a bee")
data = as.data.frame(Text)
keywords <- paste0(c("bird", "dog", "pig","wolf","cat", "bee", "turtle"), collapse = "|")
data %>% mutate(Word = str_extract(Text, keywords))
It's just an example but I have more than 2000 possible words to extract from each row. I don't know yet another approach to use, but the fact I will have a big regex will make things slow or doesn't matter the size of the regex? I think it will not appear more than one of these words in each row, but there is a way to make multiple columns automatically if more than one word appear in each row?
We can use str_extract_all to return a list, convert the list elements to a named list or tibble and use unnest_wider
library(purrr)
library(stringr)
library(tidyr)
library(dplyr)
data %>%
mutate(Words = str_extract_all(Text, keywords),
Words = map(Words, ~ as.list(unique(.x)) %>%
set_names(str_c('col', seq_along(.))))) %>%
unnest_wider(Words)
# A tibble: 3 x 3
# Text col1 col2
# <fct> <chr> <chr>
#1 A little bird told me about the dog bird dog
#2 A pig in a poke pig <NA>
#3 As busy as a bee bee <NA>
Try intersect with keywords as an array
data <- data.frame(Text = Text, Word = sapply(Text, function(v) intersect(unlist(strsplit(v,split = " ")),keywords),USE.NAMES = F))

Count number of times a word appears (dplyr)

Simple question here, perhaps a duplicate of this?
I'm trying to figure out how to count the number of times a word appears in a vector. I know I can count the number of rows a word appears in, as shown here:
temp <- tibble(idvar = 1:3,
response = (c("This sounds great",
"This is a great idea that sounds great",
"What a great idea")))
temp %>% count(grepl("great", response)) # lots of ways to do this line
# answer = 3
The answer in the code above is 3 since "great" appears in three rows. However, the word "great" appears 4 different times in the vector "response". How do I find that instead?
We could use str_count from stringr to get the number of instances having 'great' in each row and then get the sum of that count
library(tidyverse)
temp %>%
mutate(n = str_count(response, 'great')) %>%
summarise(n = sum(n))
# A tibble: 1 x 1
# n
# <int>
#1 4
Or using regmatches/gregexpr from base R
sum(lengths(regmatches(temp$response, gregexpr('great', temp$response))))
#[1] 4
Off the top of my head, this should solve your problem:
library(tidyverse)
temp$response %>%
str_extract_all('great') %>%
unlist %>%
length

Resources