Count word frequency from multiple strings in dataframe column - r

I have a DF like the following with about 33000 rows:
tibble(ID = c(1,2,3), desc = c("This is a description.", "Also a description!","This is yet another desciption"))
I would like to count every word for all rows, to get a resulting df like:
tibble(word = c("this", "is", "a", "description", "also", "yet", "another"), count = c(2,2,2,3,1,1,1))

There are several textmining packages available. tidytext, quanteda, tm, ...
Below an example using tidytext.
library(tibble)
df1 <- tibble(ID = c(1,2,3), desc = c("This is a description.", "Also a description!","This is yet another desciption"))
library(dplyr)
library(tidytext)
df1 %>%
unnest_tokens(words, desc) %>%
group_by(words) %>%
count(words)
# A tibble: 8 x 2
# Groups: words [8]
words n
<chr> <int>
1 a 2
2 also 1
3 another 1
4 desciption 1
5 description 2
6 is 2
7 this 2
8 yet 1

Might be something like
table(unlist(strsplit(paste(collection_df$desc), "\\W")))
It is hard to answer your question as you did not provide clear problem, example and your expected output.

Related

Counting all elements, row wise in list column

My dataframe looks like this:
V1
c("cheese","bread","sugar","cream","milk","butter")
c("milk","butter","apples","cream","bread")
c("butter","milk","toffee")
c("cream","milk","butter","sugar")
I am trying to count the number of times each element appears and sum in a new column. I would like to end up with something like this:
V2 V3
cheese 1
bread 2
sugar 2
cream 3
milk 4
butter 4
apples 1
toffee 1
I have tried using the following code
counts <- unlist(V1, use.names = FALSE)
counts <- table(counts)
But for some reason the counts are wrong and values are being skipped.
If I understand you correctly and your data is organized as provided below, then we could do it this way:
Using separate_rows will allow to bring all your strings in one row.
remove c and empty rows
Use fct_inorder from forcats package (it is in tidyverse) to keep the order as provided
then apply count with the name argument:
library(tidyverse)
df %>%
separate_rows(V1) %>%
filter(!(V1 == "c" | V1 == "")) %>%
mutate(V1 = fct_inorder(V1)) %>%
count(V1, name ="V3")
V1 V3
<fct> <int>
1 cheese 1
2 bread 2
3 sugar 2
4 cream 3
5 milk 4
6 butter 4
7 apples 1
8 toffee 1
df <- structure(list(V1 = c("c(\"cheese\",\"bread\",\"sugar\",\"cream\",\"milk\",\"butter\")",
"c(\"milk\",\"butter\",\"apples\",\"cream\",\"bread\")", "c(\"butter\",\"milk\",\"toffee\")",
"c(\"cream\",\"milk\",\"butter\",\"sugar\")")), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -4L))
A couple of little issues with the question. Found it hard to reproduce exactly so took some liberties with the DF and present a couple of options that might help:
Option 1 - data in one column
library(tidyverse)
df <- data.frame(V1 = c("cheese","bread","sugar","cream","milk","butter",
"milk","butter","apples","cream","bread",
"butter","milk","toffee",
"cream","milk","butter","sugar"))
df <- df %>% dplyr::group_by(V1) %>%
summarise(
V3 = n()
)
Option 2 - data in columns - added NAs so it made a DF
library(tidyverse)
df <- data.frame(c("cheese","bread","sugar","cream","milk","butter"),
c("milk","butter","apples","cream","bread",NA),
c("butter","milk","toffee",NA,NA,NA),
c("cream","milk","butter","sugar",NA,NA))
df <- data.frame(V1=unlist(df)) %>%
select(V1) %>%
drop_na() %>%
group_by(V1) %>%
summarise(V3 = n())
hope this helps!

Counting Specific Word Frequency In R

I have a data set where I have split text from the journal abstracts to print 1 word per column. This has lead to over 5 million rows, but I just want certain the word counts of certain words. Below is an example of the data:
1 rna
1 synthesis
1 resembles
1 copy
1 choice
1 rna
1 recombination
1 process
1 nascent
1 rna
So in that example let's say I want just the rna counts, I would get 3 and that's it. I have done that word count on the whole set but this is not as useful to me.
wordCount <- m3 %>%
count(word, sort = TRUE)
Since many of the words aren't helpful for what I am trying to get to.
Any help would be welcome.
You can group_by the word and count occurrences of each unique word and then subset the ones you want.
library(tidyverse)
data <- data.frame(word = c("rna",
"synthesis",
"resembles",
"copy",
"choice",
"rna",
"recombination",
"process",
"nascent",
"rna"))
counts <- data %>%
group_by(word) %>%
count()
counts[which(counts$word == "rna"),]
# A tibble: 1 x 2
# Groups: word [1]
word n
<fct> <int>
1 rna 3
or using dplyr subsetting:
counts %>% filter(word == "rna")
# A tibble: 1 x 2
# Groups: word [1]
word n
<fct> <int>
1 rna 3
Piping it all through at once:
data %>%
group_by(word) %>%
count() %>%
filter(word == "rna")
A one liner with data.table solution:
library(data.table)
setDT(data)
data[word == "rna", .N, by = word]
word N
1: rna 3

The most frequent in column of dataframe

I have 3-column dataframe, where 3-rd (last) contains text body, something like one sentence.
Additionally I have one vector of words.
How to compute in elegant way a following thing:
find 15 the most frequent words (with number of occurences) in whole
3-rd column which occur in mentioned above vector ?
The sentence can look like:
I like dogs and my father like cats
vector=["dogs", "like"]
Here, the most frequent words are dogs and like.
You can try with this:
library(tidytext)
library(tidyverse)
df %>% # your data
unnest_tokens(word,text) %>% # clean a bit the data and split the phrases
group_by(word) %>% # grouping by words
summarise(Freq = n()) %>% # count them
arrange(-Freq) %>% # order decreasing
top_n(2) # here the top 2, you can use 15
Result:
# A tibble: 8 x 2
word Freq
<chr> <int>
1 dogs 3
2 i 2
If you already have the words splitted, you can skip the second line.
With data:
df <- data.frame(
id = c(1,2,3),
group = c(1,1,1),
text = c("I like dogs","I don't hate dogs", "dogs are the best"), stringsAsFactors = F)

Dplyr keeps automatically adding one of my columns [duplicate]

This question already has answers here:
"Adding missing grouping variables" message in dplyr in R
(4 answers)
Closed 4 years ago.
So, I have a large data.frame with multiple columns of which "trial.number" and "indexer" are 2.
It annoys me that dplyr constantly, no matter what, adds indexer column.
A simple example:
saccade.df %>%
distinct(trial.number, .keep_all = F)
I would expect to see the the unique trial.numbers and only the trial.number column. However, the output looks like this:
How do I stop dplyr from doing this? And why isn't it showing the unique trial.numbers but only the unique indexer (for which I didnt even ask).
example.df <- data.frame(trial.number = rep(1:10, each = 10), time =
seq(1:100), indexer = rep(21:30, each = 10))
example.df %>%
distinct(trial.number, .keep_all = F)
This goes give the right output. However, I somehow grouped my own variables.
Thanks!
Try ungroup :
df <- data.frame(trial.number=1:2,indexer=3:4)
df %>% distinct(trial.number)
# trial.number
#1 1
#2 2
df %>% group_by(trial.number,indexer) %>% distinct(trial.number)
## A tibble: 2 x 2
## Groups: trial.number, indexer [2]
# trial.number indexer
# <int> <int>
#1 1 3
#2 2 4
df %>% group_by(trial.number,indexer) %>% ungroup %>% distinct(trial.number)
## A tibble: 2 x 1
# trial.number
# <int>
#1 1
#2 2

Paste string values from different rows, if values from another column are the same [duplicate]

This question already has answers here:
Collapse / concatenate / aggregate a column to a single comma separated string within each group
(6 answers)
Concatenate strings by group with dplyr [duplicate]
(4 answers)
Closed 4 years ago.
I would like to create a new dataframe based on an existing one. As the title suggests, I would like to paste all string values in a certain column, if a value in another column is equivalent.
Due to my poor writing skills, I think I'm not being very clear what I mean by this. To clarify, I've created an example.
Existing Dataframe
If I have something like this:
DF <- data.frame(
ID = c(1,2,2,3,3,3,4,4,4,4),
value = c("I","ate","cereals","for","breakfast","it","was","delicious","!!!",":)"))
New Dataframe
I would like to create something like this:
DF2 <- data.frame(
ID = c(1,2,3,4),
value = c(paste("I"), paste("ate","cereals"), paste("for","breakfast","it"), paste("was","delicious","!!!",":)")))
All strings from column value are consolidated using paste when they have same values in column ID. I'm having troubles building a function that can do this. Could you please help me.
I am comfortable with either dplyr or data.table.
In dplyr you can use group_by with summarise
DF %>%
group_by(ID) %>%
summarise(value = paste(value, collapse = " "))
## A tibble: 4 x 2
# ID value
# <dbl> <chr>
#1 1. I
#2 2. ate cereals
#3 3. for breakfast it
#4 4. was delicious !!! :)
You can just group_by(ID) and summarise with a concatenation function. Here I use str_c with the collapse argument.
library(tidyverse)
DF <- data.frame(
ID = c(1, 2, 2, 3, 3, 3, 4, 4, 4, 4),
value = c("I", "ate", "cereals", "for", "breakfast", "it", "was", "delicious", "!!!", ":)")
)
DF %>%
group_by(ID) %>%
summarise(value = str_c(value, collapse = " "))
#> # A tibble: 4 x 2
#> ID value
#> <dbl> <chr>
#> 1 1 I
#> 2 2 ate cereals
#> 3 3 for breakfast it
#> 4 4 was delicious !!! :)
Created on 2018-08-26 by the reprex package (v0.2.0).

Resources