I have a DF like the following with about 33000 rows:
tibble(ID = c(1,2,3), desc = c("This is a description.", "Also a description!","This is yet another desciption"))
I would like to count every word for all rows, to get a resulting df like:
tibble(word = c("this", "is", "a", "description", "also", "yet", "another"), count = c(2,2,2,3,1,1,1))
There are several textmining packages available. tidytext, quanteda, tm, ...
Below an example using tidytext.
library(tibble)
df1 <- tibble(ID = c(1,2,3), desc = c("This is a description.", "Also a description!","This is yet another desciption"))
library(dplyr)
library(tidytext)
df1 %>%
unnest_tokens(words, desc) %>%
group_by(words) %>%
count(words)
# A tibble: 8 x 2
# Groups: words [8]
words n
<chr> <int>
1 a 2
2 also 1
3 another 1
4 desciption 1
5 description 2
6 is 2
7 this 2
8 yet 1
Might be something like
table(unlist(strsplit(paste(collection_df$desc), "\\W")))
It is hard to answer your question as you did not provide clear problem, example and your expected output.
Related
My dataframe looks like this:
V1
c("cheese","bread","sugar","cream","milk","butter")
c("milk","butter","apples","cream","bread")
c("butter","milk","toffee")
c("cream","milk","butter","sugar")
I am trying to count the number of times each element appears and sum in a new column. I would like to end up with something like this:
V2 V3
cheese 1
bread 2
sugar 2
cream 3
milk 4
butter 4
apples 1
toffee 1
I have tried using the following code
counts <- unlist(V1, use.names = FALSE)
counts <- table(counts)
But for some reason the counts are wrong and values are being skipped.
If I understand you correctly and your data is organized as provided below, then we could do it this way:
Using separate_rows will allow to bring all your strings in one row.
remove c and empty rows
Use fct_inorder from forcats package (it is in tidyverse) to keep the order as provided
then apply count with the name argument:
library(tidyverse)
df %>%
separate_rows(V1) %>%
filter(!(V1 == "c" | V1 == "")) %>%
mutate(V1 = fct_inorder(V1)) %>%
count(V1, name ="V3")
V1 V3
<fct> <int>
1 cheese 1
2 bread 2
3 sugar 2
4 cream 3
5 milk 4
6 butter 4
7 apples 1
8 toffee 1
df <- structure(list(V1 = c("c(\"cheese\",\"bread\",\"sugar\",\"cream\",\"milk\",\"butter\")",
"c(\"milk\",\"butter\",\"apples\",\"cream\",\"bread\")", "c(\"butter\",\"milk\",\"toffee\")",
"c(\"cream\",\"milk\",\"butter\",\"sugar\")")), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -4L))
A couple of little issues with the question. Found it hard to reproduce exactly so took some liberties with the DF and present a couple of options that might help:
Option 1 - data in one column
library(tidyverse)
df <- data.frame(V1 = c("cheese","bread","sugar","cream","milk","butter",
"milk","butter","apples","cream","bread",
"butter","milk","toffee",
"cream","milk","butter","sugar"))
df <- df %>% dplyr::group_by(V1) %>%
summarise(
V3 = n()
)
Option 2 - data in columns - added NAs so it made a DF
library(tidyverse)
df <- data.frame(c("cheese","bread","sugar","cream","milk","butter"),
c("milk","butter","apples","cream","bread",NA),
c("butter","milk","toffee",NA,NA,NA),
c("cream","milk","butter","sugar",NA,NA))
df <- data.frame(V1=unlist(df)) %>%
select(V1) %>%
drop_na() %>%
group_by(V1) %>%
summarise(V3 = n())
hope this helps!
I have a data set where I have split text from the journal abstracts to print 1 word per column. This has lead to over 5 million rows, but I just want certain the word counts of certain words. Below is an example of the data:
1 rna
1 synthesis
1 resembles
1 copy
1 choice
1 rna
1 recombination
1 process
1 nascent
1 rna
So in that example let's say I want just the rna counts, I would get 3 and that's it. I have done that word count on the whole set but this is not as useful to me.
wordCount <- m3 %>%
count(word, sort = TRUE)
Since many of the words aren't helpful for what I am trying to get to.
Any help would be welcome.
You can group_by the word and count occurrences of each unique word and then subset the ones you want.
library(tidyverse)
data <- data.frame(word = c("rna",
"synthesis",
"resembles",
"copy",
"choice",
"rna",
"recombination",
"process",
"nascent",
"rna"))
counts <- data %>%
group_by(word) %>%
count()
counts[which(counts$word == "rna"),]
# A tibble: 1 x 2
# Groups: word [1]
word n
<fct> <int>
1 rna 3
or using dplyr subsetting:
counts %>% filter(word == "rna")
# A tibble: 1 x 2
# Groups: word [1]
word n
<fct> <int>
1 rna 3
Piping it all through at once:
data %>%
group_by(word) %>%
count() %>%
filter(word == "rna")
A one liner with data.table solution:
library(data.table)
setDT(data)
data[word == "rna", .N, by = word]
word N
1: rna 3
I have 3-column dataframe, where 3-rd (last) contains text body, something like one sentence.
Additionally I have one vector of words.
How to compute in elegant way a following thing:
find 15 the most frequent words (with number of occurences) in whole
3-rd column which occur in mentioned above vector ?
The sentence can look like:
I like dogs and my father like cats
vector=["dogs", "like"]
Here, the most frequent words are dogs and like.
You can try with this:
library(tidytext)
library(tidyverse)
df %>% # your data
unnest_tokens(word,text) %>% # clean a bit the data and split the phrases
group_by(word) %>% # grouping by words
summarise(Freq = n()) %>% # count them
arrange(-Freq) %>% # order decreasing
top_n(2) # here the top 2, you can use 15
Result:
# A tibble: 8 x 2
word Freq
<chr> <int>
1 dogs 3
2 i 2
If you already have the words splitted, you can skip the second line.
With data:
df <- data.frame(
id = c(1,2,3),
group = c(1,1,1),
text = c("I like dogs","I don't hate dogs", "dogs are the best"), stringsAsFactors = F)
This question already has answers here:
"Adding missing grouping variables" message in dplyr in R
(4 answers)
Closed 4 years ago.
So, I have a large data.frame with multiple columns of which "trial.number" and "indexer" are 2.
It annoys me that dplyr constantly, no matter what, adds indexer column.
A simple example:
saccade.df %>%
distinct(trial.number, .keep_all = F)
I would expect to see the the unique trial.numbers and only the trial.number column. However, the output looks like this:
How do I stop dplyr from doing this? And why isn't it showing the unique trial.numbers but only the unique indexer (for which I didnt even ask).
example.df <- data.frame(trial.number = rep(1:10, each = 10), time =
seq(1:100), indexer = rep(21:30, each = 10))
example.df %>%
distinct(trial.number, .keep_all = F)
This goes give the right output. However, I somehow grouped my own variables.
Thanks!
Try ungroup :
df <- data.frame(trial.number=1:2,indexer=3:4)
df %>% distinct(trial.number)
# trial.number
#1 1
#2 2
df %>% group_by(trial.number,indexer) %>% distinct(trial.number)
## A tibble: 2 x 2
## Groups: trial.number, indexer [2]
# trial.number indexer
# <int> <int>
#1 1 3
#2 2 4
df %>% group_by(trial.number,indexer) %>% ungroup %>% distinct(trial.number)
## A tibble: 2 x 1
# trial.number
# <int>
#1 1
#2 2
This question already has answers here:
Collapse / concatenate / aggregate a column to a single comma separated string within each group
(6 answers)
Concatenate strings by group with dplyr [duplicate]
(4 answers)
Closed 4 years ago.
I would like to create a new dataframe based on an existing one. As the title suggests, I would like to paste all string values in a certain column, if a value in another column is equivalent.
Due to my poor writing skills, I think I'm not being very clear what I mean by this. To clarify, I've created an example.
Existing Dataframe
If I have something like this:
DF <- data.frame(
ID = c(1,2,2,3,3,3,4,4,4,4),
value = c("I","ate","cereals","for","breakfast","it","was","delicious","!!!",":)"))
New Dataframe
I would like to create something like this:
DF2 <- data.frame(
ID = c(1,2,3,4),
value = c(paste("I"), paste("ate","cereals"), paste("for","breakfast","it"), paste("was","delicious","!!!",":)")))
All strings from column value are consolidated using paste when they have same values in column ID. I'm having troubles building a function that can do this. Could you please help me.
I am comfortable with either dplyr or data.table.
In dplyr you can use group_by with summarise
DF %>%
group_by(ID) %>%
summarise(value = paste(value, collapse = " "))
## A tibble: 4 x 2
# ID value
# <dbl> <chr>
#1 1. I
#2 2. ate cereals
#3 3. for breakfast it
#4 4. was delicious !!! :)
You can just group_by(ID) and summarise with a concatenation function. Here I use str_c with the collapse argument.
library(tidyverse)
DF <- data.frame(
ID = c(1, 2, 2, 3, 3, 3, 4, 4, 4, 4),
value = c("I", "ate", "cereals", "for", "breakfast", "it", "was", "delicious", "!!!", ":)")
)
DF %>%
group_by(ID) %>%
summarise(value = str_c(value, collapse = " "))
#> # A tibble: 4 x 2
#> ID value
#> <dbl> <chr>
#> 1 1 I
#> 2 2 ate cereals
#> 3 3 for breakfast it
#> 4 4 was delicious !!! :)
Created on 2018-08-26 by the reprex package (v0.2.0).