Counting Specific Word Frequency In R - r

I have a data set where I have split text from the journal abstracts to print 1 word per column. This has lead to over 5 million rows, but I just want certain the word counts of certain words. Below is an example of the data:
1 rna
1 synthesis
1 resembles
1 copy
1 choice
1 rna
1 recombination
1 process
1 nascent
1 rna
So in that example let's say I want just the rna counts, I would get 3 and that's it. I have done that word count on the whole set but this is not as useful to me.
wordCount <- m3 %>%
count(word, sort = TRUE)
Since many of the words aren't helpful for what I am trying to get to.
Any help would be welcome.

You can group_by the word and count occurrences of each unique word and then subset the ones you want.
library(tidyverse)
data <- data.frame(word = c("rna",
"synthesis",
"resembles",
"copy",
"choice",
"rna",
"recombination",
"process",
"nascent",
"rna"))
counts <- data %>%
group_by(word) %>%
count()
counts[which(counts$word == "rna"),]
# A tibble: 1 x 2
# Groups: word [1]
word n
<fct> <int>
1 rna 3
or using dplyr subsetting:
counts %>% filter(word == "rna")
# A tibble: 1 x 2
# Groups: word [1]
word n
<fct> <int>
1 rna 3
Piping it all through at once:
data %>%
group_by(word) %>%
count() %>%
filter(word == "rna")
A one liner with data.table solution:
library(data.table)
setDT(data)
data[word == "rna", .N, by = word]
word N
1: rna 3

Related

Group strings that have the same words but in a different order

I have an example concatenated text field (please see sample data below) that is created from two or three different fields, however there is no guarantee that the order of the words will be the same. I would like to create a new dataset where fields with the same words, regardless of order, are collapsed. However, since I do not know in advance what words will be concatenated together, the code will have to recognize that all words in both strings match.
Code for example data:
var1<-c("BLUE|RED","RED|BLUE","WHITE|BLACK|ORANGE","BLACK|WHITE|ORANGE")
freq<-c(1,1,1,1)
have<-as.data.frame(cbind(var1,freq))
Have:
var1 freq
BLUE|RED 1
RED|BLUE 1
WHITE|BLACK|ORANGE 1
BLACK|WHITE|ORANGE 1
How can I collapse the data into what I want below?
color freq
BLUE|RED 2
WHITE|BLACK|ORANGE 2
data.frame(table(sapply(strsplit(have$var1, '\\|'),
function(x)paste(sort(x), collapse = '|'))))
Var1 Freq
1 BLACK|ORANGE|WHITE 2
2 BLUE|RED 2
In the world of piping: R > 4.0
have$var1 |>
strsplit('\\|')|>
sapply(\(x)paste0(sort(x), collapse = "|"))|>
table()|>
data.frame()
Here is a tidyverse approach:
library(dplyr)
library(tidyr)
have %>%
group_by(id=row_number()) %>%
separate_rows(var1) %>%
arrange(var1, .by_group = TRUE) %>%
mutate(var1 = paste(var1, collapse = "|")) %>%
slice(1) %>%
ungroup() %>%
count(var1, name = "freq")
var1 freq
<chr> <int>
1 BLACK|ORANGE|WHITE 2
2 BLUE|RED 2

Tally()ing Multiple Observations In an Entire Data Frame

I'm having trouble with figuring out how to deal with a column that features several observations that I would like to tally. For example:
HTML/CSS;Java;JavaScript;Python;SQL
This is one of the cells for a column of a data frame and I'd like to tally the occurrences of each programming language. Is this something that should be tackled with str_detect(), with corpus(), or is there another way I'm not seeing?
My goal is to make each one of these languages (HTML, CSS, Java, JavaScript, Python, SQL, etc...) into a column name with the tally of how many times they occur in this column of the data frame.
I feel like I might've phrased this strangely so let me know if you need any clarification.
In tidyverse you can use separate_rows and count.
library(dplyr)
df %>% tidyr::separate_rows(PL, sep = ';') %>% count(PL)
In base R, we can split the string on semi-colon and count with table :
table(unlist(strsplit(df$PL, ';')))
#If you need a dataframe
#stack(table(unlist(strsplit(df$PL, ';'))))
If you just want a total count of each label, you can use unnest_longer and a grouped count:
# using #DPH's example data
library(dplyr)
library(tidyr)
df %>%
mutate(across(PL, strsplit, ";")) %>%
unnest_longer(PL) %>%
group_by(PL) %>%
count()
# A tibble: 6 x 2
# Groups: PL [6]
PL n
<chr> <int>
1 HTML/CSS 2
2 Java 1
3 JavaScript 2
4 Python 1
5 R 3
6 SQL 2
If I understood your problem correctly this would be solution:
library(dplyr)
library(tidyr)
# demo data
df <- dplyr::tibble(ID = c("Line 1: ","Line 2:"),
PL = c("HTML/CSS;JavaScript;Python;SQL;R","R;HTML/CSS;Java;JavaScript;SQL;R"))
# calculations
df %>%
dplyr::mutate(PLANG = stringr::str_split(PL, ";")) %>%
tidyr::unnest(c(PLANG)) %>%
dplyr::group_by(ID, PLANG) %>%
dplyr::count() %>%
tidyr::pivot_wider(names_from = "PLANG", values_from = "n", values_fill = 0)
ID `HTML/CSS` JavaScript Python R SQL Java
<chr> <int> <int> <int> <int> <int> <int>
1 "Line 1: " 1 1 1 1 1 0
2 "Line 2:" 1 1 0 2 1 1

The most frequent in column of dataframe

I have 3-column dataframe, where 3-rd (last) contains text body, something like one sentence.
Additionally I have one vector of words.
How to compute in elegant way a following thing:
find 15 the most frequent words (with number of occurences) in whole
3-rd column which occur in mentioned above vector ?
The sentence can look like:
I like dogs and my father like cats
vector=["dogs", "like"]
Here, the most frequent words are dogs and like.
You can try with this:
library(tidytext)
library(tidyverse)
df %>% # your data
unnest_tokens(word,text) %>% # clean a bit the data and split the phrases
group_by(word) %>% # grouping by words
summarise(Freq = n()) %>% # count them
arrange(-Freq) %>% # order decreasing
top_n(2) # here the top 2, you can use 15
Result:
# A tibble: 8 x 2
word Freq
<chr> <int>
1 dogs 3
2 i 2
If you already have the words splitted, you can skip the second line.
With data:
df <- data.frame(
id = c(1,2,3),
group = c(1,1,1),
text = c("I like dogs","I don't hate dogs", "dogs are the best"), stringsAsFactors = F)

How to transpose character data for unique IDs

Im trying to perform a sum function to count the number of interactions for Unique Id's
So I have something like this:
Client ID
JOE12_EMI
ABC12_CANC
ABC12_EMI
ABC12_RENE
and so on...
It'll also have a column next to it that counts the how many times each unique ID repeats.
Frequency
1
2
2
1
Is there a way that i can have all the activity types (EMI, TELI, PFL) summed for each ID and then placed into new columns?
I've tried to transpose the data by separating the actual ID from the activity type but this doesn't return the sums, thank you for any help. I'm not sure if that's the best way or if transposing the data to wide format and then doing another sum function but I am unsure how to go about it.
separate(frequency, id, c("id", "act_code") )
nd <- melt(frequency, id=(c("id")))
Try this:
library(dplyr)
data=data.frame(Client_ID= c("JOE12_EMI",
"ABC12_CANC",
"ABC12_EMI",
"ABC12_RENE"),
frequency= c(1,2,2,1))
client_and_id <- as.data.frame(do.call(rbind, strsplit(as.character(data$Client_ID), "_")))
names(client_and_id) <- c("client", "id")
data <- cbind(data, client_and_id)
data_sum <- data %>% group_by(id) %>% mutate(sum_freq = sum(frequency))
The output
> data_sum
# A tibble: 4 x 5
# Groups: id [3]
Client_ID frequency client id sum_freq
<fct> <dbl> <fct> <fct> <dbl>
1 JOE12_EMI 1 JOE12 EMI 3
2 ABC12_CANC 2 ABC12 CANC 2
3 ABC12_EMI 2 ABC12 EMI 3
4 ABC12_RENE 1 ABC12 RENE 1
You can also display the output by ID:
distinct(data_sum %>% dplyr::select(id, sum_freq))
# A tibble: 3 x 2
# Groups: id [3]
id sum_freq
<fct> <dbl>
1 EMI 3
2 CANC 2
3 RENE 1
You're on the right track; I think the only thing you need is a group_by. Something like this:
library(dplyr)
library(tidyr)
df = data.frame(ClientID = c("JOE12_EMI",
"ABC12_CANC",
"ABC12_EMI",
"ABC12_RENE"))
df %>%
separate(ClientID, into = c("id", "act_code"), sep = "_") %>%
group_by(id) %>%
mutate(frequency = n()) %>%
ungroup() %>%
group_by(id, act_code) %>%
mutate(act_frequency = n()) %>%
ungroup() %>%
spread(act_code, act_frequency)
(This does the sum by user and the pivot by activity type separately; it's possible to calculate the sum by user after pivoting, but this way is easier for me to read.)

Dplyr keeps automatically adding one of my columns [duplicate]

This question already has answers here:
"Adding missing grouping variables" message in dplyr in R
(4 answers)
Closed 4 years ago.
So, I have a large data.frame with multiple columns of which "trial.number" and "indexer" are 2.
It annoys me that dplyr constantly, no matter what, adds indexer column.
A simple example:
saccade.df %>%
distinct(trial.number, .keep_all = F)
I would expect to see the the unique trial.numbers and only the trial.number column. However, the output looks like this:
How do I stop dplyr from doing this? And why isn't it showing the unique trial.numbers but only the unique indexer (for which I didnt even ask).
example.df <- data.frame(trial.number = rep(1:10, each = 10), time =
seq(1:100), indexer = rep(21:30, each = 10))
example.df %>%
distinct(trial.number, .keep_all = F)
This goes give the right output. However, I somehow grouped my own variables.
Thanks!
Try ungroup :
df <- data.frame(trial.number=1:2,indexer=3:4)
df %>% distinct(trial.number)
# trial.number
#1 1
#2 2
df %>% group_by(trial.number,indexer) %>% distinct(trial.number)
## A tibble: 2 x 2
## Groups: trial.number, indexer [2]
# trial.number indexer
# <int> <int>
#1 1 3
#2 2 4
df %>% group_by(trial.number,indexer) %>% ungroup %>% distinct(trial.number)
## A tibble: 2 x 1
# trial.number
# <int>
#1 1
#2 2

Resources