Most frequent phrases from text data in R - r

Does anyone here have experience in identifying the most common phrases (3 ~ 7 consecutive words)? Understand that most analysis on frequency focuses on the most frequent/common word (along with plotting a WordCloud) rather than phrases.
# Assuming a particular column in a data frame (df) with n rows that is all text data
# as I'm not able to provide a sample data as using dput() on a large text file won't # be feasible here
Text = df$Text_Column
docs = Corpus(VectorSource(Text))
...
Thanks in advance!

You have several options to do this in R. Let's grab some data first. I use the books by Jane Austen from the janeaustenr and do some cleaning to have each paragrah in a separate row:
library(janeaustenr)
library(tidyverse)
books <- austen_books() %>%
mutate(paragraph = cumsum(text == "" & lag(text) != "")) %>%
group_by(paragraph) %>%
summarise(book = head(book, 1),
text = trimws(paste(text, collapse = " ")),
.groups = "drop")
With tidytext:
library(tidytext)
map_df(3L:7L, ~unnest_tokens(books, ngram, text, token = "ngrams", n = .x)) %>% # using multiple values for n is not directly implemented in tidytext
count(ngram) %>%
filter(!is.na(ngram)) %>%
slice_max(n, n = 10)
#> # A tibble: 10 × 2
#> ngram n
#> <chr> <int>
#> 1 i am sure 415
#> 2 i do not 412
#> 3 she could not 328
#> 4 it would be 258
#> 5 in the world 247
#> 6 as soon as 236
#> 7 a great deal 214
#> 8 would have been 211
#> 9 she had been 203
#> 10 it was a 202
With quanteda:
library(quanteda)
books %>%
corpus(docid_field = "paragraph",
text_field = "text") %>%
tokens(remove_punct = TRUE,
remove_symbols = TRUE) %>%
tokens_ngrams(n = 3L:7L) %>%
dfm() %>%
topfeatures(n = 10) %>%
enframe()
#> # A tibble: 10 × 2
#> name value
#> <chr> <dbl>
#> 1 i_am_sure 415
#> 2 i_do_not 412
#> 3 she_could_not 328
#> 4 it_would_be 258
#> 5 in_the_world 247
#> 6 as_soon_as 236
#> 7 a_great_deal 214
#> 8 would_have_been 211
#> 9 she_had_been 203
#> 10 it_was_a 202
With text2vec:
library(text2vec)
library(janeaustenr)
library(tidyverse)
books <- austen_books() %>%
mutate(paragraph = cumsum(text == "" & lag(text) != "")) %>%
group_by(paragraph) %>%
summarise(book = head(book, 1),
text = trimws(paste(text, collapse = " ")),
.groups = "drop")
library(text2vec)
itoken(books$text, tolower, word_tokenizer) %>%
create_vocabulary(ngram = c(3L, 7L), sep_ngram = " ") %>%
filter(str_detect(term, "[[:alpha:]]")) %>% # keep terms with at tleas one alphabetic character
slice_max(term_count, n = 10)
#> Number of docs: 10293
#> 0 stopwords: ...
#> ngram_min = 3; ngram_max = 7
#> Vocabulary:
#> term term_count doc_count
#> 1: i am sure 415 384
#> 2: i do not 412 363
#> 3: she could not 328 288
#> 4: it would be 258 233
#> 5: in the world 247 234
#> 6: as soon as 236 233
#> 7: a great deal 214 209
#> 8: would have been 211 192
#> 9: she had been 203 179
#> 10: it was a 202 194
Created on 2022-08-03 by the reprex package (v2.0.1)

Related

simplifying tidyeval with multiple symbols

The following function behaves as desired: several variables can be passed to group_by without the need to put them into alist() or dplyr::vars:
mean_by_grp <- function(df, meanvar, grp) {
grouping <- enexpr(grp) %>%
expr_deparse %>%
str_split(",",simplify = T) %>% `[`(1,) %>%
map(str_remove,"c\\(") %>% map(str_remove,"\\)") %>% map(str_trim) %>%
unlist %>% syms
df %>%
group_by(!!!syms(grouping)) %>%
summarise("average_{{meanvar}}" := mean({{meanvar}}, na.rm = TRUE),
.groups = 'drop')
}
starwars %>% mean_by_grp(height, species)
starwars %>% mean_by_grp(height, c(species, homeworld))
However, it is complicated. I need to turn c(var1,....varn) into a string, split it and turn it into a list of symbols so I can use with with syms.
Isn't there a much easier way to do this?
Of course, I could use ellipses instead of grp, but then I can only have one argument that passes multiple symbols to another function.
One option would be dplyr::across:
mean_by_grp <- function(df, meanvar, grp) {
df %>%
group_by(across({{ grp }})) %>%
summarise("average_{{meanvar}}" := mean({{meanvar}}, na.rm = TRUE),
.groups = 'drop')
}
library(dplyr)
starwars %>% mean_by_grp(height, species)
#> # A tibble: 38 × 2
#> species average_height
#> <chr> <dbl>
#> 1 Aleena 79
#> 2 Besalisk 198
#> 3 Cerean 198
#> 4 Chagrian 196
#> 5 Clawdite 168
#> 6 Droid 131.
#> 7 Dug 112
#> 8 Ewok 88
#> 9 Geonosian 183
#> 10 Gungan 209.
#> # … with 28 more rows
starwars %>% mean_by_grp(height, c(species, homeworld))
#> # A tibble: 58 × 3
#> species homeworld average_height
#> <chr> <chr> <dbl>
#> 1 Aleena Aleen Minor 79
#> 2 Besalisk Ojom 198
#> 3 Cerean Cerea 198
#> 4 Chagrian Champala 196
#> 5 Clawdite Zolan 168
#> 6 Droid Naboo 96
#> 7 Droid Tatooine 132
#> 8 Droid <NA> 148
#> 9 Dug Malastare 112
#> 10 Ewok Endor 88
#> # … with 48 more rows

R clasification of a number

I am working in R, but I don't know very well how to extract from any number a series of data, i.e., from the number 20102168056, I want to subdivide it like this
2010 -> year
2 -> semester
168 -> university career
056 -> unique number
I tried to do it with an if, but every time I got more errors, I am new to this and I would like to know if you can help me (By the way, it is for any number, as 20211888070, so I did not use the if I raised).
You can use tidyr::separate.
library(tidyverse)
df <- tibble(original = c(20102168056, 20141152013, 20182008006))
df %>%
separate(original, into = c("year", "semester", "university_career", "unique_number"), sep = c(4,5,8,11))
# A tibble: 3 × 4
year semester university_career unique_number
<chr> <chr> <chr> <chr>
1 2010 2 168 056
2 2014 1 152 013
3 2018 2 008 006
You may want to convert some of the columns to an integer:
df %>%
separate(original, into = c("year", "semester", "university_career", "unique_number"), sep = c(4,5,8,11)) %>%
mutate(across(year:unique_number, as.integer))
# A tibble: 3 × 4
year semester university_career unique_number
<int> <int> <int> <int>
1 2010 2 168 56
2 2014 1 152 13
3 2018 2 8 6
We can use stringr::str_match().
library(tidyverse)
data <- c(20102168056, 20102168356)
str_match(data, '^(\\d{4})(\\d{1})(\\d{3})(\\d{3})') %>%
as.data.frame() %>%
set_names(c('value', 'year', 'semester', 'university_career', 'unique_number'))
#> value year semester university_career unique_number
#> 1 20102168056 2010 2 168 056
#> 2 20102168356 2010 2 168 356
Created on 2021-12-08 by the reprex package (v2.0.1)
You can use the substr() function if you first make the number into a character with as.character().
test <- '20102168056'
data <- list()
data$year <- substr(test, 1, 4)
data$semester <- substr(test, 5, 5)
data$uni_career <- substr(test, 6, 8)
data$unique_num <- substr(test, 9, 11)
print(data)
#> $year
#> [1] "2010"
#>
#> $semester
#> [1] "2"
#>
#> $uni_career
#> [1] "168"
#>
#> $unique_num
#> [1] "056"
Created on 2021-12-08 by the reprex package (v2.0.1)

How to join, group and summarise large dataframes in R with multidplyr and parallel

This question is similar to other problems with very large data in R, but I can't find an example of how to merge/join and then perform calculations on two dfs (as opposed to reading in lots of dataframes and using mclapply to do the calculations). Here the problem is not loading the data (takes ~20 min but they do load), but rather the merging & summarising.
I've tried all data.table approachesI could find, different types of joins, and ff, and I still run into the problem of vecseq limits 2^31 rows. Now I'm trying to use multidplyr to do it in parallel, but can't figure out where the error is coming from.
Dataframes:
species_data # df with ~ 65 million rows with cols <- c("id","species_id")
lookup # df with ~ 17 million rows with cols <- c("id","cell_id","rgn_id")
Not all ids in the lookup appear in the species_data
## make sample dataframes:
lookup <- data.frame(id = seq(2001,2500, by = 1),
cell_id = seq(1,500, by = 1),
rgn_id = seq(801,1300, by = 1))
library(stringi)
species_id <- sprintf("%s%s%s", stri_rand_strings(n = 1000, length = 3, pattern = "[A-Z]"),
pattern = "-",
stri_rand_strings(1000, length = 5, '[1-9]'))
id <- sprintf("%s%s%s", stri_rand_strings(n = 1000, length = 1, pattern = "[2]"),
stri_rand_strings(n = 1000, length = 1, pattern = "[0-4]"),
stri_rand_strings(n = 1000, length = 1, pattern = "[0-9]"))
species_data <- data.frame(species_id, id)
merge and join dfs with multidplyr
library(tidyverse)
install.packages("devtools")
library(devtools)
devtools::install_github("hadley/multidplyr")
library(multidplyr)
library(parallel)
species_summary <- species_data %>%
# partition the species data by species id
partition(species_id, cluster = cluster) %>%
left_join(species_data, lookup, by = "id") %>%
dplyr::select(-id) %>%
group_by(species_id) %>%
## total number of cells each species occurs in
mutate(tot_count_cells = n_distinct(cell_id)) %>%
ungroup() %>%
dplyr::select(c(cell_id, species_id, rgn_id, tot_count_cells)) %>%
group_by(rgn_id, species_id) %>%
## number of cells each species occurs in each region
summarise(count_cells_eez = n_distinct(cell_id)) %>%
collect() %>%
as_tibble()
## Error in partition(., species_id, cluster = cluster) : unused argument (species_id)
## If I change to:
species_summary <- species_data %>%
group_by(species_id) %>%
partition(cluster = cluster) %>% ...
## get, "Error in worker_id(data, cluster) : object 'cluster' not found
This is my first attempt at parallel and big data and I'm struggling to diagnose the errors.
Thanks!
First I load dplyr and multidplyr
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(multidplyr)
my_clusters <- new_cluster(3) # I have 4 cores
then I load the same data that you propose
library(stringi)
lookup <- tibble(
id = as.character(seq(2001, 2500, by = 1)),
cell_id = seq(1, 500, by = 1),
rgn_id = sprintf("%s", stri_rand_strings(n = 500, length = 3, pattern = "[0-9]"))
)
species_id <- sprintf(
"%s%s%s",
stri_rand_strings(n = 1000, length = 3, pattern = "[A-Z]"),
pattern = "-",
stri_rand_strings(n = 1000, length = 5, "[1-9]")
)
id <- sprintf(
"%s%s%s",
stri_rand_strings(n = 1000, length = 1, pattern = "[2]"),
stri_rand_strings(n = 1000, length = 1, pattern = "[0-4]"),
stri_rand_strings(n = 1000, length = 1, pattern = "[0-9]")
)
species_data <- tibble(species_id, id)
Check the result
species_data
#> # A tibble: 1,000 x 2
#> species_id id
#> <chr> <chr>
#> 1 CUZ-98293 246
#> 2 XDG-61673 234
#> 3 WFZ-94338 230
#> 4 UIH-97549 226
#> 5 AGE-35257 229
#> 6 BMD-75361 249
#> 7 MJB-78799 226
#> 8 STS-15141 225
#> 9 RXD-18645 245
#> 10 SKZ-58666 243
#> # ... with 990 more rows
lookup
#> # A tibble: 500 x 3
#> id cell_id rgn_id
#> <chr> <dbl> <chr>
#> 1 2001 1 649
#> 2 2002 2 451
#> 3 2003 3 532
#> 4 2004 4 339
#> 5 2005 5 062
#> 6 2006 6 329
#> 7 2007 7 953
#> 8 2008 8 075
#> 9 2009 9 008
#> 10 2010 10 465
#> # ... with 490 more rows
Now I can run the code using a multidplyr approach. I divide the dplyr code in two steps according to the two group_by(s)
first_step <- species_data %>%
left_join(lookup, by = "id") %>%
select(-id) %>%
group_by(species_id) %>%
partition(my_clusters) %>%
mutate(tot_count_cells = n_distinct(cell_id)) %>%
collect() %>%
ungroup()
first_step
#> # A tibble: 1,000 x 4
#> species_id cell_id rgn_id tot_count_cells
#> <chr> <dbl> <chr> <int>
#> 1 UIH-97549 NA <NA> 1
#> 2 BMD-75361 NA <NA> 1
#> 3 STS-15141 NA <NA> 1
#> 4 RXD-18645 NA <NA> 1
#> 5 HFI-78676 NA <NA> 1
#> 6 KVP-45194 NA <NA> 1
#> 7 SGW-29988 NA <NA> 1
#> 8 WBI-79521 NA <NA> 1
#> 9 MFY-86277 NA <NA> 1
#> 10 BHO-37621 NA <NA> 1
#> # ... with 990 more rows
and
second_step <- first_step %>%
group_by(rgn_id, species_id) %>%
partition(my_clusters) %>%
summarise(count_cells_eez = n_distinct(cell_id)) %>%
collect() %>%
ungroup()
second_step
#> # A tibble: 1,000 x 3
#> rgn_id species_id count_cells_eez
#> <chr> <chr> <int>
#> 1 <NA> ABB-24645 1
#> 2 <NA> ABY-98559 1
#> 3 <NA> AEQ-42462 1
#> 4 <NA> AFO-58569 1
#> 5 <NA> AKQ-44439 1
#> 6 <NA> AMF-23978 1
#> 7 <NA> ANF-49159 1
#> 8 <NA> APD-85367 1
#> 9 <NA> AQH-64126 1
#> 10 <NA> AST-77513 1
#> # ... with 990 more rows
Created on 2020-03-21 by the reprex package (v0.3.0)

How to bin the summarised frequency table with dplyr

I have the following data frame:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
df <- nycflights13::flights %>%
select(distance) %>%
group_by(distance) %>%
summarise(n = n()) %>%
arrange(distance) %>% ungroup()
df
#> # A tibble: 214 x 2
#> distance n
#> <dbl> <int>
#> 1 17 1
#> 2 80 49
#> 3 94 976
#> 4 96 607
#> 5 116 443
#> 6 143 439
#> 7 160 376
#> 8 169 545
#> 9 173 221
#> 10 184 5504
#> # … with 204 more rows
What I want to do is to bin the distance column by bin of size 100,
and also summing the n column accordingly.
How can do that?
So you get something like:
bin_distance sum_n
1-100 1633 #(1 + 49 + 976 + 607)
101-200 21344 # (443 + ... + 5327)
#etc
The most simple approach would be to use cut by creating groups using seq for every 100 values and sum the values for each group.
library(dplyr)
df %>%
group_by(group = cut(distance, breaks = seq(0, max(distance), 100))) %>%
summarise(n = sum(n))
# group n
# <fct> <int>
# 1 (0,100] 1633
# 2 (100,200] 21344
# 3 (200,300] 28310
# 4 (300,400] 7748
# 5 (400,500] 21292
# 6 (500,600] 26815
# 7 (600,700] 7846
# 8 (700,800] 48904
# 9 (800,900] 7574
#10 (900,1e+03] 18205
# ... with 17 more rows
which can be translated to base R using aggregate like
aggregate(n ~ distance,
transform(df, distance = cut(distance, breaks = seq(0, max(distance), 100))), sum)
A different tidyverse solution. It is closely following the logic of #Ronak Shah code, but instead of cut() it uses cut_width() from ggplot2:
nycflights13::flights %>%
select(distance) %>%
group_by(ints = cut_width(distance, width = 100, boundary = 0)) %>%
summarise(n = n())
ints n
<fct> <int>
1 [0,100] 1633
2 (100,200] 21344
3 (200,300] 28310
4 (300,400] 7748
5 (400,500] 21292
6 (500,600] 26815
7 (600,700] 7846
8 (700,800] 48904
9 (800,900] 7574
10 (900,1e+03] 18205

Tidy text: Compute Zipf's law from the following term-document matrix

I tried the code from http://tidytextmining.com/tfidf.html. My result can be seen in this image.
My question is: How can I rewrite the code to produce the negative relationship between the term frequency and the rank?
The following is the term-document matrix. Any comments are highly appreciated.
# Zipf 's law
freq_rk < -DTM_words %>%
group_by(document) %>%
mutate(rank=row_number(),
'term_frequency'=count/total)
freq_rk %>%
ggplot(aes(rank,term_frequency,color=document)) +
geom_line(size=1.2,alpha=0.8)
DTM_words
# A tibble: 4,530 x 5
document term count n total
<chr> <chr> <dbl> <int> <dbl>
1 1 activ 1 1 109
2 1 agencydebt 1 1 109
3 1 assess 1 1 109
4 1 avail 1 1 109
5 1 balanc 2 1 109
# ... with 4,520 more rows
To use row_number() to get rank, you need to make sure that your data frame is ordered by n, the number of times a word is used in a document. Let's look at an example. It sounds like you are starting with a document-term matrix that you are tidying? (I'm going to use some example data that is similar to a DTM from quanteda.)
library(tidyverse)
library(tidytext)
data("data_corpus_inaugural", package = "quanteda")
inaug_dfm <- quanteda::dfm(data_corpus_inaugural, verbose = FALSE)
ap_td <- tidy(inaug_dfm)
ap_td
#> # A tibble: 44,725 x 3
#> document term count
#> <chr> <chr> <dbl>
#> 1 1789-Washington fellow 3
#> 2 1793-Washington fellow 1
#> 3 1797-Adams fellow 3
#> 4 1801-Jefferson fellow 7
#> 5 1805-Jefferson fellow 8
#> 6 1809-Madison fellow 1
#> 7 1813-Madison fellow 1
#> 8 1817-Monroe fellow 6
#> 9 1821-Monroe fellow 10
#> 10 1825-Adams fellow 3
#> # ... with 44,715 more rows
Notice that here, you have a tidy data frame with one word per row, but it is not ordered by count, the number of times that each word was used in each document. If we used row_number() here to try to assign rank, it isn't meaningful because the words are all jumbled up in order.
Instead, we can arrange this by descending count.
ap_td <- tidy(inaug_dfm) %>%
group_by(document) %>%
arrange(desc(count))
ap_td
#> # A tibble: 44,725 x 3
#> # Groups: document [58]
#> document term count
#> <chr> <chr> <dbl>
#> 1 1841-Harrison the 829
#> 2 1841-Harrison of 604
#> 3 1909-Taft the 486
#> 4 1841-Harrison , 407
#> 5 1845-Polk the 397
#> 6 1821-Monroe the 360
#> 7 1889-Harrison the 360
#> 8 1897-McKinley the 345
#> 9 1841-Harrison to 318
#> 10 1881-Garfield the 317
#> # ... with 44,715 more rows
Now we can use row_number() to get rank, because the data frame is actually ranked/arranged/ordered/sorted/however you want to say it.
ap_td <- tidy(inaug_dfm) %>%
group_by(document) %>%
arrange(desc(count)) %>%
mutate(rank = row_number(),
total = sum(count),
`term frequency` = count / total)
ap_td
#> # A tibble: 44,725 x 6
#> # Groups: document [58]
#> document term count rank total `term frequency`
#> <chr> <chr> <dbl> <int> <dbl> <dbl>
#> 1 1841-Harrison the 829 1 9178 0.09032469
#> 2 1841-Harrison of 604 2 9178 0.06580954
#> 3 1909-Taft the 486 1 5844 0.08316222
#> 4 1841-Harrison , 407 3 9178 0.04434517
#> 5 1845-Polk the 397 1 5211 0.07618499
#> 6 1821-Monroe the 360 1 4898 0.07349939
#> 7 1889-Harrison the 360 1 4744 0.07588533
#> 8 1897-McKinley the 345 1 4383 0.07871321
#> 9 1841-Harrison to 318 4 9178 0.03464807
#> 10 1881-Garfield the 317 1 3240 0.09783951
#> # ... with 44,715 more rows
ap_td %>%
ggplot(aes(rank, `term frequency`, color = document)) +
geom_line(alpha = 0.8, show.legend = FALSE) +
scale_x_log10() +
scale_y_log10()
A graph that would describe a linear regression (i.e. not Zipf's Law) would just add a smooth with a linear regression model (lm).
freq_rk %>%
ggplot(aes(rank,term_frequency,color=document)) +
geom_line(size=1.2,alpha=0.8) +
geom_smooth(method = lm)
To identify the differences between Austen's distributions and yours, run the following code:
Austen:
ggplot(freq_by_rank, aes(rank, fill = book) + geom_density(alpha = 0.5) + labs(title = "Austen linear")
ggplot(freq_by_rank, aes(rank, fill = book) + geom_density(alpha = 0.5) + scale_x_log10() + labs(title = "Austen Logarithmic")
Tom's Sample
ggplot(freq_rk, aes(rank, fill = document) + geom_density(alpha = 0.5) + labs(title = "Sample linear")
ggplot(freq_rk, aes(rank, fill = document) + geom_density(alpha = 0.5) + scale_x_log10() + labs(title = "Sample Logarithmic")

Resources