Quanteda changing rel freq of a term over time - r

I have a corpus of news articles with date and time of publication as 'docvars'.
readtext object consisting of 6 documents and 8 docvars.
# Description: df[,10] [6 × 10]
doc_id text year month day hour minute second title source
* <chr> <chr> <int> <int> <int> <int> <int> <int> <chr> <chr>
1 2014_01_01_10_51_00… "\"新华网伦敦1… 2014 1 1 10 51 0 docid报告称若不减… RMWenv
2 2014_01_01_11_06_00… "\"新华网北京1… 2014 1 1 11 6 0 docid盘点2013… RMWenv
3 2014_01_02_08_08_00… "\"原标题:报告… 2014 1 2 8 8 0 docid报告称若不减… RMWenv
4 2014_01_03_08_42_00… "\"地球可能毁灭… 2014 1 3 8 42 0 docid地球可能毁灭… RMWenv
5 2014_01_03_08_44_00… "\"北美鼠兔看起… 2014 1 3 8 44 0 docid北美鼠兔为应… RMWenv
6 2014_01_06_10_30_00… "\"欣克力C点核… 2014 1 6 10 30 0 docid英国欲建50… RMWenv
I would like to measure the changing relative frequency that a particular term - e.g 'development' - occurs in these articles (either as a proportion of the total terms in the article / or as a proportion of the total terms in all the articles published in a particular day / month). I know that I can count the number of times the term occurs in all the articles in a month, using:
dfm(corp, select = "term", groups = "month")
and that I can get the relative frequency of the word to the total words in the document using:
dfm_weight(dfm, scheme = "prop")
But how do I combine these together to get the frequency of a specific term relative to the total number of words on a particular day or in a particular month?
What I would like to be able to do is measure the change in the amount of times a term is used over time, but accounting for the fact that the total number of words used is also changing. Thanks for any help!

#DaveArmstrong gives a good answer here and I upvoted it, but can add a bit of efficiency using some of the newest quanteda syntax, which is a bit simpler.
The key here is preserving the date format created by zoo::yearmon(), since the dfm grouping coerce that to a character. So we pack it into a docvar, which is preserved by the grouping, and then retrieve it in the ggplot() call.
load(file("https://www.dropbox.com/s/kl2cnd63s32wsxs/music.rda?raw=1"))
library("quanteda")
## Package version: 2.1.1
## create corpus and dfm
corp <- corpus(m, text_field = "body_text")
corp$date <- m$first_publication_date %>%
zoo::as.yearmon()
D <- dfm(corp, remove = stopwords("english")) %>%
dfm_group(groups = "date") %>%
dfm_weight(scheme = "prop")
library("ggplot2")
convert(D[, "wonderfully"], to = "data.frame") %>%
ggplot(aes(x = D$date, y = wonderfully, group = 1)) +
geom_line() +
labs(x = "Date", y = "Wonderfully/Total # Words")

I suspect someone will come up with a better solution within quanteda, but in the event they don't, you could always extract the word from the dfm and put it in a dataset along with the date and then make the graph. In the code below, I'm using some music reviews I scraped from the Guardian's website. I've commented out the functions that read in the data from an .rda file from Dropbox. You're welcomed to use it if you like - it's clean, but I don't want to inadvertently have someone download a file from the web they're not aware of.
# f <- file("https://www.dropbox.com/s/kl2cnd63s32wsxs/music.rda?raw=1")
# load(f)
## create corpus and dfm
corp <- corpus(as.character(m$body_text))
docvars(corp, "date") <- m$first_publication_date
D <- dfm(corp, remove=stopwords("english"))
## take word frequencies "wonderfully" in the dfm
## along with the date
tmp <- tibble(
word = as.matrix(D)[,"wonderfully"],
date = docvars(corp)$date,
## calculate the total number of words in each document
total = rowSums(D)
)
tmp <- tmp %>%
## turn date into year-month
mutate(yearmon =zoo::as.yearmon(date)) %>%
## group by year-month
group_by(yearmon) %>%
## calculate the sum of the instances of "wonderfully"
## divided by the sum of the total words across all
## documents in the month
summarise(prop = sum(word)/sum(total))
## make a plot.
ggplot(tmp, aes(x=yearmon, y=prop)) +
geom_line() +
labs(x= "Date", y="Wonderfully/Total # Words")

Related

Quanteda dfm_weight() results in relative frequency > 1

I'm using Quanteda and trying to compute the relative frequencies of specific words in a corpus which is organized by date and party. However, after converting the corpus to a dfm and using dfm_weight(dfmat, scheme = "prop") followed by textstat_frequency, I get scores of bigger than 1.
Here is my code (I also stem and clean my tokens, not here in the code):
corp <- corpus(title_df, text_field = "text", meta = list(title_df[,-4]))
toks <- tokens(corp)
dfmat <- dfm(toks, verbose=TRUE)
dfm_rel_freq <- dfm_weight(dfmat, scheme = "prop")
rel_freq_all <- quanteda.textstats::textstat_frequency(dfm_rel_freq, groups = year)
# arrange by max frequency:
rel_freq_all %>% arrange(frequency) %>% tail()
feature
<chr>
frequency
<dbl>
rank
<dbl>
docfreq
<dbl>
group
<chr>
81093 pension 5.802529 1 117 2004
40971 pension 6.117154 1 97 1998
148372 peopl 6.430454 1 220 2014
65747 pension 6.721089 1 138 2002
53303 pension 7.871011 1 153 2000
74391 pension 8.153381 1 156 2003
6 rows
This is the expected behaviour: quanteda.textstats::textstat_frequency(x, groups = year) will sum the dfm within the year groups. So your proportions from the dfm are being summed, and these can exceed 1.0.
If you wanted a different operation on the groups, for instance mean, then you should not use a groups argument, and then use some dplyr operations such as
library(dplyr)
quanteda.textstats::textstat_frequency(dfm_rel_freq) %>%
group_by(year) %>%
summarize(mean_rel_freq = mean(frequency))

Having difficulty using rle command within a mutate step in r to count the max number of consecutive characters in a word

I created this function to count the maximum number of consecutive characters in a word.
max(rle(unlist(strsplit("happy", split = "")))$lengths)
The function works on individual words, but when I try to use the function within a mutate step it doesn't work. Here is the code that involves the mutate step.
text3 <- "The most pressing of those issues, considering the franchise's
stated goal of competing for championships above all else, is an apparent
disconnect between Lakers vice president of basketball operations and general manager"
text3_df <- tibble(line = 1:1, text3)
text3_df %>%
unnest_tokens(word, text3) %>%
mutate(
num_letters = nchar(word),
num_vowels = get_count(word),
num_consec_char = max(rle(unlist(strsplit(word, split = "")))$lengths)
)
The variables num_letters and num_vowels work fine, but I get a 2 for every value of num_consec_char. I can't figure out what I'm doing wrong.
This command rle(unlist(strsplit(word, split = "")))$lengths is not vectorized and thus is operating on the entire list of words for each row thus the same result for each row.
You will need to use some type of loop (ie for, apply, purrr::map) to solve it.
library(dplyr)
library(tidytext)
text3 <- "The most pressing of those issues, considering the franchise's
stated goal of competing for championships above all else, is an apparent
disconnect between Lakers vice president of basketball operations and general manager"
text3_df <- tibble(line = 1:1, text3)
output<- text3_df %>%
unnest_tokens(word, text3) %>%
mutate(
num_letters = nchar(word),
# num_vowels = get_count(word),
)
output$num_consec_char<- sapply(output$word, function(word){
max(rle(unlist(strsplit(word, split = "")))$lengths)
})
output
# A tibble: 32 × 4
line word num_letters num_consec_char
<int> <chr> <int> <int>
1 1 the 3 1
2 1 most 4 1
3 1 pressing 8 2
4 1 of 2 1
5 1 those 5 1
6 1 issues 6 2
7 1 considering 11 1

Obtaining Percentage for Date Observations

I am very new to R and am struggling with this concept. I have a data frame that looks like this:
enter image description here
I have used summary(FoodFacilityInspections$DateRecent) to get the observations for each "date" listed. I have 3932 observations, though, and wanted to get a summary of:
Dates with the most observations and the percentage for that
Percentage of observations for the Date Recent category
I have tried:
*
> count(FoodFacilityInspections$DateRecent) Error in UseMethod("count")
> : no applicable method for 'count' applied to an object of class
> "factor"
Using built in data as you did not provide example data
library(data.table)
dtcars <- data.table(mtcars, keep.rownames = TRUE)
Solution
dtcars[, .("count"=.N, "percent"=.N/dtcars[, .N]*100),
by=cyl]
You can use the table function to find out which date occurs the most. Then you can loop through each item in the table (date in your case) and divide it by the total number of rows like this (also using the mtcars dataset):
table(mtcars$cyl)
percent <- c()
for (i in 1:length(table(mtcars$cyl))){
percent[i] <- table(mtcars$cyl)[i]/nrow(mtcars) * 100
}
output <- cbind(table(mtcars$cyl), percent)
output
percent
4 11 34.375
6 7 21.875
8 14 43.750
A one-liner using table and proportions in within.
within(as.data.frame.table(with(mtcars, table(cyl))), Pc <- proportions(Freq)*100)
# cyl Freq Pc
# 1 4 11 34.375
# 2 6 7 21.875
# 3 8 14 43.750
An updated solution with total, percent and cumulative percent table based on your data.
library(data.table)
data<-data.frame("ScoreRecent"=c(100,100,100,100,100,100,100,100,100),
"DateRecent"=c("7/23/2021", "7/8/2021","5/25/2021","5/19/2021","5/20/2021","5/13/2021","5/17/2021","5/18/2021","5/18/2021"),
"Facility_Type_Description"=c("Retail Food Stores", "Retail Food Stores","Food Service Establishment","Food Service Establishment","Food Service Establishment","Food Service Establishment","Food Service Establishment","Food Service Establishment","Food Service Establishment"),
"Premise_zip"=c(40207,40207,40207,40206,40207,40206,40207,40206,40206),
"Opening_Date"=c("6/27/1988","6/29/1988","10/20/2009","2/28/1989","10/20/2009","10/20/2009","10/20/2009","10/20/2009", "10/20/2009"))
tab <- function(dataset, var){
dataset %>%
group_by({{var}}) %>%
summarise(n=n()) %>%
mutate(total = cumsum(n),
percent = n / sum(n) * 100,
cumulativepercent = cumsum(n / sum(n) * 100))
}
tab(data, Facility_Type_Description)
Facility_Type_Description n total percent cumulativepercent
<chr> <int> <int> <dbl> <dbl>
1 Food Service Establishment 7 7 77.8 77.8
2 Retail Food Stores 2 9 22.2 100

R dplyr - running total with row-wise calculations

I have a dataframe that keeps track of the activities associated with a bank account (example below).
The initial balance is $5,000 (type "initial). If type is "in", that means a cash deposit. In this example each deposit is $1,000. If type is "out", that means a withdrawal from the account. In this example each withdrawal is 10% of the account balance.
data <- tibble(
activity=1:6,
type=c("initial","in","out","out","in","in"),
input=c(5000,1000,10,10,1000,1000))
Is there a dplyr solution to keep track of the balance after each activity?? I have tried several ways but I can't seem to find a way to efficiently calculate running totals and the withdrawal amount (which depends on the running total).
For this example the output should be:
result <- tibble(
activity=1:6,
type=c("initial","in","out","out","in","in"),
input=c(5000,1000,10,10,1000,1000),
balance=c(5000,6000,5400,4860,5860,6860))
Thanks in advance for any suggestions or recommendations!
You can use purrr::accumulate2() to condition the calculation on the value of type:
library(dplyr)
library(purrr)
library(tidyr)
data %>%
mutate(balance = accumulate2(input, type[-1], .f = function(x, y, type) if(type == "out") x - x * y/100 else x + y)) %>%
unnest(balance)
# A tibble: 6 x 4
activity type input balance
<int> <chr> <dbl> <dbl>
1 1 initial 5000 5000
2 2 in 1000 6000
3 3 out 10 5400
4 4 out 10 4860
5 5 in 1000 5860
6 6 in 1000 6860

Creating a graph of frequency of a specific word from a dataframe over a time period in R

I have a dataframe of tweets in R, looking like this:
tweet_text tweet_time rdate twt
<chr> <dttm> <date> <dbl>
1 No New England cottage is complete without nautical t.. 2016-08-25 09:21:00 2016-08-25 1
2 Justice Scalia spent his last hours with members of co… 2016-11-24 16:28:00 2016-11-24 1
3 WHAT THE FAILED OKLAHOMA ABORTION BILL TELLS US http:/… 2016-11-24 16:27:00 2016-11-24 1
4 Bipartisan bill in US Senate to restrict US arms sales… 2016-10-26 07:03:00 2016-10-26 1
5 #MustResign campaign is underway with the heat p his S… 2016-10-01 08:15:00 2016-10-01 1
Each tweet has a specific date assigned, all tweets in the dataframe are from a period of one year. I want to find out a frequency of one specific word ("Senate" for example) over the entire period and plot a graph capturing how the frequency changed over time. I am fairly new to R and I could only think of super complicated ways to do it, but I am sure there must be some that's really easy and simple.
I appreciate any suggestions.
textFreq <- function(pattern, text){
freq <- gregexpr(pattern = pattern, text = text, ignore.case = TRUE)
freq <- lapply(freq, FUN = function(x){
if(length(x)==1&&x==-1){
return(0)
} else {
return(length(x))
}
})
freq <- unlist(freq)
return(freq)
}
test.text <- c("senate.... SENate.. sen","Working in the senate...", "I like dogs")
textFreq(pattern = "senate", test.text)
# [1] 2 1 0
you can use dplyr to group by time periods and use mutate
library(dplyr)
library(magrittr)
data <- data %>%
group_by(*somedatefactor*) %>% #if you wanted to aggrigate every 10 days or something
mutate(SenateFreqPerTweet = textFreq(pattern = "Senate", text = tweet_text),
SenateFreqTotal = sum(SenateFreqPerTweet)) #Counts sum based on current grouping
You may even wrap the previous statement into another function. To do so check out programming with dplyr
But regardless, using this approach you can easily plot the SenateFreqTotal with ggplot2 package
data2 <- data %>% #may be helpful to reduce the size of the dataframe before plotting.
select(SenateFreqTotal, *somedatefactor*) %>%
distinct()
ggplot(data2, aes(y=SenateFreqTotal, x = *somedatefactor*)+ geom_bar(stat="identity")
if you do not want to aggregate the frequencies you can just plot like so
ggplot(data, aes(y=SenateFreqPerTweet, x = tweet_time)) +
geom_bar(stat = "identity")

Resources