Interpretation question: Textstat_similarity Quanteda - r

I have a dataset of 310,225 tweets. I want to find out how many tweets were same or similar. I calculated the similarity between the tweets using Quanteda's textstat frequency. I found that the frequency of distance 1 and 0.9999 in the similarity matrix as below:
0.9999 1
2288 162743
Here's my code:
dfmat_users <- dfm_data %>%
dfm_select(min_nchar = 2) %>%
dfm_trim(min_termfreq = 10)
dfmat_users <- dfmat_users[ntoken(dfmat_users) > 10,]
tstat_sim <- textstat_simil(dfmat_users, method = "cosine", margin = "documents", min_simil = 0.9998)
table(tstat_sim#x) #result of this code is given above.
I need to find out the number of similar or same tweets in the dataset. How should I interpret the results above?

The easiest way is to convert the textstat_simil() output to a data.frame of unique pairs, and then filter the ones whose cosine value is above your threshold (here, .9999).
To illustrate, we can reshape the built-in inaugural address corpus into sentences, and then compute the similarity matrix on these, and then do the coercion to data.frame and use dplyr to filter the results you want.
library("quanteda", warn.conflicts = FALSE)
## Package version: 2.1.0.9000
## Parallel computing: 2 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
sim_df <- data_corpus_inaugural %>%
corpus_reshape(to = "sentences") %>%
dfm() %>%
textstat_simil(method = "cosine") %>%
as.data.frame()
nrow(sim_df)
## [1] 12508670
You can adjust the condition below for your data to 0.9999 - here I'm using 0.99 as an illustration.
library("dplyr", warn.conflicts = FALSE)
filter(sim_df, cosine > .99)
## document1 document2 cosine
## 1 1861-Lincoln.69 1861-Lincoln.71 1
## 2 1861-Lincoln.69 1861-Lincoln.73 1
## 3 1861-Lincoln.71 1861-Lincoln.73 1
## 4 1953-Eisenhower.6 1985-Reagan.6 1
## 5 1953-Eisenhower.6 1989-Bush.15 1
## 6 1985-Reagan.6 1989-Bush.15 1
## 7 1989-Bush.140 2009-Obama.108 1
## 8 1989-Bush.140 2013-Obama.87 1
## 9 2009-Obama.108 2013-Obama.87 1
## 10 1989-Bush.140 2017-Trump.9 1
## 11 2009-Obama.108 2017-Trump.9 1
## 12 2013-Obama.87 2017-Trump.9 1
(And: yeah, that's a very fast computation of cosine similarity between 12.5 million sentence pairs!)

Related

is there a way to filter words by length in a bag of words matrix in r?

I have created a matrix in R (called bag_of_words) I need to compute the top 100 most popular words (most occurrences), but filter tokens by length (min. size= 4 and max. size = 20) and indicate the total occurrences of the words.
I have created code to find the top 100 words without this filter which works, but cannot find a way of filtering words in matrix by length. Any help would be appreciated.
My attempt:
#view the top 100 most common words
term_f <- colSums(bag_of_words)
term_f <- sort(term_f, decreasing = T)
term_f[1:100]
Maybe I did not understand your question. But I think a vector might be easier to handle, especially if it is column of a data.table
library(data.table)
list_words <- data.table(x = as.numeric(bag_of_words))
If you only want words between 4 and 20 characters, use nchar:
list_words <- list_words[nchar(x) %between% c(4,20)]
Count the number of occurrences for each words
list_words <- list_words[,.(n = .N), by = "x"]
Get the top 100
list_words <- list_words[arrange(desc(n))][1:100]
I am not sure what NLP infrastructure you are using, but my recommendation is to use quanteda. If you don't have the package, just install it from CRAN with install.packages("quanteda").
Please find below a way to easily solve your issue ahead of computing token frequencies.
library(quanteda)
text = c("some short tokens, but maybe just fine.",
"thesearesomeverylongtokens.",
"v e r y s hort tokens" )
mycorp = corpus( text )
mytok = tokens( mycorp )
my_selected_tok = tokens_keep( mytok, min_nchar = 4, max_nchar = 20 )
mydfm = dfm(my_selected_tok)
frequencies = textstat_frequency( mydfm )
> frequencies
feature frequency rank docfreq group
1 tokens 2 1 2 all
2 some 1 2 1 all
3 short 1 2 1 all
4 maybe 1 2 1 all
5 just 1 2 1 all
6 fine 1 2 1 all
7 hort 1 2 1 all
> class(frequencies)
[1] "frequency" "textstat" "data.frame"

how to create a contingency table for each row of a data frame

I have a large data frame with rows as species and counts from 2 years as columns. I want to create a contingency table for each row in order to test if there was a significant change (decrease) from the first to the second year. Here is similar pretend data:
Species 2016 2017
cat 14 8
dog 16 12
bird 10 5
and then for each row I want a table like:
cat 2017 2018
present 14 8
absent 0 6
dog 2017 2018
present 16 12
absent 0 4
bird 2017 2018
present 10 5
absent 0 5
With this I will then do a Fisher's Exact Test on each table to test if the decrease was significant or not.
I think this can be done with maybe dplyr or apply looping through rows similar to the link below but am unsure how to build the correct list of tables first. How to convert data frame to contingency table in R?
I started with one row at a time:
A <- df[1,1:3]
A[2,] <- 0
A[2,3] <- (A[1,2] - A[1,3])
fisher.test(A[2:3])
Suggestions on how to apply this to a large number of rows would be greatly appreciated! My brain really struggles with coding.
One tidyverse possibility could be:
library(tidyverse)
library(broom)
df %>%
rowid_to_column() %>%
gather(var, present, -c(Species, rowid)) %>%
arrange(rowid, var) %>%
group_by(rowid) %>%
mutate(absent = lag(present, default = first(present)) - present) %>%
ungroup() %>%
select(-rowid, -var) %>%
nest(present, absent) %>%
mutate(p_value = data %>%
map(~fisher.test(.)) %>%
map(tidy) %>%
map_dbl(pluck, "p.value")) %>%
select(-data)
Species p_value
<chr> <dbl>
1 cat 0.0159
2 dog 0.101
3 bird 0.0325
Here it, first, performs a wide-to-long data transformation, excluding the columns "Species" and a column referring to row ID. Second, it arranges the data according the row ID and the original column names referring to years and groups by the row ID. Third, it calculates the differences between the years. Finally, it nests the present and absent variables per species and performs the fisher.test, then returns the p-values for each species.
Here is a solution using base R. You can probably use some of the ideas in this answer to make a much more concise answer. Let me know if this works for you!
# Create dataframe
df <- data.frame(Species = c("cat", "dog", "bird"),
year_2016 = c(14, 16, 10),
year_2017 = c(8, 12, 5),
stringsAsFactors = F)
# Create columns to later convert to a matrix
df$absent <- 0
df$present <- df$year_2016 - df$year_2017
# Tranpose the dataframe to use lapply
df_t <- t(df)
colnames(df_t) <- as.vector(df_t[1,])
df_t <- df_t[-1,]
class(df_t) <- "numeric"
# Use lapply to create matrices
matrix_list <- lapply(1:ncol(df_t), function(x) matrix(as.vector(df_t[,x]), 2, 2, byrow = T))
names(matrix_list) <- colnames(df_t)
matrix_list
$cat
[,1] [,2]
[1,] 14 8
[2,] 0 6
$dog
[,1] [,2]
[1,] 16 12
[2,] 0 4
$bird
[,1] [,2]
[1,] 10 5
[2,] 0 5
# Lots of fisher.tests
lapply(matrix_list, fisher.test)
$cat
Fisher's Exact Test for Count Data
data: X[[i]]
p-value = 0.01594
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
1.516139 Inf
sample estimates:
odds ratio
Inf
$dog
Fisher's Exact Test for Count Data
data: X[[i]]
p-value = 0.1012
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.7200866 Inf
sample estimates:
odds ratio
Inf
$bird
Fisher's Exact Test for Count Data
data: X[[i]]
p-value = 0.03251
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
1.195396 Inf
sample estimates:
odds ratio
Inf
And then if you want the p-values you could get them in a vector using sapply:
sapply(tests, "[[", "p.value")
cat dog bird
0.01594203 0.10122358 0.03250774
EDIT: this is probably a slight improvement. It is a little more concise. I can check how it scales with microbenchmark later today fi you are concerned with performance (or you have a large number of tests to run). Also, remember to penalize those p-values with all those tests ;). Also, #tmfmnk posted a great tidyverse solution if you prefer tidyverse over base.
# Create columns to later convert to a matrix
df$absent <- 0
df$present <- df$year_2016 - df$year_2017
df_t <- t(df[-1]) # tranpose dataframe excluding column of species
# Use lapply to create the list of matrices
matrix_list <- lapply(1:ncol(df_t), function(x) matrix(as.vector(df_t[,x]), 2, 2, byrow = T))
names(matrix_list) <- df$Species
# Running the fisher's test on every matrix
# in the list and extracting the p-values
tests <- lapply(matrix_list, fisher.test)
sapply(tests, "[[", "p.value")
cat dog bird
0.01594203 0.10122358 0.03250774
Last EDIT. Was able to run them through microbenchmark and wanted to post results for anyone who comes across this post in the future:
Unit: milliseconds
expr min lq mean median uq max neval
tidyverse_sol 12.506 13.497 15.130 14.560 15.827 26.205 100
base_sol 1.120 1.162 1.339 1.225 1.296 5.712 100

R tm TermDocumentMatrix based on a sparse matrix

I have a collection of books in txt format and want to apply some procedures of the tm R library to them. However, I prefer to clean the texts in bash rather than in R because it is much faster.
Suppose I am able to get from bash a data.frame such as:
book term frequency
--------------------
1 the 10
1 zoo 2
2 animal 2
2 car 3
2 the 20
I know that TermDocumentMatrices are actually sparse matrices with metadata. In fact, I can create a sparse matrix from the TDM using the TDM's i, j and v entries for the i, j and x ones of the sparseMatrix function. Please help me if you know how to do the inverse, or in this case, how to construct a TDM by using the three columns in the data.frame above. Thanks!
You could try
library(tm)
library(reshape2)
txt <- readLines(n = 7)
book term frequency
--------------------
1 the 10
1 zoo 2
2 animal 2
2 car 3
2 the 20
df <- read.table(header=T, text=txt[-2])
dfwide <- dcast(data = df, book ~ term, value.var = "frequency", fill = 0)
mat <- as.matrix(dfwide[, -1])
dimnames(mat) <- setNames(dimnames(dfwide[-1]), names(df[, 1:2]))
(tdm <- as.TermDocumentMatrix(t(mat), weighting = weightTf))
# <<TermDocumentMatrix (terms: 4, documents: 2)>>
# Non-/sparse entries: 5/3
# Sparsity : 38%
# Maximal term length: 6
# Weighting : term frequency (tf)
as.matrix(tdm)
# Docs
# Terms 1 2
# animal 0 2
# car 0 3
# the 10 20
# zoo 2 0

Return values from a Correlation Matrix in R

I have a correlation matrix (called correl)that is 390 x 390 so I would like to scan for values that are within 0.80 & 0.99. I have written the following loop:
cc1 <- NA #creates a NA vector to store values between 0.80 & 0.99
cc2 <- NA #creates a NA vector to store desired values
p <- dim(correl)[2] #dim returns the size of the correlation matrix
i =1
while (i <= p) {
cc1 <- correl[,correl[,i] >=0.80 & correl[,i] < 1.00]
cc2<- cbind(cc2,cc1)
i <- i +1
}
The problem I am having is that I also get undesired correlations ( those below 0.80) into cc2.
#Sample of what I mean:
SPY.Adjusted AAPL.Adjusted CHL.Adjusted CVX.Adjusted
1 SPY.Adjusted 1.0000000 0.83491778 0.6382930 0.8568000
2 AAPL.Adjusted 0.8349178 1.00000000 0.1945304 0.1194307
3 CHL.Adjusted 0.6382930 0.19453044 1.0000000 0.2991739
4 CVX.Adjusted 0.8568000 0.11943067 0.2991739 1.0000000
5 GE.Adjusted 0.6789054 0.13729877 0.3356743 0.5219169
6 GOOGL.Adjusted 0.5567947 0.10986655 0.2552149 0.2128337
I only want to return the correlations within the desired range ( 0.80 & 0.99) without losing the row.names or col.names as I would not know which are which.
Let's create a simple reproducible example
m = matrix(runif(100), ncol=10)
rownames(m) = LETTERS[1:10]
colnames(m) = rownames(m)
The tricky part is getting a nice return structure that contains the variable names. So I would collapse the matrix into a standard data frame
dd = data.frame(cor = as.vector(m1),
id1=rownames(m),
id2=rep(rownames(m), each=nrow(m)))
Remove duplicate entries
dd = dd[as.vector(upper.tri(m, TRUE)),]
Then select as usual
dd[dd$cor > 0.8 & dd$cor < 0.99,]
Glad you found an answer, but here's another that puts the results in a tidy data frame just in case others are looking for this.
This solution uses the corrr package (and using dplyr functions that are attached with it):
library(corrr)
mtcars %>%
correlate() %>%
shave() %>%
stretch(na.rm = TRUE) %>%
filter(between(r, .8, .99))
#> # A tibble: 3 × 3
#> x y r
#> <chr> <chr> <dbl>
#> 1 cyl disp 0.9020329
#> 2 cyl hp 0.8324475
#> 3 disp wt 0.8879799
Explanation:
mtcars is the data.
correlate() creates a correlation data frame.
shave() is optional and removes the upper triangle (to remove duplicates).
stretch() converts the data frame (in matrix format) to a long format.
filter(between(r, .8, .99)) selects only the correlations between .8 and .99
When I understood your problem correctly, one wouldn't expect a symmetric matrix as return object. For every variable of yours, you want to extract the other variables that are highly correlated with it - but this amount differs from variable to variable, so you cannot work with a matrix.
If you insist on a matrix/data frame, I would rather replace small correlations with NA
correl[correl<0.8] <- NA
and then access the column names for highly correlated with variable (e.g. in the first row) like this
colnames(correl)[!is.na(correl[1,])]
(Although then the NA step is kind of useless, as you could access the colnames straight with the constraint
colnames(correl)[correl[1,]>0.8)]
)

list of word frequencies using R

I have been using the tm package to run some text analysis.
My problem is with creating a list with words and their frequencies associated with the same
library(tm)
library(RWeka)
txt <- read.csv("HW.csv",header=T)
df <- do.call("rbind", lapply(txt, as.data.frame))
names(df) <- "text"
myCorpus <- Corpus(VectorSource(df$text))
myStopwords <- c(stopwords('english'),"originally", "posted")
myCorpus <- tm_map(myCorpus, removeWords, myStopwords)
#building the TDM
btm <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
myTdm <- TermDocumentMatrix(myCorpus, control = list(tokenize = btm))
I typically use the following code for generating list of words in a frequency range
frq1 <- findFreqTerms(myTdm, lowfreq=50)
Is there any way to automate this such that we get a dataframe with all words and their frequency?
The other problem that i face is with converting the term document matrix into a data frame. As i am working on large samples of data, I run into memory errors.
Is there a simple solution for this?
Try this
data("crude")
myTdm <- as.matrix(TermDocumentMatrix(crude))
FreqMat <- data.frame(ST = rownames(myTdm),
Freq = rowSums(myTdm),
row.names = NULL)
head(FreqMat, 10)
# ST Freq
# 1 "(it) 1
# 2 "demand 1
# 3 "expansion 1
# 4 "for 1
# 5 "growth 1
# 6 "if 1
# 7 "is 2
# 8 "may 1
# 9 "none 2
# 10 "opec 2
I have the following lines in R that can help to create word frequencies and put them in a table, it reads the file of text in .txt format and create the frequencies of words, I hope that this can help to anyone interested.
avisos<- scan("anuncio.txt", what="character", sep="\n")
avisos1 <- tolower(avisos)
avisos2 <- strsplit(avisos1, "\\W")
avisos3 <- unlist(avisos2)
freq<-table(avisos3)
freq1<-sort(freq, decreasing=TRUE)
temple.sorted.table<-paste(names(freq1), freq1, sep="\\t")
cat("Word\tFREQ", temple.sorted.table, file="anuncio.txt", sep="\n")
Looking at the source of findFreqTerms, it appears that the function slam::row_sums does the trick when called on a term-document matrix. Try, for instance:
data(crude)
slam::row_sums(TermDocumentMatrix(crude))
Depending on your needs, using some tidyverse functions might be a rough solution that offers some flexibility in terms of how you handle capitalization, punctuation, and stop words:
text_string <- 'I have been using the tm package to run some text analysis. My problem is with creating a list with words and their frequencies associated with the same. I typically use the following code for generating list of words in a frequency range. Is there any way to automate this such that we get a dataframe with all words and their frequency?
The other problem that i face is with converting the term document matrix into a data frame. As i am working on large samples of data, I run into memory errors. Is there a simple solution for this?'
stop_words <- c('a', 'and', 'for', 'the') # just a sample list of words I don't care about
library(tidyverse)
data_frame(text = text_string) %>%
mutate(text = tolower(text)) %>%
mutate(text = str_remove_all(text, '[[:punct:]]')) %>%
mutate(tokens = str_split(text, "\\s+")) %>%
unnest() %>%
count(tokens) %>%
filter(!tokens %in% stop_words) %>%
mutate(freq = n / sum(n)) %>%
arrange(desc(n))
# A tibble: 64 x 3
tokens n freq
<chr> <int> <dbl>
1 i 5 0.0581
2 with 5 0.0581
3 is 4 0.0465
4 words 3 0.0349
5 into 2 0.0233
6 list 2 0.0233
7 of 2 0.0233
8 problem 2 0.0233
9 run 2 0.0233
10 that 2 0.0233
# ... with 54 more rows
a = scan(file='~/Desktop//test.txt',what="list")
a1 = data.frame(lst=a)
count(a1,vars="lst")
seems to work to get simple frequencies. I've used scan because I had a txt file, but it should work with read.csv too.
Does apply(myTdm, 1, sum) or rowSums(as.matrix(myTdm)) give the ngram counts you're after?

Resources