Creating LDA model using gensim from bag-of-words vectors - vector

I want to create a topic model from data provided by Jstor (e.g. https://www.jstor.org/dfr/about/sample-datasets). However, because of copyright, they do not allow full text access. Instead, I can request a list of unigrams followed by their frequencies in the document (supplied in plain .txt). e.g:
his 295
old 181
he 165
age 152
p 110
from 79
life 74
de 71
petrarch 58
book 51
courtier 47
This should be easy to convert to a bag-of-words vector. However, I have only found examples of Gensim LDA models being built from fulltext. Would it be possible to pass it these vectors instead?

Yes, you only need to convert (word, frequency) to (word_number, frequency), and pass a list of tuples to corpus of any gensim model. To convert a word to a number, you can first count how many words are in the whole corpus, suppose we have V words, then each word can be represented as an integer between 1 to V.

Related

Rolling subset of data frame within for loop in R

Big picture explanation is I am trying to do a sliding window analysis on environmental data in R. I have PAR (photosynthetically active radiation) data for a select number of sequential dates (pre-determined based off other biological factors) for two years (2014 and 2015) with one value of PAR per day. See below the few first lines of the data frame (data frame name is "rollingpar").
par14 par15
1356.3242 1306.7725
NaN 1232.5637
1349.3519 505.4832
NaN 1350.4282
1344.9306 1344.6508
NaN 1277.9051
989.5620 NaN
I would like to create a loop (or any other way possible) to subset the data frame (both columns!) into two week windows (14 rows) from start to finish sliding from one window to the next by a week (7 rows). So the first window would include rows 1 to 14 and the second window would include rows 8 to 21 and so forth. After subsetting, the data needs to be flipped in structure (currently using the melt function in the reshape2 package) so that the values of the PAR data are in one column and the variable of par14 or par15 is in the other column. Then I need to get rid of the NaN data and finally perform a wilcox rank sum test on each window comparing PAR by the variable year (par14 or par15). Below is the code I wrote to prove the concept of what I wanted and for the first subsetted window it gives me exactly what I want.
library(reshape2)
par.sub=rollingpar[1:14, ]
par.sub=melt(par.sub)
par.sub=na.omit(par.sub)
par.sub$variable=as.factor(par.sub$variable)
wilcox.test(value~variable, par.sub)
#when melt flips a data frame the columns become value and variable...
#for this case value holds the PAR data and variable holds the year
#information
When I tried to write a for loop to iterate the process through the whole data frame (total rows = 139) I got errors every which way I ran it. Additionally, this loop doesn't even take into account the sliding by one week aspect. I figured if I could just figure out how to get windows and run analysis via a loop first then I could try to parse through the sliding part. Basically I realize that what I explained I wanted and what I wrote this for loop to do are slightly different. The code below is sliding row by row or on a one day basis. I would greatly appreciate if the solution encompassed the sliding by a week aspect. I am fairly new to R and do not have extensive experience with for loops so I feel like there is probably an easy fix to make this work.
wilcoxvalues=data.frame(p.values=numeric(0))
Upar=rollingpar$par14
for (i in 1:length(Upar)){
par.sub=rollingpar[[i]:[i]+13, ]
par.sub=melt(par.sub)
par.sub=na.omit(par.sub)
par.sub$variable=as.factor(par.sub$variable)
save.sub=wilcox.test(value~variable, par.sub)
for (j in 1:length(save.sub)){
wilcoxvalues$p.value[j]=save.sub$p.value
}
}
If anyone has a much better way to do this through a different package or function that I am unaware of I would love to be enlightened. I did try roll apply but ran into problems with finding a way to apply it to an entire data frame and not just one column. I have searched for assistance from the many other questions regarding subsetting, for loops, and rolling analysis, but can't quite seem to find exactly what I need. Any help would be appreciated to a frustrated grad student :) and if I did not provide enough information please let me know.
Consider an lapply using a sequence of every 7 values through 365 days of year (last day not included to avoid single day in last grouping), all to return a dataframe list of Wilcox test p-values with Week indicator. Then later row bind each list item into final, single dataframe:
library(reshape2)
slidingWindow <- seq(1,364,by=7)
slidingWindow
# [1] 1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127
# [20] 134 141 148 155 162 169 176 183 190 197 204 211 218 225 232 239 246 253 260
# [39] 267 274 281 288 295 302 309 316 323 330 337 344 351 358
# LIST OF WILCOX P VALUES DFs FOR EACH SLIDING WINDOW (TWO-WEEK PERIODS)
wilcoxvalues <- lapply(slidingWindow, function(i) {
par.sub=rollingpar[i:(i+13), ]
par.sub=melt(par.sub)
par.sub=na.omit(par.sub)
par.sub$variable=as.factor(par.sub$variable)
data.frame(week=paste0("Week: ", i%/%7+1, "-", i%/%7+2),
p.values=wilcox.test(value~variable, par.sub)$p.value)
})
# SINGLE DF OF ALL P-VALUES
wilcoxdf <- do.call(rbind, wilcoxvalues)

Single line user input in R

I want to input numeric values from user in R. These numeric values will be in one line. readline() does read the values but then returns them as character making me unable to do statistical operations on those values whereas scan() doesn't take multiple numeric values in one line in R. Please help.
Sample Input
630 135 146 233 144 498 729 120 511 670
Can you suggest me a way using which I can prompt user to input these values and store them in numeric so that I can perform basic statistic operation on these values.
as.numeric(unlist(strsplit(readline()," ")))

R Text Mining with quanteda

I have a data set (Facebook posts) (via netvizz) and I use the quanteda package in R. Here is my R code.
# Load the relevant dictionary (relevant for analysis)
liwcdict <- dictionary(file = "D:/LIWC2001_English.dic", format = "LIWC")
# Read File
# Facebooks posts could be generated by FB Netvizz
# https://apps.facebook.com/netvizz
# Load FB posts as .csv-file from .zip-file
fbpost <- read.csv("D:/FB-com.csv", sep=";")
# Define the relevant column(s)
fb_test <-as.character(FB_com$comment_message) #one column with 2700 entries
# Define as corpus
fb_corp <-corpus(fb_test)
class(fb_corp)
# LIWC Application
fb_liwc<-dfm(fb_corp, dictionary=liwcdict)
View(fb_liwc)
Everything works until:
> fb_liwc<-dfm(fb_corp, dictionary=liwcdict)
Creating a dfm from a corpus ...
... indexing 2,760 documents
... tokenizing texts, found 77,923 total tokens
... cleaning the tokens, 1584 removed entirely
... applying a dictionary consisting of 68 key entries
Error in `dimnames<-.data.frame`(`*tmp*`, value = list(docs = c("text1", :
invalid 'dimnames' given for data frame
How would you interpret the error message? Are there any suggestions to solve the problem?
There was a bug in quanteda version 0.7.2 that caused dfm() to fail when using a dictionary when one of the documents contains no features. Your example fails because in the cleaning stage, some of the Facebook post "documents" end up having all of their features removed through the cleaning steps.
This is not only fixed in 0.8.0, but also we changed the underlying implementation of dictionaries in dfm(), resulting in a significant speed improvement. (The LIWC is still a large and complicated dictionary, and the regular expressions still mean that it is much slower to use than simply indexing tokens. We will work on optimising this further.)
devtools::install_github("kbenoit/quanteda")
liwcdict <- dictionary(file = "LIWC2001_English.dic", format = "LIWC")
mydfm <- dfm(inaugTexts, dictionary = liwcdict)
## Creating a dfm from a character vector ...
## ... indexing 57 documents
## ... lowercasing
## ... tokenizing
## ... shaping tokens into data.table, found 134,024 total tokens
## ... applying a dictionary consisting of 68 key entries
## ... summing dictionary-matched features by document
## ... indexing 68 feature types
## ... building sparse matrix
## ... created a 57 x 68 sparse dfm
## ... complete. Elapsed time: 14.005 seconds.
topfeatures(mydfm, decreasing=FALSE)
## Fillers Nonfl Swear TV Eating Sleep Groom Death Sports Sexual
## 0 0 0 42 47 49 53 76 81 100
It will also work if a document contains zero features after tokenization and cleaning, which is probably what is breaking the older dfm you are using with your Facebook texts.
mytexts <- inaugTexts
mytexts[3] <- ""
mydfm <- dfm(mytexts, dictionary = liwcdict, verbose = FALSE)
which(rowSums(mydfm)==0)
## 1797-Adams
## 3

Can I check the frequencies of predetermined words or phrases in document clustering using R?

I'm doing a text mining using "tm" packages in R, and I can get word frequencies after I generate a term document matrix:
freq <- colSums(as.matrix(dtm))
ord <- order(freq)
freq[head(ord)]
# abit acal access accord across acsess
# 1 1 1 1 1 1
freq[tail(ord)]
# direct save month will thank list
# 106 107 116 122 132 154
It can only present me a list of word frequencies by sequence, I was wondering if I can check a word's frequency individually? Can I also check a phrase's frequency? For example, how many times the word "thank" is in a text corpus or what is the frequency of the phrase "contact number" shown in this corpus?
Many thanks for any hints and suggestions.
I show this by the data from the tm package:
library(tm)
data(crude)
dtm <- as.matrix(DocumentTermMatrix(crude))
#find the column that contains the word "demand"
columnindices <- which(colnames(dtm)=="demand")
#how often dooes the word "demand" show up?
sum(dtm[,columnindices])
>6
If you want to do this with phrases your dtm must contain these phrases not just the bag of single words as it is used in most cases. if this data is available, the procedure is the same as for a single word.

Interpret knn.cv (R) results after applying on data set

I have encountered a problem while using the k-nearest neighbors algorithm (with cross validation) on a data set in R, the knn.cv from the FNN package.
The data set consists of 4601 email cases with 58 attributes, with the 57 depending on character or word frequencies in the emails(numerical, range [0,100]) , and the last one indicating if it is spam (value 1) or ham (value 0).
After indicating train and cl variables and using 10 neighbors, running the package presents a list of all the emails with values like 7.4032 at each column, which I don't know how to use. I need to find the percentage of spam and ham the package classifies and compare it with the correct percentage. How should I interpret these results?
Given that the data set you describe matches (exactly) the spam data set in the ElemStatLearn package accompanying the well-known book by the same title, I'm wondering if this is in fact a homework assignment. If that's the case, it's ok, but you should add the homework tag to your question.
Here are some pointers.
The documentation for the function knn.cv says that it returns a vector of classifications, along with the distances and indices of the k nearest neighbors as "attributes". So when I run this:
out <- knn.cv(spam[,-58],spam[,58],k = 10)
The object out looks sort of like this:
> head(out)
[1] spam spam spam spam spam email
Levels: email spam
The other values you refer to are sort of "hidden" as attributes, but you can see that they are there using str:
> str(out)
Factor w/ 2 levels "email","spam": 2 2 2 2 2 1 1 1 2 2 ...
- attr(*, "nn.index")= int [1:4601, 1:10] 446 1449 500 5 4 4338 2550 4383 1470 53 ...
- attr(*, "nn.dist")= num [1:4601, 1:10] 8.10e-01 2.89 1.50e+02 2.83e-03 2.83e-03 ...
You can access those additional attributes via something like this:
nn.index <- attr(out,'nn.index')
nn.dist <- attr(out,'nn.dist')
Note that both of these objects end up being matrices of dimension 4601 x 10, which makes sense, since the documentation said that they recorded the index (i.e. row number) of the k = 10 nearest neighbors as well as the distances to each.
For the last bit, you will probably find the table() function useful, as well as prop.table().

Resources