lda.collapsed.gibbs.sampler model and top words ranking - r

I have a model generated by the function lda.collapsed.gibbs.sampler, from the lda package, and i need to know the "relevance" of the top words.
When using the
top.topic.words(result$topics, 10, by.score=TRUE)
i get a list of top 10 words for each topic, but i'd like to see the percentage of the topic that those 10 words represent. I guess the information exists, because there is a "score", but I'm not really familiar with the statistical methods of the Gibbs sampler.
Thanks in advance!

I think something like this may be what you want:
for (ii in 1:nrow(result$topics)) {
print(
head(
cumsum(
sort(result$topics[ii,], decreasing=TRUE)
),
n = 20
) / result$topic_sums[ii]
)
}
Let's break it down. If you want the fraction of Gibbs assignments, then that is easy. The LDA routine returns the number of assignments to each (word, topic) pair. So all you have to do is sort each row of the result$topics to get the top words (this is essentially what top.topic.words does if you set by.score=FALSE). Once you have it in sorted order you can just see, for each topic, how many counts occur for that word versus for the entire topic. To do that I divide by result$topic_sums which contains the total number of assignments of that topic. Finally, I use cumsum so you can see the running total weight for words in that topic.

Related

Making a for loop in r

I am just getting started with R so I am sorry if I say things that dont make sense.
I am trying to make a for loop which does the following,
l_dtest[[1]]<-vector()
l_dtest[[2]]<-vector()
l_dtest[[3]]<-vector()
l_dtest[[4]]<-vector()
l_dtest[[5]]<-vector()
all the way up till any number which will be assigned as n. for example, if n was chosen to be 100 then it would repeat this all the way to > l_dtest[[100]]<-vector().
I have tried multiple different attempts at doing this and here is one of them.
n<-4
p<-(1:n)
l_dtest<-list()
for(i in p){
print((l_dtest[i]<-vector())<-i)
}
Again I am VERY new to R so I don't know what I am doing or what is wrong with this loop.
The detailed background for why I need to do this is that I need to write an R function that receives as input the size of the population "n", runs a simulation of the model below with that population size, and returns the number of generations it took to reach a MRCA (most recent common ancestor).
Here is the model,
We assume the population size is constant at n. Generations are discrete and non-overlapping. The genealogy is formed by this random process: in each
generation, each individual chooses two parents at random from the previous generation. The choices are made randomly and equally likely over the n possibilities and each individual chooses twice. All choices are made independently. Thus, for example, it is possible that, when an individual chooses his two parents, he chooses the same individual twice, so that in
fact he ends up with just one parent; this happens with probability 1/n.
I don't understand the specific step at the begining of this post or why I need to do it but my teacher said I do. I don't know if this helps but the next step is choosing parents for the first person and then combining the lists from the step I posted with a previous step. It looks like this,
sample(1:5, 2, replace=T)
#[1] 1 2
l_dtemp[[1]]<-union(l_dtemp[[1]], l_d[[1]]) #To my understanding, l_dtem[[1]] is now receiving the listdescandants from l_d[[1]] bcs the ladder chose l_dtemp[[1]] as first parent
l_dtemp[[2]]<-union(l_dtemp[[2]], l_d[[1]]) #Same as ^^ but for l_d[[1]]'s 2nd choice which is l_dtemp[[2]]
sample(1:5, 2, replace=T)
#[1] 1 3
l_dtemp[[1]]<-union(l_dtemp[[1]], l_d[[2]])
l_dtemp[[3]]<-union(l_dtemp[[3]], l_d[[2]])

r + tfidf and inverse document frequency

I was hoping that someone can explain a specific part of an academic paper and assist in writing R code for that section:
Name of Paper
Large-scale Analysis of Counseling Conversations:
An Application of Natural Language Processing to Mental Health (https://cs.stanford.edu/people/jure/pubs/counseling-tacl16.pdf)
On page 5, we have the following snippet:
"
...build a TF-IDF vector of word occurrences
to represent the language of counselors within this
subset. We use the global inverse document (i.e.,
conversation) frequencies instead of the ones from
each subset to make the vectors directly comparable
and control for different counselors having different numbers of conversations by weighting conversations so all counselors have equal contributions.
"
What does the paper mean by "global inverse document frequency"?
How can I code this in R with the different subsets (positive and negative counsellors for example)
Here is my sample code:
corp_pos_1 <- Corpus(VectorSource(positive_chats$Text1))
#corp_pos_1 <- tm_map(corp_pos_1, removeWords, stopwords("english"))
tdm_pos_1 <- DocumentTermMatrix(corp_pos_1,control = list(weighting = function(x) weightTfIdf(x, normalize = FALSE)))
ui = unique(tdm_pos_1 $i)
tdm_pos_1 = tdm_pos_1 [ui,]
cosine_tdm_pos_1 <- crossprod_simple_triplet_matrix(tdm_pos_1)/(sqrt(col_sums(tdm_pos_1^2) %*% t(col_sums(tdm_pos_1^2))))
In the code 'pos' stands for positive, and 'neg' would stand for negative.
The number at the end of the variable end shows the part of the chunk being calculated.
Now I have them chunked in 5 different parts trying to follow this paper. But how would I be able to calculate "global inverse document frequency"?
I think I have found this stackoverflow question from before but I am still not understanding the paper + what I need to do in R.
R: weighted inverse document frequency (tfidf) similarity between strings
TF/IDF is a well-known measure in information retrieval. For more information on it, and formulae that describe how to calculate it, see the Wikipedia page.
In short, you want to have words that are specific to texts; words that occur in all texts do not add any distinctive information. So, the inverse document frequency is the number of all documents divided by the number of documents that contain a given word. For common words such as the or of, the IDF would be 1.0, as we would assume they occurred in all texts. For that reason they are often excluded as stop words. IDF can also be scaled, eg by taking the logarithm.
If I understand your application correctly, you would take a term and divide the total number of documents by the number of negative documents that contain the term.

Randomly pairing elements of a vector in R to count unique arrangements

Background:
On this combinatorics question, the issue is how to determine the sample space: the ways 8 different soccer teams can be paired up for the next round of competition. Two different answers have been advanced for that part of the problem: 28 (see comments OP) and 105 (see edit within OP and answer).
I'd like to do this manually to try to hone down on the mistake in whichever answer is incorrect.
What I have tried:
teams = 1:8
names(teams) = c("RM", "BCN", "SEV", "JUV", "ROM", "MC", "LIV", "BYN")
split(sample(teams), rep(1:(length(teams)/2), each=2))
Unfortunately, the output is a list, and I wanted a vector to be able to run something like:
unique(...,MARGIN=2)
Is there a way of doing this in an elegant manner?
After a now erased answer (thank you), I would go with
a <- replicate(1e5, unlist(split(sample(teams), rep(1:(length(teams)/2), each=2))))
to simulate 100,000 random samples, and later run
unique(a, MARGIN = 2).
But how can I account for the fact that the order of the 4 pairings of opponents doesn't matter, and that LIV-BYN and BYN-LIV, for example, is the same pairing (field advantage notwithstanding)?
> u = ncol(unique(replicate(1e6, unlist(split(sample(teams), rep(1:(length(teams)/2), each=2)))), MARGIN = 2))
> u / (factorial(4) * 2^4)
[1] 105
The idea of unlist is from #Song Zhengyi, and if his answer is un-deleted, I'll accept it. The complete answer is in the lines above.
u needs to be divided by 4! because
BCN-RM, BYN-SEV, JUV-ROM, LIV-MC
is exactly the same as
LIV-MC, BCN-RM, BYN-SEV, JUV-ROM
or
BCN-RM, LIV-MC, BYN-SEV, JUV-ROM
etc.
The term 2^4 is to avoid over-counting since for every possible unique draw, each one of the pairings can be flipped without loss (discarding field advantage): BCN-RM is the same as RM-BCN, and there are 4 pairs in each draw.
If field advantage is a consideration (real life)...
> u/factorial(4)
[1] 1680
we end up with 1,680 possible draws.

How to run a function to EACH of my observations in R?

My problem is as follows:
I have a dataset of 6000 observation containing information from costumers (each observation is one client's information).
I'm optimizing a given function (in my case is a profit function) in order to find an optimal for my variable of interest. Particularly I'm looking for the optimal interest rate I should offer in order to maximize my expected profits.
I don't have any doubt about my function. The problem is that I don't know how should I proceed in order to apply this function to EACH OBSERVATION in order to obtain an OPTIMAL INTEREST RATE for EACH OF MY 6000 CLIENTS (or observations, as you prefer).
Until now, it has been easy to find the UNIQUE optimal (same for all clients) for this variable that would maximize my profits (This is, the global maximum I guess). But what I need to know is how I should proceed in order to apply my optimization problem to EACH of my 6000 observations, INDIVIDUALLY, in order to have the optimal interest rate to offer to each costumer (this is, 6000 optimal interest rates, one for each of them).
I guess I should do something similar to a for loop, but my experience in this area is limited, and I'm quite frustrated already. What's more, I've tried to use mapply(myfunction, mydata) as usual, but I only get error messages.
This is how my (really) simple code now looks like:
profits<- function(Rate)
sum((Amount*(Rate-1.2)/100)*
(1/(1+exp(0.600002438-0.140799335888812*
((Previous.Rate - Rate)+(Competition.Rate - Rate))))))
And results for ONE optimal for the entire sample:
> optimise(profits, lower = 0, upper = 100, maximum = TRUE)
$maximum
[1] 6.644821
$objective
[1] 1347291
So the thing is, how do I rewrite my code in order to maximize this and obtain the optimal of my variable of interest for EACH of my rows?
Hope I've been clear! Thank you all in advance!
It appears each of your customers are independent. So you just put lapply() around the optimize() call:
lapply(customer_list, function(one_customer){
optimise(profits, lower = 0, upper = 100, maximum = TRUE)
})
This will return a very big list, where each list element has a $maximum and a $objective. You can then run lapply to total the $maximums, to find just how rich you have become!

Calculate correlation coefficient between words?

For a text analysis program, I would like to analyze the co-occurrence of certain words in a text. For example, I would like to see that e.g. the words "Barack" and "Obama" appear more often together (i.e. have a positive correlation) than others.
This does not seem to be that difficult. However, to be honest, I only know how to calculate the correlation between two numbers, but not between two words in a text.
How can I best approach this problem?
How can I calculate the correlation between words?
I thought of using conditional probabilities, since e.g. Barack Obama is much more probable than Obama Barack; however, the problem I try to solve is much more fundamental and does not depend on the ordering of the words
The Ngram Statistics Package (NSP) is devoted precisely to this task. They have a paper online which describes the association measures they use. I haven't used the package myself, so I cannot comment on its reliability/requirements.
Well a simple way to solve your question is by shaping the data in a 2x2 matrix
obama | not obama
barack A B
not barack C D
and score all occuring bi-grams in the matrix. That way you can for instance use simple chi squared.
I don't know how this is commonly done, but I can think of one crude way to define a notion of correlation that captures word adjacency.
Suppose the text has length N, say it is an array
text[0], text[1], ..., text[N-1]
Suppose the following words appear in the text
word[0], word[1], ..., word[k]
For each word word[i], define a vector of length N-1
X[i] = array(); // of length N-1
as follows: the ith entry of the vector is 1 if the word is either the ith word or the (i+1)th word, and zero otherwise.
// compute the vector X[i]
for (j = 0:N-2){
if (text[j] == word[i] OR text[j+1] == word[i])
X[i][j] = 1;
else
X[i][j] = 0;
}
Then you can compute the correlation coefficient between word[a] and word[b] as the dot product between X[a] and X[b] (note that the dot product is the number of times these words are adjacent) divided by the lenghts (the length is the square root of the number of appearances of the word, well maybe twice that). Call this quantity COR(X[a],X[b]). Clearly COR(X[a],X[a]) = 1, and COR(X[a],X[b]) is larger if word[a], word[b] are often adjacent.
This can be generalized from "adjacent" to other notions of near - for example we could have chosen to use 3 word (or 4, 5, etc.) blocks instead. One can also add weights, probably do many more things as well if desired. One would have to experiment to see what is useful, if any of it is of use at all.
This problem sounds like a bigram, a sequence of two "tokens" in a larger body of text. See this Wikipedia entry, which has additional links to the more general n-gram problem.
If you want to do a full analysis, you'd most likely take any given pair of words and do a frequency analysis. E.g., the sentence "Barack Obama is the Democratic candidate for President," has 8 words, so there are 8 choose 2 = 28 possible pairs.
You can then ask statistical questions like, "in how many pairs does 'Obama' follow 'Barack', and in how many pairs does some other word (not 'Obama') follow 'Barack'? In this case, there are 7 pairs that include 'Barack' but in only one of them is it paired with 'Obama'.
Do the same for every possible word pair (e.g., "in how many pairs does 'candidate' follow 'the'?"), and you've got a basis for comparison.

Resources