Clustering unstructured text based on similarity and calculating optimum number of clusters - r

I am a data mining beginner and am trying to first formulate an approach to a clustering problem I am solving.
Suppose we have x writers, each with a particular style (use of unique words etc.). They each write multiple short texts, let's say a haiku. We collect many hundreds of these haikus from the authors and try to understand from the haikus, using context analysis, how many authors we had in the first place (we somehow lost records of how many authors there were, after a great war!)
Let's assume I create a hash table of words for each of these haikus. Then I could write a distance function that would look at the repetition of similar words between each vector. This could allow me to implement some sort of k-mean clustering function.
My problem now is to measure, probabilistically, the number of clusters, i.e. the number of authors, that would give me the optimum fit.
Something like:
number of authors | probability
1, 0.05
2, 0.1
3, 0.2
4, 0.4
5, 0.1
6, 0.05
7, 0.03
8, 0.01
The only constraint here is that as the number of authors (or clusters) goes to infinity, the sigma of the probabilities should converge onto 1, I think.
Does anyone have any thoughts or suggestions on how to implement this second part?

Let's formulate an approach using Bayesian statistics.
Pick a prior P(K) on the number of authors, K. For example, you might say K ~ Geometric(p) with support {1, 2, ... } where E[K] = 1 / p is the number of authors you expect there to be prior to seeing any writings.
Pick a likelihood function L(D|K) that assigns a likelihood to the writing data D given a fixed number of authors K. For example, you might say L(D|K) is the total amount of error in a k-component GMM found by expectation-maximization. To be really thorough, you could learn L(D|K) from data: the internet is full of haikus with known authors.
Find the value of K that maximizes the posterior probability P(K|D) - your best guess at the number of authors. Note that since P(K|D) = P(D|K)P(K)/P(D), P(D) is constant, and L(D|K) is proportional to P(D|K), you have:
max { P(K|D) | K = 1, 2, ... } = max { L(D|K)P(K) | K = 1, 2, ... }
With respect to your question, the first column in your table corresponds to K and the second column corresponds to a normalized P(K|D); that is, it is proportional to L(D|K)P(K).

Related

How can I find dependancies in my data in R? (A + B + C -> D)

I want to reduce my data by sorting out dependent variables. E.g. A + B + C -> D so I can leave out D without loosing any information.
d <- data.frame(A = c( 1, 2, 3, 4, 5),
B = c( 2, 4, 6, 4, 2),
C = c( 3, 4, 2, 1, 0),
D = c( 6, 10, 11, 9, 9))
The last value of D is wrong, that is because the data can be inaccurate.
How can I identify these dependencies with R and how can I influence the accuracy of the correlation? (e.g., use a cutoff of 80 or 90 percent)
Example findCorrelation only considers pair-wise correlations. Is there a function for multiple correlations?
You want to find dependencies in your data, you contrast findCorrelation with what you want asking 'is there a function for multiple correlations'. To answer that we need to clarify the technique that is appropriate for you...
Do you want partial correlation:
Partial correlation is the correlation of two variables while controlling for a third or more other variables
or semi-partial correlation?
Semi-partial correlation is the correlation of two variables with variation from a third or more other variables removed only from the second variable.
Definitions from {ppcor}. Decent YouTube video although the speaker might have some of the relationship to regression details slightly confused/confusing.
To Fer Arce's suggestion... it is about right. Regression is quite related to these methods, however when predictors are correlated (called multicollinearity) it can cause issues (see the answer by gung). You could force your predictors to be orthogonal (uncorrelated) via PCA, but then you'd make interpreting the coefficients quite hard.
Implementation:
library(ppcor)
d <- data.frame(A = c( 1, 2, 3, 4, 5),
B = c( 2, 4, 6, 4, 2),
C = c( 3, 4, 2, 1, 0),
D = c( 6, 10, 11, 9, 9))
# partial correlations
pcor(d, method = "pearson")
# semi-partial correlations
spcor(d, method = "pearson")
You can get a 'correlation', if you fit a lm
summary(lm(D ~ A + B + C, data =d))
But I am not sure what are you exactly asking for. I mean, with this you can get R^2, that I guess is what you are looking for?
Although correlation matrices are helpful and perfectly legitimate, one way I find particularly useful is to look at the variance inflation factor. Wikipedia's article describing the VIF is quite good.
A few reasons why I like to use the VIF:
Instead of looking at rows or columns of a correlation matrix, to try and divine which variables are more collinear than others with the other covariates multiply instead of singly, you get a single number which describes an aspect of a given predictor's relationship to all others in the model.
It's easy to use the VIF in a stepwise fashion to, in most cases, eliminate collinearity within your predictor space.
It's easy to obtain, either through using the vif() function in the car package, or by writing your own function to calculate.
VIF essentially works by regressing all the covariates/predictors in your model against each predictor you've included in turn. It obtains the R^2 value and takes the ratio: 1/(1-R^2). This gives you a number vif >= 1. If you think of R^2 as the amount of variation in your response space explained by your selected covariate model, then if one of your covariates gets a high R^2 of, say 0.80, then your vif is 5.
You choose what your threshold of comfort is. The wikipedia article suggests a vif of 10 indicates a predictor should go. I was taught that 5 is a good threshold. Often, I've found it's easy to get the vif down to less than 2 for all of my predictors without a big impact to my final models adjusted-R^2.
I feel like even a vif of 5, meaning a predictor can be modeled by its companion predictors with an R^2 of 0.80 means that that predictors marginal information contribution is quite low and not worth it. I try to take a strategy of minimizing all of my vifs for a given model without a huge (say, > 0.1 reduction in R^2) impact to my main model. That sort of an impact gives me a sense that even if the vif is higher than I'd like, the predictor still holds a lot of information.
There are other approaches. You might look into Lawson's paper on an alias-matrix-guided variable selection method as well - I feel it's particularly clever, though harder to implement than what I've discussed above.
The Question is about how it is possible to detect dependancies in larger sets of data.
For one this is possible by manually checking every possibility, like proposed in other answers with summary(lm(D ~ A + B + C, data =d)) for example. But this means a lot of manual work.
I see a few possibilities. For one Filter Methods like RReliefF or Spearman Correlation for example. they look at the correlation and
measure distance witin the data set.
Possibility two is using Feature Extraction methods lika PCA, LDA or ICA all trying to find the independent compontents (meaning eliminating any correlations...)

How are random numbers generated in R when random seed is fixed?

I am running some stochastic simulation experiments and in one step I want to estimate the correlation between random numbers when the underlying source of randomness is the same, i.e., common U(0,1) random numbers.
I thought the following two code segments should produce the same result.
set.seed(1000)
a_1 = rgamma(100, 3, 4)
set.seed(1000)
b_1 = rgamma(100, 4, 5)
cor(a_1,b_1)
set.seed(1000)
u = runif(100)
a_2 = qgamma(u, 3, 4)
b_2 = qgamma(u, 4, 5)
cor(a_2,b_2)
But the results are different
> cor(a_1,b_1)
[1] -0.04139218
> cor(a_2,b_2)
[1] 0.9993478
Which a fixed random seed, I expect the correlations to be close to 1 (as it is the case in the second code segment). However, the outputs in the first code segment is surprising
For this specific seed (1000), the correlation in the first segment has a negative sign and a very small magnitude. Neither the sign nor the magnitude makes sense...
When playing around with different seeds (e.g., 1, 10, 100, 1000), the correlation in the first segment changes significantly while the correlation in the second segment is quite stable.
Can anyone give some insights about how R samples random numbers from the same seed?
Thanks in advance!
set.seed(1)
u = runif(1000)
Seems to be a typo for
set.seed(1000)
u = runif(100)
If so, the only reason that I see for you thinking that the two experiments should be equivalent is that you are hypothesizing that rgamma(100, 3, 4) is generated by inverse transform sampling: start with runif(100) and then run those 100 numbers through the inverse of the cdf for a gamma random variable with parameters 3 and 4. However, the Wikipedia article on gamma random variables suggest that more sophisticated methods are used in generating gamma random variables (methods which involve multiple calls to the underlying PRNG).
?rgamma shows that R uses the Ahrens-Dieter algorithm that the Wikipedia article discusses. Thus there is no reason to expect that your two computations will yield the same result.
If what I took to be a typo at the beginning of my answer is what you actually intended then I have absolutely no idea why you would think that they should be equivalent, since they would then lack the "same seed" that you mention and furthermore correspond to different sample sizes.

Identifying Data Bands based on Distance between Centroids with Clustering in R

I'm trying to use clustering to identify bands in my data set. I'm working with supply chain data, so my data looks like this:
The relevant column is the price per Each.
The problem is that sometimes we incorrectly have that this product comes in a Case of 100 instead of 10, so the Price per Each would look like (2, 0.25, 3). I want to create a code that only creates clusters if the mean price of an additional cluster is at least 2 times greater or lesser than all existing clusters.
For example, if my prices per each were (4, 5, 6, 13, 14, 15), I want it to return 2 clusters with centroids of 5 and 14. If, on the other hand, my data looked like (3, 4, 5, 6), it should return one cluster.
The goal is to create a code that returns the product codes for items in which multiple clusters have been generated so that I can audit those product codes for bad units of measure (case 100 vs case 10).
I'm thinking about using divisive hierarchical clustering, but I don't know how to introduce the centroid distance rule for creating new clusters.
I'm fairly new to R, but I have SQL and Stata experience, so I'm looking for a package that would do this or help with the syntax I need to accomplish this.
Don't use clustering here.
While you can probably use HAC with a ratio-like distance function and a threshold of 8x, this will be rather unreliable and expensive: clustering will take O(n²) or O(n³) usually.
If you know that these error happen, but not frequently, then I'd rather use a classic statistical approach. For example, compute the median and then report values that are 9x times larger/smaller than the median as errors. If errors are infrequent enough, you could even use the mean, but the median is more robust.

What is the probability of a TERM for a specific TOPIC in Latent Dirichlet Allocation (LDA) in R

I'm working in R, package "topicmodels". I'm trying to work out and better understand the code/package. In most of the tutorials, documentation I'm reading I'm seeing people define topics by the 5 or 10 most probable terms.
Here is an example:
library(topicmodels)
data("AssociatedPress", package = "topicmodels")
lda <- LDA(AssociatedPress[1:20,], k = 5)
topics(lda)
terms(lda)
terms(lda,5)
so the last part of the code returns me the 5 most probable terms associated with the 5 topics I've defined.
In the lda object, i can access the gamma element, which contains per document the probablity of beloning to each topic. So based on this I can extract the topics with a probability greater than any threshold I prefer, instead of having for everyone the same number of topics.
But my second step would then to know which words are strongest associated to the topics. I can use the terms(lda) function to pull this out, but this gives me the N so many.
In the output I've also found the
lda#beta
which contains the beta per word per topic, but this is a Beta value, which I'm having a hard time interpreting. They are all negative values, and though I see some values around -6, and other around -200, i can't interpret this as a probability or a measure to see which words and how much stronger certain words associate to a topic. Is there a way to pull out/calculate anything that can be interpreted as such a measure.
many thanks
Frederik
The beta-matrix gives you a matrix with dimension #topics x #terms. The values are log-likelihoods, therefore you exp them. The given probabilities are of the type
P(word|topic) and these probabilities only add up to 1 if you take the sum over the words but not over the topics P(all words|topic) = 1 and NOT P(word|all topics) = 1.
What you are searching for is P(topic|word) but I actually do not know how to access or calculate it in this context. You will need P(word) and P(topic) I guess. P(topic) should be:
colSums(lda#gamma)/sum(lda#gamma)
Becomes more obvious if you look at the gamma-matrix, which is #document x #topics. The given probabilites are P(topic|document) and can be interpreted as "what is the probability of topic x given document y". The sum over all topics should be 1 but not the sum over all documents.

Statistical functions for correlation between 2 data sets in R

This is more of a general question that I haven't been able to find. I am trying to find the correlation between 2 data sets, with the goal of matching them with a certain correlation percentage. They won't be exact matches, but will mostly be within 1%, though there will likely be some outliers. For example, every 100th point might be off by 5%, possibly more.
I am also trying to find instances where a data set might match another but have a different magnitude. For example, if you multiplied all of the data by a multiplier, you would get a match. It obviously wouldn't make sense to loop through a ton of possible multipliers. I'm contemplating trying to match positive and negative slopes as either +1/-1 as the slope would not work. Though, this would not work in some instances as the data is very granular and thus it might match the shape of the data but if you zoom in the slopes would be off.
Are there any built in functions in R? I don't have a statistical background and my searches came up with mostly how to handle a single data set and outliers in those.
For a basic Pearson, Spearman, or Kendall correlation, you can use the cor() function:
x <- c(1, 2, 5, 7, 10, 15)
y <- c(2, 4, 6, 9, 12, 13)
cor(x, y, use="pairwise.complete.obs", method="pearson")
You're going to want to adjust the "use" and "method" options based on your data. Since you didn't provide the nature of your data, I can't give you any more specific guidance.

Resources