Estimate Dictionary size using Zipf’s Law - information-retrieval

How would one go about Calculating the Dictionary Size(no.of unique words) of a collection using Zipfs Law?

Dictionary size is approximated by Heap's Law and not directly by Zipf's law.
Unique words = constant1 x (Total. words)^costant2 where constant1 is usually between 10 and 100 and constant2 is usually between 0.5 and 0.6.

Related

Generating groups of skewed size but whose elements add to a fixed sum

I have some fixed number of people (e.g. 1000). I would like to split these 1000 people into some random number of classes Y (e.g. 5), but not equally. I want them to be distributed unevenly, according to some probability distribution that is heavily skewed (something like a power-law distribution).
My intuition is that I need to generate a distribution of probabilities that is (1) skewed and (2) which also adds up to 1.
My ad hoc solution was to generate random numbers from a power law distribution, multiply these by some scalar that ensures these add up to something close to my target number, adjust my target number to that new number, and then split accordingly.
But it seems awfully inelegant, and 'y_size' doesn't always sum to 1000, which requires looping through and trying again. What's a better approach?
require(poweRlaw)
x<-1000
y<-10
y_sizes<-rpldis(10,xmin=5,alpha=2,discrete_max=x)
y_sizes<-round(y_sizes * x/sum(y_sizes))
newx<-y_sizes #newx only approx = x rather than = x
people<-1:x
groups<-cut(
people,
c(0,cumsum(y_sizes))
) %>% as.numeric
data.frame(
people=people,
group=groups
)
The algorithm presented by Smith and Tromble in "Sampling Uniformly from the Unit Simplex" shows a solution. I have pseudocode on this algorithm in my section "Random Integers with a Given Positive Sum".

Competing risk survival random forest with large data

I have a data set with 500,000 observations with events and a competing risk as well as a time-to-event variable (survival analysis).
I want to run a survival random forest.
The R-package randomForestSRC is great for it, however, it is impossible to use more than 100,000 rows due to memory limitation (100'000 uses 40GB of RAM) even though I limit my number of predictors to 15 to 20.
I have a hard time finding a solution. Does anyone have a recommendation?
I looked at h2o and spark mllib, both of which do not support survival random forests.
Ideally I am looking for a somewhat R-based solution but I am happy to explore anything else if anyone knows a way to use large data + competing risk random forest.
In general, the memory profile for an RF-SRC data set is n x p x 8 on a 64-bit machine. With n=500,000 and p=20, RAM usage is approximately 80MB. This is not large.
You also need to consider the size of the forest, $nativeArray. With the default nodesize = 3, you will have n / 3 = 166,667 terminal nodes. Assuming symmetric trees for convenience, the total number of interanal/external nodes will approximately be 2 * n / 3 = 333,333. With the default ntree = 1000, and assuming no factors, $nativeArray will be of dimensions [2 * n / 3 * ntree] x [5]. A simple example will show you why we have [5] columns in the $nativeArray to tag the split parameter, and split value. Memory usage for the forest will be thus be 2 * n / 3 * ntree * 5 * 8 = 1.67GB.
So now we are getting into some serious memory usage.
Next consider the ensembles. You haven't said how many events you have in your competing risk data set, but let's assume there are two for simplicity.
The big arrays here are the cause-specific hazard function (CSH) and the cause-specific cumulative incidence function (CIF). These are both of dimension [n] x [time.interest] x [2]. In a worst case scenario, if all your times are distinct, and there are no censored events, time.interest = n. So each of these outputs is n * n * 2 * 8 bytes. This will blow up most machines. It's time.interest that is your enemy. In big-n situations, you need to constrain the time.interest vector to a subset of the actual event times. This can be controlled with the parameter ntime.
From the documentation:
ntime: Integer value used for survival families to constrain ensemble calculations to a grid of time values of no more than ntime time points. Alternatively if a vector of values of length greater than one is supplied, it is assumed these are the time points to be used to constrain the calculations (note that the constrained time points used will be the observed event times closest to the user supplied time points). If no value is specified, the default action is to use all observed event times.
My suggestion would be to start with a very small value of ntime, just to test whether the data set can be analyzed in its entirety without issue. Then increase it gradually and observe your RAM usage. Note that if you have missing data, then RAM usage will be much larger. Also note that I did not mention other arrays such as the terminal node statistics that also lead to heavy RAM usage. I only considered the ensembles, but the reality is that each terminal node will contain arrays of dimension [time.interest] x 2 for the node specific estimator of the CSH and CIF that is used in creating the forest ensemble.
In the future, we will be implementing a Big Data option that will suppress ensembles and optimize the memory profile of the package to better accommodate big-n scenarios. In the meantime, you will have to be diligent in using the existing options like ntree, nodesize, and ntime to reduce your RAM usage.

What is the probability of a TERM for a specific TOPIC in Latent Dirichlet Allocation (LDA) in R

I'm working in R, package "topicmodels". I'm trying to work out and better understand the code/package. In most of the tutorials, documentation I'm reading I'm seeing people define topics by the 5 or 10 most probable terms.
Here is an example:
library(topicmodels)
data("AssociatedPress", package = "topicmodels")
lda <- LDA(AssociatedPress[1:20,], k = 5)
topics(lda)
terms(lda)
terms(lda,5)
so the last part of the code returns me the 5 most probable terms associated with the 5 topics I've defined.
In the lda object, i can access the gamma element, which contains per document the probablity of beloning to each topic. So based on this I can extract the topics with a probability greater than any threshold I prefer, instead of having for everyone the same number of topics.
But my second step would then to know which words are strongest associated to the topics. I can use the terms(lda) function to pull this out, but this gives me the N so many.
In the output I've also found the
lda#beta
which contains the beta per word per topic, but this is a Beta value, which I'm having a hard time interpreting. They are all negative values, and though I see some values around -6, and other around -200, i can't interpret this as a probability or a measure to see which words and how much stronger certain words associate to a topic. Is there a way to pull out/calculate anything that can be interpreted as such a measure.
many thanks
Frederik
The beta-matrix gives you a matrix with dimension #topics x #terms. The values are log-likelihoods, therefore you exp them. The given probabilities are of the type
P(word|topic) and these probabilities only add up to 1 if you take the sum over the words but not over the topics P(all words|topic) = 1 and NOT P(word|all topics) = 1.
What you are searching for is P(topic|word) but I actually do not know how to access or calculate it in this context. You will need P(word) and P(topic) I guess. P(topic) should be:
colSums(lda#gamma)/sum(lda#gamma)
Becomes more obvious if you look at the gamma-matrix, which is #document x #topics. The given probabilites are P(topic|document) and can be interpreted as "what is the probability of topic x given document y". The sum over all topics should be 1 but not the sum over all documents.

Mathematical representation of a set of points in N dimensional space?

Given some x data points in an N dimensional space, I am trying to find a fixed length representation that could describe any subset s of those x points? For example the mean of the s subset could describe that subset, but it is not unique for that subset only, that is to say, other points in the space could yield the same mean therefore mean is not a unique identifier. Could anyone tell me of a unique measure that could describe the points without being number of points dependent?
In short - it is impossible (as you would achieve infinite noiseless compression). You have to either have varied length representation (or fixed length with length being proportional to maximum number of points) or dealing with "collisions" (as your mapping will not be injective). In the first scenario you simply can store coordinates of each point. In the second one you approximate your point clouds with more and more complex descriptors to balance collisions and memory usage, some posibilities are:
storing mean and covariance (so basically perofming maximum likelihood estimation over Gaussian families)
performing some fixed-complexity density estimation like Gaussian Mixture Model or training a generative Neural Network
use set of simple geometrical/algebraical properties such as:
number of points
mean, max, min, median distance between each pair of points
etc.
Any subset can be identified by a bit mask of length ceiling(lg(x)), where bit i is 1 if the corresponding element belongs to the subset. There is no fixed-length representation that is not a function of x.
EDIT
I was wrong. PCA is a good way to perform dimensionality reduction for this problem, but it won't work for some sets.
However, you can almost do it. Where "almost" is formally defined by the Johnson-Lindenstrauss Lemma, which states that for a given large dimension N, there exists a much lower dimension n, and a linear transformation that maps each point from N to n, while keeping the Euclidean distance between every pair of points of the set within some error ε from the original. Such linear transformation is called the JL Transform.
In other words, your problem is only solvable for sets of points where each pair of points are separated by at least ε. For this case, the JL Transform gives you one possible solution. Moreover, there exists a relationship between N, n and ε (see the lemma), such that, for example, if N=100, the JL Transform can map each point to a point in 5D (n=5), an uniquely identify each subset, if and only if, the minimum distance between any pair of points in the original set is at least ~2.8 (i.e. the points are sufficiently different).
Note that n depends only on N and the minimum distance between any pair of points in the original set. It does not depend on the number of points x, so it is a solution to your problem, albeit some constraints.

substitution matrix based on spatial autocorrelation transformation

I would like to measure the hamming sequence similarity in which the substitution costs are not based on the substitution rates in the observed sequences but based on the spatial autocorrelation within the study area of the different states (states are thus not related to DNA but something else).
I divided my study area in grid cells of equal size (e.g. 1000m) and measured how often the same "state" is observed in a neighboring cell (Rook-case). Consequently the weight matrix indicates that from state A to A (to move within the same states) has a much higher probability than to go from A to B or B to C or A to C. This already indicates that states have a high spatial autocorrelation.
The problem is, if you want to measure sequence similarity the substitution matrix should be 0 at the diagonal. Therefore I was wondering whether there is a kind of transformation to go from an "autocorrelation matrix" to a substitution matrix, with 0 values along the diagonal. By means of this we would like to account for spatial autocorrelation in the study area in our sequence similarity measure. To do my analysis I am using the package TraMineR.
Example matrix in R for sequences consisting out of four states (A,B,C,D):
Sequence example: AAAAAABBBBCCCCCCCCCCCCDDDDDDDDDDDDDDDDDDDDDDDAAAAAAAAA
Autocorrelation matrix:
A = c(17.50,3.00,1.00,0.05)
B = c(3.00,10.00,2.00,1.00)
C = c(1.00,2.00,30.00,3.00)
D = c(0.05,1.00,3.00,20.00)
subm = rbind(A,B,C,D)
colnames(subm) = c("A","B","C","D")
how to transform this matrix to a substitution matrix?
First, TraMineR computes the Hamming distance, i.e., a dissimilarity, not a similarity.
The simple Hamming distance is just the count of mismatches between two sequences. For example, the Hamming distance between AABBCC and ABBBAC is 2, and between AAAAAA and AAAAAA it is 0 since there are no mismatches.
Generalized Hamming allows to weighting mismatches (not matches!) with substitution costs. For example if the substitution cost between A and B is 1.5, and is 2 between B and C, then the distance would be the weighted sum of mismatches, i.e., 3.5 between the first two sequences. It would still be zero between one sequence and itself.
From what I understand, the shown matrix is not the matrix of substitution costs. It is the matrix of what you call 'spatial autocorrelations', and you look for how you can turn this information into substitutions costs.
The idea is to assign high substitution cost (mismatch weight) when the autocorrelation (a rate in your case) is low, i.e., when there is a low probability to find say state B in the neighborhood of state A, and to assign a low substitution cost when the probability is high. Since your probability matrix is symmetric, a simple solution is to use $1 - p(A|B)$ for all off diagonal terms, and leave 0 on the diagonal for the reason explained above.
sm <- 1 - subm/100
diag(sm) <- 0
sm
For non symmetric probabilities, you could use a similar formula to the one used for deriving the costs from transition rates, i.e., $2 - p(A|B) - p(B|A)$.

Resources