How to represent a random variable in a program - probability-theory

I am not sure whether stack overflow is the right place to ask this question or not, but I am curious as to how to represent a random variable.
Say a random variable X ~ N(mu, sigma) then we can represent it by its mean and co variance separately. However, I know this can only be done for Gaussian distribution. If I want to represent a Poisson distribution, then this data type can no longer hold my Random Variable.
My question boils down, are there any languages/ libraries where I represent a random variable, like any other data structure. Personally I find it very difficult to understand the concept of random variable and such a representation would be great.
My ideal concept would be
RandomVariable rv = new RandomVariable(mu, sigma) // Assume 1-Dimension as of now
I know that in MATLAB, there is a function mvnpdf() which gives an instance of the distribution, but there is no notion of representing a random variable,

In general, for a programming language, you don't declare a random variable, so much as you create a function that generates (pseudo)random numbers according to whatever distribution you want to use. There are many libraries for generating random numbers, and the most common are typically Gaussian distributed. When programming you have to try to think in terms of unambiguous steps to solve a problem you want to solve rather than a particular abstract notation which while useful, may not correspond directly to an algorithm. Most likely you will use random numbers generated from the selected random number generating function as input to a function which substitutes into your abstract model and produces a set of values that constitute a single sample of your statistical evaluation of the model.


rpart variable importance shows more variables than decision tree plots

I fitted an rpart model in Leave One Out Cross Validation on my data using Caret library in R. Everything is ok, but I want to understand the difference between model's variable importance and decision tree plot.
Calling the variable importance with the function varImp() shows nine variables. Plotting the decision tree using functions such as fancyRpartPlot() or rpart.plot() shows a decision tree that uses only two variables to classify all subjects.
How can it be? Why does the decision tree plot not shows the same nine variables from the variable importance table?
Thank you.
Similar to rpart(), Caret has a cool property: it deals with surrogate variables, i.e. variables that are not chosen for splits, but that were close to win the competition.
Let me be more clear. Say at a given split, the algorithm decided to split on x1. Suppose also there is another variable, say x2, which would be almost as good as x1 for splitting at that stage. We call x2 surrogate, and we assign it its variable importance as we do for x1.
This is way you can get in the importance ranking variables that are actually not used for splitting. You can also find that such variables are more important than others actuall used!
The rationale for this is explained in the documentation for rpart(): suppose we have two identical covariates, say x3 and x4. Then rpart() is likely to split on one of them only, e.g., x3. How can we say that x4 is not important?
To conclude, variable importance considers the increase in fit for both primary variables (i.e., the ones actually chosen for splitting) and surrogate variables. So, the importance for x1 considers both splits for which x1 is chosen as splitting variable, and splits for which another variables is chosen but x1 is a close competitor.
Hope this clarifies your doubts. For more details, see here. Just a quick quotation:
The following methods for estimating the contribution of each variable to the model are available [speaking of how variable importance is computed]:
- Recursive Partitioning: The reduction in the loss function (e.g. mean squared error) attributed to each variable at each split is tabulated and the sum is returned. Also, since there may be candidate variables that are important but are not used in a split, the top competing variables are also tabulated at each split. This can be turned off using the maxcompete argument in rpart.control.
I am not used to caret, but from this quote it appears that such package actually uses rpart() to grow trees, thus inheriting the property about surrogate variables.

K-Means Distance Measure - Large Data and mixed Scales

I´ve a question regarding k-means clustering. We have a dataset with 120,000 observations and need to compute a k-means cluster solution with R. The problem is that k-means usually use Euclidean Distance. Our dataset consists of 3 continous variables, 11 ordinal (Likert 0-5) (i think it would be okay to handle them like continous) and 5 binary variables. Do you have any suggestion for a distance measure that we can use for our k-means approach with regards to the "large" dataset? We stick to k-means, so I really hope one of you has a good idea.
One approach would be to normalize the features and then just use the 11-dimensional
Euclidean Distance. Cast the binary values to 0/1 (Well, it's R, so it does that anyway) and go from there.
I don't see an immediate problem with this method other than k-means in 11 dimensions will definitely be hard to interpret. You could try to use a dimensionality reduction technique and hopefully make the k-means output easier to read, but you know way more about the data set than we ever could, so our ability to help you is limited.
You can certainly encode there binary variables as 0,1 too.
It is a best practise in statistics to not treat likert scale variables as numeric, because of that uneven distribution.
But I don't you will get meaningful k-means clusters. That algorithm is all about computing means. That makes sense on continuous variables. Discrete variables usually lack "resolution" for this to work well. Three mean then degrades to a "frequency" and then the data should be handled very differently.
Do not choose the problem by the hammer. Maybe your data is not a nail; and even if you'd like to make it with kmeans, it won't solve your problem... Instead, formulate your problem, then choose the right tool. So given your data, what is a good cluster? Until you have an equation that measures this, handing the data won't solve anything.
Encoding the variables to binary will not solve the underlying problem. Rather, it will only aid in increasing the data dimensionality, an added burden. It's best practice in statistics to not alter the original data to any other form like continuous to categorical or vice versa. However, if you are doing so, i.e. the data conversion then it must be in sync with the question to solve as well as you must provide valid justification.
Continuing further, as others have stated, try to reduce the dimensionality of the dataset first. Check for issues like, missing values, outliers, zero variance, principal component analysis (continuous variables), correspondence analysis (for categorical variables) etc. This can help you reduce the dimensionality. After all, data preprocessing tasks constitute 80% of analysis.
Regarding the distance measure for mixed data type, you do understand the mean in k will work only for continuous variable. So, I do not understand the logic of using the algorithm k-means for mixed datatypes?
Consider choosing other algorithm like k-modes. k-modes is an extension of k-means. Instead of distances it uses dissimilarities (that is, quantification of the total mismatches between two objects: the smaller this number, the more similar the two objects). And instead of means, it uses modes. A mode is a vector of elements that minimizes the dissimilarities between the vector itself and each object of the data.
Mixture models can be used to cluster mixed data.
You can use the R package VarSelLCM which models, within each cluster, the continuous variables by Gaussian distributions and the ordinal/binary variables.
Moreover, missing values can be managed by the model at hand.
A tutorial is available at:

Hmm training with multiple observations and mhsmm package in R

i wanted to train a new hmm model, by means of Poisson observations that are the only thing i know.
I'm using the mhsmm package for R.
The first thing that bugs me is the initialization of the model, in the examples is:
initial <- rep(1/J,J)
P <- matrix(1/J, nrow = J, ncol = J)
b <- list(lambda=c(1,3,6))
model = hmmspec(init=initial, trans=P, parms.emission=b,dens.emission=dpois.hsmm)
in my case i don't have initial values for the emission distribution parameters, that's what i want to estimate. How?
Secondly: if i only have observations, how do i pass them to
h1 = hmmfit(list_of_observations, model ,mstep=mstep.pois)
in order to obtain the trained model?
list_of_observations, in the examples, contains a vector of states, one of observations and one of observation sequence length and is usually obtained by a simulation of the model:
list_of_observations = simulate(model, N, rand.emis = rpois.hsmm)
EDIT: Found this old question with an answer that partially solved my problem:
MHSMM package in R-Input Format?
These two lines did the trick:
train <- list(x = data.df$sequences, N = N)
class(train) <- ""
where data.df$sequences is the array containing all observations sequences and N is the array containing the count of observations for each sequence.
Still, the initial model is totally random, but i guess this is the way it is meant to be since it will be re-estimated, am i right?
The problem of initialization is critical not only for HMMs and HSMMs, but for all learning methods based on a form of the Expectation-Maximization algorithm. EM converges to a local optimum in terms of likelihood between model and data, but that does not always guarantee to reach the global optimum.
Goal: find estimates of the emission distribution but it also works for initial probability and transition matrix
Algorithm: needs initial estimate to start the optimisation from
You: have to provide an initial "guess" of the parameters
This may seem confusing at first, but the EM algorithm needs a point to start the optimisation. Then it makes some computations and it gives you a better estimate of your own initial guess (re-estimation, as you said). It is not able to just find the best parameters on its own, without being initialised.
From my experience, there is no general way to initialise the parameters that guarantee to converge to a global optimum, but it will depend more on the case at hand. That's why initialisation plays a critical role (mostly for emission distribution).
What I used to do in such a case is to separate the training data in different groups (e.g. percentiles of a certain parameter in the set), estimate the parameters on these groups, and then use them as initial parameter estimates for the EM algorithm. Basically, you have to try different methods and see which one works best.
I'd recommend to search the literature if similar problems have been solved with HMM, and try their initialisation method.

Formula interface for glmnet

In the last few months I've worked on a number of projects where I've used the glmnet package to fit elastic net models. It's great, but the interface is rather bare-bones compared to most R modelling functions. In particular, rather than specifying a formula and data frame, you have to give a response vector and predictor matrix. You also lose out on many quality-of-life things that the regular interface provides, eg sensible (?) treatment of factors, missing values, putting variables into the correct order, etc.
So I've generally ended up writing my own code to recreate the formula/data frame interface. Due to client confidentiality issues, I've also ended up leaving this code behind and having to write it again for the next project. I figured I might as well bite the bullet and create an actual package to do this. However, a couple of questions before I do so:
Are there any issues that complicate using the formula/data frame interface with elastic net models? (I'm aware of standardisation and dummy variables, and wide datasets maybe requiring sparse model matrices.)
Is there any existing package that does this?
Well, it looks like there's no pre-built formula interface, so I went ahead and made my own. You can download it from Github:
Or in R, using devtools::install_github:
From the readme:
Some quality-of-life functions to streamline the process of fitting
elastic net models with glmnet, specifically:
glmnet.formula provides a formula/data frame interface to glmnet.
cv.glmnet.formula does a similar thing for cv.glmnet.
Methods for predict and coef for both the above.
A function cvAlpha.glmnet to choose both the alpha and lambda parameters via cross-validation, following the approach described in
the help page for cv.glmnet. Optionally does the cross-validation in
Methods for plot, predict and coef for the above.
Incidentally, while writing the above, I think I realised why nobody has done this before. Central to R's handling of model frames and model matrices is a terms object, which includes a matrix with one row per variable and one column per main effect and interaction. In effect, that's (at minimum) roughly a p x p matrix, where p is the number of variables in the model. When p is 16000, which is common these days with wide data, the resulting matrix is about a gigabyte in size.
Still, I haven't had any problems (yet) working with these objects. If it becomes a major issue, I'll see if I can find a workaround.
Update Oct-2016
I've pushed an update to the repo, to address the above issue as well as one related to factors. From the documentation:
There are two ways in which glmnetUtils can generate a model matrix out of a formula and data frame. The first is to use the standard R machinery comprising model.frame and model.matrix; and the second is to build the matrix one variable at a time. These options are discussed and contrasted below.
Using model.frame
This is the simpler option, and the one that is most compatible with other R modelling functions. The model.frame function takes a formula and data frame and returns a model frame: a data frame with special information attached that lets R make sense of the terms in the formula. For example, if a formula includes an interaction term, the model frame will specify which columns in the data relate to the interaction, and how they should be treated. Similarly, if the formula includes expressions like exp(x) or I(x^2) on the RHS, model.frame will evaluate these expressions and include them in the output.
The major disadvantage of using model.frame is that it generates a terms object, which encodes how variables and interactions are organised. One of the attributes of this object is a matrix with one row per variable, and one column per main effect and interaction. At minimum, this is (approximately) a p x p square matrix where p is the number of main effects in the model. For wide datasets with p > 10000, this matrix can approach or exceed a gigabyte in size. Even if there is enough memory to store such an object, generating the model matrix can take a significant amount of time.
Another issue with the standard R approach is the treatment of factors. Normally, model.matrix will turn an N-level factor into an indicator matrix with N-1 columns, with one column being dropped. This is necessary for unregularised models as fit with lm and glm, since the full set of N columns is linearly dependent. With the usual treatment contrasts, the interpretation is that the dropped column represents a baseline level, while the coefficients for the other columns represent the difference in the response relative to the baseline.
This may not be appropriate for a regularised model as fit with glmnet. The regularisation procedure shrinks the coefficients towards zero, which forces the estimated differences from the baseline to be smaller. But this only makes sense if the baseline level was chosen beforehand, or is otherwise meaningful as a default; otherwise it is effectively making the levels more similar to an arbitrarily chosen level.
Manually building the model matrix
To deal with the problems above, glmnetUtils by default will avoid using model.frame, instead building up the model matrix term-by-term. This avoids the memory cost of creating a terms object, and can be noticeably faster than the standard approach. It will also include one column in the model matrix for all levels in a factor; that is, no baseline level is assumed. In this situation, the coefficients represent differences from the overall mean response, and shrinking them to zero is meaningful (usually).
The main downside of not using model.frame is that the formula can only be relatively simple. At the moment, only straightforward formulas like y ~ x1 + x2 + ... + x_p are handled by the code, where the x's are columns already present in the data. Interaction terms and computed expressions are not supported. Where possible, you should compute such expressions beforehand.
Update Apr-2017
After a few hiccups, this is finally on CRAN.

Histogram matching - image processing - c/c++

I have two histograms.
int Hist1[10] = {1,4,3,5,2,5,4,6,3,2};
int Hist1[10] = {1,4,3,15,12,15,4,6,3,2};
Hist1's distribution is of type multi-modal;
Hist2's distribution is of type uni-modal with single prominent peak.
My questions are
Is there any way that i could determine the type of distribution programmatically?
How to quantify whether these two histograms are similar/dissimilar?
I posted a C function in your other question ( automatically compare two series -Dissimilarity test ) that will compute divergence between two sets of similar data. It's actually intended to tell you how closely real data matches predicted data but I suspect you could use it for your purpose.
Basically, the smaller the error, the more similar the two sets are.
These are just guesses, but I would try fitting each distribution as a gaussian distribution and use something like the R-squared value to determine if the distribution is uni-modal or not.
As to the similarity between the two distributions, I would try doing an autocorrelation and using the peak positive value in the autocorrelation as a similarity measure. These ideas are pretty rough, but hopefully they give you some ideas.
For #2, you could calculate their cross-correlation (so long as the buckets themselves can be sorted). That would give you a rough estimation of what "similarity".
Comparison of Histograms (For Use in Cloud Modeling).
(That's an MS .doc file.)
There are a variety of software packages that will "fit" your distributions to known discrete distributions for you - Minitab, STATA, R, etc. A reference to fitting distributions in R is here. I wouldn't advise programming this from scratch.
Regarding distribution comparisons, if neither distribution fits a known distribution (Poisson, Binomial, etc.), then you need to use non-parametric methods described here.
