Assessing LDA predictions with textmineR in R - Calculating perplexity? - r

I am working on a LDA model with textmineR, have calculated coherence, log-likelihood measures and optimized my model.
As a last step I would like to see how well the model predicts topics on unseen data. Thus, I am using the predict() function from the textminer package in combination with GIBBS sampling on my testset-sample.
This results in predicted "Theta" values for each document in my testset-sample.
While I have read in another post that perplexity-calculations are not available with the texminer package (See this post here: How do i measure perplexity scores on a LDA model made with the textmineR package in R?), I am now wondering what the purpose of the prediction function is then for? Especially with a large dataset of over 100.000 Documents it is hard to just visually assess whether the prediction has performed well or not.
I do not want to use perplexity for model selection (I am using coherence/log-likelihood instead), but as far as I understand, perplexity would help me to understand how well the prediction is and how "surprised" the model is with new, previously unseen data.
Since this does not seem to be available for textmineR, I am not sure how to assess the model prediction. Is there anything else that I could use to measure the prediction quality of my textminer model?
Thank you!

Related

Correct presentation of model develoment with post-estimation shrinkage after internal validation in R

I'm developing a prediction model (binary outcome) in R and I have the following question:
After fitting my model with fit.mult.impute from the hmisc package (i dealt with some missing values in my data), i used the validate function from the rms package to assess for optimism with bootstrapping and got my optimism-corrected performance measures. I decided to do a post-estimation shrinkage using the index.corrected calibration slope as shrinkage factor to obtain shrunk coefficients. Now the question I have is related to correct model presentation.
It is common practice to report performance measures of apparent performance (with CIs), then optimism, and then optimism-corrected measures (with CIs). But what about the final post-shrinkage model? Should performance measures be reported for this final model as well, or this final model is only to be used for presentation of the model equation or a nomogram? The same question applies for calibration curves, for which i intended to use the CalibrationCurves package or the calibrate function from rms, but I don't know if I have to plot the performance of the shrunk model as well.

SVM for data prediction in R

I'd like to use the 'e1071' library for fitting an SVM model. So far, I've made a model that creates a curve regression based on the data set.
(take a look at the purple curve):
However, I want the SVM model to "follow" the data, such that the prediction for each value is as close as possible to the actual data. I think this is possible because of this graph that shows how SVMs (model 2) model are similar to an ARIMA model (model 1):
I tried changing the kernel to no avail. Any help will be much appreciated.
Fine tuning a SVM classifier is no easy task. Have you considered other models? For ex. GAM's (generalized additive models)? These work well on very curvy data.

Feature selection and prediction accuracy in regression Forest in R

I am attempting to solve a regression problem where the input feature set is of size ~54.
Using OLS linear regression with a single predictor 'X1', I am not able to explain the variation in Y - hence I am trying to find additional important features using Regression forest (i.e., Random forest regression). The selected 'X1' is later found to be the most important feature.
My dataset has ~14500 entries. I have separated it into training and test sets in the ratio 9:1.
I have the following questions:
when trying to find the important features, should I run the regression forest on the entire dataset, or only on the training data?
Once the important features are found, should the model be re-built using the top few features to see whether feature selection speeds up the computation at a small cost to predictive power?
For now, I have built the model using the training set and all the features, and I am using it for prediction on the test set. I am calculating the MSE and R-squared from the training set. I am getting high MSE and low R2 on the training data, and reverse on the test data (shown below). Is this unusual?
forest <- randomForest(fmla, dTraining, ntree=501, importance=T)
mean((dTraining$y - predict(forest, data=dTraining))^2)
0.9371891
rSquared(dTraining$y, dTraining$y - predict(forest, data=dTraining))
0.7431078
mean((dTest$y - predict(forest, newdata=dTest))^2)
0.009771256
rSquared(dTest$y, dTest$y - predict(forest, newdata=dTest))
0.9950448
Please suggest.
Any suggestion if R-squared and MSE are good metrics for this problem, or if I need to look at some other metrics to evaluate if the model is good?
You should also try Cross Validated here
when trying to find the important features, should I run the regression forest on the entire dataset, or only on the training data?
Only on the training data. You want to prevent overfitting, which is why you do a train-test split in the first place.
Once the important features are found, should the model be re-built using the top few features to see whether feature selection speeds up the computation at a small cost to predictive power?
Yes, but the purpose of feature selection is not necessarily to speed up computation. With infinite features, it is possible to fit any pattern of data (i.e., overfitting). With feature selection, you're hoping to prevent overfitting by using only a few 'robust' features.
For now, I have built the model using the training set and all the features, and I am using it for prediction on the test set. I am calculating the MSE and R-squared from the training set. I am getting high MSE and low R2 on the training data, and reverse on the test data (shown below). Is this unusual?
Yes, it's unusual. You want low MSE and high R2 values for both your training and test data. (I would double check your calculations.) If you're getting high MSE and low R2 with your training data, it means your training was poor, which is very surprising. Also, I haven't used rSquared but maybe you want rSquared(dTest$y, predict(forest, newdata=dTest))?

Output posterior distribution from bayesian network in R (bnlearn)

I'm experimenting with Bayesian networks in R and have built some networks using the bnlearn package. I can use them to make predictions for new observations with predict(), however I would also like to have the posterior distribution over the possible classes. Is there a way of retrieving this information?
It seems like there is a prob-parameter that does this for the naive bayes implementation in the bnlearn package, but not for networks fitted with bn.fit.
Thankful for any help with this.
See the documentation of bnlearn.
predict function implements prob only for naive.bayes and TAN.
In short, because all other methods do not necessarily compute posterior probabilities.
[bnlearn] :: predict returns the predicted values for node given the data specified by data. Depending on the
value of method, the predicted values are computed as follows:
a)parents b)bayes-lw
When using bayes-lw , likelihood weighting simulations are performed for making predictions.
Hope this helps. :)

prediction intervals with caret

I've been using the caret package in R to run some boosted regression tree and random forest models and am hoping to generate prediction intervals for a set of new cases using the inbuilt cross-validation routine.
The trainControl function allows you to save the hold-out predictions at each of the n-folds, but I'm wondering whether unknown cases can also be predicted at each fold using the built-in functions, or whether I need to use a separate loop to build the models n-times.
Any advice much appreciated
Check the R package quantregForest, available at CRAN. It can easily calculate prediction intervals for random forest models. There's a nice paper by the author of the package, explaining the backgrounds of the method. (Sorry, I can't say anything about prediction intervals for BRT models; I'm looking for them by myself...)

Resources