Evaluate stm Model - r

I´m working on a STM Model (topicmodelling) and i´d like to evaluate and verify the model, but i´m not sure how to do it. My code is:
Corpus.STM <- readCorpus(dtm, type = "slam")
Model choice:
BestM1. <- searchK(Corpus.STM$documents, Corpus.STM$vocab, K=c(10,20, 30, 40, 50, 60), proportion = .4, heldout.seed = 1, prevalence=~ cvJahr+ cvDienstgrad+ cvLand, data=Jahr.Land )
BestM2. <- searchK(Corpus.STM$documents, Corpus.STM$vocab, K=c(85,110), proportion = .4, heldout.seed = 1, prevalence=~ cvJahr+ cvDienstgrad+ cvLand, data=Jahr.Land )
BestM3. <- searchK(Corpus.STM$documents, Corpus.STM$vocab, K=c(20,21,22,23,24,25,26,27,28,29,30), proportion = .4, heldout.seed = 1, prevalence=~ cvJahr+ cvDienstgrad+ cvLand, data=Jahr.Land )
str(BestM1.)
plot.searchK(BestM1.)
plot.STM(BestM2)
plot.searchK(BestM3.)
#27 seems to be a good choice
#Heldout
set.seed(1)
heldout<- make.heldout(Corpus.STM$documents, Corpus.STM$vocab, proportion = .5,seed = 1)
stm.mod1 <- stm(heldout$documents, heldout$vocab, K =27, seed = 1, init.type = "Spectral", max.em.its = 100 )
heldout.evaluation <- eval.heldout(stm.mod1, heldout$missing)
heldout.evaluation
#evaluation heldout
labelTopics(stm.mod1)
plot.STM(stm.mod1, type="labels", n=5, frexweight = 0.25)
cloud(stm.mod1, topic=5)
plot.STM(stm.mod1, type="summary", labeltype="frex", topics=c(1:5), n=8)
I´m not sure how to interpret the output of "eval.heldout". Additional I want to make sure that the model doesn´t overfit, but i´m not sure how it could work.

eval.heldout() calculates the held-out log-likelihood using document completion. The number you want is the heldout.evaluation$expected.heldout which is the average of the held-out log-likelihood values for each document. Unfortunately there is no unambiguous measure of whether or not the model is "overfit." The plot.searchK() call you have will give you a plot of the held-out log-likelihood over different values of K and certainly if that number is decreasing as K goes up one explanation is overfitting.
Sorry to not have a clearer answer but unfortunately there are no hard and fast rules here.

Related

Output of keras in R can not be used to predict

i use keras and tensorflow to run an lstm in R to predict some stock market prices.
Here I am providing the code where instead of stock market prices, I just use one randomly generated vector VECTOR of length 100. Then I consider a training period of 80 first values and try to predict the 20 test values...
What am I doing wrong?
I am getting an error:Error in UseMethod("predict") :
no applicable method for 'predict' applied to an object of class "keras_training_history"
Thank you
library(tensorflow)
library(keras)
set.seed(12345)
VECTOR=rnorm(100,2,5)
VECTOR_training=VECTOR[1:80]
VECTOR_test=VECTOR[81:100]
training_rescaled=scale(VECTOR_training)
#I also calculate the scale factors because I will need them when I will be coming
#back to the original data
scale_factors=matrix(NA,nrow=1,ncol=2)
scale_factors=c(mean(VECTOR_training), sd(VECTOR_training))
#We want to predict 20 days, so we need to base each prediction on 20 data points.
prediction_stocks=20
lag_stocks=prediction_stocks
test_rescaled =training_rescaled[(length(VECTOR_training)- prediction_stocks + 1):length(VECTOR_training)]
#We lag the data 20times, so that each prediction is based on 20 values, and arrange lagged values into columns. Then we transform it into the desired 3D form.
x_train_data_stocks=t(sapply(1:(length(VECTOR_training)-lag_stocks-prediction_stocks+1),
function(x) training_rescaled[x:(x+lag_stocks-1),1]
))
# now we transform it into 3D form
x_train_arr_stocks=array(
data=as.numeric(unlist(x_train_data_stocks)),
dim=c(
nrow(x_train_data_stocks),
lag_stocks,
1
)
)
#Now we apply similar transformation for the Y values.
y_train_data_stocks=t(sapply(
(1 + lag_stocks):(length(training_rescaled) - prediction_stocks + 1),
function(x) training_rescaled[x:(x + prediction_stocks - 1)]
))
y_train_arr_stocks= array(
data = as.numeric(unlist(y_train_data_stocks)),
dim = c(
nrow(y_train_data_stocks),
prediction_stocks,
1
)
)
#In the same manner we need to prepare input data for the prediction
#list_test_rescaled
# this time our array just has one sample, as we intend to perform one 20-days prediction
x_pred_arr_stocks=array(
data = test_rescaled,
dim = c(
1,
lag_stocks,
1
)
)
###lstm forecast prova
set.seed(12345)
lstm_model <- keras_model_sequential()
lstm_model_prova=
layer_lstm(lstm_model,units = 70, # size of the layer
batch_input_shape = c(1, 20, 1), # batch size, timesteps, features
return_sequences = TRUE,
stateful = TRUE) %>%
# fraction of the units to drop for the linear transformation of the inputs
layer_dropout(rate = 0.5) %>%
layer_lstm(units = 50,
return_sequences = TRUE,
stateful = TRUE) %>%
layer_dropout(rate = 0.5) %>%
time_distributed(keras::layer_dense(units = 1))
lstm_model_compile=compile(lstm_model_prova,loss = 'mae', optimizer = 'adam', metrics = 'accuracy')
lstm_fit_prova=fit(lstm_model_compile,
x = x_train_arr_stocks[[1]],
y = y_train_arr_stocks[[1]],
batch_size = 1,
epochs = 20,
verbose = 0,
shuffle = FALSE
)
lstm_forecast_prova=predict(lstm_fit_prova,x_pred_arr_stocks, batch_size = 1)
It works if I use
lstm_forecast_prova=predict(lstm_model_compile,x_pred_arr_stocks, batch_size = 1)
But shouldn't I use the fitted model in order to make the predictions?
Also, if I plot the fitted model, the accuracy is 0. And actually on my real data the predictions do not make any sense. So what does it mean that the accuracy is 0? Maybe something is wrong with the lstm parameters?
Thank you in advance!!

MXNET softmax output: label shape confusion

I have not got a clear idea about how labels for the softmax classifier should be shaped.
What I could understand from my experiments is that a scalar laber indicating the index of class probability output is one option, while another is a 2D label where the rows are class probabilities, or one-hot encoded variable, like c(1, 0, 0).
What puzzles me though is that:
I can use sclalar label values that go beyong indexing, like 4 in my
example below -- without warning or error. Why is that?
When my label is a negative scalar or an array with a negative value,
the model converges to uniform probablity distribution over classes.
For example, is this expected that actor_train.y = matrix(c(0, -1,v0), ncol = 1) results in equal probabilities in the softmax output?
I try to use softmax MXNET classifier to produce the policy gradient
reifnrocement learning, and my negative rewards lead to the issue
above: uniform probability. Is that expected?
require(mxnet)
actor_initializer <- mx.init.Xavier(rnd_type = "gaussian",
factor_type = "avg",
magnitude = 0.0001)
actor_nn_data <- mx.symbol.Variable('data') actor_nn_label <- mx.symbol.Variable('label')
device.cpu <- mx.cpu()
NN architecture
actor_fc3 <- mx.symbol.FullyConnected(
data = actor_nn_data
, num_hidden = 3 )
actor_output <- mx.symbol.SoftmaxOutput(
data = actor_fc3
, label = actor_nn_label
, name = 'actor' )
crossentfunc <- function(label, pred)
{
- sum(label * log(pred)) }
actor_loss <- mx.metric.custom(
feval = crossentfunc
, name = "log-loss"
)
initialize NN
actor_train.x <- matrix(rnorm(11), nrow = 1)
actor_train.y = 0 #1 #2 #3 #-3 # matrix(c(0, 0, -1), ncol = 1)
rm(actor_model)
actor_model <- mx.model.FeedForward.create(
symbol = actor_output,
X = actor_train.x,
y = actor_train.y,
ctx = device.cpu,
num.round = 100,
array.batch.size = 1,
optimizer = 'adam',
eval.metric = actor_loss,
clip_gradient = 1,
wd = 0.01,
initializer = actor_initializer,
array.layout = "rowmajor" )
predict(actor_model, actor_train.x, array.layout = "rowmajor")
It is quite strange to me, but I found a solution.
I changed optimizer from optimizer = 'adam' to optimizer = 'rmsprop', and the NN started to converge as expected in case of negative targets. I made simulations in R using a simple NN and optim function to get the same result.
Looks like adam or SGD may be buggy or whatever in case of multinomial classification... I also used to get stuck at the fact those optimizers did not converge to a perfect solution on just 1 example, while rmsprop does! Be aware!

Understanding num_classes for xgboost in R

I'm having a lot of trouble figuring out how to correctly set the num_classes for xgboost.
I've got an example using the Iris data
df <- iris
y <- df$Species
num.class = length(levels(y))
levels(y) = 1:num.class
head(y)
df <- df[,1:4]
y <- as.matrix(y)
df <- as.matrix(df)
param <- list("objective" = "multi:softprob",
"num_class" = 3,
"eval_metric" = "mlogloss",
"nthread" = 8,
"max_depth" = 16,
"eta" = 0.3,
"gamma" = 0,
"subsample" = 1,
"colsample_bytree" = 1,
"min_child_weight" = 12)
model <- xgboost(param=param, data=df, label=y, nrounds=20)
This returns an error
Error in xgb.iter.update(bst$handle, dtrain, i - 1, obj) :
SoftmaxMultiClassObj: label must be in [0, num_class), num_class=3 but found 3 in label
If I change the num_class to 2 I get the same error. If I increase the num_class to 4 then the model runs, but I get 600 predicted probabilities back, which makes sense for 4 classes.
I'm not sure if I'm making an error or whether I'm failing to understand how xgboost works. Any help would be appreciated.
label must be in [0, num_class)
in your script add y<-y-1 before model <-...
I ran into this rather weird problem as well. It seemed in my class to be a result of not properly encoding the labels.
First, using a string vector with N classes as the labels, I could only get the algorithm to run by setting num_class = N + 1. However, this result was useless, because I only had N actual classes and N+1 buckets of predicted probabilities.
I re-encoded the labels as integers and then num_class worked fine when set to N.
# Convert classes to integers for xgboost
class <- data.table(interest_level=c("low", "medium", "high"), class=c(0,1,2))
t1 <- merge(t1, class, by="interest_level", all.x=TRUE, sort=F)
and
param <- list(booster="gbtree",
objective="multi:softprob",
eval_metric="mlogloss",
#nthread=13,
num_class=3,
eta_decay = .99,
eta = .005,
gamma = 1,
max_depth = 4,
min_child_weight = .9,#1,
subsample = .7,
colsample_bytree = .5
)
For example.
I was seeing the same error, my issue was that I was using an eval_metric that was only meant to be used for multiclass labels when my data had binary labels. See eval_metric in the Learning Class Parameters section of the XGBoost docs for a list of all of the options.
I had this problem and it turned out that I was trying to subtract 1 from my predictor which was already in the units of 0 and 1. Probably a novice mistake, but in case anyone else is running into this with a binary response variable that is already 0 and 1 it is something to make note of.
Tutorial said:
label = as.integer(iris$Species)-1
What worked for me (response is high_end):
label = as.integer(high_end)

Try to figure out "target dose" from DoseFinding package

I tried to run the codes to figure out target dose and D optimally
but it always say my length is wrong. I do not understand how to code "weights"
and delta. Can someone help me out here?
Here is my code:
library(DoseFinding)
doses <- c(0,5, 25, 125, 200)
fmodels <- Mods(linear = NULL, emax = 14,
doses = doses, placEff=-0.17, maxEff=-1.4)
weights <- rep(1/5, 5)
desTD <- optDesign(fmodels, probs=1, designCrit="TD&Dopt",Delta=0.5)
plot(fmodels, plotTD = TRUE, Delta = 0.2)
When I used the example it showed all the probabilities for all doses:
data(IBScovars)
doses <- c(0, 10, 25, 50, 100, 150)
fmodels <- Mods(linear = NULL, emax = 25, exponential = 85,
logistic = c(50, 10.8811),
doses = doses, placEff=0, maxEff=0.4)
plot(fmodels, plotTD = TRUE, Delta = 0.2)
weights <- rep(1/4, 4)
desTD <- optDesign(fmodels, weights, Delta=0.2, designCrit="TD")
Calculated TD - optimal design:
0 10 25 50 100 150
0.34960 0.09252 0.00366 0.26760 0.13342 0.15319
But for mine only three doses show up...does it mean
other doses are not important?
Well, from the help page, you need to have the same number of weights as doses. The models retain info about the doses used. What you have looks fine, but you could also do
ds <- attr(fmodels, "doses")
weights <- rep.int(1, length(ds))/length(ds)
to extract the info from the fmodel object.
Also, when running your optDesign function, i had problem with the probs and dsignCrit parameters you specified. The probs length should correspond to the length of fmodels. An interal calculation does this to find the total length
Reduce("+",lapply(fmodels, function(x) {
if (is.vector(x))
return(1)
if (is.matrix(x))
return(nrow(x))
}))
# [1] 2
so there should be two probabilities. Also i think designCrit="TD&Dopt" should be designCrit="Dopt&TD" so the following seems to run without error
desTD <- optDesign(fmodels, probs=c(.5,.5),
weights=weights, designCrit="Dopt&TD",Delta=0.5)
It's unclear exactly what your question about Delta is. According to the help page, that's just your estimated effect size.

how do I select the smoothing parameter for smooth.spline()?

I know that the smoothing parameter(lambda) is quite important for fitting a smoothing spline, but I did not see any post here regarding how to select a reasonable lambda (spar=?), I was told that spar normally ranges from 0 to 1. Could anyone share your experience when use smooth.spline()? Thanks.
smooth.spline(x, y = NULL, w = NULL, df, spar = NULL,
cv = FALSE, all.knots = FALSE, nknots = NULL,
keep.data = TRUE, df.offset = 0, penalty = 1,
control.spar = list(), tol = 1e-6 * IQR(x))
agstudy provides a visual way to choose spar. I remember what I learned from linear model class (but not exact) is to use cross validation to pick "best" spar. Here's a toy example borrowed from agstudy:
x = seq(1:18)
y = c(1:3,5,4,7:3,2*(2:5),rep(10,4))
splineres <- function(spar){
res <- rep(0, length(x))
for (i in 1:length(x)){
mod <- smooth.spline(x[-i], y[-i], spar = spar)
res[i] <- predict(mod, x[i])$y - y[i]
}
return(sum(res^2))
}
spars <- seq(0, 1.5, by = 0.001)
ss <- rep(0, length(spars))
for (i in 1:length(spars)){
ss[i] <- splineres(spars[i])
}
plot(spars, ss, 'l', xlab = 'spar', ylab = 'Cross Validation Residual Sum of Squares' , main = 'CV RSS vs Spar')
spars[which.min(ss)]
R > spars[which.min(ss)]
[1] 0.381
Code is not neatest, but easy for you to understand. Also, if you specify cv=T in smooth.spline:
R > xyspline <- smooth.spline(x, y, cv=T)
R > xyspline$spar
[1] 0.3881
From the help of smooth.spline you have the following:
The computational λ used (as a function of \code{spar}) is λ = r *
256^(3*spar - 1)
spar can be greater than 1 (but I guess no too much). I think you can vary this parameters and choose it graphically by plotting the fitted values for different spars. For example:
spars <- seq(0.2,2,length.out=10) ## I will choose between 10 values
dat <- data.frame(
spar= as.factor(rep(spars,each=18)), ## spar to group data(to get different colors)
x = seq(1:18), ## recycling here to repeat x and y
y = c(1:3,5,4,7:3,2*(2:5),rep(10,4)))
xyplot(y~x|spar,data =dat, type=c('p'), pch=19,groups=spar,
panel =function(x,y,groups,...)
{
s2 <- smooth.spline(y,spar=spars[panel.number()])
panel.lines(s2)
panel.xyplot(x,y,groups,...)
})
Here for example , I get best results for spars = 0.4
If you don't have duplicated points at the same x value, then try setting GCV=TRUE - the Generalized Cross Validation (GCV) procedure is a clever way of selecting a pretty good stab at picking a good value for lambda (span). One neat detail about the GCV is that it doesn't actually have to go to the trouble of doing the calculations for every single set of one-left-out points - as highlighted in Simon Wood's book. For lots of detail on this have a look at the notes on Simon Wood's web page on MGCV.
Adrian Bowman's (sm) r-package has a function h.select() which is intended specifically for going the grunt work for choosing a value of lambda (though I'm not 100% sure that it is compatible with the smooth.spline() function in the base package.

Resources