Random data generation leading to good prediction on random labels - r

I've been playing around with implementing CV in R but encountered a weird problem with the returned value among folds in LOOCV.
First I'll randomly generate data as well as labels, then I'll fit a randomForest on what should be just noise. From the returned loop I get not only a good AUC but a significant p-value from a t-test. I don't understand how this could be theoretically happening so I was curious if the ways I attempted to generate data/labels was best?
Here is a code snippet that shows my issue.
library(randomForest)
library(pROC)
n=30
p=900
set.seed(3)
XX=matrix(rnorm(n*p, 0, 1) , nrow=n)
YY=as.factor(sample(c('P', 'C'), n, replace=T))
resp = vector()
for(i in 1:n){
fit = randomForest(XX[-i,], YY[-i])
pred = predict(fit, XX[i,], type = "prob")[2]
resp[i] <- pred
}
t.test(resp~YY)$p.value
roc(YY, resp)$auc
I tried multiple ways of generating data all of which result in the same thing
XX=matrix(runif(n*p), nrow=n)
XX=matrix(rnorm(n*p, 0, 1) , nrow=n)
and
random_data=matrix(0, n, p)
for(i in 1:n){
random_data[i,]=jitter(runif(p), factor = 1, amount = 10)
}
XX=as.matrix(random_data)
Since the randomForest is finding relevant predictors in this scenario that leads me to believe that data may not be truly random. Is there a better possible way I could generate data, or generate the random labels? is it possible that this is an issue with R?

This is a partial answer: I modified your roc function call to make sure the distribution of AUC values are between 0 and 1. Then I ran it 20 times. Mean AUC and p-value are 0.73 and 0.12, respectively. Improved but still better than random...
library(ROCR)
library(randomForest)
library(pROC)
n=30
p=900
pvs=vector()
aucs=vector()
for (j in seq(20)){
XX=matrix(rnorm(n*p, 0, 1) , nrow=n)
YY=as.factor(sample(c('C', 'P'), n, replace=T))
resp = vector()
for(i in 1:n){
fit = randomForest(XX[-i,], YY[-i])
pred = predict(fit, XX[i,], type = "prob")[2]
resp[i] <- pred
}
pvs[j]=t.test(resp~YY)$p.value
aucs[j]=roc(YY, resp, direction='>')$auc
}

Related

R: Crossvalidation of a lasso parameter and comparison to cv.glmnet

I've two questions regarding the Cross-Validation. I tried to create a code for Cross-Validation by myself to improve my intuition. In particular I find the optimal penalize parameter lambda for the Lasso.
Here is my Code for the CV:
CV_lambda <- matrix(0, length(lambda_grid), 2)
colnames(CV_lambda) <- c("lambda", "CV-est")
#Loop for a vector of Lambda values
for(l in 1:length(lambda_grid)){
#Number of folds
nfolds <- 5
#Fold IDs
foldid <- sample(rep(seq(nfolds), length = N))
#Output Store
cv_out <- as.list(seq(nfolds))
mspe_out <- matrix(0, 1, nfolds)
#Cross-Validation
for(i in seq(nfolds)){
y_train <- y[!foldid == i,]
X_train <- X[!foldid == i,]
#Lasso is a self-written function, estimating a lasso-regression
cv_out[i] <- lasso(y_train, X_train, lambda=lambda_grid[l])
predict <- X[foldid == i,] %*% cv_out[[i]]
mspe_out[i] <- mean((y[foldid == i,] - predict)^2)
}
CV_lambda[l,] <- c(lambda_grid[l],mean(mspe_out))
}
lambda_opt <- CV_lambda[which.min(CV_lambda[,2]),1]
I've two questions regarding the CV. I am sitting here for more than hours to find a solution to both questions. I would be very grateful, if some of you can help me. The Code is not very well and sparse, since I am a beginner. However, my questions are the following:
For the general understanding. LASSO-Regression requires a penalize parameter, which has a strong influence on the results. The goal is to choose this lambda, that reduces the mean squared prediction error.
1) Should the split into the sample be equal in every lambda-sequence or should it differ (as in my code). Im a bit skeptical about this.
2) Since I am very uncertain about my code, I want to compare my results to the glmnet. Is there a way to this ? The glmnet is also taking a random sample. A comparison isnt then possible. Has someone an idea how to "reproduce" the the cv.glmnet results with my code to find mistakes in my code ?

Implementing the bootstrap method for resampling the data set. Assuming that log prices follow random walk but using ARMA model

#install.packages("quantmod")
#install.packages("dataframes2xls")
#install.packages("bootstrap")
#install.packages("fArma")
library(bootstrap)
library(quantmod)
library(dataframes2xls)
library(fArma)
require(TTR)
getSymbols("SNE",src="yahoo",from = as.Date("2011-04-20"), to =as.Date("2015-04-22"))
SNElog <- diff( log( Cl( SNE ) ) )
SNElog <- SNElog[-1,]
SNElogT <- as.ts( tail(SNElog, 1000))
SNElogTimeArma <- armaFit( formula=~arima(0,1,0), data=SNElogT )
SNE.Adjusted.boot.sum <- numeric(1000)
for(i in 1:1000)
{
this.samp <- SNElog [ sample(1000,1000,replace=T, prob=??? )]
SNE.Adjusted.boot.sum[i] <- sum(this.samp)
}
This is my code.
My professor requirement: Implement the bootstrap method for resampling the data set, assuming that log prices follow random walk using an ARMA model.
Random walk just reminds my of ARIMA(0,1,0), But I have no idea how to combine the bootstrap with ARMA model.
Simply put, bootstrap is just recursively generating samples with replacement so as to fit a model. Then their performance is aggregated.
Below is a quick trial to obtain bootstrap coefficients, assuming ARIMA(1, 0, 1). As it is not specified clearly, I'm not sure the actual requirement.
library(fArma)
set.seed(1237)
price <- diff(sample(log(100:120), 101, replace = TRUE))
# bootstrap
boot <- function(trial, formula, data) {
mod <- armaFit(formula, sample(data, trial, replace = TRUE))
c(mod#fit$coef)
}
coef <- do.call(rbind, lapply(rep(length(price), 2), boot, formula = ~ arima(1,0,1), data = price))
apply(coef, 2, mean)
ar1 ma1 intercept
-0.66724275 0.67331811 -0.00551791
Note that I only made 2 random samples (rep(length(price), 2)) and your result will be different with a different setup or even with the same setup - recall that bootstrap generates random samples.
The key idea of bootstrap is in armaFit(formula, sample(data, trial, replace = TRUE)) where the model is fit to bootstrap sample, not the actual data.
I hope it is helpful.

Generating correlated ordinal data

I'm using the package GenOrd for generating correlated ordinal data. The basic idea is to get correlated ordinal data with correlation 0.5, now I want to repeat the whole code for 1000 times and save the results of correlation, to see how close I can get to the correlation of 0.5, then change the sample size and the Marginal probabilities and see what changes.
library(GenOrd)
R<-matrix(c(1,0.5,0.5,1),2,2)
Marginal<-list(c(0.2,0.5,0.7,0.9),c(0.1,0.3,0.4,0.5))
DataOrd<-ordsample(100,Marginal,R)
correlation<-cor(DataOrd)
correlation[1,2] # 0.5269
Here is a simple solution:
sim.cor <- function(R, Marginal, n, K)
{
res <- numeric(length = K)
for(i in 1:K)
res[i] <- cor(ordsample(n, Marginal, R))[1,2]
res
}
where n is the sample size and K is the number of times you want to repeat. So, in your example, you can call this function and save the result (a vector of size K with the correlations) in an object:
set.seed(1234)
correlations <- sim.cor(R = R, Marginal = Marginal, n = 100, K = 1000)
mean(correlations)
[1] 0.5009389
A faster and more elegant solution is to use the replicate function as suggested by jaysunice3401:
set.seed(1234)
n <- 100
correlations <- replicate(n = 1000, expr = cor(ordsample(n, Marginal, R))[1,2])
mean(correlations)
[1] 0.5009389
I hope this can help!

Using anova() on gamma distributions gives seemingly random p-values

I am trying to determine whether there is a significant difference between two Gamm distributions. One distribution has (shape, scale)=(shapeRef,scaleRef) while the other has (shape, scale)=(shapeTarget,scaleTarget). I try to do analysis of variance with the following code
n=10000
x=rgamma(n, shape=shapeRef, scale=scaleRef)
y=rgamma(n, shape=shapeTarget, scale=scaleTarget)
glmm1 <- gam(y~x,family=Gamma(link=log))
anova(glmm1)
The resulting p values keep changing and can be anywhere from <0.1 to >0.9.
Am I going about this the wrong way?
Edit: I use the following code instead
f <- gl(2, n)
x=rgamma(n, shape=shapeRef, scale=scaleRef)
y=rgamma(n, shape=shapeTarget, scale=scaleTarget)
xy <- c(x, y)
anova(glm(xy ~ f, family = Gamma(link = log)),test="F")
But, every time I run it I get a different p-value.
You will indeed get a different p-value every time you run this, if you pick different realizations every time. Just like your data values are random variables, which you'd expect to vary each time you ran an experiment, so is the p-value. If the null hypothesis is true (which was the case in your initial attempts), then the p-values will be uniformly distributed between 0 and 1.
Function to generate simulated data:
simfun <- function(n=100,shapeRef=2,shapeTarget=2,
scaleRef=1,scaleTarget=2) {
f <- gl(2, n)
x=rgamma(n, shape=shapeRef, scale=scaleRef)
y=rgamma(n, shape=shapeTarget, scale=scaleTarget)
xy <- c(x, y)
data.frame(xy,f)
}
Function to run anova() and extract the p-value:
sumfun <- function(d) {
aa <- anova(glm(xy ~ f, family = Gamma(link = log),data=d),test="F")
aa["f","Pr(>F)"]
}
Try it out, 500 times:
set.seed(101)
r <- replicate(500,sumfun(simfun()))
The p-values are always very small (the difference in scale parameters is easily distinguishable), but they do vary:
par(las=1,bty="l") ## cosmetic
hist(log10(r),col="gray",breaks=50)

Topic models: cross validation with loglikelihood or perplexity

I'm clustering documents using topic modeling. I need to come up with the optimal topic numbers. So, I decided to do ten fold cross validation with topics 10, 20, ...60.
I have divided my corpus into ten batches and set aside one batch for a holdout set. I have ran latent dirichlet allocation (LDA) using nine batches (total 180 documents) with topics 10 to 60. Now, I have to calculate perplexity or log likelihood for the holdout set.
I found this code from one of CV's discussion sessions. I really don't understand several lines of code below. I have dtm matrix using the holdout set (20 documents). But I don't know how to calculate the perplexity or log likelihood of this holdout set.
Questions:
Can anybody explain to me what seq(2, 100, by =1) mean here? Also, what AssociatedPress[21:30] mean? What function(k) is doing here?
best.model <- lapply(seq(2, 100, by=1), function(k){ LDA(AssociatedPress[21:30,], k) })
If I want to calculate perplexity or log likelihood of the holdout set called dtm, is there better code? I know there are perplexity() and logLik() functions but since I'm new I can not figure out how to implement it with my holdout matrix, called dtm.
How can I do ten fold cross validation with my corpus, containing 200 documents? Is there existing code that I can invoke? I found caret for this purpose, but again cannot figure that out either.
The accepted answer to this question is good as far as it goes, but it doesn't actually address how to estimate perplexity on a validation dataset and how to use cross-validation.
Using perplexity for simple validation
Perplexity is a measure of how well a probability model fits a new set of data. In the topicmodels R package it is simple to fit with the perplexity function, which takes as arguments a previously fit topic model and a new set of data, and returns a single number. The lower the better.
For example, splitting the AssociatedPress data into a training set (75% of the rows) and a validation set (25% of the rows):
# load up some R packages including a few we'll need later
library(topicmodels)
library(doParallel)
library(ggplot2)
library(scales)
data("AssociatedPress", package = "topicmodels")
burnin = 1000
iter = 1000
keep = 50
full_data <- AssociatedPress
n <- nrow(full_data)
#-----------validation--------
k <- 5
splitter <- sample(1:n, round(n * 0.75))
train_set <- full_data[splitter, ]
valid_set <- full_data[-splitter, ]
fitted <- LDA(train_set, k = k, method = "Gibbs",
control = list(burnin = burnin, iter = iter, keep = keep) )
perplexity(fitted, newdata = train_set) # about 2700
perplexity(fitted, newdata = valid_set) # about 4300
The perplexity is higher for the validation set than the training set, because the topics have been optimised based on the training set.
Using perplexity and cross-validation to determine a good number of topics
The extension of this idea to cross-validation is straightforward. Divide the data into different subsets (say 5), and each subset gets one turn as the validation set and four turns as part of the training set. However, it's really computationally intensive, particularly when trying out the larger numbers of topics.
You might be able to use caret to do this, but I suspect it doesn't handle topic modelling yet. In any case, it's the sort of thing I prefer to do myself to be sure I understand what's going on.
The code below, even with parallel processing on 7 logical CPUs, took 3.5 hours to run on my laptop:
#----------------5-fold cross-validation, different numbers of topics----------------
# set up a cluster for parallel processing
cluster <- makeCluster(detectCores(logical = TRUE) - 1) # leave one CPU spare...
registerDoParallel(cluster)
# load up the needed R package on all the parallel sessions
clusterEvalQ(cluster, {
library(topicmodels)
})
folds <- 5
splitfolds <- sample(1:folds, n, replace = TRUE)
candidate_k <- c(2, 3, 4, 5, 10, 20, 30, 40, 50, 75, 100, 200, 300) # candidates for how many topics
# export all the needed R objects to the parallel sessions
clusterExport(cluster, c("full_data", "burnin", "iter", "keep", "splitfolds", "folds", "candidate_k"))
# we parallelize by the different number of topics. A processor is allocated a value
# of k, and does the cross-validation serially. This is because it is assumed there
# are more candidate values of k than there are cross-validation folds, hence it
# will be more efficient to parallelise
system.time({
results <- foreach(j = 1:length(candidate_k), .combine = rbind) %dopar%{
k <- candidate_k[j]
results_1k <- matrix(0, nrow = folds, ncol = 2)
colnames(results_1k) <- c("k", "perplexity")
for(i in 1:folds){
train_set <- full_data[splitfolds != i , ]
valid_set <- full_data[splitfolds == i, ]
fitted <- LDA(train_set, k = k, method = "Gibbs",
control = list(burnin = burnin, iter = iter, keep = keep) )
results_1k[i,] <- c(k, perplexity(fitted, newdata = valid_set))
}
return(results_1k)
}
})
stopCluster(cluster)
results_df <- as.data.frame(results)
ggplot(results_df, aes(x = k, y = perplexity)) +
geom_point() +
geom_smooth(se = FALSE) +
ggtitle("5-fold cross-validation of topic modelling with the 'Associated Press' dataset",
"(ie five different models fit for each candidate number of topics)") +
labs(x = "Candidate number of topics", y = "Perplexity when fitting the trained model to the hold-out set")
We see in the results that 200 topics is too many and has some over-fitting, and 50 is too few. Of the numbers of topics tried, 100 is the best, with the lowest average perplexity on the five different hold-out sets.
I wrote the answer on CV that you refer to, here's a bit more detail:
seq(2, 100, by =1) simply creates a number sequence from 2 to 100 by ones, so 2, 3, 4, 5, ... 100. Those are the numbers of topics that I want to use in the models. One model with 2 topics, another with 3 topics, another with 4 topics and so on to 100 topics.
AssociatedPress[21:30] is simply a subset of the built-in data in the topicmodels package. I just used a subset in that example so that it would run faster.
Regarding the general question of optimal topic numbers, I now follow the example of Martin
Ponweiser on Model Selection by Harmonic Mean (4.3.3 in his thesis, which is here: http://epub.wu.ac.at/3558/1/main.pdf). Here's how I do it at the moment:
library(topicmodels)
#
# get some of the example data that's bundled with the package
#
data("AssociatedPress", package = "topicmodels")
harmonicMean <- function(logLikelihoods, precision=2000L) {
library("Rmpfr")
llMed <- median(logLikelihoods)
as.double(llMed - log(mean(exp(-mpfr(logLikelihoods,
prec = precision) + llMed))))
}
# The log-likelihood values are then determined by first fitting the model using for example
k = 20
burnin = 1000
iter = 1000
keep = 50
fitted <- LDA(AssociatedPress[21:30,], k = k, method = "Gibbs",control = list(burnin = burnin, iter = iter, keep = keep) )
# where keep indicates that every keep iteration the log-likelihood is evaluated and stored. This returns all log-likelihood values including burnin, i.e., these need to be omitted before calculating the harmonic mean:
logLiks <- fitted#logLiks[-c(1:(burnin/keep))]
# assuming that burnin is a multiple of keep and
harmonicMean(logLiks)
So to do this over a sequence of topic models with different numbers of topics...
# generate numerous topic models with different numbers of topics
sequ <- seq(2, 50, 1) # in this case a sequence of numbers from 1 to 50, by ones.
fitted_many <- lapply(sequ, function(k) LDA(AssociatedPress[21:30,], k = k, method = "Gibbs",control = list(burnin = burnin, iter = iter, keep = keep) ))
# extract logliks from each topic
logLiks_many <- lapply(fitted_many, function(L) L#logLiks[-c(1:(burnin/keep))])
# compute harmonic means
hm_many <- sapply(logLiks_many, function(h) harmonicMean(h))
# inspect
plot(sequ, hm_many, type = "l")
# compute optimum number of topics
sequ[which.max(hm_many)]
## 6
Here's the output, with numbers of topics along the x-axis, indicating that 6 topics is optimum.
Cross-validation of topic models is pretty well documented in the docs that come with the package, see here for example: http://cran.r-project.org/web/packages/topicmodels/vignettes/topicmodels.pdf Give that a try and then come back with a more specific question about coding CV with topic models.

Resources