Question regarding k fold cross validation for KNN using R - r

I am trying to fit 5 fold cross validation for several values of k. I used the OJ data set in ISLR package.
my code so far as follows,
library(ISLR)
library(class)
ks=c(1:5)
err.rate.test <- numeric(length = 5)
folds <- cut(seq(1,nrow(OJ)),breaks=5,labels=FALSE)
for (j in seq(along = ks)) {
set.seed(123)
cv.knn <- sapply(1:5, FUN = function(i) {
testID <- which(folds == i, arr.ind = TRUE)
test.X <- OJ[testID, 3]
test.Y <- OJ[testID, 1]
train.X <- OJ[-testID, 3]
train.Y <- OJ[-testID, 1]
knn.test <- knn(data.frame(train.X), data.frame(test.X), train.Y, k = ks[j])
cv.test.est <- mean(knn.test != test.Y)
return(cv.test.est)
})
err.rate.test[j] <- mean(cv.knn)
}
err.rate.test
[1] 0.3757009 0.3757009 0.3757009 0.3757009 0.3757009
The code doesn't give any errors. But for some reason , my test error rate for each value of k is same.This seems to be weird for me. So i assume there is something wrong with my code.
Can anyone help me to figure that out ?
Thank you

remove set.seed(123), this causes the repeat error rates.
set.seed is used for reproducibility, ensuring that any random grid searches or parameter estimates remain constant, meaning all of the parameter estimates that go into fitting the knn model will be the same across executions, resulting in the same predictions and therefore the same error rates.

Related

storing data from a loop in a matrix

I would like to code a loop for cross-validation: computing MSE for a one- and a four-step forecast and store the results in a matrix. The problem I get is that the columns for the 1 to 3-step forecast get overwritten and I get just the 4-step forecast in all columns. Anybody can help?
k<-20
n<-length(xy)-1
h<-4
start <- tsp(xy) [1]+k
j <- n-k
mseQ1 <- matrix(NA,j,h)
colnames(mseQ1) <- paste0('h=',1:h)
for(i in 1:j)
{
xtrain <- window(xy, end=start+(i-1))
xvalid <- window(xy, start=start+i, end=start+i)
qualifiedETS <- ets(xtrain, alpha=NULL, beta=NULL, additive.only=TRUE, opt.crit="mse")
fcastHW <- forecast(qualifiedETS, h=h)
mseQ1[i,] <- ((fcastHW[['mean']]-xvalid)^2)
}

R issue : performing lm and then a boxcox to find a proper lambda value

I have a dataset of climate data in a data.frame (columns are measuring stations, and rows indicate time of measurement), and I'm trying to find the proper lambda values in a Yeo-Johnson transform to limit skewness impact on a principal component analysis.
Obviously, the first step is to get log likelihoods to find the best lambda : I use the following, where i is the index of a column :
getYeoJohsnonLambda <- function(myClimateData,cols,lambda_min, lambda_max,eps)
...
lambda <- seq(lambda_min,lambda_max,eps)
for(i in cols)
{
formula <- as.formula(paste("myClimateData$",colnames(myClimateData)[i],"~1"))
currentModel <- lm(formula,myClimateData)
print(currentModel)
myboxCox <- boxCox(currentModel, lambda = lambda ,family="yjPower", plotit = FALSE)
...
}
When I am trying to call it for a climateData time series which could be, for example :
`climateData <-data.frame(c(8.2,6.83,5.46,4.1,3.73,3.36,3,3,3,3,3.7),c(0,0.66,1.33,2,2,2,2,2,2,2,1.6))`
I get this error : Error in is.data.frame(data) : object 'myClimateData' not found
This is weird, as lm seems to find it and return a correct fit, and myClimateData should be found as it is one of the arguments of the function, right ?
Sadly, it seems that the problem comes from the function boxCox rather than your getYeoJohsnonLambda function. As BrodieG pointed out in a related question, this function uses parent.frame as an argument to eval which is considered as bad practice in the doc.
One way to solve this is to build the models before the call, as suggested in Adam Quek's answer:
library(car)
climateData <- data.frame(c(8.2,6.83,5.46,4.1,3.73,3.36,3,3,3,3,3.7),c(0,0.66,1.33,2,2,2,2,2,2,2,1.6))
names(climateData) <- c("a","b")
modelList <- list()
for(k in 1:ncol(climateData)) {
modelList[[k]] <- lm(as.formula(paste0(names(climateData)[k],"~1")),data=climateData)
}
getYeoJohnsonLambda <- function(myClimateData, cols, lambda_min, lambda_max, eps)
{
#Recommended values for lambda_min = -0.5 and lambda_max = 2.0, eps = 0.1
myboxCox <- list()
lmd <- seq(lambda_min,lambda_max,eps)
for(i in cols)
{
cat("Creating model for column # ",i,"\n")
currentModel <- modelList[[i]]
myboxCox[[i]] <- boxCox(currentModel, lambda = lmd ,family="yjPower", plotit = FALSE)
}
return(myboxCox)
}
test <- getYeoJohnsonLambda(climateData,c(1,2) ,-0.5,2,0.1)
Other solution (arguably cleaner): use yeo.johnson in VGAM
library(VGAM)
getYeoJohnsonLambda_VGAM <- function(myClimateData, cols, lambda_min, lambda_max, eps)
{
#Recommended values for lambda_min = -0.5 and lambda_max = 2.0, eps = 0.1
myboxCox <- list()
lmd <- seq(lambda_min,lambda_max,eps)
return(apply(climateData,2,yeo.johnson,lambda=lmd))
}
test2 <- getYeoJohnsonLambda_VGAM(climateData,c(1,2) ,-0.5,2,0.1)
Here's a solution without troubleshooting the function getYeoJohsnonLambda:
iris.dat <- iris[-5]
vars <- names(iris.dat)
lmd <- seq(.1, 1, .1) #lambda_min, lambda_max, eps
all.form <- lapply(vars, function(x) as.formula(paste0(x, "~ 1")))
all.lm <- lapply(all.form, lm, data=iris.dat)
library(MASS)
all.bcox <- lapply(all.form, boxcox, data=iris.dat,
lambda=lmd, family="yjPower", plotit=FALSE)

plm cross-validation runtime-error

I am trying to cross-validate a panel data analysis, but I am having trouble with the prediction function. The code below is a section of the CV function where the problem occurs. When I try to run this part of the function I get the error message tagged on to the end of the code description.
compare <- data.frame()
train.model <- list()
for (f in 1:numoffolds)
{
train.model[[f]] <- plm(logit(II1)~.,data=select(filter(train, cv != f),-Incomegroup, -cv),
model="within", effect="twoways", index=c("Year", "Country") )
II1.p <- predict(train.model[[f]], newdata=select(filter(train, cv == f),-Country, -Year, -cv, -Incomegroup), type="response")
II1 <- filter(train, cv == f)$II1
compare <- rbind(compare, data.frame(II1.p = II1.p, II1 = II1))
}
Error in crossprod(beta, t(X)) : non-conformable arguments
Called from: crossprod(beta, t(X))
I've had a look at the prediction function and the X and beta terms are of conflicting dimensions. Does anyone have a suggesting to what I can do to overcome this problem?
function (object, newdata = NULL, ...)
{
tt <- terms(object)
if (is.null(newdata)) {
result <- fitted(object, ...)
}
else {
Terms <- delete.response(tt)
m <- model.frame(Terms, newdata)
X <- model.matrix(Terms, m)
beta <- coef(object)
result <- as.numeric(crossprod(beta, t(X)))
}
result
I've been thinking about resorting to other methods such as tree based regression, but I wanted to use plm to make a specific point. Any helping comments and/or codes would be greatly appreciated.

How to include p-values<0.05 in q-graphs?

I am following up an old question without answer (https://stackoverflow.com/questions/31653029/r-thresholding-networks-with-inputted-p-values-in-q-graph). I'm trying to assess relations between my variables.For this, I've used a correlation network map. Once I did so, I would like to implement a significance threshold component. For instance, I want to only show results with p-values <0.05. Any idea about how could I implement my code?
Data set: https://www.dropbox.com/s/xntc3i4eqmlcnsj/d100_partition_all3.csv?dl=0
My code:
library(qgraph)
cor_d100_partition_all3<-cor(d100_partition_all3)
qgraph(cor_d100_partition_all3, layout="spring",
label.cex=0.9, labels=names(d100_partition_all3),
label.scale=FALSE, details = TRUE)
Output:
Additionally, I have this small piece of code that transform R2 values into p.values:
Code:
cor.mtest <- function(mat, ...) {
mat <- as.matrix(mat)
n <- ncol(mat)
p.mat<- matrix(NA, n, n)
diag(p.mat) <- 0
for (i in 1:(n - 1)) {
for (j in (i + 1):n) {
tmp <- cor.test(mat[, i], mat[, j], ...)
p.mat[i, j] <- p.mat[j, i] <- tmp$p.value
}
}
colnames(p.mat) <- rownames(p.mat) <- colnames(mat)
p.mat
}
p.mat <- cor.mtest(d100_partition_all3)
Cheers
There are a few ways to only plot the significant correlations. First, you could pass additional arguments to the qgraph()function. You can look at the documentation for more details. The function call given below should have values that are close to what is needed.
qgraph(cor_d100_partition_all3
, layout="spring"
, label.cex=0.9
, labels=names(d100_partition_all3)
, label.scale=FALSE
, details = TRUE
, minimum='sig' # minimum based on statistical significance
,alpha=0.05 # significance criteria
,bonf=F # should Bonferroni correction be used
,sampleSize=6 # number of observations
)
A second option is to create a modified correlation matrix. When the correlations are not statistically significant based on your cor.mtest() function, the value is set to NA in the modified correlation matrix. This modified matrix is plotted. A main visual difference between the first and second solutions seems to be the relative line weights.
# initializing modified correlation matrix
cor_d100_partition_all3_mod <- cor_d100_partition_all3
# looping through all elements and setting values to NA when p-values is greater than 0.05
for(i in 1:nrow(cor_d100_partition_all3)){
for(j in 1:nrow(cor_d100_partition_all3)){
if(p.mat[i,j] > 0.05){
cor_d100_partition_all3_mod[i,j] <- NA
}
}
}
# plotting result
qgraph(cor_d100_partition_all3_mod
,layout="spring"
,label.cex=0.7
,labels=names(d100_partition_all3)
,label.scale=FALSE
,details = F
)

R training and test sets

So I needed some help with a train and test set that I am creating in R. The goal of the code is to break a data set into a certain amount k, and the number of folds the test set will be i. It will then return the training and test sets. We assume that k will be 5 or 10.
This is what I have so far.
create_sets<-function(df,k,i)
{
n<-dim(df)[1]
#fold size
size<-n/k
#beggining of test set
test_start<-(size*i)-(size)+1
#end of test set
test_end<-size*i
indices<-df(test_start,test_end)
train<-df[indices,]
test<-df[-indices,]
return (list(train=train,test=test))
}
df is just a random data frame of x and y. That is:
x<-c(1,6,7,4,3,5,7,8,9,8,7,6,5,4,3,4,5,3,2,1)
y<-c(3,5,6,7,5,4,3,5,7,8,9,0,2,3,4,5,6,7,5,6)
df<-data.frame(x,y)
When I run the code I am returning an error
Error in df(test_start, test_end) :
argument "df2" is missing, with no default
This is how I would approach it:
n <- nrow(df)
k <- 5
set.seed(10272015)
s <- sample(1:k, n, replace=TRUE)
result <- rep(NA, k)
for (i in 1:k) {
train <- df[s!=i, ]
test <- df[s==i,]
# fit model
# evaluate
# result[i] <- evalscore
}
mean(result)
I think you just need an index for different subsets,like this:
k <- 5
folds <- sample(rep(1:k,length=nrow(df)))
then, you can get any one of k subsets by(take 1 for example):
df[folds==1,]

Resources