Using KDD cup 99 dataset and machine learning with R - r

I am trying to use the KDD cup 99 dataset with R but unfortunately, I get very bad results. Basically, the predictor is guessing (~50% error on the cross-validation set). There is probably a bug in my code but I can't find where.
The KDD cup 99 dataset is composed of around 4 millions examples which are separated in 4 different classes of attacks + the class "normal". First, I split the dataset into 5 files (one for each class + one for the class "normal") and I convert the non-numerical data into numerical ones. For the moment, I am working on the class "Remote to Local" (r2l). I select some features according to the results of a paper on the topic. Afterwards, I sample a number of "normal" instances equal to the number of r2l instances to avoid the problem of skewed class. I also replace all the labels for the different types of r2l attacks by the label "attack" so I can train a two-classes classifier. Then I join the sample to the r2l instances in one new dataset. Finally, I apply a 10-fold cross-validation to assess my model which is built using SVM and I get the worst results in the history of machine learning... :(
Here is my code:
r2l <- read.table("kddcup_r2l.data",sep=",",header=T)
#u2r <- read.table("kddcup_u2r.data",sep=",",header=T)
#probe_original <- read.table("kddcup_probe.data",sep=",",header=T)
#dos <- read.table("kddcup_dos.data",sep=",",header=T)
normal <- read.table("kddcup_normal.data",sep=",",header=T)
#probe <- probe_original[sample(1:dim(probe_original)[1],10000),]
# Features selected by the three algorithms svm, lgp and mars
# for the different classes of attack
########################################################################
features.r2l.svm <- c("srv_count","service","duration","count","dst_host_count")
features.r2l.lgp <- c("is_guest_login","num_access_files","dst_bytes","num_failed_logins","logged_in")
features.r2l.mars <- c("srv_count","service","dst_host_srv_count","count","logged_in")
features.r2l.combined <- unique(c(features.r2l.svm,features.r2l.lgp,features.r2l.mars))
# Sample the training set containing the normal labels
# for each class of attack in order to have the same number
# of training data belonging to the "normal" class and the
# "attack" class
#######################################################################
normal_sample.r2l <- normal[sample(1:dim(normal)[1],dim(r2l)[1]),]
# This part was useful before the separation normal/attack because
# attack was composed of different types for each class
######################################################################
normal.r2l.Y <- matrix(normal_sample.r2l[,c("label")])
#######################################################################
# Class of attack Remote to Local (r2l)
#######################################################################
# Select the features according to the algorithms(svm,lgp and mars)
# for this particular type of attack. Combined contains the
# combination of the features selected by the 3 algorithms
#######################################################################
#features.r2l.svm <- c(features.r2l.svm,"label")
r2l_svm <- r2l[,features.r2l.svm]
r2l_lgp <- r2l[,features.r2l.lgp]
r2l_mars <- r2l[,features.r2l.mars]
r2l_combined <- r2l[,features.r2l.combined]
r2l_ALL <- r2l[,colnames(r2l) != "label"]
r2l.Y <- matrix(r2l[,c("label")])
r2l.Y[,1] = "attack"
# Merge the "normal" instances and the "r2l" instances and shuffle the result
###############################################################################
r2l_svm.tr <- rbind(normal_sample.r2l[,features.r2l.svm],r2l_svm)
r2l_svm.tr <- r2l_svm.tr[sample(1:nrow(r2l_svm.tr),replace=F),]
r2l_lgp.tr <- rbind(normal_sample.r2l[,features.r2l.lgp],r2l_lgp)
r2l_lgp.tr <- r2l_lgp.tr[sample(1:nrow(r2l_lgp.tr),replace=F),]
r2l_mars.tr <- rbind(normal_sample.r2l[,features.r2l.mars],r2l_mars)
r2l_mars.tr <- r2l_mars.tr[sample(1:nrow(r2l_mars.tr),replace=F),]
r2l_ALL.tr <- rbind(normal_sample.r2l[,colnames(normal_sample.r2l) != "label"],r2l_ALL)
r2l_ALL.tr <- r2l_ALL.tr[sample(1:nrow(r2l_ALL.tr),replace=F),]
r2l.Y.tr <- rbind(normal.r2l.Y,r2l.Y)
r2l.Y.tr <- matrix(r2l.Y.tr[sample(1:nrow(r2l.Y.tr),replace=F),])
#######################################################################
#
# 10-fold CROSS-VALIDATION to assess the models accuracy
#
#######################################################################
# CV for Remote to Local
########################
cv(r2l_svm.tr, r2l_lgp.tr, r2l_mars.tr, r2l_ALL.tr, r2l.Y.tr)
And the cross-validation function:
cv <- function(svm.tr, lgp.tr, mars.tr, ALL.tr, Y.tr){
Jcv.svm_mean <- NULL
#Compute the size of the cross validation
# =======================================
index=sample(1:dim(svm.tr)[1])
size.CV<-floor(dim(svm.tr)[1]/10)
Jcv.svm <- NULL
#Start 10-fold Cross validation
# =============================
for (i in 1:10) {
# if m is the size of the training set
# (nr of rows in svm.tr for example)
# take n observations for test and (m-n) for training
# with n << m (here n = m/10)
# ===================================================
i.ts<-(((i-1)*size.CV+1):(i*size.CV))
i.tr<-setdiff(index,i.ts)
Y.tr.tr <- as.factor(Y.tr[i.tr])
Y.tr.ts <- as.factor(matrix(Y.tr[i.ts],ncol=1))
svm.tr.tr <- svm.tr[i.tr,]
svm.tr.ts <- svm.tr[i.ts,]
# Get the model for the algorithms
# ==============================================
model.svm <- svm(Y.tr.tr~.,svm.tr.tr,type="C-classification")
# Compute the prediction
# ==============================================
Y.hat.ts.svm <- predict(model.svm,svm.tr.ts)
# Compute the error
# ==============================================
h.svm <- NULL
h.svm <- matrix(Y.hat.ts.svm,ncol=1)
Jcv.svm <- c(Jcv.svm ,sum(!(h.svm == Y.tr.ts))/size.CV)
print(table(h.svm,Y.tr.ts))
}
Jcv.svm_mean <- c(Jcv.svm_mean, mean(Jcv.svm))
d <- 10
print(paste("Jcv.svm_mean: ", round(Jcv.svm_mean,digits=d) ))
}
I obtain very strange results. It seems that the algorithm doesn't really see any difference between the instances. It looks like a guess more than a prediction. I also tried with the class of attack "Probe" but obtain the same results. The paper that I mentioned earlier had a 30% accuracy on the class r2l and 60-98% (depending on the polynomial degree) on probe.
Here is the prediction for one of the 10 fold of the cross-validation:
h.svm(attack) & Y.tr.ts(attack) --> 42 instances
h.svm(attack) & Y.tr.ts(normal.) --> 44 instances
h.svm(normal.) & Y.tr.ts(attack) --> 71 instances
h.svm(normal.) & Y.tr.ts(normal.) --> 68 instances
I would be really grateful if somebody could tell me what is wrong with my code.
Thank you in advance

couldn't be sure if this is your problem, but there are known problems with that data set.
http://www.bruggerink.com/~zow/GradSchool/KDDCup99Harmful.html
sorry I couldn't help with code, I don't know R :/

Related

How to capture the most important variables in Bootstrapped models in R?

I have several models that I would like to compare their choices of important predictors over the same data set, Lasso being one of them. The data set I am using consists of census data with around a thousand variables that have been renamed to "x1", "x2" and so on for convenience sake (The original names are extremely long). I would like to report the top features then rename these variables with a shorter more concise name.
My attempt to solve this is by extracting the top variables in each iterated model, put it into a list, then finding the mean of the top variables in X amount of loops. However, my issue is I still find variability with the top 10 most used predictors and so I cannot manually alter the variable names as each run on the code chunk yields different results. I suspect this is because I have so many variables in my analysis and due to CV causing the creation of new models every bootstrap.
For the sake of a simple example I used mtcars and will look for the top 3 most common predictors due to only having 10 variables in this data set.
library(glmnet)
data("mtcars") # Base R Dataset
df <- mtcars
topvar <- list()
for (i in 1:100) {
# CV and Splitting
ind <- sample(nrow(df), nrow(df), replace = TRUE)
ind <- unique(ind)
train <- df[ind, ]
xtrain <- model.matrix(mpg~., train)[,-1]
ytrain <- df[ind, 1]
test <- df[-ind, ]
xtest <- model.matrix(mpg~., test)[,-1]
ytest <- df[-ind, 1]
# Create Model per Loop
model <- glmnet(xtrain, ytrain, alpha = 1, lambda = 0.2)
# Store Coeffecients per loop
coef_las <- coef(model, s = 0.2)[-1, ] # Remove intercept
# Store all nonzero Coefficients
topvar[[i]] <- coef_las[which(coef_las != 0)]
}
# Unlist
varimp <- unlist(topvar)
# Count all predictors
novar <- table(names(varimp))
# Find the mean of all variables
meanvar <- tapply(varimp, names(varimp), mean)
# Return top 3 repeated Coefs
repvar <- novar[order(novar, decreasing = TRUE)][1:3]
# Return mean of repeated Coefs
repvar.mean <- meanvar[names(repvar)]
repvar
Now if you were to rerun the code chunk above you would notice that the top 3 variables change and so if I had to rename these variables it would be difficult to do if they are not constant and changing every run. Any suggestions on how I could approach this?
You can use function set.seed() to ensure your sample will return the same sample each time. For example
set.seed(123)
When I add this to above code and then run twice, the following is returned both times:
wt carb hp
98 89 86

Perceptron in R not converging

I am trying to understand Neural Networks better so I am trying to implement a simple perceptron from scratch in R. I know that this is very inefficient as there are many libraries that do this extemely well optimized but my goal is to understand the basics of neural networks better and work my way forward to more complex models.
I have created some artificial test data with a very simple linear decision boundary and split this into a training set and a test set. I then ran a logistic regression on the training data and checked the predictions from the test-set and got +99% accuray, which was to be expected given the simple nature of the data. I then tried implementing a perceptron with 2 inputs, 1 neuron, 1000 iterations, a learning rate of 0.1 and a sigmoid activation function.
I would expect to get very similar accuracy to the logistic regression model but my results are a lot worse (around 70% correct classifications in the training set). so I definitly did something wrong. The predictions only seem to get better after the first couple of iterations and then just go back and forth around a specific value (I tried with many different learning rates, no success). I'm attaching my script and I#m thankful for any advice! I think the problem lies in the calculation of the error or the weight adjustment but I can't put my finger on it...
### Reproducible Example for StackOverflow
#### Setup
# loading libraries
library(data.table)
#remove scientifc notation
options(scipen = 999)
# setting seed for random number generation
seed <- 123
#### Selfmade Test Data
# input points
x1 <- runif(10000,-100,100)
x2 <- runif(10000,-100,100)
# setting decision boundary to create output
output <- vector()
output[0.5*x1 + -1.2*x2 >= 50] <- 0
output[0.5*x1 + -1.2*x2 < 50] <- 1
# combining to dataframe
points <- cbind.data.frame(x1,x2,output)
# plotting all data points
plot(points$x1,points$x2, col = as.factor(points$output), main = "Self-created data", xlab = "x1",ylab = "x2")
# split into test and training sets
trainsize = 0.2
set.seed(seed)
train_rows <- sample(1:dim(points)[1], size = trainsize * dim(points)[1])
train <- points[train_rows,]
test <- points[-c(train_rows),]
# plotting training set only
plot(train$x1,train$x2, col = as.factor(train$output), main = "Self-created data (training set)", xlab = "x1",ylab = "x2")
#### Approaching the problem with logistic regression
# building model
train_logit <- glm(output ~ x1 + x2, data = train, family = "binomial", maxit = 10000)
summary(train_logit)
# testing performance in training set
table(round(train_logit$fitted.values) == train$output)
# testing performance of train_logit model in test set
table(test$output == round(predict(train_logit,test[,c(1,2)], type = "response")))
# We get 100% accuracy in the training set and near 100% accuracy in the test set
#### Approaching Problem with a Perceptron from scratch
# setting inputs, outputs and weights
inputs <- as.matrix(train[,c(1,2)])
output <- as.matrix(train[,3])
set.seed(123456)
weights <- as.matrix(runif(dim(inputs)[2],-1,1))
## Defining activation function + derivative
# defining sigmoid and it's derivative
sigmoid <- function(x) {1 / (1 + exp(-x))}
sig_dir <- function(x){sigmoid(x)*(1 - sigmoid(x))}
## Perceptron nitial Settings
bias <- 1
# number of iterations
iterations <- 1000
# setting learning rate
alpha <- 0.1
## Perceptron
# creating vectors for saving results per iteration
weights_list <- list()
weights_list[[1]] <- weights
errors_vec <- vector()
outputs_vec <- vector()
# saving results across iterations
weights_list_all <- list()
outputs_list <- list()
errors_list <- list()
# looping through the backpropagation algorithm "iteration" # times
for (j in 1:iterations) {
# Loop for backpropagation with updating weights after every datapoint
for (i in 1:dim(train)[1]) {
# taking the weights from the last iteration of the outer loop as a starting point
if (j > 1) {
weights_list[[1]] <- weights
}
# Feed Forward (Should we really round this?!)
output_pred <- round(sigmoid(sum(inputs[i,] * as.numeric(weights)) + bias))
error <- output_pred - output[i]
# Backpropagation (Do I need the sigmoid derivative AND a learning rate? Or should I only take one of them?)
weight_adjustments <- inputs[i,] * (error * sig_dir(output_pred)) * alpha
weights <- weights - weight_adjustments
# saving progress for later plots
weights_list[[i + 1]] <- weights
errors_vec[i] <- error
outputs_vec[[i]] <- output_pred
}
# saving results for each iteration
weights_list_all[[j]] <- weights_list
outputs_list[[j]] <- outputs_vec
errors_list[[j]] <- errors_vec
}
#### Formatting Diagnostics for easier plotting
# implementing empty list to transform weightslist
WeightList <- list()
# collapsing individual weightslist into datafames
for (i in 1:iterations) {
WeightList[[i]] <- t(data.table::rbindlist(weights_list_all[i]))
}
# pasting dataframes together
WeightFrame <- do.call(rbind.data.frame, WeightList)
colnames(WeightFrame) <- paste("w",1:dim(WeightFrame)[2], sep = "")
# pasting dataframes together
ErrorFrame <- do.call(rbind.data.frame, errors_list)
OutputFrame <- do.call(rbind.data.frame, outputs_list)
##### Plotting Results
# Development of Mean Error per iteration
plot(rowMeans(abs(ErrorFrame)),
type = "l",
xlab = "Sum of absolute Error terms")
# Development of Weights over time
plot(WeightFrame$w1, type = "l",xlim = c(1,dim(train)[1]), ylim = c(min(WeightFrame),max(WeightFrame)), ylab = "Weights", xlab = "Iterations")
lines(WeightFrame$w2, col = "green")
# lines(WeightFrame$w3, col = "blue")
# lines(WeightFrame$w4, col = "red")
# lines(WeightFrame$w5, col = "orange")
# lines(WeightFrame$w6, col = "cyan")
# lines(WeightFrame$w7, col = "magenta")
# Empty vector for number of correct categorizations per iteration
NoCorr <- vector()
# Computing percentage of correct predictions per iteration
colnames(OutputFrame) <- paste("V",1:dim(OutputFrame)[2], sep = "")
Output_mat <- as.matrix(OutputFrame)
for (i in 1:iterations) {
NoCorr[i] <- sum(output == Output_mat[i,]) / nrow(train)
}
# plotting number of correct predictions per iteration
plot(NoCorr, type = "l")
# Performance in training set after last iteration
table(output,round(OutputFrame[iterations,]))
First of all, welcome to the world of Neural Networks :).
Secondly, I want to recommend a great article to you, which I personally used to get a better understanding of backtracking and the whole NN learning stuff: https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/. Might be a bit rough to get through sometimes, and for the general implementation I think it is much easier to follow pseudocode from a NN book. However, to understand what is going on this is article is very nice!
Thirdly, I will hopefully solve your problem :)
You comment yourself already with whether you should really round that output_pred. Yes you should.. if you want to use that output_pred to make a prediction! However, if you want to use it for learning it is generally not good! The reason for this is that if you round it for learning, than an output which was rounded up from 0.51 to 1 with target output 1 will not learn anything as the output was the same as the target and thus is perfect. However, 0.99 would have been a lot better of a prediction than 0.51 and thus there is definitely something to learn!
I am not 100% sure if this solves all your problems (im not an R programmer) and gets your accuracy up to 99%, but it should solve some of it, and hopefully the intuition is also clear :)

MM Estimation in Robust Regression

I am working with different linear regression models in R. I used the DATASET, which has 21263 rows and 82 columns.
All of the regression models have acceptable time consumption except the MM-estimate regression using the R function lmrob.
I was waiting for more than 10 hours to run the first for loop (#Block A), and it does not work. By "does not work", I mean It may give me an output after two days. I tried this code with a smaller DATASET which has 9568 rows, 5 columns and it runs in a one minute.
I am using my standard Laptop.
The steps of my analysis as follows
Uploading and scaling the dataset and then used k-folds split with k=30 because I want to calculate the variance of coefficients for each variable within the k split.
Could you please provide me with any guide?
wdbc = read.csv("train.csv") #critical_temp is the dependent varaible.
wdbcc=as.data.frame(scale(wdbc)) # scaling the variables
### k-folds split ###
set.seed(12345)
k = 30
folds <- createFolds(wdbcc$critical_temp, k = k, list = TRUE, returnTrain = TRUE)
############ Start of MM Regression Model #################
#Block A
lmrob = list()
for (i in 1:k) {
lmrob[[i]] = lmrob(critical_temp~ .,
data = wdbcc[folds[[i]],],setting="KS2014")
}
#Block B
lmrob_coef = list()
lmrob_coef_var = list()
for(j in 1:(lmrob[[1]]$coefficients %>% length())){
for(i in 1:k){
lmrob_coef[[i]] = lmrob[[i]]$coefficients[j]
lmrob_coef_var[[j]] = lmrob_coef %>% unlist() %>% var()
}
}
#Block C
lmrob_var = unlist(lmrob_coef_var)
lmrob_df = cbind(coefficients = lmrob[[1]]$coefficients %>% names() %>% as.data.frame()
, variance = lmrob_var %>% as.data.frame())
colnames(lmrob_df) = c("coefficients", "variance_lmrob")
#Block D
lmrob_var_sum = sum(lmrob_var)
Not an answer, but some code to help you test this for yourself. I didn't run lmrob() on the full dataset, but everything I show below suggests that one full realization of the model (all observations, all predictors) should run in about 10-20 minutes [on a 10-year old MacOS desktop machine], which would extrapolate to approximately 5 hours for 30-fold cross-validation. (It looks like the time scales a little worse than the square root of the number of observations, and nonlinearly even on the log scale with the number of predictors ...) You can try the code below to see if things are much slower on your machine, and to predict how long you think it should take to do the whole problem. Other general suggestions:
is there a chance you're running out of memory? Memory constraints can make things run much slower
if the problem is just that things are too slow, you can easily parallelize across folds if you have access to multiple cores (probably don't do this on a laptop, you'll burn it up)
AWS and other cloud services can be very useful
I set up a test function to record the time taken by lmrob() running on a random subset of predictors and observations from your data set.
Extract data, load packages:
unzip("superconduct.zip")
xx <- read.csv("train.csv")
library(robustbase)
library(ggplot2); theme_set(theme_bw())
library(cowplot)
Define a test function for timing lmrob runs for different numbers of observations and predictors:
nc <- ncol(xx) ## response vble is last column, "critical_temp"
test <- function(nobs=1000,npred=10,seed=NULL, ...) {
if (!is.null(seed)) set.seed(seed)
dd <- xx[sample(nrow(xx),size=nobs),
c(sample(nc-1,size=npred),nc)]
tt <- system.time(fit <- lmrob(critical_temp ~ ., data=dd, ...))
tt[c("user.self","sys.self","elapsed")]
}
t0 <- test()
The minimal example here (1000 observations, 10 predictors) is very fast (0.2 seconds).
This is the basic loop I ran:
res <- expand.grid(nobs=seq(1000,10000,by=1000), npred=seq(10,30,by=2))
res$user.self <- res$sys.self <- res$elapsed <- NA
for (i in seq(nrow(res))) {
cat(res$nobs[i],res$npred[i],"\n")
res[i,-(1:2)] <- test(res$nobs[i],res$npred[i],seed=101)
}
(As you can see in the plot below, I did this again with larger numbers of observations and predictors and used rbind() to combine the results into a single data frame.) I also tried fitting linear models to make predictions of the time taken to do the full data set with all predictors. (Plotting [see below] suggests that the time is log-log-linear in number of observations but nonlinear in number of predictors ...)
m1 <- lm(log10(elapsed)~poly(log10(npred),2)*log10(nobs), data=resc)
pp <- predict(m1, newdata=data.frame(npred=ncol(xx)-1,nobs=nrow(xx)),
interval="confidence")
10^pp ## convert from log10(predicted seconds) to seconds
Test the full data set.
t_all <- test(nobs=nrow(xx),npred=ncol(xx)-1)
I then realized that you were using setting = "KS2014" (as suggested in the documentation) rather than the default: this is at least 5x slower, as suggested by the following comparison:
test(nobs=10000,npred=30)
test(nobs=10000,npred=30,setting = "KS2014")
I re-ran some of the stuff above with setting="KS2014". Making the prediction for the full data set suggested a run-time of about 700 seconds (CI from 300 to 2000 seconds) - still nowhere near as slow as you're suggesting.
gg0 <- ggplot(resc2,aes(x=npred,y=elapsed,colour=nobs,linetype=setting))+
geom_point()+geom_line(aes(group=interaction(nobs,setting)))+
scale_x_log10()+scale_y_log10()
gg1 <- ggplot(resc2,aes(x=nobs,y=elapsed,colour=npred, linetype=setting))+
geom_point()+geom_line(aes(group=interaction(npred,setting)))+
scale_x_log10()+scale_y_log10()
plot_grid(gg0,gg1,nrow=1)
ggsave("lmrob_times.pdf")

poLCA - Latent Class how to do the adjusted Lo-Mendell-Rubin (LMR) test with R

Good afternoon,
I am trying to perform Lo, Mendell and Rubin's (2001) adjusted test (LMR) in order to decide the optimal number of classes in LCA. I performed the command with poLCA, but I didn't find any command to perform it.
Is there someone that can help me?
Thank you very much!
Here is an example of a (ad-hoc adjusted) LMR test comparing a LCA with 3 groups (alternative model) against 2 groups (baseline model).
# load packages/install if needed
library(poLCA)
library(tidyLPA)
data("election")
# Fit LCA with 2 classes (NULL model)
mod_null <- poLCA(formula = cbind(MORALG, CARESG, KNOWG) ~ 1,
data = election, nclass = 2, verbose = F)
# store values baseline model
n <- mod_null$Nobs #number of observations (should be equal in both models)
null_ll <- mod_null$llik #log-likelihood
null_param <- mod_null$npar # number of parameters
null_classes <- length(mod_null$P) # number of classes
# Fit LCA with 3 classes (ALTERNATIVE model)
mod_alt <- poLCA(formula = cbind(MORALG, CARESG, KNOWG) ~ 1,
data = election, nclass = 3, verbose = F)
# Store values alternative model
alt_ll <- mod_alt$llik #log-likelihood
alt_param <- mod_alt$npar # number of parameters
alt_classes <- length(mod_alt$P) # number of classes
# use calc_lrt from tidyLPA package
calc_lrt(n, null_ll, null_param, null_classes, alt_ll, alt_param, alt_classes)
Wow really late to the game but as Im looking at similar things Ill leave for the next person.
The Lo-Mendell-Rubin test involves a transformation of the data and then a chi-sq test to determine if K classes is a better fit than K-1 classes... basically.
However there is reasonable research out there suggesting that a better measure of this is the bootstrap likelihood ratio.
The former is still in common use with MPlus users, the latter is far more common in LCA packages in R, e.g. mclust. Dunno about poLCA though...

Local prediction modelling approach in R

users
I am trying to develop a local model (PLSR) which is predicting a query sample by a model built on the 10 most similar samples using the code below (not the full model yet, just a part of it). I got stuck when trying to predict the query sample (second to last line). The model is actually predicting something, ("prd") but not the query sample!
Here is my code:
require("pls")
set.seed(10000) # generate some sample data
mat <- replicate(100, rnorm(100))
y <- as.matrix(mat[,1], drop=F)
x <- mat[,2:100]
eD <- dist(x, method="euclidean") # create a distance matrix
eDm <- as.matrix(eD)
Looping over all 100 samples and extracting their 10 most similar samples for subsequent model building and prediction of query sample:
for (i in 1:nrow(eDm)) {
kni <- head(order(eDm[,i]),11)[-1] # add 10 most similar samples to kni
pls1 <- plsr(y[kni,] ~ x[kni,], ncomp=5, validation="CV") # run plsr on sel. samples
prd <- predict(pls1, ncomp=5, newdata=x[[i]]) # predict query sample ==> I suspect there is something wrong with this expression: newdata=x[[i]]
}
I can't figure out how to address the query sample properly - many thanks i.a. for any help!
Best regards,
Chega
You are going to run into all sorts of pain building models with formulae like that. Also the x[[i]] isn't doing what you think it is - you need to supply a data frame usually to these modelling functions. In this case a matrix seems fine too.
I get all your code working OK if I use:
prd <- predict(pls1, ncomp=5, newdata=x[i, ,drop = FALSE])
giving
> predict(pls1, ncomp=5, newdata=x[i,,drop = FALSE])
, , 5 comps
y[kni, ]
[1,] 0.6409897
What you were seeing with your code are the fitted values for the training data.
> fitted(pls1)[, , 5, drop = FALSE]
, , 5 comps
y[kni, ]
1 0.1443274
2 0.2706769
3 1.1407780
4 -0.2345429
5 -1.0468221
6 2.1353091
7 0.8267103
8 3.3242296
9 -0.5016016
10 0.6781804
This is convention in R when you either don't supply newdata or the object you are supplying makes no sense and doesn't contain the covariates required to generate predictions.
I would have fitted the model as follows:
pls1 <- plsr(y ~ x, ncomp=5, validation="CV", subset = kni)
where I use the subset argument for its intended purpose; to select the rows of the input data to fit the model with. You get nicer output from the models; the labels use y instead of y[kni, ] etc, plus this general convention will serve you well in other modelling tools, where R will expect newdata to be a data frame with names exactly the same as those mentioned in the model formula. In your case, with your code, that would mean creating a data frame with names like x[kni, ] which are not easy to do, for good reason!

Resources