I would like to see If SOM algorithm can be used for classification prediction.
I used to code below but I see that the classification results are far from being right. For example, In the test dataset, I get a lot more than just the 3 values that I have in the training target variable. How can I create a prediction model that will be in alignment to the training target variable?
library(kohonen)
library(HDclassif)
data(wine)
set.seed(7)
training <- sample(nrow(wine), 120)
Xtraining <- scale(wine[training, ])
Xtest <- scale(wine[-training, ],
center = attr(Xtraining, "scaled:center"),
scale = attr(Xtraining, "scaled:scale"))
som.wine <- som(Xtraining, grid = somgrid(5, 5, "hexagonal"))
som.prediction$pred <- predict(som.wine, newdata = Xtest,
trainX = Xtraining,
trainY = factor(Xtraining$class))
And the result:
$unit.classif
[1] 7 7 1 7 1 11 6 2 2 7 7 12 11 11 12 2 7 7 7 1 2 7 2 16 20 24 25 16 13 17 23 22
[33] 24 18 8 22 17 16 22 18 22 22 18 23 22 18 18 13 10 14 15 4 4 14 14 15 15 4
This might help:
SOM is an unsupervised classification algorithm, so you shouldn't expect it to be trained on a dataset that contains a classifier label (if you do that it will need this information to work, and will be useless with unlabelled datasets)
The idea is that it will kind of "convert" an input numeric vector to a network unit number (try to run your code again with a 1 per 3 grid and you'll have the output you expected)
You'll then need to convert those network units numbers back into the categories you are looking for (that is the key part missing in your code)
Reproducible example below will output a classical classification error. It includes one implementation option for the "convert back" part missing in your original post.
Though, for this particular dataset, the model overfitts pretty quickly: 3 units give the best results.
#Set and scale a training set (-1 to drop the classes)
data(wine)
set.seed(7)
training <- sample(nrow(wine), 120)
Xtraining <- scale(wine[training, -1])
#Scale a test set (-1 to drop the classes)
Xtest <- scale(wine[-training, -1],
center = attr(Xtraining, "scaled:center"),
scale = attr(Xtraining, "scaled:scale"))
#Set 2D grid resolution
#WARNING: it overfits pretty quickly
#Errors are 36% for 1 unit, 63% for 2, 93% for 3, 89% for 4
som_grid <- somgrid(xdim = 1, ydim=3, topo="hexagonal")
#Create a trained model
som_model <- som(Xtraining, som_grid)
#Make a prediction on test data
som.prediction <- predict(som_model, newdata = Xtest)
#Put together original classes and SOM classifications
error.df <- data.frame(real = wine[-training, 1],
predicted = som.prediction$unit.classif)
#Return the category number that has the strongest association with the unit
#number (0 stands for ambiguous)
switch <- sapply(unique(som_model$unit.classif), function(x, df){
cat <- as.numeric(names(which.max(table(
error.df[error.df$predicted==x,1]))))
if(length(cat)<1){
cat <- 0
}
return(c(x, cat))
}, df = data.frame(real = wine[training, 1], predicted = som_model$unit.classif))
#Translate units numbers into classes
error.df$corrected <- apply(error.df, MARGIN = 1, function(x, switch){
cat <- switch[2, which(switch[1,] == x["predicted"])]
if(length(cat)<1){
cat <- 0
}
return(cat)
}, switch = switch)
#Compute a classification error
sum(error.df$corrected == error.df$real)/length(error.df$real)
Related
I've got this data processing:
library(text2vec)
##Using perplexity for hold out set
t1 <- Sys.time()
perplex <- c()
for (i in 3:25){
set.seed(17)
lda_model2 <- LDA$new(n_topics = i)
doc_topic_distr2 <- lda_model2$fit_transform(x = dtm, progressbar = F)
set.seed(17)
sample.dtm2 <- itoken(rawsample$Abstract,
preprocessor = prep_fun,
tokenizer = tok_fun,
ids = rawsample$id,
progressbar = F) %>%
create_dtm(vectorizer,vtype = "dgTMatrix", progressbar = FALSE)
set.seed(17)
new_doc_topic_distr2 <- lda_model2$transform(sample.dtm2, n_iter = 1000,
convergence_tol = 0.001, n_check_convergence = 25,
progressbar = FALSE)
perplex[i] <- text2vec::perplexity(sample.dtm2, topic_word_distribution =
lda_model2$topic_word_distribution,
doc_topic_distribution = new_doc_topic_distr2)
}
print(difftime(Sys.time(), t1, units = 'sec'))
I know there are a lot of questions like this, but I haven't been able to exactly find the answer to my situation. Above you see perplexity calculation from 3 to 25 topic number for a Latent Dirichlet Allocation model. I want to get the most sufficient value among those, meaning that I want to find the elbow or knee, for those values that might only be considered as a simple numeric vector which outcome looks like this:
1 NA
2 NA
3 222.6229
4 210.3442
5 200.1335
6 190.3143
7 180.4195
8 174.2634
9 166.2670
10 159.7535
11 153.7785
12 148.1623
13 144.1554
14 141.8250
15 138.8301
16 134.4956
17 131.0745
18 128.8941
19 125.8468
20 123.8477
21 120.5155
22 118.4426
23 116.4619
24 113.2401
25 114.1233
plot(perplex)
This is how plot looks like
I would say that the elbow would be 13 or 16, but I'm not completely sure and I want the exact number as an outcome. I saw in this paper that f''(x) / (1+f'(x)^2)^1.5 is the knee formula, which I tried like this and says it's 18:
> d1 <- diff(perplex) # first derivative
> d2 <- diff(d1) / diff(perplex[-1]) # second derivative
> knee <- (d2)/((1+(d1)^2)^1.5)
Warning message:
In (d2)/((1 + (d1)^2)^1.5) :
longer object length is not a multiple of shorter object length
> which.min(knee)
[1] 18
I can't fully figure this thing out. Would someone like to share how I could get the exact ideal topics number according to perplexity as an outcome?
Found this: "The LDA model with the optimal coherence score, obtained with an elbow method (the point with maximum absolute second derivative) (...)" in this paper, so this coding does the work: d1 <- diff(perplex); k <- which.max(abs(diff(d1) / diff(perplex[-1])))
I have the following PCA data on which i am doing Kmeans clustering:
head(pcdffinal)
PC1 PC2 PC3 PC4 PC5 PC6
1 -9.204228 -2.73517110 2.7975063 0.6794614 -0.84627095 0.4455297
2 2.927245 0.05666389 0.5085896 0.1472800 0.18193152 0.1041490
3 -4.667932 -1.98176361 2.2751862 0.5347725 -0.43314927 0.3222719
4 -1.366505 -0.40858595 0.5005192 0.4507366 -0.54996933 0.5533013
5 -4.689454 -2.77185636 2.4323856 0.7387788 0.49237229 -0.4817083
6 -3.477046 -1.84904214 1.5539558 0.5463861 -0.03231143 0.2814843
opt.cluster<-3
set.seed(115)
pccomp.km <- kmeans(pcdffinal,opt.cluster,nstart=25)
head(pccomp.km$cluster)
[1] 2 1 2 2 2 2
barplot(table(pccomp.km$cluster), col="steelblue")
pccomp.km$tot.withinss #For total within cluster sum of squares.
[1] 13172.59
We can also use a plot to illustrate the groups that the data have been arranged into.
par(mfrow=c(1,1))
plot(pcdffinal[,1:2],col=(pccomp.km$cluster+1),main=paste('K-Means Clustering result with k = ', opt.cluster,sep=" "),pch=20,cex=2)
points(pccomp.km$centers, pch=15,cex=2)#plotting the centres of the cluster as black squares
library("factoextra")
fviz_cluster(pccomp.km, data = pcdffinal, frame.type = "convex")+ theme_minimal()
df.num_kmeans<-df.num
df.num_kmeans$cluster.kmeans <- pccomp.km$cluster# is a vector of cluster assignment from kmeans() added as a column to the original dataset as
save this dataset & kmeans model for further use
saveRDS(pccomp.km, "kmeans_model.RDS")
write.csv(df.num_kmeans,"dfnum_kmeans.cluster.csv")
library(cluster)
clusplot(df.num_kmeans,pccomp.km$cluster,color = TRUE,shade=TRUE,labels = 2,lines = 0)
library(ggfortify)
autoplot(pccomp.km, data=pcdffinal, frame=TRUE,frame.type='norm')
I would like to do Kmeans iteratively for a range of Ks say k=2:6 each time making plots for the respective k as well as saving the models as well as the data as a csv but each done separately for different k's.
Need help to convert the above codes into an iterative with the counter i going from 2 till 6.
original data:
head(df.num_kmeans)
datausage mou revenue calldrop handset2g handset3g smartphone
1 896804.7 2854801 40830.404 27515 7930 19040 20810
2 155932.1 419109 5512.498 5247 2325 2856 3257
3 674983.3 2021183 25252.265 21068 6497 13056 14273
4 522787.2 1303221 14547.380 8865 4693 9439 10746
5 523465.7 1714641 24177.095 25441 8668 12605 14766
6 527062.3 1651303 20153.482 18219 6822 11067 12994
rechargecount rechargesum arpu subscribers
1 4461 235430 197704.10 105822
2 843 39820 34799.21 18210
3 2944 157099 133842.38 71351
4 2278 121697 104681.58 44975
5 2802 144262 133190.55 75860
6 2875 143333 119389.91 63740
Using random forest for accuracy comparison
dfnum.kmeans <- read.csv("dfnum_kmeans.cluster.csv")
table(dfnum.kmeans$cluster.kmeans) # size of each cluster
convert cluster var into a factor
dfnum.kmeans$cluster.kmeans <- as.factor(dfnum.kmeans$cluster.kmeans)
is.factor(dfnum.kmeans$cluster.kmeans)
create training and test sets (75:25 split) using 'caret' package
set.seed(128) # for reproducibility
inTrain_kmeans <- caret::createDataPartition(y = dfnum.kmeans$cluster.kmeans, p = 0.75, list = FALSE)
training_kmeans <- dfnum.kmeans[inTrain_kmeans, ]
testing_kmeans <- dfnum.kmeans[-inTrain_kmeans, ]
set.seed(122)
control <- trainControl(method = "repeatedcv", number = 10,allowParallel = TRUE)
modFit.rfcaret_kmeans <- caret::train(cluster.kmeans~ ., method = "rf",data = training_kmeans, trControl = control, number = 25)
modFit.rfcaret_kmeans$finalModel
pred.test_kmeans = predict(modFit.rfcaret_kmeans, testing_kmeans); confusionMatrix(pred.test_kmeans, testing_kmeans$cluster.kmeans )
confusionMatrix(pred.test_kmeans, testing_kmeans$cluster.kmeans )$overall[1]
Assuming that your original dataframe is df.num, the following could save all the files (for different k values) in your working directory:
for (k in 2:6) {
set.seed(115)
pccomp.km <- kmeans(pcdffinal,k,nstart=25)
head(pccomp.km$cluster)
print(paste(k, pccomp.km$tot.withinss)) #For total within cluster sum of squares.
png(paste0('kmeans_proj_',k, '.png'))
par(mfrow=c(1,1))
plot(pcdffinal[,1:2],col=(pccomp.km$cluster+1),main=paste('K-Means Clustering result with k = ', k,sep=" "),pch=20,cex=2)
points(pccomp.km$centers, pch=15,cex=2)#plotting the centres of the cluster as black squares
dev.off()
png(paste0('kmeans_fviz_',k, '.png'))
print(fviz_cluster(pccomp.km, data = pcdffinal, frame.type = "convex")+ theme_minimal())
dev.off()
df.num_kmeans<-df.num
df.num_kmeans$cluster.kmeans <- pccomp.km$cluster# is a vector of cluster assignment from kmeans() added as a column to the original dataset as
saveRDS(pccomp.km, paste0("kmeans_model_", k, ".RDS"))
write.csv(df.num_kmeans,paste0("dfnum_kmeans_", k, ".cluster.csv"))
png(paste0('clusplot_',k, '.png'))
clusplot(df.num_kmeans,pccomp.km$cluster,color = TRUE,shade=TRUE,labels = 2,lines = 0)
dev.off()
png(paste0('autoplot_',k, '.png'))
print(autoplot(pccomp.km, data=pcdffinal, frame=TRUE,frame.type='norm'))
dev.off()
}
After fitting a Tree with party::ctree() I want to create a table to characterise the branches.
I have fitted these variables
> summary(juridicos_segmentar)
actividad_economica
Financieras : 89
Gubernamental : 48
Sector Primario : 34
Sector Secundario:596
Sector Terciario :669
ingresos_cut
(-Inf,1.03e+08] :931
(1.03e+08,4.19e+08]:252
(4.19e+08,1.61e+09]:144
(1.61e+09, Inf] :109
egresos_cut
(-Inf,6e+07] :922
(6e+07,2.67e+08] :256
(2.67e+08,1.03e+09]:132
(1.03e+09, Inf] :126
patrimonio_cut
(-Inf,2.72e+08] :718
(2.72e+08,1.46e+09]:359
(1.46e+09,5.83e+09]:191
(5.83e+09, Inf] :168
op_ingreso_cut
(-Inf,3] :1308
(3,7] : 53
(7,22] : 44
(22, Inf]: 31
The first one is categorical and the others are ordinal and I fitted them to
another factor variable
> summary(as.factor(segmento))
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
27 66 30 39 36 33 39 15 84 70 271 247 101 34 100 74 47 25 48 50
I used the following code
library(party)
fit_jur <- ctree(cluster ~ .,
data=data.frame(juridicos_segmentar, cluster=as.factor(segmento)))
to get this tree
> fit_jur
Conditional inference tree with 31 terminal nodes
Response: cluster
Inputs: actividad_economica, ingresos_cut, egresos_cut, patrimonio_cut, op_ingreso_cut
Number of observations: 1436
1) actividad_economica == {Financieras}; criterion = 1, statistic = 4588.487
2) ingresos_cut <= (4.19e+08,1.61e+09]; criterion = 1, statistic = 62.896
3) egresos_cut <= (6e+07,2.67e+08]; criterion = 1, statistic = 22.314
4)* weights = 70
3) egresos_cut > (6e+07,2.67e+08]
5)* weights = 10
2) ingresos_cut > (4.19e+08,1.61e+09]
6)* weights = 9
plot of part of the tree
What I want is a table where every row is a path from the node to a leaf saying the prediction of the variable segmento and every column is the condition on the variable to split. Something alike this:
actividad economica ingresos (rango) egresos (rango) patrimonio (rango) operaciones de ingreso segmento
Sector Primario <=261.000.000 18
Sector Primario >261.000.000 20
The problem is there are several leaves to characterise and some time a variable appears several times in one path so I'd like to intersect the conditions, i.e. intersecting the ranges.
I've thought of data.tree::ToDataFrameTable but I've got no idea of how it works with party.
Thank you very much guys!
library(partykit)
fit_jur <- ctree(cluster ~ .,
data=data.frame(juridicos_segmentar, cluster=as.factor(segmento)))
pathpred <- function(object, ...)
{
## coerce to "party" object if necessary
if(!inherits(object, "party")) object <- as.party(object)
## get standard predictions (response/prob) and collect in data frame
rval <- data.frame(response = predict(object, type = "response", ...))
rval$prob <- predict(object, type = "prob", ...)
## get rules for each node
rls <- partykit:::.list.rules.party(object)
## get predicted node and select corresponding rule
rval$rule <- rls[as.character(predict(object, type = "node", ...))]
return(rval)
}
ct_pred_jur <- unique(pathpred(fit_jur)[c(1,3)])
write.csv2(ct_pred_jur,'parametrizacion_juridicos.csv')
thank you Achim Zeileis for pointing me in this direction, I couldn't intersect the rules in a same variable, i.e. evaluate the '&s'. That problem is still open.
You can convert both party class (from partykit) and BinaryTree (from party) to a data.tree, and use it for conversion to data frame and/or printing. For example like this:
library(party)
airq <- subset(airquality, !is.na(Ozone))
airct <- ctree(Ozone ~ ., data = airq,
controls = ctree_control(maxsurrogate = 3))
tree <- as.Node(airct)
df <- ToDataFrameTable(tree,
"pathString",
"label",
criterion = function(x) round(x$criterion$maxcriterion, 3),
statistic = function(x) round(max(x$criterion$statistic), 3)
)
df
This will print like so:
pathString label criterion statistic
1 1/2/3 weights = 10 0.000 0.000
2 1/2/4/5 weights = 48 0.936 6.141
3 1/2/4/6 weights = 21 0.891 5.182
4 1/7/8 weights = 30 0.675 3.159
5 1/7/9 weights = 7 0.000 0.000
Plotting:
#print subtree
subtree <- Clone(tree$`2`)
SetNodeStyle(subtree,
style = "filled,rounded",
shape = "box",
fillcolor = "GreenYellow",
fontname = "helvetica",
label = function(x) x$label,
tooltip = function(x) round(x$criterion$maxcriterion, 3))
plot(subtree)
And the result will look like this:
I am trying to use code for fully reproducible parallel models in caret but do not understand how to set the size of the vectors in the seed object. For gbm I have 4 tuning parameters with a total of 11 different levels, and I have 54 rows in my tuning grid. If I specify any value < 18 as the last value in the "for(i in 1:10)" line below, I get an error: "Bad seeds: the seed object should be a list of length 11 with 10 integer vectors of size 18 and the last list element having a single integer." Why 18? Also it runs without errors for values > 18 (e.g., 54) - why? Many thanks for the help. The following is based on http://topepo.github.io/caret/training.html, added some things.
library(mlbench)
data(Sonar)
str(Sonar[, 1:10])
library(caret)
library(doParallel)
set.seed(998)
inTraining <- createDataPartition(Sonar$Class, p = .75, list = FALSE)
training <- Sonar[ inTraining,]
testing <- Sonar[-inTraining,]
grid <- expand.grid(n.trees = seq(50,150,by=50), interaction.depth = seq(1,3,by=1),
shrinkage = seq(.09,.11,by=.01),n.minobsinnode=seq(8,10,by=2))
# set seed to run fully reproducible model in parallel mode using caret
set.seed(825)
seeds <- vector(mode = "list", length = 11) # length is = (n_repeats*nresampling)+1
for(i in 1:10) seeds[[i]]<- sample.int(n=1000, 11) # ...the number of tuning parameter...
seeds[[11]]<-sample.int(1000, 1) # for the last model
fitControl <- trainControl(method = "cv",number = 10,seeds=seeds)
# run model in parallel
cl <- makeCluster(detectCores())
registerDoParallel(cl)
gbmFit1 <- train(Class ~ ., data = training,method = "gbm",
trControl = fitControl,tuneGrid=grid,verbose = FALSE)
gbmFit1
I will address your question in two parts:
1 - Setting the seeds:
The code to do it as you stated :
set.seed(825)
seeds <- vector(mode = "list", length = 11)
for(i in 1:10) seeds[[i]]<- sample.int(n=1000, 54)
#for the last model
seeds[[11]]<-sample.int(1000, 1)
The 11 in seeds <- vector(mode = "list", length = 11) is (n_repeats*nresampling)+1, so in your case, you're using 10-fold CV, so 10+1 = 11. If you were using repeatedcv with number=10 and repeats = 5 you would replace the 11 by (5*10)+1 = 51.
The 10 in for(i in 1:10) is (n_repeats*nresampling). in your case it is 10 because you're using 10-fold CV. Similarly, if you were using repeatedcv with number=10 and repeats = 5 it would be for(i in 1:50).
The 54 in sample.int(n=1000, 54) is the number of tuning parameter combinations. In your case, you have 4 parameters with 3,3,3 and 2 values. So, it is 3*3*3*2 = 54. But, I remember I red somewhere that for gbm, the model is fit to the max(n.trees) in the grid, and the models with less trees are derived from it, this explains why caret calculates the seeds based on the interaction.depth * shrinkage * n.minobsinnode in your case 3 * 3 * 2 = 18 and not 3*3*3*2 = 54 as we will see later.
But if you were using a SVM model with a grid svmGrid <- expand.grid(sigma= 2^c(-25, -20, -15,-10, -5, 0), C= 2^c(0:5)) your value is 6 * 6 = 36
Remember, the goal of using seeds is to allow reproducible research by setting the seeds for the models fit at each resampling iteration.
The seeds[[11]]<-sample.int(1000, 1) is used to set the seed for the last (optimum) model fit to the complete dataset.
2 - Why you get an error if you specify a value < 18, but no error with a value >= 18
I was able to reproduce the same error on my machine:
Error in train.default(x, y, weights = w, ...) :
Bad seeds: the seed object should be a list of length 11 with 10 integer vectors of size 18 and the last list element having a single integer
So, by inspecting the train.default I was able to find its source. The error message is triggered by the stop in lines 7 to 10 based on the test badSeed in lines 4 and 5.
else {
if (!(length(trControl$seeds) == 1 && is.na(trControl$seeds))) {
numSeeds <- unlist(lapply(trControl$seeds, length))
4 badSeed <- (length(trControl$seeds) < length(trControl$index) +
5 1) || (any(numSeeds[-length(numSeeds)] < nrow(trainInfo$loop)))
if (badSeed)
7 stop(paste("Bad seeds: the seed object should be a list of length",
8 length(trControl$index) + 1, "with", length(trControl$index),
9 "integer vectors of size", nrow(trainInfo$loop),
10 "and the last list element having a", "single integer"))
}
}
The number 18 is coming from nrow(trainInfo$loop), so we need to find the value of trainInfo$loop. The object trainInfo is assigned a value trainInfo <- models$loop(tuneGrid) in line 3:
if (trControl$method != "none") {
if (is.function(models$loop) && nrow(tuneGrid) > 1) {
3 trainInfo <- models$loop(tuneGrid)
if (!all(c("loop", "submodels") %in% names(trainInfo)))
stop("The 'loop' function should produce a list with elements 'loop' and 'submodels'")
}
Now, we need to find the object models. It is assigned the value of models <- getModelInfo(method, regex = FALSE)[[1]] in line 2:
else {
2 models <- getModelInfo(method, regex = FALSE)[[1]]
if (length(models) == 0)
stop(paste("Model", method, "is not in caret's built-in library"))
}
Since we are using method = "gbm", we can see the value of getModelInfo("gbm", regex = FALSE)[[1]]$loop and inspect the result below:
> getModelInfo("gbm", regex = FALSE)[[1]]$loop
function(grid) {
3 loop <- ddply(grid, c("shrinkage", "interaction.depth", "n.minobsinnode"),
function(x) c(n.trees = max(x$n.trees)))
submodels <- vector(mode = "list", length = nrow(loop))
for(i in seq(along = loop$n.trees)) {
index <- which(grid$interaction.depth == loop$interaction.depth[i] &
grid$shrinkage == loop$shrinkage[i] &
grid$n.minobsinnode == loop$n.minobsinnode[i])
trees <- grid[index, "n.trees"]
submodels[[i]] <- data.frame(n.trees = trees[trees != loop$n.trees[i]])
}
list(loop = loop, submodels = submodels)
}
>
The loop (in line 3 above) is assigned the value:
loop <- ddply(grid, c("shrinkage", "interaction.depth", "n.minobsinnode"),
function(x) c(n.trees = max(x$n.trees)))`
Now, let's pass your grid with 54 rows to the line above and inspect the result:
> nrow(grid)
[1] 54
>
> loop <- ddply(grid, c("shrinkage", "interaction.depth", "n.minobsinnode"),
+ function(x) c(n.trees = max(x$n.trees)))
> loop
shrinkage interaction.depth n.minobsinnode n.trees
1 0.09 1 8 150
2 0.09 1 10 150
3 0.09 2 8 150
4 0.09 2 10 150
5 0.09 3 8 150
6 0.09 3 10 150
7 0.10 1 8 150
8 0.10 1 10 150
9 0.10 2 8 150
10 0.10 2 10 150
11 0.10 3 8 150
12 0.10 3 10 150
13 0.11 1 8 150
14 0.11 1 10 150
15 0.11 2 8 150
16 0.11 2 10 150
17 0.11 3 8 150
18 0.11 3 10 150
>
ahh!, we found it. The value 18 is coming from nrow(trainInfo$loop) which is coming from getModelInfo("gbm", regex = FALSE)[[1]]$loop shown above with just 18 rows.
Now, going back to the test that triggered the error:
badSeed <- (length(trControl$seeds) < length(trControl$index) +
1) || (any(numSeeds[-length(numSeeds)] < nrow(trainInfo$loop)))
The first part of the test (length(trControl$seeds) < length(trControl$index) + 1) is FALSE, but the second part (any(numSeeds[-length(numSeeds)] < nrow(trainInfo$loop))) is TRUE for all valuse less that 18 [coming from nrow(trainInfo$loop)], and FALSE for all valuse greater than 18. That's why the error is triggered for a value <18 and not for >=18. As I said above, the caret's calculates the seeds based on the interaction.depth * shrinkage * n.minobsinnode in your case 3 * 3 * 2 = 18 (a model is fit to the max(n.trees) and the others are derived from it, so there is no need for 54 integers).
I have the R iris dataset which I am using for a PNN. The 3 species have been recoded from level 0 to 3 as follows: 0 is setosa, 1 is versicolor, 2 is virginica. Training set is 75%
Q1. I don't understand the function pred_pnn, if anyone is good in R perhaps you can explain how it works
Q2. The output of the test set or prediction is shown below, I don't understand the output because it is supposed to be something close to either 0,1,2
data = read.csc("c:/iris-recoded.csv" , header = T)
size = nrow(data)
length = ncol(data)
index <- 1:size
positions <- sample(index, trunc(size * 0.75))
training <- data[positions,]
testing <- data[-positions,1:length-1]
result = data[-positions,]
result$actual = result[,length]
result$predict = -1
nn1 <- smooth(learn(training), sigma = 0.9)
pred_pnn <- function(x, nn){
xlst <- split(x, 1:nrow(x))
pred <- foreach(i = xlst, .combine = rbind) %dopar% {
data.frame(prob = guess(nn, as.matrix(i))$probabilities[1], row.names =NULL)
}
}
print(pred_pnn(testing, nn1))
prob
1 1.850818e-03
2 9.820653e-03
3 6.798603e-04
4 7.421435e-03
5 2.168817e-03
6 3.277354e-03
7 6.541173e-03
8 1.725332e-04
9 2.081845e-03
10 2.491388e-02
11 7.679823e-03
12 1.291811e-03
13 2.197234e-06
14 1.316366e-03
15 1.421219e-05
16 4.639239e-05
17 3.671907e-04
18 1.460001e-04
19 4.382849e-05
20 2.387543e-05
21 1.011196e-05
22 2.719982e-04
23 4.445472e-04
24 1.281762e-04
25 5.931106e-09
26 9.741870e-08
27 9.236434e-09
28 8.384690e-08
29 3.311667e-07
30 6.045306e-11
31 2.949265e-08
32 2.070014e-10
33 8.043735e-06
34 2.136666e-08
35 5.604398e-08
36 2.455841e-07
37 3.445977e-07
38 7.314647e-07
I'm assuming you're using the pnn package. Documentation for ?guess would lead us to believe that it does similar to what predict does for other models. In other words, it predicts to which class the observation belongs to. Everything else in there for bookkeeping. Why you get only the probabilities? Because the person who wrote the function made it that way by extracting guess(x)$probabilities and returning only that. If you look at the raw output, you would also get predicted class tucked in away in $category list element.