Difference in Cross validated results between R and python - r

I have a data frame like so:
log.comb CDEM_TWI Gruber_Ruggedness dNBR TC_Change_Sexton_Rel \
0 8.714914 10.70240 0.626106 0.701591 -27.12220
1 6.501334 10.65650 1.146360 0.693891 -35.52890
2 8.946111 13.58910 1.146360 0.513136 7.00000
3 8.955151 9.85036 1.126980 0.673891 13.81380
4 7.751379 7.28264 0.000000 0.256136 10.06940
5 8.895197 8.36555 0.000000 0.506000 -27.61340
6 8.676571 12.92650 0.000000 0.600627 -44.48400
7 8.562267 12.76980 0.519255 0.747009 -29.84790
8 9.052766 11.81580 0.519255 0.808336 -29.00900
9 9.133744 9.42046 0.484616 0.604891 -18.53550
10 8.221441 9.53682 0.484616 0.817336 -21.39920
11 8.398913 12.32050 0.519255 0.814745 -18.12080
12 7.587468 11.08880 1.274430 0.590282 92.85710
13 7.983136 8.95073 1.274430 0.316000 -10.34480
14 9.044404 11.18440 0.698818 0.608600 -14.77000
15 8.370293 11.96980 0.687634 0.323000 -9.60452
16 7.938134 12.42380 0.709549 0.374027 36.53140
17 8.183456 12.73490 1.439180 0.679627 -12.94420
18 8.322246 9.61600 0.551689 0.642900 37.50000
19 7.934997 7.77564 0.519255 0.690936 -25.29880
20 9.049387 11.16000 0.519255 0.789064 -35.73880
21 8.071323 6.17036 0.432980 0.574355 -22.43590
22 6.418345 5.98927 0.432980 0.584991 4.34783
23 7.950516 5.49527 0.422882 0.689009 25.22520
24 6.355529 7.35982 0.432980 0.419045 -18.81920
25 8.043683 5.18300 0.763596 0.582555 50.56180
26 6.013468 5.34018 0.493781 0.241155 -3.01205
27 7.961675 5.43264 0.493781 0.421527 -21.72290
28 8.074614 11.94630 0.493781 0.451800 11.61620
29 8.370570 6.34100 0.492384 0.550127 -12.50000
Pct_Pima Sand._15cm
0 75.62120 44.6667
1 69.30690 41.8333
2 59.47490 41.8333
3 66.08800 41.5000
4 34.31250 39.6667
5 35.04750 39.2424
6 62.32120 41.6667
7 57.14320 43.3333
8 57.35020 43.3333
9 72.90980 41.0000
10 57.61790 38.8333
11 57.35020 39.8333
12 69.30690 47.8333
13 69.30690 47.3333
14 76.58910 42.8333
15 75.62120 45.3333
16 76.69440 41.7727
17 59.47090 37.8333
18 61.10130 42.8333
19 72.67650 38.1818
20 57.35020 40.6667
21 23.15380 48.0000
22 17.15050 51.5000
23 0.00000 47.5000
24 6.67001 58.0000
25 15.18050 54.8333
26 5.89344 49.0000
27 5.89344 49.1667
28 13.18900 48.5000
29 13.30450 49.0000
I want to run a linear model through 10-fold cross validation repeated 100 times.
In python I do this:
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import RepeatedKFold
from sklearn.metrics import r2_score
X = df[['CDEM_TWI', 'Gruber_Ruggedness', 'dNBR', 'TC_Change_Sexton_Rel', 'Pct_Pima', 'Sand._15cm']].copy()
y = df[['log.comb']].copy()
all_r2 = []
rskf = RepeatedKFold(n_splits=10, n_repeats=10, random_state=42)
for train_index, test_index in rskf.split(X, y):
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
lm = LinearRegression(fit_intercept = True)
lm.fit(X_train, y_train)
pred = lm.predict(X_test)
r2 = r2_score(y_test, pred)
all_r2.append(r2)
avg = np.mean(all_r2)
and here avg returns -0.11
In R I do this:
library(caret)
library(klaR)
train_control <- trainControl(method="repeatedcv", number=10, repeats=10)
model <- train(log.comb~., data=df, trControl=train_control, method="lm")
and model returns:
RMSE Rsquared MAE
0.7868838 0.6132806 0.7047198
I am curious why these results are so inconsistent with each other? I realize the folds between the two different't languages are different, but since I am repeating it so many times I don't get why the numbers aren't more similar.
I also tried a nested grid search in sklearn like so:
inner_cv = KFold(n_splits=10, shuffle=True, random_state=10)
outer_cv = KFold(n_splits=10, shuffle=True, random_state=10)
param_grid = {'fit_intercept': [True, False],
'normalize': [True, False]}
# Non_nested parameter search and scoring
clf = GridSearchCV(estimator=LinearRegression(), param_grid = param_grid, cv=inner_cv)
clf.fit(X, y)
non_nested_score = clf.best_score_
# Pass the gridSearch estimator to cross_val_score
clf = GridSearchCV(estimator= LinearRegression(), param_grid = param_grid, cv=inner_cv)
nested_score = cross_val_score(clf, X=X, y=y, cv=outer_cv).mean()
but both the nested_score and non_nested_score are both still negative.

The Python code is returning the average of the results, the R code is returning the best model found.

Related

DFFITs for Beta Regression

I am trying to calculate DFFITS for GLM, where responses follow a Beta distribution. By using betareg R package. But I think this package doesn't support influence.measures() because by using dffits()
Code
require(betareg)
df<-data("ReadingSkills")
y<-ReadingSkills$accuracy
n<-length(y)
bfit<-betareg(accuracy ~ dyslexia + iq, data = ReadingSkills)
DFFITS<-dffits(bfit, infl=influence(bfit, do.coef = FALSE))
DFFITS
it yield
Error in if (model$rank == 0) { : argument is of length zero
I am a newbie in R. I don't know how to resolve this problem. Kindly help to solve this or give me some tips through R code that how to calculate DFFITs manually.
Regards
dffits are not implemented for "betareg" objects, but you could try to calculate them manually.
According to this Stack Overflow Q/A we could write this function:
dffits1 <- function(x1, bres.type="response") {
stopifnot(class(x1) %in% c("lm", "betareg"))
sapply(1:length(x1$fitted.values), function(i) {
x2 <- update(x1, data=x1$model[-i, ]) # leave one out
h <- hatvalues(x1)
nm <- rownames(x1$model[i, ])
num_dffits <- suppressWarnings(predict(x1, x1$model[i, ]) -
predict(x2, x1$model[i, ]))
residx <- if (class(x1) == "betareg") {
betareg:::residuals.betareg(x2, type=bres.type)
} else {
x2$residuals
}
denom_dffits <- sqrt(c(crossprod(residx)) / x2$df.residual*h[i])
return(num_dffits / denom_dffits)
})
}
It works well for lm:
fit <- lm(mpg ~ hp, mtcars)
dffits1(fit)
stopifnot(all.equal(dffits1(fit), dffits(fit)))
Now let's try betareg:
library(betareg)
data("ReadingSkills")
bfit <- betareg(accuracy ~ dyslexia + iq, data=ReadingSkills)
dffits1(bfit)
# 1 2 3 4 5 6 7
# -0.07590185 -0.21862047 -0.03620530 0.07349169 -0.11344968 -0.39255172 -0.25739032
# 8 9 10 11 12 13 14
# 0.33722706 0.16606198 0.10427684 0.11949807 0.09932991 0.11545263 0.09889406
# 15 16 17 18 19 20 21
# 0.21732090 0.11545263 -0.34296030 0.09850239 -0.36810187 0.09824013 0.01513643
# 22 23 24 25 26 27 28
# 0.18635669 -0.31192106 -0.39038732 0.09862045 -0.10859676 0.04362528 -0.28811277
# 29 30 31 32 33 34 35
# 0.07951977 0.02734462 -0.08419156 -0.38471945 -0.43879762 0.28583882 -0.12650591
# 36 37 38 39 40 41 42
# -0.12072976 -0.01701615 0.38653773 -0.06440176 0.15768684 0.05629040 0.12134228
# 43 44
# 0.13347935 0.19670715
Looks not bad.
Notes:
Even if this works in code, you should check if it meets your statistical requirements!
I've used suppressWarnings in lines 5:6 of dffits1. predict(bfit, ReadingSkills) drops the contrasts somehow, whereas predict(bfit) does not (should practically be the same). However the results are identical: all.equal(predict(bfit, ReadingSkills), predict(bfit)), thus ignoring the warnings be safe.

Implementing XGBoost on Methyl450k data set in R

I'm attempting to implement a the XGBoost on a Methyl450k data set. The data has approximately 480000+ specific CpG sites with subsequent beta values between 0 and 1. Here is a look at the data (sample 10 columns with response):
cg13869341 cg14008030 cg12045430 cg20826792 cg00381604 cg20253340 cg21870274 cg03130891 cg24335620 cg16162899 response
1 0.8612869 0.6958909 0.07918330 0.10816711 0.03484078 0.4875475 0.7475878 0.11051578 0.7120003 0.8453396 0
2 0.8337106 0.6276754 0.09811698 0.08934333 0.03348864 0.6300766 0.7753453 0.08652890 0.6465146 0.8137132 0
3 0.8516102 0.6575332 0.13310207 0.07990076 0.04195286 0.4325115 0.7257208 0.14334007 0.7384455 0.8054013 0
4 0.8970384 0.6955810 0.08134887 0.08950676 0.03578006 0.4711689 0.7214661 0.08299838 0.7718571 0.8151683 0
5 0.8562323 0.7204416 0.08078766 0.14902533 0.04274820 0.4769631 0.8034706 0.16473891 0.7143823 0.8475410 0
6 0.8613325 0.6527599 0.10158672 0.15459204 0.04839691 0.4805285 0.8004808 0.12598627 0.8218743 0.8222552 0
7 0.9168869 0.5963966 0.11457045 0.13245761 0.03720798 0.5067649 0.6806004 0.13601034 0.7063457 0.8509160 0
8 0.9002366 0.6898320 0.07029171 0.07158694 0.03875135 0.7065322 0.8167016 0.15394095 0.7226098 0.8310477 0
9 0.8876504 0.6172154 0.13511072 0.15276686 0.06149520 0.5642073 0.7177438 0.14752285 0.6846876 0.8360360 0
10 0.8992898 0.6361644 0.15423780 0.19111275 0.05700406 0.4941239 0.7819968 0.10109936 0.6680640 0.8504023 0
11 0.8997905 0.5906462 0.10411472 0.15006796 0.04157008 0.4931531 0.7857664 0.13430963 0.6946644 0.8326747 0
12 0.9009607 0.6721858 0.09081460 0.11057752 0.05824153 0.4683763 0.7655608 0.01755990 0.7113345 0.8346149 0
13 0.9036750 0.6313643 0.07477824 0.12089404 0.04738597 0.5502747 0.7520128 0.16332395 0.7036665 0.8564414 0
14 0.8420276 0.6265071 0.15351674 0.13647090 0.04901864 0.5037902 0.7446693 0.10534171 0.7727812 0.8317943 0
15 0.8995276 0.6515500 0.09214429 0.08973162 0.04231420 0.5071999 0.7484940 0.21822470 0.6859165 0.7775508 0
16 0.9071643 0.7945852 0.15809474 0.11264440 0.04793316 0.5256078 0.8425513 0.17150603 0.7581367 0.8271037 0
17 0.8691358 0.6206902 0.11868549 0.15944891 0.03523320 0.4581166 0.8058461 0.11557264 0.6960848 0.8579109 1
18 0.8330247 0.7030860 0.12832663 0.12936172 0.03534059 0.4687507 0.7630222 0.12176819 0.7179690 0.8775521 1
19 0.9015574 0.6592869 0.12693119 0.14671845 0.03819418 0.4395692 0.7420882 0.10293369 0.7047038 0.8435531 1
20 0.8568249 0.6762936 0.18220218 0.10123198 0.04963466 0.5781550 0.6324743 0.06676272 0.6805745 0.8291353 1
21 0.8799152 0.6736554 0.15056617 0.16070673 0.04944037 0.4015415 0.4587438 0.10392791 0.7467060 0.7396137 1
22 0.8730770 0.6663321 0.10802390 0.14481460 0.04448009 0.5177664 0.6682854 0.16747621 0.7161234 0.8309462 1
23 0.9359656 0.7401368 0.16730300 0.11842173 0.03388908 0.4906018 0.5730439 0.15970761 0.7904663 0.8136450 1
24 0.9320397 0.6978085 0.10474803 0.10607080 0.03268366 0.5362214 0.7832729 0.15564091 0.7171350 0.8511477 1
25 0.8444256 0.7516799 0.16767449 0.12025258 0.04426417 0.5040725 0.6950104 0.16010829 0.7026808 0.8800469 1
26 0.8692707 0.7016945 0.10123979 0.09430876 0.04037325 0.4877716 0.7053603 0.09539885 0.8316933 0.8165352 1
27 0.8738410 0.6230674 0.12793232 0.14837137 0.04878595 0.4335648 0.6547601 0.13714725 0.6944921 0.8788708 1
28 0.9041870 0.6201079 0.12490195 0.16227251 0.04812720 0.4845896 0.6619842 0.13093443 0.7415606 0.8479339 1
29 0.8618622 0.7060291 0.09453812 0.14068246 0.04799782 0.5474036 0.6088231 0.23338428 0.6772588 0.7795908 1
30 0.8776350 0.7132561 0.12100425 0.17367148 0.04399987 0.5661632 0.6905305 0.12971867 0.6788903 0.8198201 1
31 0.9134456 0.7249370 0.07144695 0.08759897 0.04864476 0.6682650 0.7445900 0.16374150 0.7322691 0.8071598 1
32 0.8706637 0.6743936 0.15291891 0.11422262 0.04284591 0.5268217 0.7207478 0.14296945 0.7574967 0.8609048 1
33 0.8821504 0.6845216 0.12004074 0.14009196 0.05527732 0.5677475 0.6379840 0.14122421 0.7090634 0.8386022 1
34 0.9061180 0.5989445 0.09160787 0.14325261 0.05142950 0.5399465 0.6718870 0.08454002 0.6709083 0.8264233 1
35 0.8453511 0.6759766 0.13345672 0.16310764 0.05107034 0.4666146 0.7343603 0.12733287 0.7062292 0.8471812 1
36 0.9004188 0.6114532 0.11837118 0.14667433 0.05050403 0.4975502 0.7258132 0.14894363 0.7195090 0.8382364 1
37 0.9051729 0.6652954 0.15153241 0.14571184 0.05026702 0.4855397 0.7226850 0.12179138 0.7430388 0.8342340 1
38 0.9112012 0.6314450 0.12681305 0.16328649 0.04076789 0.5382251 0.7404122 0.13971506 0.6607798 0.8657917 1
39 0.8407927 0.7148585 0.12792107 0.15447060 0.05287096 0.6798039 0.7182050 0.06549068 0.7433669 0.7948445 1
40 0.8554747 0.7356683 0.22698080 0.21692162 0.05365043 0.4496654 0.7353112 0.13341649 0.8032266 0.7883068 1
41 0.8535359 0.5729331 0.14392737 0.16612463 0.04651752 0.5228045 0.7397588 0.09967424 0.7906682 0.8384434 1
42 0.8059968 0.7148594 0.16774123 0.19006840 0.04990847 0.5929818 0.7011064 0.17921090 0.8121909 0.8481069 1
43 0.8856906 0.6987405 0.19262137 0.18327412 0.04816967 0.4340002 0.6569263 0.13724290 0.7600389 0.7788117 1
44 0.8888717 0.6760166 0.17025712 0.21906969 0.04812641 0.4173613 0.7927178 0.17458413 0.6806101 0.8297604 1
45 0.8691575 0.6682723 0.11932277 0.13669098 0.04014911 0.4680455 0.6186511 0.10002737 0.8012731 0.7177891 1
46 0.9148742 0.7797494 0.13313955 0.15166151 0.03934042 0.4818276 0.7484973 0.16354624 0.6979735 0.8164431 1
47 0.9226736 0.7211714 0.08036409 0.10395457 0.04063595 0.4014187 0.8026643 0.17762644 0.7194800 0.8156545 1
I've attempted to implement the algorithm in R but I'm continuing to get errors.
Attempt:
> train <- beta_values1_updated[training1, ]
> test <- beta_values1_updated[-training1, ]
> labels <- train$response
> ts_label <- test$response
> new_tr <- model.matrix(~.+0,data = train[,-c("response"),with=F])
Error in `[.data.frame`(train, , -c("response"), with = F) :
unused argument (with = F)
> new_ts <- model.matrix(~.+0,data = test[,-c("response"),with=F])
Error in `[.data.frame`(test, , -c("response"), with = F) :
unused argument (with = F)
I am attempting to follow the tutorial here:
https://www.hackerearth.com/practice/machine-learning/machine-learning-algorithms/beginners-tutorial-on-xgboost-parameter-tuning-r/tutorial/
Any insight as to how I could correctly implement the XGBoost algorithm would be greatly appreciated.
Edit:
I'm adding additional code to show the point in the tutorial where I get stuck:
train<-data.table(train)
test<-data.table(test)
new_tr <- model.matrix(~.+0,data = train[,-c("response"),with=F])
new_ts <- model.matrix(~.+0,data = test[,-c("response"),with=F])
#convert factor to numeric
labels <- as.numeric(labels)-1
ts_label <- as.numeric(ts_label)-1
#preparing matrix
dtrain <- xgb.DMatrix(data = new_tr,label = labels)
#preparing matrix
dtrain <- xgb.DMatrix(data = new_tr,label = labels)
dtest <- xgb.DMatrix(data = new_ts,label=ts_label)
params <- list(booster = "gbtree", objective = "binary:logistic", eta=0.3, gamma=0, max_depth=6, min_child_weight=1, subsample=1, colsample_bytree=1)
xgbcv <- xgb.cv( params = params, data = dtrain, nrounds = 100, nfold = 5, showsd = T, stratified = T, print.every.n = 10, early.stop.round = 20, maximize = F)
[1] train-error:0.000000+0.000000 test-error:0.000000+0.000000
Multiple eval metrics are present. Will use test_error for early stopping.
Will train until test_error hasn't improved in 20 rounds.
[11] train-error:0.000000+0.000000 test-error:0.000000+0.000000
[21] train-error:0.000000+0.000000 test-error:0.000000+0.000000
Stopping. Best iteration:
[1] train-error:0.000000+0.000000 test-error:0.000000+0.000000
Warning messages:
1: 'print.every.n' is deprecated.
Use 'print_every_n' instead.
See help("Deprecated") and help("xgboost-deprecated").
2: 'early.stop.round' is deprecated.
Use 'early_stopping_rounds' instead.
See help("Deprecated") and help("xgboost-deprecated").
The author of the tutorial is using the data.table package. As you can read here, using with = F is sometimes used to get a single column. Make sure you have loaded and installed data.table and other packages to follow the tutorial. Also, make sure your data set is a data.table object.

How can I use SOM algorithm for classification prediction

I would like to see If SOM algorithm can be used for classification prediction.
I used to code below but I see that the classification results are far from being right. For example, In the test dataset, I get a lot more than just the 3 values that I have in the training target variable. How can I create a prediction model that will be in alignment to the training target variable?
library(kohonen)
library(HDclassif)
data(wine)
set.seed(7)
training <- sample(nrow(wine), 120)
Xtraining <- scale(wine[training, ])
Xtest <- scale(wine[-training, ],
center = attr(Xtraining, "scaled:center"),
scale = attr(Xtraining, "scaled:scale"))
som.wine <- som(Xtraining, grid = somgrid(5, 5, "hexagonal"))
som.prediction$pred <- predict(som.wine, newdata = Xtest,
trainX = Xtraining,
trainY = factor(Xtraining$class))
And the result:
$unit.classif
[1] 7 7 1 7 1 11 6 2 2 7 7 12 11 11 12 2 7 7 7 1 2 7 2 16 20 24 25 16 13 17 23 22
[33] 24 18 8 22 17 16 22 18 22 22 18 23 22 18 18 13 10 14 15 4 4 14 14 15 15 4
This might help:
SOM is an unsupervised classification algorithm, so you shouldn't expect it to be trained on a dataset that contains a classifier label (if you do that it will need this information to work, and will be useless with unlabelled datasets)
The idea is that it will kind of "convert" an input numeric vector to a network unit number (try to run your code again with a 1 per 3 grid and you'll have the output you expected)
You'll then need to convert those network units numbers back into the categories you are looking for (that is the key part missing in your code)
Reproducible example below will output a classical classification error. It includes one implementation option for the "convert back" part missing in your original post.
Though, for this particular dataset, the model overfitts pretty quickly: 3 units give the best results.
#Set and scale a training set (-1 to drop the classes)
data(wine)
set.seed(7)
training <- sample(nrow(wine), 120)
Xtraining <- scale(wine[training, -1])
#Scale a test set (-1 to drop the classes)
Xtest <- scale(wine[-training, -1],
center = attr(Xtraining, "scaled:center"),
scale = attr(Xtraining, "scaled:scale"))
#Set 2D grid resolution
#WARNING: it overfits pretty quickly
#Errors are 36% for 1 unit, 63% for 2, 93% for 3, 89% for 4
som_grid <- somgrid(xdim = 1, ydim=3, topo="hexagonal")
#Create a trained model
som_model <- som(Xtraining, som_grid)
#Make a prediction on test data
som.prediction <- predict(som_model, newdata = Xtest)
#Put together original classes and SOM classifications
error.df <- data.frame(real = wine[-training, 1],
predicted = som.prediction$unit.classif)
#Return the category number that has the strongest association with the unit
#number (0 stands for ambiguous)
switch <- sapply(unique(som_model$unit.classif), function(x, df){
cat <- as.numeric(names(which.max(table(
error.df[error.df$predicted==x,1]))))
if(length(cat)<1){
cat <- 0
}
return(c(x, cat))
}, df = data.frame(real = wine[training, 1], predicted = som_model$unit.classif))
#Translate units numbers into classes
error.df$corrected <- apply(error.df, MARGIN = 1, function(x, switch){
cat <- switch[2, which(switch[1,] == x["predicted"])]
if(length(cat)<1){
cat <- 0
}
return(cat)
}, switch = switch)
#Compute a classification error
sum(error.df$corrected == error.df$real)/length(error.df$real)

Feeding new data into predict() for multiple regression? [duplicate]

This question already has an answer here:
r predict function returning too many values [closed]
(1 answer)
Closed 6 years ago.
Assume I have have fit a regression model with multiple predictor variables in R, like in the following toy example:
n <- 20
x <- rnorm(n)
y <- rnorm(n)
z <- x + y + rnorm(n)
m <- lm(z ~ x + y + I(y^2))
Now I have new date, consisting of x and y values, and I want to predict the corresponding z values:
x.new <- rnorm(5)
y.new <- rnorm(5)
Question: How should I best call predict to apply the fitted model to the new data?
Here are a few things I tried, which do not work:
Attempt 1. Trying to use the x.new and y.new as the columns of a new data frame:
> predict(m, data=data.frame(x=x.new, y=y.new))
1 2 3 4 5 6 7
-0.0157090 1.1667958 -1.3797101 0.1185750 0.7786496 1.7666232 -0.6692865
8 9 10 11 12 13 14
1.9720532 0.3514206 1.1677019 0.6441418 -2.3010431 -0.3228424 -0.2181511
15 16 17 18 19 20
-0.8883275 0.4549592 -1.0377040 0.1750522 -2.4542843 1.2250101
This gave 20 values instead of 5, so cannot be right.
Attempt 2: Maybe predict got confused because the y^2 values were not supplied? Try to use model.frame to provide data in the correct form.
> predict(m, model.frame(~ x.new + y.new + I(y.new^2)))
1 2 3 4 5 6 7
-0.0157090 1.1667958 -1.3797101 0.1185750 0.7786496 1.7666232 -0.6692865
8 9 10 11 12 13 14
1.9720532 0.3514206 1.1677019 0.6441418 -2.3010431 -0.3228424 -0.2181511
15 16 17 18 19 20
-0.8883275 0.4549592 -1.0377040 0.1750522 -2.4542843 1.2250101
Warning message:
'newdata' had 5 rows but variables found have 20 rows
Again, this results in 20 values (plus a warning), so cannot be right.
The parameter is newdata (not data) when telling predict what to predict for.
predict(m, newdata = data.frame(x = x.new, y = y.new))

R - Probabilistic Neural Network - Iris dataset

I have the R iris dataset which I am using for a PNN. The 3 species have been recoded from level 0 to 3 as follows: 0 is setosa, 1 is versicolor, 2 is virginica. Training set is 75%
Q1. I don't understand the function pred_pnn, if anyone is good in R perhaps you can explain how it works
Q2. The output of the test set or prediction is shown below, I don't understand the output because it is supposed to be something close to either 0,1,2
data = read.csc("c:/iris-recoded.csv" , header = T)
size = nrow(data)
length = ncol(data)
index <- 1:size
positions <- sample(index, trunc(size * 0.75))
training <- data[positions,]
testing <- data[-positions,1:length-1]
result = data[-positions,]
result$actual = result[,length]
result$predict = -1
nn1 <- smooth(learn(training), sigma = 0.9)
pred_pnn <- function(x, nn){
xlst <- split(x, 1:nrow(x))
pred <- foreach(i = xlst, .combine = rbind) %dopar% {
data.frame(prob = guess(nn, as.matrix(i))$probabilities[1], row.names =NULL)
}
}
print(pred_pnn(testing, nn1))
prob
1 1.850818e-03
2 9.820653e-03
3 6.798603e-04
4 7.421435e-03
5 2.168817e-03
6 3.277354e-03
7 6.541173e-03
8 1.725332e-04
9 2.081845e-03
10 2.491388e-02
11 7.679823e-03
12 1.291811e-03
13 2.197234e-06
14 1.316366e-03
15 1.421219e-05
16 4.639239e-05
17 3.671907e-04
18 1.460001e-04
19 4.382849e-05
20 2.387543e-05
21 1.011196e-05
22 2.719982e-04
23 4.445472e-04
24 1.281762e-04
25 5.931106e-09
26 9.741870e-08
27 9.236434e-09
28 8.384690e-08
29 3.311667e-07
30 6.045306e-11
31 2.949265e-08
32 2.070014e-10
33 8.043735e-06
34 2.136666e-08
35 5.604398e-08
36 2.455841e-07
37 3.445977e-07
38 7.314647e-07
I'm assuming you're using the pnn package. Documentation for ?guess would lead us to believe that it does similar to what predict does for other models. In other words, it predicts to which class the observation belongs to. Everything else in there for bookkeeping. Why you get only the probabilities? Because the person who wrote the function made it that way by extracting guess(x)$probabilities and returning only that. If you look at the raw output, you would also get predicted class tucked in away in $category list element.

Resources