evaluating neural network performance - r

I trained my neural network with a sigmoid activation function so that the predicted values lie in the range [0,1). However, the range of real data in which the z-score transformation has been performed goes beyond [0,1). In this case what would be the appropriate way to evaluate my model. Should I rescale as well the original test data to the same range and then evaluate with criteria like mean square forecast error?
> real_predicted_neural
predicted real
1 1.909219e-07 -3.57877473
2 4.161819e-08 -2.28704595
3 1.754706e-11 -1.08509429
4 1.149891e-13 -0.46573114
5 7.777560e-02 0.42381300
6 4.173448e-07 -0.44060297
7 1.119703e-01 0.21075550
8 8.682557e-01 -0.01292402
9 4.736056e-08 -0.29830701
10 7.506821e-08 -1.20302227
11 7.341235e-01 -0.03986571
12 7.501776e-05 -0.94315815
13 1.145697e-04 0.49730175
14 2.214929e-13 0.04252241
15 4.597199e-01 -0.38539901
16 2.324931e-03 -0.74468628
17 4.366025e-06 -0.77037244
18 1.394450e-06 0.16679048
19 5.869884e-11 -0.75876486
20 1.817941e-04 0.04303387
21 7.060773e-04 0.06099372
22 8.267170e-06 -1.21687318
23 9.388680e-02 0.61135319
24 1.099290e-01 0.55715201
25 9.757236e-01 -0.33480226
26 9.544055e-01 0.09061006
27 7.322074e-07 0.09290822
28 1.014327e-06 -0.61658893
29 7.848382e-08 -0.78739456
30 1.791908e-04 -0.44073540
31 1.357918e-03 -0.22099008
32 5.192233e-06 -0.32744703
33 2.624779e-06 -0.37644068
34 6.414216e-02 -0.36947939
35 1.388143e-06 -0.00994845
36 3.010872e-05 -0.05984833
37 9.873201e-03 -0.21815268
38 3.896163e-04 -0.24009094
39 2.718760e-02 0.33383333
40 1.025650e-02 0.09779867

Related

rpart -- number of splits

Using printcp I got output resembling the following (this is only a portion):
CP nsplit rel error xerror xstd
1 3.254666e-01 0 1.0000000 1.0000000 0.003976889
2 5.395058e-02 1 0.6745334 0.6745334 0.003567289
3 4.125633e-02 3 0.5666322 0.5878145 0.003401065
4 1.726150e-02 4 0.5253759 0.5492028 0.003317552
5 1.222830e-02 7 0.4735914 0.4925069 0.003183022
6 1.193864e-02 10 0.4364909 0.4744730 0.003137010
7 9.243634e-03 12 0.4126137 0.4489081 0.003068901
8 5.238899e-03 13 0.4033700 0.4277007 0.003009687
9 3.878800e-03 14 0.3981311 0.4183311 0.002982702
10 3.664710e-03 16 0.3903735 0.4115054 0.002962714
11 3.261718e-03 18 0.3830441 0.4098935 0.002957953
12 2.934287e-03 20 0.3765207 0.4063421 0.002947406
13 2.871320e-03 24 0.3647835 0.4044783 0.002941839
14 2.770571e-03 25 0.3619122 0.4000201 0.002928437
15 2.052742e-03 26 0.3591416 0.3973503 0.002920351
16 1.989774e-03 28 0.3550361 0.3924892 0.002905511
17 1.813465e-03 29 0.3530464 0.3911795 0.002901486
18 1.763091e-03 30 0.3512329 0.3880563 0.002891845
19 1.737904e-03 31 0.3494698 0.3863688 0.002886609
20 1.674936e-03 32 0.3477319 0.3832708 0.002876947
21 1.670739e-03 35 0.3422915 0.3830693 0.002876317
22 1.662343e-03 39 0.3355666 0.3827167 0.002875212
23 1.653947e-03 40 0.3339042 0.3824900 0.002874502
Which value shows the total number of splits in the tree -- nsplit, or the largest index (left-most column)? (I.e., 23 or 40?)
The table your are seeing from the printcp function is the $cptable object from your CART model. Column "nsplit" shows the number of splits, indeed.
So, you can get the total number of splits in the tree with
max(carttree$cptable[,"nsplit"])
Where carttree is the name of your CART tree.

when you need a Kinhom rather than a Kest?

envelope of the K funcition (and its derivative such as L) is very useful for validating a fitted spatial points process model. for instance, I fit a poisson model for a data J1a2, which is as following:
J1a2.points:
# X.1 X Y
1 1 118.544 1638.445
2 2 325.995 1761.223
3 3 681.625 1553.771
4 4 677.392 1816.261
5 5 986.451 1685.016
6 6 1469.093 1354.787
7 7 1608.805 1625.744
8 8 1994.071 1782.391
9 9 1968.669 1375.955
10 10 2362.403 1337.852
11 11 2701.099 1773.924
12 12 2900.083 1820.495
13 13 2963.588 1668.081
14 14 3412.360 1676.549
15 15 3378.490 1456.396
16 16 3721.420 1464.863
17 17 3823.028 1701.951
18 18 4072.817 1790.859
19 19 4089.751 1388.656
20 20 97.375 715.497
21 21 376.799 1033.025
22 22 563.082 1126.166
23 23 935.647 1206.607
24 24 512.277 486.876
25 25 935.647 757.834
26 26 1409.821 410.670
27 27 1435.223 639.290
28 28 1706.180 1045.726
29 29 1968.669 876.378
30 30 2307.365 711.263
31 31 2624.892 897.546
32 32 2654.528 1236.243
33 33 2857.746 423.371
34 34 3039.795 639.290
35 35 3298.050 707.029
36 36 3111.767 1011.856
37 37 3361.555 1227.775
38 38 4047.414 1185.438
39 39 3569.007 508.045
40 40 4250.632 469.942
41 41 4386.110 872.144
42 42 93.141 237.088
43 43 554.614 186.283
44 44 757.832 148.180
45 45 965.283 220.153
46 46 1723.115 296.360
47 47 1744.283 423.371
48 48 1913.631 203.218
49 49 2167.653 292.126
50 50 2629.126 211.685
51 51 3217.610 283.658
52 52 3827.262 325.996
and:
J1a2.Win<-owin(c(0, 4500.42),c(0, 1917.87))
if you draw evelope for the data with Lest:
library(spatstat)
env.data<-envelope(J1a2, Lest,correction="border",
nsim=19, global=TRUE)
plot(env.data,.-r~r, shade=NULL, legend=FALSE,
xlab=expression(paste("r(",mu,"m)")),ylab="L(r)-r", main = "")
the Lest() curve goes out of the envelope. however, if you use Linhom instead of Lest, you will find the Linhom() are all inside of the envelope.
it seems that this suggest a inhomogenous density kernel of the data. so I use y as covariate in fitting:
poisson.J1a2<-ppm(J1a2~1,Poisson(),correction="border")
y1.J1a2<-ppm(J1a2~y,correction="border")
anova(poisson.J1a2,y.J1a2,test="LR") #p=0.6484
I don't find any evidence of a spatial trend of density along y, or x, or their combinations.
then why the Linhom() outperform the Lest() in this case?
furthermore, when should one decide to use Linhom() instead of Lest?
You should first decide whether or not the intensity can be assumed to be constant. To help you with this you can look at kernel density estimates or do formal tests such as a quadrat test etc. If you decide that the intensity can be assumed to be constant you use Lest() if this is not the case you use Linhom().

How can I look at a specific generated train and test sets made from for loop?

My program divides my dataset into train and test set, builds a decision tree based on the train and test set and calculates the accuracy, sensitivity and the specifity of the confusion matrix.
I added a for loop to rerun my program 100 times. This means I get 100 train and test sets. The output of the for loop is a result_df with columns of accuracy, specifity and sensitivity.
This is the for loop:
result_df<-matrix(ncol=3,nrow=100)
colnames(result_df)<-c("Acc","Sens","Spec")
for (g in 1:100 )
{
# Divide into Train and test set
smp_size <- floor(0.8 * nrow(mydata1))
train_ind <- sample(seq_len(nrow(mydata1)), size = smp_size)
train <- mydata1[train_ind, ]
test <- mydata1[-train_ind, ]
REST OF MY CODE
}
My result_df (first 20 rows) looks like this:
> result_df[1:20,]
Acc Sens Spec id
1 26 22 29 1
2 10 49 11 2
3 37 43 36 3
4 4 79 4 4
5 21 21 20 5
6 31 17 34 6
7 57 4 63 7
8 33 3 39 8
9 56 42 59 9
10 65 88 63 10
11 6 31 7 11
12 57 44 62 12
13 25 10 27 13
14 32 24 32 14
15 19 8 19 15
16 27 27 29 16
17 38 89 33 17
18 54 32 56 18
19 35 62 33 19
20 37 6 40 20
I use ggplot() to plot the specifity and the sensitivity as a scatterplot:
What I want to do :
I want to see e.g. the train and test set of datapoint 17.
I think I can do this by using the set.seed function, but I am very unfamiliar with this function.
First, clearly, if in your code you store your estimate models, e.g., in a list, then you could recover your data from those models. However, it doesn't look like that's the case.
With your current code all you can do is to see that last train and test sets (number 100). That is because you keep redefining test, train, train_ind variables. The cheapest (in terms of memory) way to achieve what you want would be to somehow store train_ind from each iteration. For instance, you could use
train_inds <- list()[rep(1, 100)]
for (g in 1:100 )
{
smp_size <- floor(0.8 * nrow(mydata1))
train_inds[[g]] <- sample(seq_len(nrow(mydata1)), size = smp_size)
train <- mydata1[train_inds[[g]], ]
test <- mydata1[-train_ind[[g]], ]
# The rest
}
and in this way you would always know which observations were in which set. If you somehow are interested only in one specific iteration, you could save only that one.
Lastly, set.seed isn't really going to help here. If all you were doing was running rnorm(1) hundred times, then yes, by using set.seed you could quickly recover the n-th generated value later. In your case, however, you are not only using sample for train_ind; the model estimation functions are also very likely generating random values.

Morans correlogram with only one point. What is wrong?

Im trying Moran's I and respective plot in r. But the plot has only one point. I have no idea of what is going wrong. The code is based on<
http://rstudio-pubs-static.s3.amazonaws.com/9688_a49c681fab974bbca889e3eae9fbb837.html>
my data called "coordenata"
resid x y
1 0.07785411 -53.20342 -22.66700
2 -0.28358702 -53.20389 -22.66864
3 -0.64011338 -53.21392 -22.68122
4 1.22071249 -53.21311 -22.72369
5 0.95734778 -53.28469 -22.75289
6 0.35345302 -53.25822 -22.74850
7 -0.68357738 -53.28344 -22.70694
8 -1.24596010 -53.32950 -22.72872
9 -0.19944162 -53.33669 -22.73561
10 0.67544909 -53.36756 -22.80767
11 0.64002961 -53.35947 -22.79958
12 0.04564233 -53.21889 -22.67419
13 0.01618436 -53.24522 -22.70144
14 -2.65436794 -53.23017 -22.69292
15 0.72096256 -53.25539 -22.69978
16 0.89656515 -53.28489 -22.72222
17 1.85358579 -53.33069 -22.79161
18 -0.03590077 -53.33200 -22.78336
19 0.32348975 -53.33494 -22.78586
20 2.06771402 -53.37781 -22.77869
21 -1.02190709 -53.30492 -22.77244
22 -2.02813250 -53.53917 -22.79856
23 -1.20702445 -53.53858 -22.79406
24 -1.24091732 -53.55272 -22.80536
25 -1.13491596 -53.56181 -22.82914
26 -0.82934613 -53.56422 -22.83417
27 1.23418758 -53.60017 -22.85531
28 -1.72808514 -53.65900 -22.97828
29 -0.02144049 -53.65908 -22.97497
30 0.49174568 -53.64597 -22.95439
31 -0.54408149 -53.64217 -22.91033
32 -0.37111342 -53.61447 -22.86269
33 -0.31121931 -53.27153 -22.70036
34 0.32419211 -53.30308 -22.72183
35 1.57980287 -53.33053 -22.72947
36 -1.91156060 -53.34633 -22.74722
37 -0.79036645 -53.23667 -22.68925
the code
coordinates(coordenata)<-c("x","y")
fit2<-correlog(coordenata$x,coordenata$y,coordenata$resid,increment=5,resamp=100,quiet=T)
plot(fit2)
Thanks in advance for any help!

glm giving different results in R 3.2.0 and R 3.2.2. Was there any substantial change in usage?

I use glm monthly to calculate a binomial model on the payment behaviour of a credit database, using a call like:
modelx = glm(paid ~ ., data = credit_db, family = binomial())
For the last month, I use R version 3.2.2 (just recently upgraded) and the results were very different than the previous month (done with R version 3.2.0). In order to check the code, I repeated the previous month calculations with version 3.2.2 and got different results from the previous calculation done in R 3.2.0.
Coefficients are also very different, in a wild form. I use at the beginning an exploratory model, with a variable that is the average number of delinquency days during the month, which should yield low coefficients for low average. In version 3.2.0, an extract of summary(modelx) was:
## Coefficients: Estimate Std. Error z value
## delinquency_avg_days1 -0.59329 0.18581 -3.193
## delinquency_avg_days2 -1.32286 0.19830 -6.671
## delinquency_avg_days3 -1.47359 0.21986 -6.702
## delinquency_avg_days4 -1.64158 0.21653 -7.581
## delinquency_avg_days5 -2.56311 0.25234 -10.158
## delinquency_avg_days6 -2.59042 0.25886 -10.007
and for version 3.2.2
## Coefficients Estimate Std. Error z value
## delinquency_avg_days.L -1.320e+01 1.083e+03 -0.012
## delinquency_avg_days.Q -1.140e+00 1.169e+03 -0.001
## delinquency_avg_days.C 3.439e+00 1.118e+03 0.003
## delinquency_avg_days^4 8.454e+00 1.020e+03 0.008
## delinquency_avg_days^5 3.733e+00 9.362e+02 0.004
## delinquency_avg_days^6 -4.988e+00 9.348e+02 -0.005
The summary output is a little different, since the Pr(>|z|) is shown. Notice also that the coefficient names changed too.
In the dataset this delinquency_avg_days variable have the following distribution (0 is "not paid", 1 is "paid", and as you can see, coefficients might be large for average days larger than 20 or so. Number of paid was sampled to match closely the number of "not paid".
0 1
0 140 663
1 59 209
2 62 118
3 56 87
4 66 50
5 69 41
6 64 40
7 78 30
8 75 31
9 70 29
10 77 23
11 69 18
12 79 17
13 61 13
14 53 5
15 67 18
16 50 10
17 40 9
18 39 8
19 23 9
20 24 2
21 36 9
22 35 1
23 17 0
24 11 0
25 11 0
26 7 1
27 3 0
28 0 0
29 0 1
30 1 0
In previous months, I used this exploratory model to create a second binomial model using ranges af average delinquency days. But this other model gives similar results with a few levels.
Now, I'd like to know whether there are substantial changes that require specifying other parameters or there is an issue with glm in version 3.2.2.

Resources