Solve systems of nonlinear equations in R / BlackScholesMerton Model - r

I am writing my Masters final project in which I am deriving probability of default using Black Scholes Merton Model.I have got stuck in R code. Mathematically, I want to solve this system of nonlinear equations with the package nleqslv:
library(nleqslv)
T <- 1
D1 <- 20010.75
R <- 0.8516
sigmaS <- .11
SO1 <- 1311.74
fnewton <- function(x){
y <- numeric(2)
d1 <- (log(x[1]/D1)+(R+x[2]^2/2)*T)/x[2]*sqrt(T)
d2 <- d1 - x[2]*sqrt(T)
y[1] <- SO1 - (x[1]*pnorm(d1) - exp(-R*T)*D1*pnorm(d2))
y[2] <- sigmaS*SO1 - pnorm(d1)*x[2]*x[1]
y
}
xstart <- c(1311.74,0.11)
nleqslv(xstart, fnewton, method="Broyden")
# $x
# [1] 1311.74 0.11
# $fvec
# [1] 1311.7400 144.2914
# $termcd
# [1] 6
# $message
# [1] "Jacobian is singular (see allowSingular option)"
# $scalex
# [1] 1 1
# $nfcnt
# [1] 0
# $njcnt
# [1] 1
# $iter
# [1] 1
I have tried this with many values of the 5 inputs( stated above that I have computed for 2 companies for different years), but I am not getting the final values of S0 and sigma V.
I am getting message as "Jacobian is singular (see allowSingular option)" If I allow singular Jacobean using "control=list(trace=1,allowSingular=TRUE)", then also no answer is displayed. I do not know how to obtain the solution of these 2 variables now.
I really don’t know, what I am doing wrong as I oriented my model on Teterevas slides ( on slide no.5 is her model code), who’s presentation is the first result by googeling
https://www.google.de/search?q=moodys+KMV+in+R&rlz=1C1SVED_enDE401DE401&aq=f&oq=moodys+KMV+in+R&aqs=chrome.0.57.13309j0&sourceid=chrome&ie=UTF-8#q=distance+to+default+in+R
q=distance+to+default+in+R
Like me, however more successful, she calculates the Distance to Default risk measure via the Black Scholes Merton approach. In this model, the value of equity (usually represented by the market capitalization, > SO1) can be written as a European call option.
The other variables are:
x[1]: the variable I want to derive, value of total assets
x[2]: the variable I want to derive, volatility of total assets
D1: the book value of debt (19982009)
R: a riskfree interest rate
T: is set to 1 year (time)
sigmaS: estimated (historical) equity volatility

You should be able to use the initial values of SO1 and sigmaS as starting values for nleqslv.
First of all the R code given by Tetereva doesn't seem quite correct (the variable Z should be D1 as you have named it; similar changes for her S0 and D).
I have modified Tetereva's into this:
library(nleqslv)
T <- 1
D1 <- 33404048
R <- 2.32
sigmaS <- .02396919
SO1 <- 4740291 # Ve?
fnewton <- function(x){
y <- numeric(2)
d1 <- (log(x[1]/D1)+(R+x[2]^2/2)*T)/x[2]*sqrt(T)
d2 <- d1 - x[2]*sqrt(T)
y[1] <- SO1 - (x[1]*pnorm(d1) - exp(-R*T)*D1*pnorm(d2))
y[2] <- sigmaS*SO1 - pnorm(d1)*x[2]*x[1]
y
}
xstart <- c(SO1,sigmaS)
nleqslv(xstart, fnewton, method="Broyden",control=list(trace=1))
nleqslv(xstart, fnewton, method="Newton",control=list(trace=1))
which will give the solution given by Tetereva. (I use trace=1 here just to check the iteration steps.)
I believe the value you give for R should be 8.516 and not something else. Using your values for the parameters
T <- 1
D1 <- 20010.75
R <- 8.516 # modified
sigmaS <- .11
SO1 <- 1311.74
like this
xstart <- c(1311.74,0.11)
nleqslv(xstart, fnewton, method="Broyden")
nleqslv(xstart, fnewton, method="Newton")
Then running nleqslv with these values converges very quickly.
If one uses R <- 2.32 (like Tetereva) nleqslv will also converge albeit with more iterations.
I cannot help you with what R should actually be but from Tetereva's presentation I assume R is in percentages. Since I don't have enough knowledge on the Black-Scholes model I can't be of any help for finding out what the correct values are for the various parameters. It's up to you.

Related

It is possible to solve equation R that are not linear?

I want to build a function that takes E[x] and Var[X] and give me the mean and standard error of a univariate lognormal variable.
E[x] = exp(mu + theta)
Var[x] = exp(2*mu + theta)*(exp(theta) - 1)
The function would take E[x] and Var[x] as input and as output would give me theta and mu
There are several packages that provide ways and means to solve a system of nonlinear equations. One of these is nleqslv.
You nee to provide a function that function that returns the differences between the actual value of the equations and the desired value.
Load package nleqslv and define the following function
library(nleqslv)
f <- function(x,Ex,Varx) {
y<- numeric(length(x))
mu <- x[1]
theta <- x[2]
y[1] <- exp(mu+theta) - Ex
y[2] <- exp(2*mu+theta)*(exp(theta)-1) - Varx
y
}
The vector x in the function contains the values of mu and theta.
An example with Ex=2 and Varx=3 and some random starting values
xstart <- c(1,1)
nleqslv(xstart,f,Ex=2,Varx=3)
gives the following
$x
[1] -0.6931472 1.3862944
$fvec
[1] -8.095125e-11 -8.111645e-11
$termcd
[1] 1
$message
[1] "Function criterion near zero"
$scalex
[1] 1 1
$nfcnt
[1] 31
$njcnt
[1] 2
$iter
[1] 22
See the manual of nleqslv for the meaning of the different elements of the return value of nleqslv.
If you want to investigate the effect of the different solving methods try this
testnslv(xstart,f,Ex=2,Varx=3)

cosine similarity(patient similarity metric) between 48k patients data with predictive variables

I have to calculate cosine similarity (patient similarity metric) in R between 48k patients data with some predictive variables. Here is the equation: PSM(P1,P2) = P1.P2/ ||P1|| ||P2||
where P1 and P2 are the predictor vectors corresponding to two different patients, where for example P1 index patient and P2 will be compared with index (P1) and finally pairwise patient similarity metric PSM(P1,P2) will be calculated.
This process will go on for all 48k patients.
I have added sample data-set for 300 patients in a .csv file. Please find the sample data-set here.https://1drv.ms/u/s!AhoddsPPvdj3hVTSbosv2KcPIx5a
First things first: You can find more rigorous treatments of cosine similarity at either of these posts:
Find cosine similarity between two arrays
Creating co-occurrence matrix
Now, you clearly have a mixture of data types in your input, at least
decimal
integer
categorical
I suspect that some of the integer values are Booleans or additional categoricals. Generally, it will be up to you to transform these into continuous numerical vectors if you want to use them as input into the similarity calculation. For example, what's the distance between admission types ELECTIVE and EMERGENCY? Is it a nominal or ordinal variable? I will only be modelling the columns that I trust to be numerical dependent variables.
Also, what have you done to ensure that some of your columns don't correlate with others? Using just a little awareness of data science and biomedical terminology, it seems likely that the following are all correlated:
diasbp_max, diasbp_min, meanbp_max, meanbp_min, sysbp_max and sysbp_min
I suggest going to a print shop and ordering a poster-size printout of psm_pairs.pdf. :-) Your eyes are better at detecting meaningful (but non-linear) dependencies between variable. Including multiple measurements of the same fundamental phenomenon may over-weight that phenomenon in your similarity calculation. Don't forget that you can derive variables like
diasbp_rage <- diasbp_max - diasbp_min
Now, I'm not especially good at linear algebra, so I'm importing a cosine similarity function form the lsa text analysis package. I'd love to see you write out the formula in your question as an R function. I would write it to compare one row to another, and use two nested apply loops to get all comparisons. Hopefully we'll get the same results!
After calculating the similarity, I try to find two different patients with the most dissimilar encounters.
Since you're working with a number of rows that's relatively large, you'll want to compare various algorithmic methodologies for efficiency. In addition, you could use SparkR/some other Hadoop solution on a cluster, or the parallel package on a single computer with multiple cores and lots of RAM. I have no idea whether the solution I provided is thread-safe.
Come to think of it, the transposition alone (as I implemented it) is likely to be computationally costly for a set of 1 million patient-encounters. Overall, (If I remember my computational complexity correctly) as the number of rows in your input increases, the performance could degrade exponentially.
library(lsa)
library(reshape2)
psm_sample <- read.csv("psm_sample.csv")
row.names(psm_sample) <-
make.names(paste0("patid.", as.character(psm_sample$subject_id)), unique = TRUE)
temp <- sapply(psm_sample, class)
temp <- cbind.data.frame(names(temp), as.character(temp))
names(temp) <- c("variable", "possible.type")
numeric.cols <- (temp$possible.type %in% c("factor", "integer") &
(!(grepl(
pattern = "_id$", x = temp$variable
))) &
(!(
grepl(pattern = "_code$", x = temp$variable)
)) &
(!(
grepl(pattern = "_type$", x = temp$variable)
))) | temp$possible.type == "numeric"
psm_numerics <- psm_sample[, numeric.cols]
row.names(psm_numerics) <- row.names(psm_sample)
psm_numerics$gender <- as.integer(psm_numerics$gender)
psm_scaled <- scale(psm_numerics)
pair.these.up <- psm_scaled
# checking for independence of variables
# if the following PDF pair plot is too big for your computer to open,
# try pair-plotting some random subset of columns
# keep.frac <- 0.5
# keep.flag <- runif(ncol(psm_scaled)) < keep.frac
# pair.these.up <- psm_scaled[, keep.flag]
# pdf device sizes are in inches
dev <-
pdf(
file = "psm_pairs.pdf",
width = 50,
height = 50,
paper = "special"
)
pairs(pair.these.up)
dev.off()
#transpose the dataframe to get the
#similarity between patients
cs <- lsa::cosine(t(psm_scaled))
# this is super inefficnet, because cs contains
# two identical triangular matrices
cs.melt <- melt(cs)
cs.melt <- as.data.frame(cs.melt)
names(cs.melt) <- c("enc.A", "enc.B", "similarity")
extract.pat <- function(enc.col) {
my.patients <-
sapply(enc.col, function(one.pat) {
temp <- (strsplit(as.character(one.pat), ".", fixed = TRUE))
return(temp[[1]][[2]])
})
return(my.patients)
}
cs.melt$pat.A <- extract.pat(cs.melt$enc.A)
cs.melt$pat.B <- extract.pat(cs.melt$enc.B)
same.pat <- cs.melt[cs.melt$pat.A == cs.melt$pat.B ,]
different.pat <- cs.melt[cs.melt$pat.A != cs.melt$pat.B ,]
most.dissimilar <-
different.pat[which.min(different.pat$similarity),]
dissimilar.pat.frame <- rbind(psm_numerics[rownames(psm_numerics) ==
as.character(most.dissimilar$enc.A) ,],
psm_numerics[rownames(psm_numerics) ==
as.character(most.dissimilar$enc.B) ,])
print(t(dissimilar.pat.frame))
which gives
patid.68.49 patid.9
gender 1.00000 2.00000
age 41.85000 41.79000
sysbp_min 72.00000 106.00000
sysbp_max 95.00000 217.00000
diasbp_min 42.00000 53.00000
diasbp_max 61.00000 107.00000
meanbp_min 52.00000 67.00000
meanbp_max 72.00000 132.00000
resprate_min 20.00000 14.00000
resprate_max 35.00000 19.00000
tempc_min 36.00000 35.50000
tempc_max 37.55555 37.88889
spo2_min 90.00000 95.00000
spo2_max 100.00000 100.00000
bicarbonate_min 22.00000 26.00000
bicarbonate_max 22.00000 30.00000
creatinine_min 2.50000 1.20000
creatinine_max 2.50000 1.40000
glucose_min 82.00000 129.00000
glucose_max 82.00000 178.00000
hematocrit_min 28.10000 37.40000
hematocrit_max 28.10000 45.20000
potassium_min 5.50000 2.80000
potassium_max 5.50000 3.00000
sodium_min 138.00000 136.00000
sodium_max 138.00000 140.00000
bun_min 28.00000 16.00000
bun_max 28.00000 17.00000
wbc_min 2.50000 7.50000
wbc_max 2.50000 13.70000
mingcs 15.00000 15.00000
gcsmotor 6.00000 5.00000
gcsverbal 5.00000 0.00000
gcseyes 4.00000 1.00000
endotrachflag 0.00000 1.00000
urineoutput 1674.00000 887.00000
vasopressor 0.00000 0.00000
vent 0.00000 1.00000
los_hospital 19.09310 4.88130
los_icu 3.53680 5.32310
sofa 3.00000 5.00000
saps 17.00000 18.00000
posthospmort30day 1.00000 0.00000
Usually I wouldn't add a second answer, but that might be the best solution here. Don't worry about voting on it.
Here's the same algorithm as in my first answer, applied to the iris data set. Each row contains four spatial measurements of the flowers form three different varieties of iris plants.
Below that you will find the iris analysis, written out as nested loops so you can see the equivalence. But that's not recommended for production with large data sets.
Please familiarize yourself with starting data and all of the intermediate dataframes:
The input iris data
psm_scaled (the spatial measurements, scaled to mean=0, SD=1)
cs (the matrix of pairwise similarities)
cs.melt (the pairwise similarities in long format)
At the end I have aggregated the mean similarities for all comparisons between one variety and another. You will see that comparisons between individuals of the same variety have mean similarities approaching 1, and comparisons between individuals of the same variety have mean similarities approaching negative 1.
library(lsa)
library(reshape2)
temp <- iris[, 1:4]
iris.names <- paste0(iris$Species, '.', rownames(iris))
psm_scaled <- scale(temp)
rownames(psm_scaled) <- iris.names
cs <- lsa::cosine(t(psm_scaled))
# this is super inefficient, because cs contains
# two identical triangular matrices
cs.melt <- melt(cs)
cs.melt <- as.data.frame(cs.melt)
names(cs.melt) <- c("enc.A", "enc.B", "similarity")
names(cs.melt) <- c("flower.A", "flower.B", "similarity")
class.A <-
strsplit(as.character(cs.melt$flower.A), '.', fixed = TRUE)
cs.melt$class.A <- sapply(class.A, function(one.split) {
return(one.split[1])
})
class.B <-
strsplit(as.character(cs.melt$flower.B), '.', fixed = TRUE)
cs.melt$class.B <- sapply(class.B, function(one.split) {
return(one.split[1])
})
cs.melt$comparison <-
paste0(cs.melt$class.A , '_vs_', cs.melt$class.B)
cs.agg <-
aggregate(cs.melt$similarity, by = list(cs.melt$comparison), mean)
print(cs.agg[order(cs.agg$x),])
which gives
# Group.1 x
# 3 setosa_vs_virginica -0.7945321
# 7 virginica_vs_setosa -0.7945321
# 2 setosa_vs_versicolor -0.4868352
# 4 versicolor_vs_setosa -0.4868352
# 6 versicolor_vs_virginica 0.3774612
# 8 virginica_vs_versicolor 0.3774612
# 5 versicolor_vs_versicolor 0.4134413
# 9 virginica_vs_virginica 0.7622797
# 1 setosa_vs_setosa 0.8698189
If you’re still not comfortable with performing lsa::cosine() on a scaled, numerical dataframe, we can certainly do explicit pairwise calculations.
The formula you gave for PSM, or cosine similarity of patients, is expressed in two formats at Wikipedia
Remembering that vectors A and B represent the ordered list of attributes for PatientA and PatientB, the PSM is the dot product of A and B, divided by (the scalar product of [the magnitude of A] and [the magnitude of B])
The terse way of saying that in R is
cosine.sim <- function(A, B) { A %*% B / sqrt(A %*% A * B %*% B) }
But we can rewrite that to look more similar to your post as
cosine.sim <- function(A, B) { A %*% B / (sqrt(A %*% A) * sqrt(B %*% B)) }
I guess you could even re-write that (the calculations of similarity between a single pair of individuals) as a bunch of nested loops, but in the case of a manageable amount of data, please don’t. R is highly optimized for operations on vectors and matrices. If you’re new to R, don’t second guess it. By the way, what happened to your millions of rows? This will certainly be less stressful now that your down to tens of thousands.
Anyway, let’s say that each individual only has two elements.
individual.1 <- c(1, 0)
individual.2 <- c(1, 1)
So you can think of individual.1 as a line that passes between the origin (0,0) and (0, 1) and individual.2 as a line that passes between the origin and (1, 1).
some.data <- rbind.data.frame(individual.1, individual.2)
names(some.data) <- c('element.i', 'element.j')
rownames(some.data) <- c('individual.1', 'individual.2')
plot(some.data, xlim = c(-0.5, 2), ylim = c(-0.5, 2))
text(
some.data,
rownames(some.data),
xlim = c(-0.5, 2),
ylim = c(-0.5, 2),
adj = c(0, 0)
)
segments(0, 0, x1 = some.data[1, 1], y1 = some.data[1, 2])
segments(0, 0, x1 = some.data[2, 1], y1 = some.data[2, 2])
So what’s the angle between vector individual.1 and vector individual.2? You guessed it, 0.785 radians, or 45 degrees.
cosine.sim <- function(A, B) { A %*% B / (sqrt(A %*% A) * sqrt(B %*% B)) }
cos.sim.result <- cosine.sim(individual.1, individual.2)
angle.radians <- acos(cos.sim.result)
angle.degrees <- angle.radians * 180 / pi
print(angle.degrees)
# [,1]
# [1,] 45
Now we can use the cosine.sim function I previously defined, in two nested loops, to explicitly calculate the pairwise similarities between each of the iris flowers. Remember, psm_scaled has already been defined as the scaled numerical values from the iris dataset.
cs.melt <- lapply(rownames(psm_scaled), function(name.A) {
inner.loop.result <-
lapply(rownames(psm_scaled), function(name.B) {
individual.A <- psm_scaled[rownames(psm_scaled) == name.A, ]
individual.B <- psm_scaled[rownames(psm_scaled) == name.B, ]
similarity <- cosine.sim(individual.A, individual.B)
return(list(name.A, name.B, similarity))
})
inner.loop.result <-
do.call(rbind.data.frame, inner.loop.result)
names(inner.loop.result) <-
c('flower.A', 'flower.B', 'similarity')
return(inner.loop.result)
})
cs.melt <- do.call(rbind.data.frame, cs.melt)
Now we repeat the calculation of cs.melt$class.A, cs.melt$class.B, and cs.melt$comparison as above, and calculate cs.agg.from.loops as the mean similarity between the various types of comparisons:
cs.agg.from.loops <-
aggregate(cs.agg.from.loops$similarity, by = list(cs.agg.from.loops $comparison), mean)
print(cs.agg.from.loops[order(cs.agg.from.loops$x),])
# Group.1 x
# 3 setosa_vs_virginica -0.7945321
# 7 virginica_vs_setosa -0.7945321
# 2 setosa_vs_versicolor -0.4868352
# 4 versicolor_vs_setosa -0.4868352
# 6 versicolor_vs_virginica 0.3774612
# 8 virginica_vs_versicolor 0.3774612
# 5 versicolor_vs_versicolor 0.4134413
# 9 virginica_vs_virginica 0.7622797
# 1 setosa_vs_setosa 0.8698189
Which, I believe is identical to the result we got with lsa::cosine.
So what I'm trying to say is... why wouldn't you use lsa::cosine?
Maybe you should be more concerned with
selection of variables, including removal of highly correlated variables
scaling/normalizing/standardizing the data
performance with a large input data set
identifying known similars and dissimilars for quality control
as previously addressed

Integrating in R

I am trying to compute in R. I have the following values.
nb <- 100
tb <- 25
ns <- 90
ts <- 15
A0 <- 1
S_norm <- 0.4
R <-tb/ts
y_meas <- (ns-nb/R)/A0
sigma_meas = sqrt(ns+(nb+1)/R^2)/A0
I am very confused on how I can integrate L(psi), say from -10 to 10. Because I am integrating with respect to log A.
You can substitute for logA and for a fixed value of psi you can integrate as follows:
psi <- 5
integrate(function(x) exp(-0.5*(((x/A0)/S_norm)^2 + ((psi-y_meas*A0/exp(x))/sigma_meas)^2)),
-10, 10)
# 0.1775989 with absolute error < 6.6e-05
On top of an excellent answer by #SandipanDey, if you could extend limits to -Infinity...+Infinity, there is a better way to integrate functions with e-x2 kernel: Gauss-Hermite quadrature, and there is an R package for that.
Simple example:
library(gaussquad)
n.quad <- 128 # integration order
# get the particular (weights,abscissas) as data frame
# with 2 observables and n.quad observations
rule <- ghermite.h.quadrature.rules(n.quad, mu = 0.0)[[n.quad]]
# test function - integrate 1 over exp(-x^2) from -Inf to Inf
# should get sqrt(pi) as an answer
f <- function(x) {
1.0
}
q <- ghermite.h.quadrature(f, rule)
print(q - sqrt(pi))

How to construct a sequence with a pattern in R

I would like to construct a sequence with length 50 of the following type:
Xn+1=4*Xn*(1-Xn). For your information, this is the Logistic Map for r=4. In the case of the Logistic Map with parameter r = 4 and an initial state in (0,1), the attractor is also the interval (0,1) and the probability measure corresponds to the beta distribution with parameters a = 0.5 and b = 0.5. (The Logistic Map is a polynomial mapping (equivalently, recurrence relation) of degree 2, often cited as an archetypal example of how complex, chaotic behaviour can arise from very simple non-linear dynamical equations). How can I do this in R?
There are some ready to use solution on the net. I cite the general solution of mage's blog where you can find more detailed description.
logistic.map <- function(r, x, N, M){
## r: bifurcation parameter
## x: initial value
## N: number of iteration
## M: number of iteration points to be returned
z <- 1:N
z[1] <- x
for(i in c(1:(N-1))){
z[i+1] <- r *z[i] * (1 - z[i])
}
## Return the last M iterations
z[c((N-M):N)]
}
For OP example:
logistic.map(4,0.2,50,49)
This isn't really an R question, is it? More basic programming. Anyway, you probably need an accumulator and a value to process.
values <- 0.2 ## this accumulates as a vector, starting with 0.2
xn <- values ## xn gets the first value
for (it in 2:50) { ## start the loop from the second iteration
xn <- 4L*xn*(1L-xn) ## perform the sequence function
values <- c(values, xn) ## add the new value to the vector
}
values
# [1] 0.2000000000 0.6400000000 0.9216000000 0.2890137600 0.8219392261 0.5854205387 0.9708133262 0.1133392473 0.4019738493 0.9615634951 0 .1478365599 0.5039236459
# [13] 0.9999384200 0.0002463048 0.0009849765 0.0039360251 0.0156821314 0.0617448085 0.2317295484 0.7121238592 0.8200138734 0.5903644834 0 .9673370405 0.1263843622
# [25] 0.4416454208 0.9863789723 0.0537419811 0.2034151221 0.6481496409 0.9122067356 0.3203424285 0.8708926280 0.4497546341 0.9899016128 0 .0399856390 0.1535471506
# [37] 0.5198816927 0.9984188732 0.0063145074 0.0250985376 0.0978744041 0.3531800204 0.9137755744 0.3151590962 0.8633353611 0.4719496615 0 .9968527140 0.0125495222
# [49] 0.0495681269 0.1884445109

Understand scores from bestScores function in DMwR package

I'm going through the code in Chapter 2 of Luis Torgo's Data Mining with R textbook, also found here:
http://www.dcc.fc.up.pt/~ltorgo/DataMiningWithR/code2.html.
I would like to understand exactly what the scores are in the bestScores function output. I think that they are the NMSE (normalized mean squared error), but I was under the impression that normalization means that these scores are between 0 and 1. Ostensibly, the lower the score the better, but I would like to make sure. Please note that the experimentalComparison function takes about 1-2 minutes to run.
if (require(rpart)==F) install.packages("rpart"); require(rpart)
if (require(DMwR)==F) install.packages("DMwR"); require(DMwR)
data(algae)
algae <- algae[-manyNAs(algae), ]
clean.algae <- knnImputation(algae, k = 10)
lm.a1 <- lm(a1 ~ .,data=clean.algae[,1:12])
rt.a1 <- rpart(a1 ~ .,data=algae[,1:12])
final.lm <- step(lm.a1)
lm.predictions.a1 <- predict(final.lm,clean.algae)
rt.predictions.a1 <- predict(rt.a1,algae)
cv.rpart <- function(form,train,test,...) {
m <- rpartXse(form,train,...)
p <- predict(m,test)
mse <- mean((p-resp(form,test))^2)
c(nmse=mse/mean((mean(resp(form,train))-resp(form,test))^2))
}
cv.lm <- function(form,train,test,...) {
m <- lm(form,train,...)
p <- predict(m,test)
p <- ifelse(p < 0,0,p)
mse <- mean((p-resp(form,test))^2)
c(nmse=mse/mean((mean(resp(form,train))-resp(form,test))^2))
}
res <- experimentalComparison(
c(dataset(a1 ~ .,clean.algae[,1:12],'a1')),
c(variants('cv.lm'),
# 3 tree models each with a different
# complexity
variants('cv.rpart',se=c(0,0.5,1))),
# 3 times 10-fold cross-validation
# 1234 is seed
cvSettings(3,10,1234))
getVariant('cv.rpart.v1',res)
DSs <- sapply(names(clean.algae)[12:18],
function(x,names.attrs) {
f <- as.formula(paste(x,"~ ."))
# dataset is a class of objects that represent all necessary
# information on a predictive task
# dataset(formula, data, name)
dataset(f,clean.algae[,c(names.attrs,x)],x)
},
names(clean.algae)[1:11])
res.all <- experimentalComparison(
DSs,
c(variants('cv.lm'),
variants('cv.rpart',se=c(0,0.5,1))
),
cvSettings(5,10,1234))
bestScores(res.all)
Here is the output:
> bestScores(res.all)
$a1
system score
nmse cv.rpart.v1 0.64231
$a2
system score
nmse cv.rpart.v3 1
$a3
system score
nmse cv.rpart.v2 1
$a4
system score
nmse cv.rpart.v2 1
$a5
system score
nmse cv.lm.v1 0.9316803
$a6
system score
nmse cv.lm.v1 0.9359697
$a7
system score
nmse cv.rpart.v3 1.029505
Based on this website, NMSE can be greater than 1. The smaller the number the better the model performs in space and time.

Resources