Caret using C5.0 method, how to plot the final tree - r

I am using the train package method=C5.0 and would like to see the finalModel plotted as a tree.
The resulting tree has been defined :
The final values used for the model were trials = 15, model = tree and winnow = FALSE.
When I tried to plot the tree using plot or the rattle's fancyRplotModel, i get the errors below:
Using plot:
plot(diabetes.c50$finalModel,trials=15)
Error in plot(diabetes.c50$finalModel, trials = 15) :
object 'diabetes.c50' not found
Using rattle:
fancyRpartPlot(diabetes.C50$finalModel,trials=15)
Error in if (model$method == "class") { : argument is of length zero
The finalModel has been defined:
> diabetes.C50$finalModel
Call:
C5.0.default(x = structure(c(6, 8, 0, 8, 4, 10, 10, 1, 5, 7, 1, 1, 3, 8, 7, 9, 11, 10, 7, 1, 13, 5, 5, 3, 6, 4, 11, 9, 4, 3, 9, 7, 0,
"outcome", seed = 2187L), .Names = c("subset", "bands", "winnow", "noGlobalPruning", "CF", "minCases", "fuzzyThreshold",
"sample", "earlyStopping", "label", "seed")), verbose = FALSE)
Classification Tree
Number of samples: 538
Number of predictors: 8
Number of boosting iterations: 15
Average tree size: 12.9
Non-standard options: attempt to group attributes

The data structure representing a C5.0 tree is different to that representing an rpart tree. Rattle's fancyRpartPlot() assumes an rpart tree hence you get an error (recent versions of rattle check for the model class and explain this error rather than failing with the above indecipherable message).
You first error though looks like a typo and the error message is self explanatory. You meant diabetes.C50$finalModel rather than diabetes.c50$finalModel (capital C50 rather than lower c50).

Related

variogram function in R returns one single observation

I'm trying to construct a variogram cloud in R using the variogram function from the gstat package. I'm not sure if there's something about the topic that I've misunderstood, but surely I should get more than one observation, right? Here's my code:
data = data.frame(matrix(c(2, 4, 8, 7, 6, 4, 7, 9, 4, 4, -1.01, .05, .47, 1.36, 1.18), nrow=5, ncol=3))
data = rename(data, X=X1, Y=X2, Z=X3)
coordinates(data) = c("X","Y")
var.cld = variogram(Z ~ 1,data=data, cloud = TRUE)
And here's the output:
> var.cld
dist gamma dir.hor dir.ver id left right
1 1 0.0162 0 0 var1 5 4
I found the problem! Apparently the default value of the cutoff argument was too low for my specific set of data. Specifying a higher value resulted in additional observations.

STAN IRT via R programming, issue with parameter declaration?

I'm following along with this official IRT w/ STAN tutorial. The details of the model are copied below:
data {
int<lower=1> J; // number of students
int<lower=1> K; // number of questions
int<lower=1> N; // number of observations
int<lower=1,upper=J> jj[N]; // student for observation n
int<lower=1,upper=K> kk[N]; // question for observation n
int<lower=0,upper=1> y[N]; // correctness for observation n
}
parameters {
real delta; // mean student ability
real alpha[J]; // ability of student j - mean ability
real beta[K]; // difficulty of question k
}
model {
alpha ~ std_normal(); // informative true prior
beta ~ std_normal(); // informative true prior
delta ~ normal(0.75, 1); // informative true prior
for (n in 1:N)
y[n] ~ bernoulli_logit(alpha[jj[n]] - beta[kk[n]] + delta);
}
I'm not certain which variables do and do not need to be declared in R code.
toy_data <- list(
J= 5,
K = 4,
N =20,
y= c(1,1,1,1,1,1,1,0,1,1,0,0,1,0,0,0,0,0,0,0)
)
fit <- stan(file = '1PL_stan.stan', data = toy_data)
However, the following error is triggered.
Error in mod$fit_ptr() :
Exception: variable does not exist; processing stage=data initialization; variable name=jj; base type=int (in 'model920c4330dff_1PL_stan' at line 5)
In addition: Warning messages:
1: In readLines(file, warn = TRUE) :
incomplete final line found on 'C:\Users\jacob.moore\Downloads\1PL_stan.stan'
2: In system(paste(CXX, ARGS), ignore.stdout = TRUE, ignore.stderr = TRUE) :
'-E' not found
failed to create the sampler; sampling not done
In my past work, I've used python almost exclusively. So learning R has been quite the learning curve; additionally, I'm very new to STAN, hence the toy example.
The core idea is that there are 20 child/question pairings. 5 children and 4 different questions. I'm uncertain why my code is triggering the error, and what I should do to correct it. Can you clarify what needs adjustment for this code to run without triggering an error?
Every parameter listed in the data block (J, K, N, jj, kk, and y) needs to be included in the variable toy_data. You've left out jj and kk.
You have 5 students (J=5) answering 4 questions each (K=4). jj is the student ID, and kk is the question ID, so assuming your responses are ordered by student and then by question, you would have something like
jj = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5)
kk = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4)

Some "train" columns aren't present in "test"

everyone.
I have a problem. I have to realize a kNN classification on R using LOO. I've found packages "knncat" and "loo" for this. And I've written the code(without LOO):
library(knncat)
x <- c(1, 2, 3, 4)
y <- c(5, 6, 7, 8)
train <- data.frame(x, y)
x1 <- c(9, 10, 11, 12)
y1 <- c(13, 14, 15, 16)
test <- data.frame(x1, y1)
answer <- knncat(train, test, classcol = 2)
And I've got an error "Some "train" columns aren't present in "test"". I don't understand, what am I doing wrong? How can I fix this error?
If something's wrong with my English, sorry, I'm from Russia:)
Well, there are some problems with your approach and knncat:
You have to specify class labels for the train and test data sets and set classcol accordingly.
Only class labels which appear in train must be present in test.
The columns names of train and test must be the same, or knncat will throw the error you've mentioned: "Some "train" columns aren't present in "test".
Moreover if you are using integer values as class labels, they have to start from zero or knncat will throw an error: "Number in class 0 is 0! Abort!".
Here is an working example:
train <- data.frame(x1=1:4, x2=5:8, y=c(0, 0, 1, 1))
test <- data.frame(x1=9:12, x2=13:16, y=c(1, 0, 0, 1))
knncat(train, test, classcol = 3)
With the result:
Test set misclass rate: 50%

R's equivalence of numpy.linalg.lstsq

I have multiple linear regressions of the form vc = x1 * va + x2 * vb.
(Now, a too minimal example follows - it has the same values, which leads to warnings in R. Below a second data set illustrating my issue)
In Python, I programmed
#!/usr/bin/env python3
import numpy as np
va = np.array([1, 2, 3, 4, 5])
vb = np.array([1, 2, 3, 4, 5])
vc = np.array([1, 2, 3, 4, 5])
A = np.vstack([va, vb]).T
print(A)
result = np.linalg.lstsq(A, vc)
print(result)
Output:
(array([ 0.5, 0.5]), array([], dtype=float64), 1, array([ 1.04880885e+01, 3.14018492e-16]))
I thought, following code would be identical:
#!/usr/bin/Rscript
va <- c(1, 2, 3, 4, 5)
vb <- c(1, 2, 3, 4, 5)
vc <- c(1, 2, 3, 4, 5)
reg <- lm(vc ~ va + vb)
reg
summary(reg)
However, I get following output (excerpt):
Coefficients:
A1 A2
1 NA
esidual standard error: 7.022e-16 on 4 degrees of freedom
In summary.lm(reg) : essentially perfect fit: summary may be unreliable
Even if I adjust the numbers somehow, R still keeps complaining.
I assume, I am doing something basic wrong, but I can't figure out. I also tried to construct a matrix A (containg vb and vc as colums) and then use reg <- lm(vc ~ 0 + A). There, I get 3 degrees of freedom, but with the same Coefficients.
2nd data set
va = np.array([1, 2, 3, 4, 5])
vb = np.array([2, 2, 2, 2, 2])
vc = np.array([3.1, 3.2, 3.3, 3.4, 3.5])
va <- c(1, 2, 3, 4, 5)
vb <- c(2, 2, 2, 2, 2)
vc <- c(3.1, 3.2, 3.3, 3.4, 3.5)
If I add 0 + (which results in lm(vc ~ 0 + va + vb)), I geed 3 degrees of freedom and the same result. Looks good.
The 0 + removes the "implied intercept term" (whatever this this). Source
The problem is that you have a singular fit, and multiple combinations of coefficients will represent it equally well. IMHO, both numpy and R should really throw an error in this case by default. You can get R to give you an error by adding singular.ok = FALSE to your arguments. Additionally, altough your intercept in this case is zero, your regression equation indicates that you're not looking to fit one. To fit a linear model without an intercept in R, use a formula in the form:
lm(vc ~ va + vb - 1)
So, to (properly) return an error in this singular fit, you would call:
reg <- lm(vc ~ va + vb - 1, singular.ok = FALSE)

Inverted ED95 and ED5 in drc package

My aim is to determine the distance between site of injection of a treatment and target of this treatment with a 0.95 probability of success.
The outcome variable was a binary variable (Success:1/failure:0)
I used Dixon up and down methodology with six distances tested : 0, 2, 4, 6, 8 and 10 mm.
Here are my data :
column 1 : distances used
column 2 : number of success
column 3 : total number of patients
data <- data.frame(1:6,1:6,1:6)
data[,1] <- c(0, 2, 4, 6, 8, 10)
data[,2] <- c(2, 12, 3, 2, 1, 0)
data[,3] <- c(2, 12, 15, 8, 4, 1)
names(data) <- c("Distance", "Success", "Total")
I built a model with DRC package 2.3-96 and R ver 3.1.2 on Windows Vista Os :
library(drc)
model <- drm(Success/Total~Distance, weights=Total,
data=data, fct=LL.2(), type="binomial")
summary(model)
plot(model, bp=.5, legend=FALSE
, xlab=paste("Distance"), ylab="Probability of success", lwd=2,
cex=1.2, cex.axis=1.2, cex.lab=1.2, log = "")
All seems to be Ok
but when it come to estimating ED 95 (Effective dose 95 : distance required to have 0.95 probability of success), i think that this ED95 was inverted with ED5 (Effective dose 5 : distance required to have 0.05 probability of success) :
ED(model, 95, interval="delta")
ED(model, 5, interval="delta")
ED95 : 8.0780 SE: 2.0723 CI 95 % (4.0165 ; 12.139)
ED5 : 1.58440 SE: 0.46413 CI 95 % (0.67472 ; 2.4941)
ED values in drc package are by default calculated relative to the the control level. In our case, we are looking for ED values calculated relative to the upper limit.
So we must change the reference value from "control" (default) to "upper" :
ED(model, 95, interval="delta", reference = "upper")
Many thanks to Christian Ritz

Resources