I just begin to learn to code using R and I tried to do a classification by C5.0. But I encounter some problems and I don't understand. I am looking for help with gratitude. Below is the code I learned from someone and I tried to use it to run my own data:
require(C50)
data.resultc50 <- c()
prematrixc50 <- c()
for(i in 3863:3993)
{
needdata$class <- as.factor(needdata$class)
trainc50 <- C5.0(class ~ ., needdata[1:3612,], trials=5, control=C5.0Control(noGlobalPruning = TRUE, CF = 0.25))
predc50 <- predict(trainc50, newdata=testdata[i, -1], trials=5, type="class")
data.resultc50[i-3862] <- sum(predc50==testdata$class[i])/length(predc50)
prematrixc50[i-3862] <- as.character.factor(predc50)
}
Belows are two objects needdata & testdata I used in the code above with part of their heads respectively:
class Volume MA20 MA10 MA120 MA40 MA340 MA24 BIAS10
1 1 2800 8032.00 8190.9 7801.867 7902.325 7367.976 1751 7.96
2 1 2854 8071.40 8290.3 7812.225 7936.550 7373.624 1766 6.27
3 0 2501 8117.45 8389.3 7824.350 7973.250 7379.444 1811 5.49
4 1 2409 8165.40 8488.1 7835.600 8007.900 7385.294 1825 4.02
# the above is "needdata" and actually has 15 variables with 3862 obs.
class Volume MA20 MA10 MA120 MA40 MA340 MA24 BIAS10
1 1 2800 8032.00 8190.9 7801.867 7902.325 7367.976 1751 7.96
2 1 2854 8071.40 8290.3 7812.225 7936.550 7373.624 1766 6.27
3 0 2501 8117.45 8389.3 7824.350 7973.250 7379.444 1811 5.49
4 1 2409 8165.40 8488.1 7835.600 8007.900 7385.294 1825 4.02
# the above is "testdata" and has 15 variables with 4112 obs.
The data above contain the factor class with value of 0 & 1. After I run it I got warnings below:
In predict.C5.0(trainc50, newdata = testdata[i, -1], trials = 5, ... : 'trials' should be <= 1 for this object. Predictions generated
using 1 trials
And when I try to look at the object trainc50 just created, I noticed the number of boosting iterations is 1 due to early stopping as shown below:
# trainc50
Call:
C5.0.formula(formula = class ~ ., data = needdata[1:3612, ],
trials = 5, control = C5.0Control(noGlobalPruning = TRUE,
CF = 0.25), earlyStopping = FALSE)
Classification Tree
Number of samples: 3612
Number of predictors: 15
Number of boosting iterations: 5 requested; 1 used due to early stopping
Non-standard options: attempt to group attributes, no global pruning
I also tried to plot the decision tree and I got the error as below:
plot(trainc50)
Error in if (!n.cat[i]) { : argument is of length zero
In addition: Warning message:
In 1:which(out == "Decision tree:") : numerical expression has 2 elements: only the first used
Does that mean my code is too bad to perform further trials while running C5.0? What is wrong? Can someone please help me out about why do I encounter early stopping and what does the error and waring message mean? How can I fix it? If anyone can help me I'll be very thankful.
Used in
http://r-project-thanos.blogspot.tw/2014/09/plot-c50-decision-trees-in-r.html
using function
C5.0.graphviz(firandomf,
"a.txt",
fontname='Arial',
col.draw='black',
col.font='blue',
col.conclusion='lightpink',
col.question='grey78',
shape.conclusion='box3d',
shape.question='diamond',
bool.substitute=c('None', 'yesno', 'truefalse', 'TF'),
prefix=FALSE,
vertical=TRUE)
And in the command line:
pip install graphviz
dot -Tpng ~plot/a.txt >~/plot/a.png
Related
i'm using data from ("TravelMode", package = "AER") and try to follow [Heiss,2002] paper
this is what my code look like initially
Nested Structured Picture
library(mlogit)
data("TravelMode", package = "AER")
TravelMode_frame <- mlogit.data(TravelMode,choice = "choice",shape="long",chid.var="individual",alt.var = "mode")
ml_TM <- mlogit(choice ~travel|income,data=TravelMode_frame,nests = list(public = c('train','bus'), car=('car'),air = c('air')), un.nest.el = FALSE,unscaled=TRUE)
then i want to separate travel time variable between air and the other three as the picture below, so i wrote
air <- idx(TravelMode_frame, 2) %in% c('air')
TravelMode_frame$travel_air <- 0
TravelMode_frame$travel_air[air] <- TravelMode_frame$travel[air]
TravelMode_frame$travel[TravelMode_frame$alt == "air"] <- "0"
then my data look like this
individual mode choice wait vcost travel gcost income size idx travel_air
1 1 air FALSE 69 59 0 70 35 1 1:air 100
2 1 train FALSE 34 31 372 71 35 1 1:rain 0
3 1 bus FALSE 35 25 417 70 35 1 1:bus 0
4 1 car TRUE 0 10 180 30 35 1 1:car 0
~~~ indexes ~~~~
chid alt
1 1 air
2 1 train
3 1 bus
4 1 car
but when i compute it by
ml_TM <- mlogit(choice ~travel+travel_air|income,data=TravelMode_frame,nests = list(public = c('train','bus'), car=('car'), air = c('air')), un.nest.el = FALSE,unscaled=TRUE)
it's say Error in solve.default(H, g[!fixed]) : system is computationally singular: reciprocal condition number = 2.32747e-23
i had no idea why's this happened. could someone pls help me?
i try cutting variable in the formula out 1 by 1 and it's useless
ps. i roll back to the data before i create travel_air and try making travel time a alt specific constant by
ml_TM <- mlogit(choice ~0|income|travel,data=TravelMode_frame,nests = list(public = c('train','bus'), car=('car'),
air = c('air')), un.nest.el = FALSE,unscaled=TRUE)
and it cant compute either (Error in solve.default(crossprod(attr(x, "gradi")[, !fixed])) : system is computationally singular: reciprocal condition number = 1.39039e-20)
i think i get the idea behind this a little bit now. you can tell me if i'm wrong tho, i think my mistakes are
First thing, i forget to rescale income and travel time, so i need to add
TravelMode$travel <- TravelMode$travel/60+TravelMode$wait/60
TravelMode$income <- TravelMode$income/10
about the first question, this one
ml_TM <- mlogit(choice ~travel+travel_air|income,data=TravelMode_frame,nests = list(public = c('train','bus'), car=('car'), air = c('air')), un.nest.el = FALSE,unscaled=TRUE)
my nested have degenerate nested, so IV parameter will not count as dissimilarity anymore but the parameter to proportion the variable instead as the J & L model in the table in the (Heiss,2002) below and maybe because i tried to make it compute 2 variables at once, so come the error because they have to make IV parameter proportion to those variables simultaneously.
for this problem
ml_TM <- mlogit(choice ~0|income|travel,data=TravelMode_frame,nests = list(public = c('train','bus'), car=('car'),
air = c('air')), un.nest.el = FALSE,unscaled=TRUE)
like the above case as model L in the table
I am trying to run a nls model for a pavement data ,I have different sections and I need to run a same model (of course with different variables for each section)
following is my model
data<-read.csv(file.choose(),header=TRUE)
data
SecNum=data[,1]
t=c()
PCI=c()
t
PCI
for(i in 1:length(SecNum)){
if(SecNum[i] == SecNum[i+1]){
tt=data[i,2]
PCIt=data[i,3]
t=c(t, tt)
PCI=c(PCI, PCIt)
} else {
tt=data[i,2]
PCIt=data[i,3]
t=c(t, tt)
PCI=c(PCI, PCIt)
alpha=1
betha=1
theta=1
fit = nls(PCI ~ alpha-beta*exp((-theta*(t^gama))),
start=c(alpha=min(c(PCI, PCIt)),beta=1,theta=1,gama=1))
fitted(fit)
resid(fit)
print(t)
print(PCI)
t= c()
PCI= c()
}
}
after running my model I received an error like this
Error in nls(PCI ~ alpha - beta * exp((-theta * (t^gama))), start = c(alpha = min(c(PCI, :
singular gradient
My professor told me to use "nlsList" instead and that might solve the problem.
since I am new in R I don't know how to do that .I would be really thankful if anyone can advice me how to do it.
here is a sample of my data.
SecNum t PCI AADT ESAL
1 962 1 90.46 131333 3028352
2 962 2 90.01 139682 3213995
3 962 3 86.88 137353 2205859
4 962 4 86.36 137353 2205859
5 962 5 84.56 137353 2205859
6 962 6 85.11 137353 2205859
7 963 1 91.33 91600 3726288
I created a decision tree in rattle for the in-built wine dataset.
The output is shown below
Summary of the Decision Tree model for Classification (built using 'rpart'):
library(rpart)
library(rattle)
n= 124
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 124 73 2 (0.30645161 0.41129032 0.28225806)
2) Proline>=953.5 33 0 1 (1.00000000 0.00000000 0.00000000) *
3) Proline< 953.5 91 40 2 (0.05494505 0.56043956 0.38461538)
6) Intensity< 3.825 44 0 2 (0.00000000 1.00000000 0.00000000) *
7) Intensity>=3.825 47 12 3 (0.10638298 0.14893617 0.74468085)
14) Flavanoids>=1.385 13 6 2 (0.38461538 0.53846154 0.07692308) *
15) Flavanoids< 1.385 34 0 3 (0.00000000 0.00000000 1.00000000) *
Classification tree:
rpart(formula = Class ~ ., data = crs$dataset[crs$train, c(crs$input,
crs$target)], method = "class", parms = list(split = "information"),
control = rpart.control(usesurrogate = 0, maxsurrogate = 0))
Variables actually used in tree construction:
[1] Flavanoids Intensity Proline
Root node error: 73/124 = 0.58871
n= 124
CP nsplit rel error xerror xstd
1 0.452055 0 1.000000 1.00000 0.075061
2 0.383562 1 0.547945 0.52055 0.070325
3 0.082192 2 0.164384 0.26027 0.054946
4 0.010000 3 0.082192 0.21918 0.051137
Time taken: 0.02 secs
The rules are listed below
Tree as rules:
Rule number: 2 [Class=1 cover=33 (27%) prob=1.00]
Proline>=953.5
Rule number: 14 [Class=2 cover=13 (10%) prob=0.38]
Proline< 953.5
Intensity>=3.825
Flavanoids>=1.385
Rule number: 15 [Class=3 cover=34 (27%) prob=0.00]
Proline< 953.5
Intensity>=3.825
Flavanoids< 1.385
Rule number: 6 [Class=2 cover=44 (35%) prob=0.00]
Proline< 953.5
Intensity< 3.825
[1] 2 6 1 5 3 7 4
The output plot which I got is shown below
When I try to plot the tree, it is showing only the outline of the nodes. Nothing else is listed. I tried with different datasets. All are showing the same result. All plots other than decision tree are working perfectly fine.
How to resolve this? Is this related to any package problem?
Resolved It..
Removed the following packages and reinstalled
remove.packages('rpart.plot')
remove.packages('rpart')
remove.packages('rattle')
remove.packages('RColorBrewer')
Install them again from console (then it will install all the dependencies)
install.packages('rattle')
install.packages('rpart')
install.packages('rpart.plot')
install.packages('RColorBrewer')
Now reload the packages and it resolved the problem
library(rattle)
library(rpart.plot)
library(RColorBrewer)
library(rpart)
library(rattle)
I do not have sufficient information to answer your question and so I am not sure what is going wrong in your case. But please find below an example code to produce a tree using fancyRpartPlot.
model<-rpart(Type~ Flavanoids+Proline+Hue, data=wine)
fancyRpartPlot(model)
I am attempting to carry out lasso regression using the lars package but can not seem to get the lars bit to work. I have inputted code:
diabetes<-read.table("diabetes.txt", header=TRUE)
diabetes
library(lars)
diabetes.lasso = lars(diabetes$x, diabetes$y, type = "lasso")
However, I get an error message of :
Error in rep(1, n) : invalid 'times' argument.
I have tried entering it like this:
diabetes<-read.table("diabetes.txt", header=TRUE)
library(lars)
data(diabetes)
diabetes.lasso = lars(age+sex+bmi+map+td+ldl+hdl+tch+ltg+glu, y, type = "lasso")
But then I get the error message:
'Error in lars(age+sex + bmi + map + td + ldl + hdl + tch + ltg + glu, y, type = "lasso") :
object 'age' not found'
Where am I going wrong?
EDIT: Data - as below but with another 5 columns.
ldl hdl tch ltg glu
1 -0.034820763 -0.043400846 -0.002592262 0.019908421 -0.017646125
2 -0.019163340 0.074411564 -0.039493383 -0.068329744 -0.092204050
3 -0.034194466 -0.032355932 -0.002592262 0.002863771 -0.025930339
4 0.024990593 -0.036037570 0.034308859 0.022692023 -0.009361911
5 0.015596140 0.008142084 -0.002592262 -0.031991445 -0.046640874
I think some of the confusion may have to do with the fact that the diabetes data set that comes with the lars package has an unusual structure.
library(lars)
data(diabetes)
sapply(diabetes,class)
## x y x2
## "AsIs" "numeric" "AsIs"
sapply(diabetes,dim)
## $x
## [1] 442 10
##
## $y
## NULL
##
## $x2
## [1] 442 64
In other words, diabetes is a data frame containing "columns" which are themselves matrices. In this case, with(diabetes,lars(x,y,type="lasso")) or lars(diabetes$x,diabetes$y,type="lasso") work fine. (But just lars(x,y,type="lasso") won't, because R doesn't know to look for the x and y variables within the diabetes data frame.)
However, if you are reading in your own data, you'll have to separate the response variable and the predictor matrix yourself, something like
X <- as.matrix(mydiabetes[names(mydiabetes)!="y",])
mydiabetes.lasso = lars(X, mydiabetes$y, type = "lasso")
Or you might be able to use
X <- model.matrix(y~.,data=mydiabetes)
lars::lars does not appear to have a formula interface, which means you cannot use the formula specification for the column names (and furthermore it does not accept a "data=" argument). For more information on this and other "data mining" topics, you might want to get a copy of the classic text: "Elements of Statistical Learning". Try this:
# this obviously assumes require(lars) and data(diabetes) have been executed.
> diabetes.lasso = with( diabetes, lars(x, y, type = "lasso"))
> summary(diabetes.lasso)
LARS/LASSO
Call: lars(x = x, y = y, type = "lasso")
Df Rss Cp
0 1 2621009 453.7263
1 2 2510465 418.0322
2 3 1700369 143.8012
3 4 1527165 86.7411
4 5 1365734 33.6957
5 6 1324118 21.5052
6 7 1308932 18.3270
7 8 1275355 8.8775
8 9 1270233 9.1311
9 10 1269390 10.8435
10 11 1264977 11.3390
11 10 1264765 9.2668
12 11 1263983 11.0000
I have written a model that I am fitting to data using ML via the mle2 package. However, I have a large data frame of samples and I would like to fit the model to each replicate, and then retrieve all of the coefficients of the model in a data frame.
I have tried to use the ddply function in the plyr package with no success.
I get the following error message when I try:
Error in output[[var]][rng] <- df[[var]] :
incompatible types (from S4 to logical) in subassignment type fix
Any thoughts?
here is an example of what I am doing.
This is my data frame. I have measurement in Pond 5...n on day 1....n. Measurements consist of 143 fluxes (flux.cor), which is the variable I am modelling.
Pond Obs Date Time Temp DO pH U day month PAR
932 5 932 2011-06-16 17:31:00 17:31:00 294.05 334.3750 8.47 2 1 1 685.08
933 5 933 2011-06-16 17:41:00 17:41:00 294.05 339.0625 8.47 2 1 1 808.44
934 5 934 2011-06-16 17:51:00 17:51:00 294.02 340.6250 8.46 2 1 1 752.78
935 5 935 2011-06-16 18:01:00 18:01:00 294.00 340.6250 8.45 2 1 1 684.14
936 5 936 2011-06-16 18:11:00 18:11:00 293.94 340.9375 8.50 2 1 1 625.86
937 5 937 2011-06-16 18:21:00 18:21:00 293.88 341.5625 8.48 2 1 1 597.06
day.night Treat H pOH OH DO.cor sd.DO av.DO DO.sat
932 1 A 3.388442e-09 5.53 2.951209e-06 342.1406 2.63078 342.1406 274.0811
933 1 A 3.388442e-09 5.53 2.951209e-06 339.0625 2.63078 342.1406 274.0811
934 1 A 3.467369e-09 5.54 2.884032e-06 340.6250 2.63078 342.1406 274.2432
935 1 A 3.548134e-09 5.55 2.818383e-06 340.6250 2.63078 342.1406 274.3513
936 1 A 3.162278e-09 5.50 3.162278e-06 340.9375 2.63078 342.1406 274.6763
937 1 A 3.311311e-09 5.52 3.019952e-06 341.5625 2.63078 342.1406 275.0020
DO_flux NEP.hr flux.cor sd.flux av.flux
932 -3.078125 -3.09222602 -3.078125 2.104482 -0.1070312
933 1.562500 1.54903673 1.562500 2.104482 -0.1070312
934 0.000000 -0.01375489 0.000000 2.104482 -0.1070312
935 0.312500 0.29876654 0.312500 2.104482 -0.1070312
936 0.625000 0.61126617 0.625000 2.104482 -0.1070312
here is the my model:
# function that generates predictions of O2 flux given GPP R and gas exchange
flux.pred <- function(GPP24, PAR, R24, Temp, U, DO, DOsat){
# calculates Schmidt coefficient from water temperature
Sc<-function(Temp){
S<-0.0476*(Temp)^2 + 3.7818*(Temp)^2 - 120.1*Temp + 1800.6
}
# calculates piston velocity k (m h-1) from wind speed at 10m (m s-1)
k600<-function(U){
k.600<-(2.07 + 0.215*((U)^1.7))/100
}
# calculates piston velocity k (m h-1) from wind speed at 10m (m s-1)
k<-function(Temp,U){
k<-k600(U)*((Sc(Temp)/600)^-0.5)
}
# physical gas flux (mg O2 m-2 10mins-1)
D<-function(Temp,U,DO,DOsat){
d<-(k(Temp,U)/6)*(DO-DOsat)
}
# main function to generate predictions
flux<-(GPP24/sum(YSI$PAR[YSI$PAR>40]))*(ifelse(YSI$PAR>40, YSI$PAR, 0))-(R24/144)+D(YSI$Temp,YSI$U,YSI$DO,YSI$DO.sat)
return(flux)
}
which returns predictions for the fluxes.
I then build my likelihood function:
# likelihood function
ll<-function(GPP24, PAR, R24, Temp, U, DO.cor, DO.sat){
pred = (flux.pred(GPP24, PAR, R24, Temp, U, DO.cor, DOsat))
pred = pred[-144]
obs = YSI$flux.cor[-144]
return(-sum(dnorm(obs, mean=pred, sd=sqrt(var(obs-pred)))))
}
and apply it
ll.fit<-mle2(ll,start=list(GPP24=100, R24=100))
It works beautifully for one Pond on one day, but what I want to do is apply it to all ponds on all days automatically.
I tried the ddply (as stated above)
metabolism<-ddply(YSI, .(Pond,Treat,day,month), summarise,
mle = mle2(ll,start=list(GPP24=100, R24=100)))
but had not success. I also tried just extracting the coefficients using a for loop, but this did not work either.
for(i in 1:length(unique(YSI$day))){
GPP<-numeric(length=length(unique(YSI$day)))
GPP[i]<-mle2(ll,start=list(GPP24=100, R24=100))
}
any help would be gratefully received.
There's at least one problem with your functions : Nowhere in your function flux.pred or ll do you have an argument where you can actually specify what the data is that is used. You hardcoded it. So how on earth is any *ply supposed to guess that it needs to change YSI$... into a subset?
Next to that, as #hadley points out, ddply will not suit you. dlply might, or you might just use the classic approach of either by() or lapply(split()).
So imagine you make a function
flux.pred <- function(data, GPP24, R24){
# calculates Schmidt coefficient from water temperature
Sc<-function(data$Temp){
S<-0.0476*(data$Temp)^2 ...
...
}
and a function
ll<-function(GPP24, R24, data ){
pred = (flux.pred(data, GPP24, R24 ))
pred = pred[-144] # check this
obs = data$flux.cor[-144] # check this
return(-sum(dnorm(obs, mean=pred, sd=sqrt(var(obs-pred)))))
}
You should then be able to do do eg :
dlply(data, .(Pond,Treat,day,month), .fun=function(i){
mle2(ll,start=list(GPP24=100, R24=100, data=i))
})
The passing of the data argument is dependent on what you use in mle2 for the optimization. In your case, you use the default optimizer, which is optim. See ?optim for more details. The argument data=i will be passed from mle2 to optim to ll.
What I can't check, is how the optim behaves. It might even be that your function doesn't really work as you intend. Normally you should have a function ll like :
ll <- function(par, data){
GPP24 <- par[1]
R24 <- par[2]
...
}
for optim to work. But if you say it works as you wrote it, I believe you. Make sure though it actually does. I am not convinced...
On a sidenote : neither using by() / lapply(split()) nor using dlply() is the same as vectorizing. In contrary, all these constructs are intrinsic loops. On the why of using them, read : Is R's apply family more than syntactic sugar?