I have simulated some data to create a regression tree with 3 terminal nodes:
set.seed(1988)
n=1000
X1<-rnorm(n,mean=0,sd=2)
X2<-rnorm(n,mean=0,sd=2)
e<-rnorm(n)
Y=5*I(X1<1)*I(X2<0.2)+4*I(X1<1)*I(X2>=0.2)+3*I(X1>=1)+e
mydat=as.data.frame(cbind(Y,X1,X2))
So, I want first to split by X1<1, and for X1<1 I want to split by X2<0.2. The values of Y in the leaves are the coefficient of the indicator.
If I run the procedure implemented in the RPART package everything is ok in the case above.
mytree<-rpart(Y~.,data=mydat)
mytree
Output:
node), split, n, deviance, yval
* denotes terminal node
1) root 1000 1627.0670 4.043696
2) X1>=0.9490461 326 373.8485 3.124825 *
3) X1< 0.9490461 674 844.8367 4.488135
6) X2>=0.2488142 327 312.7506 3.970742 *
7) X2< 0.2488142 347 362.0582 4.975708 *
It runs also if I try with coefficient all negative.
But when I try to generate some negative and some positive values in the final terms (it means in the "interaction" of the tree, so where the split is divided at a second level), RPART change the order of the split and the value in the leaves are not correct:
Y=-5*I(X1<1)*I(X2<0.2)+4*I(X1<1)*I(X2>=0.2)+3*I(X1>=1)+e
mydat=as.data.frame(cbind(Y,X1,X2))
mytree<-rpart(Y~.,data=mydat)
mytree
Output:
node), split, n, deviance, yval
* denotes terminal node
1) root 1000 17811.4000 0.6136962
2) X2< 0.1974489 515 8116.5350 -2.3192910
4) X1< 1.002815 343 359.7394 -5.0305350 *
5) X1>=1.002815 172 207.4313 3.0874360 *
3) X2>=0.1974489 485 560.3419 3.7281050 *
Anyone have some idea for that problem?
Thanks
You need to tune the complexity parameter cp. See the code below.
# Data Generating Process
set.seed(1988)
n=1000
X1<-rnorm(n,mean=0,sd=2)
X2<-rnorm(n,mean=0,sd=2)
e<-rnorm(n)
Y=-5*I(X1<1)*I(X2<0.2)+4*I(X1<1)*I(X2>=0.2)+3*I(X1>=1)+e
mydat=as.data.frame(cbind(Y,X1,X2))
library(rpart)
mytree<-rpart(Y~.,data=mydat, cp=0.0001)
# Plot the cross-validation error vs the complexity parameter
plotcp(mytree)
# Find the optimal value of the complexity parameter cp
optcp <- mytree$cptable[which.min(mytree$cptable[,4]),1]
# Prune the tree using the optial complexity parameter
mytree <- prune(mytree,optcp)
The pruned tree correctly represents the underlying data generating process
library(rattle)
fancyRpartPlot(mytree)
Related
I want to plot a decision tree in R with rpart and fancyRpartPlot. The code is working, but I want to show the p-value of each split. When I execute the tree (last line of the code), I get the stars behind the nodes which usually indicate statistical significance - I guess this is the case here too. However, I want to access the calculated p-values and include them in the plot. I would be very grateful, if anyone has an idea on how to do this. Thanks!
library(rpart)
library(rattle)
library(rpart.plot)
library(RColorBrewer)
seatbelts <- Seatbelts
seatbelts <- as.data.frame(seatbelts)
unique(seatbelts$law)
seatbelts_tree <- rpart(law ~ ., data=seatbelts)
plot(seatbelts_tree, uniform = TRUE, margin = 0.5)
text(seatbelts_tree)
prp(seatbelts_tree)
fancyRpartPlot(seatbelts_tree, type=2)
seatbelts_tree
The ouput of the above code contains the answer, that the * indicates a terminal node , which is harder to spot given text output depending on the format.
n= 192
node), split, n, deviance, yval
* denotes terminal node
1) root 192 20.244790 0.11979170
2) drivers>=1303 178 8.544944 0.05056180
4) front>=663 158 1.974684 0.01265823
8) kms< 18147.5 144 0.000000 0.00000000 *
9) kms>=18147.5 14 1.714286 0.14285710 *
5) front< 663 20 4.550000 0.35000000
10) PetrolPrice< 0.1134217 11 0.000000 0.00000000 *
11) PetrolPrice>=0.1134217 9 1.555556 0.77777780 *
3) drivers< 1303 14 0.000000 1.00000000 *
If you want p-values you should look into the cpart library. Here is a related question to it with a short explanation and further reading material.
https://stats.stackexchange.com/questions/255150/how-to-interpret-this-decision-tree/255156
The stars in the print method of rpart highlight terminal nodes, not p values. A decision tree is a descriptive method. It is not designed to test hypotheses.
I would like to simulate exponential family random graphs, and I just started learning to use the statnet and ergm R packages. From the tutorial I found online, I am able to learn an ERGM model from an example dataset:
# install.packages('statnet')
# install.packages('ergm')
# install.packages('coda')
library(statnet)
set.seed(123)
data(package='ergm') # tells us the datasets in our packages
data(florentine) # loads flomarriage and flobusiness data
# Triad model
flomodel <- ergm(flomarriage ~ edges + triangle)
summary(flomodel)
Currently, I would like to use the simulate command to simulate networks with a pre-specified number of nodes from a pre-specified formula (that is not learned from any particular dataset), for example, P(y) = 1/Z exp(a * num_edges + b * num_triangles), where a and b are user-specified coefficients.
How should I go about writing such a model in statnet?
You can simulate from a given formula with simulate (or simulate.formula):
simulate(flomarriage ~ edges + triangles, coef = c(3,1))
To fix a simulation to have the same number of edges as the given graph (flomarriage in this case)
simulate(flomarriage ~ edges + triangles, coef = c(3,1), constraints = ~edges)
Not every constraint you might want to apply is available since each requires a specific mcmc sampler, but for a list of what is available see ?ergm.constraints
To fix the simulation to have an arbitrary number of nodes and edges (not based on an observed data) a workaround is to create such a network first. For example, to simulate over networks with 17 nodes and 16 edges.
test.mat = matrix(0, 17, 17)
test.mat[1,] = 1 #adds 16 edges
test.net = as.network(test.mat, directed = F)
test.sim = simulate(test.net ~ triangles, coef = 1, constraints = ~edges)
summary.statistics(test.sim ~ edges() + triangles())
p.s. I don't recommend using the triangles term in ERGM models. The geometrically weighted terms (gwesp, gwdsp) are the best substitutes which are more stable.
I have a large sparseMatrix (mat):
138493 x 17694 sparse Matrix of class "dgCMatrix", with 10000132 entries
I want to investigate Inter-rating agreement using kappa statistics but when I run Fleiss:
kappam.fleiss(mat)
I am shown the following error
Error in asMethod(object) :
Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105
Is this due to my matrix being too large?
Is there any other methods I can use to calculate kappa statistics for IRR on a matrix this large?
The best answer that I can offer is that this is not really possible due to the extreme sparsity in your matrix. The problem: With 10,000,132 entries for a 138,493 * 17694 = 2,450,495,142 cell matrix, you have mostly (99.59%) missing values. The irr package allows for these but here you are placing some extreme demands on the system, by asking it to compare ratings for users whose films do not overlap.
This is compounded by the problem that the methods in the irr package a) require dense matrixes as input, and b) (at least in kripp.alpha() loop over columns making them very slow.
Here is an illustration constructing a matrix similar in nature to yours (but with no pattern - in reality your situation will be better because viewers tend to rate similar sets of movies).
Note that I used Krippendorff's alpha here, since it allows for ordinal or interval ratings (as your data suggests), and normally handles missing data fine.
require(Matrix)
require(irr)
seed <- 100
(sparseness <- 1 - 10000132 / (138493 * 17694))
## [1] 0.9959191
138493 / 17694 # multiple of movies to users
## [1] 7.827117
# nraters <- 17694
# nusers <- 138493
nmovies <- 100
nusers <- 783
raterMatrix <-
Matrix(sample(c(NA, seq(0, 5, by = .5)), nmovies * nusers, replace = TRUE,
prob = c(sparseness, rep((1-sparseness)/11, 11))),
nrow = nmovies, ncol = nusers)
kripp.alpha(t(as.matrix(raterMatrix)), method = "interval")
## Krippendorff's alpha
##
## Subjects = 100
## Raters = 783
## alpha = -0.0237
This worked for that size matrix, but fails if I increase it 100x (10x on each dimension), keeping the same proportions as in your reported dataset, then it fails to produce an answer after even 30 minutes, so I killed the process.
What to conclude: You are not really asking the right question of this data. It's not an issue of how many users agreed, but probably what sort of dimensions exist in this data in terms of clusters of viewing and clusters of preferences. You probably want to use association rules or some dimensional reduction methods that don't balk at the sparsity in your dataset.
I'm new to R and rpart package. I want to create a tree using the following sample data.
My data set is similar to this
mydata =
"","A","B","C","status"
"1",TRUE,TRUE,TRUE,"okay"
"2",TRUE,TRUE,FALSE,"okay"
"3",TRUE,FALSE,TRUE,"okay"
"4",TRUE,FALSE,FALSE,"notokay"
"5",FALSE,TRUE,TRUE,"notokay"
"6",FALSE,TRUE,FALSE,"notokay"
"7",FALSE,FALSE,TRUE,"okay"
"8",FALSE,FALSE,FALSE,"okay"
fit <- rpart(status ~ A + B + C, data = mydata, method = "class")
or
I tried with different formulas and different methods in this. But always only the root node is produced. no plot possible.
its showing
fit
n= 8
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 8 3 okay (0.3750000 0.6250000) *
How to create the tree.? I need to show percentage of
"okay" and "notokay" on each node. and i need to specify one out of A, B or C
for spliting and show the statistics
With the default settings of rpart() no splits are considered at all. The minsplit parameter is 20 by default (see ?rpart.control) which is "the minimum number of observations that must exist in a node in order for a split to be attempted." So for your 8 observations no splitting is even considered.
If you are determined to consider splitting, then you could decrease the minbucket and/or minsplit parameters. For example
fit <- rpart(status ~ A + B + C, data = mydata,
control = rpart.control(minsplit = 3))
produces the following tree:
The display is created by
plot(partykit::as.party(fit), tp_args = list(beside = TRUE))
and the print output from rpart is:
n= 8
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 8 3 okay (0.3750000 0.6250000)
2) A=FALSE 4 2 notokay (0.5000000 0.5000000)
4) B=TRUE 2 0 notokay (1.0000000 0.0000000) *
5) B=FALSE 2 0 okay (0.0000000 1.0000000) *
3) A=TRUE 4 1 okay (0.2500000 0.7500000) *
Whether or not this is particularly useful is a different question though...
I get stuck when trying to build a model.
I want to class the dataset freeny into 10 subsets by year.
data(freeny)
options(digits=2)
year<-as.integer(rownames(freeny))
freeny<-cbind(freeny,year)
freeny = freeny[sample(1:nrow(freeny),length(1:nrow(freeny))),1:ncol(freeny)]
freenyValues= freeny[,1:5]
freenyTargets=decodeClassLabels(freeny[,6])
freeny = splitForTrainingAndTest(freenyValues,freenyTargets,ratio=0.15)
km<-kmeans(freeny$inputsTrain,10,iter.max = 100, nstart = 5)
kclust=km$cluster
library(tree)
kclust=as.factor(kclust)
mdp=cbind(freeny$inputsTrain,kclust)
mdp<-data.frame(mdp)
mdp.tr=tree(kclust~.,mdp)
but the result is that the tree only has 5 terminal nodes.It should be 10 terminal nodes because I divide into 10 clusters by kmeans. What's wrong?
No. It shouldn't. tree is an algorithm that tries to fit a tree given predictor and response, and stops if
the terminal nodes are too small or too few to be split.
(manual page). Try adjusting the minsize parameter (see ?tree.control).
minsize: The smallest allowed node size: a weighted quantity. The
default is 10.
I think the following will do what is intended:
mdp.tr=tree(kclust~.,mdp, minsize= 1)