Machine learning in R is slow with decisiontree - r

I'm trying to predict the type of a vehicle (model) based on the vehicle identification number (VIN). The first 10 positions of the VIN says something about the type, so I use them as variables. See an example of the data below:
positie_1_tm_3 positie_4 positie_5 positie_6 positie_7 positie_8 positie_9 positie_10 MODEL
MBL B 7 5 L 7 A 6 SKODA YETI
JNF A A E 1 1 U 2 NISSAN NOTE
VWZ Z Z 5 Z Z 9 4 VOLKSWAGEN FOX
F1D Z 0 V 0 6 4 2 RENAULT MEGANE
NAK U 8 1 1 C A 5 KIA SORENTO
F1B R 1 J 0 H 4 1 RENAULT CLIO
I used this R code for it:
#make stratisfied train and test set:
library(caret)
train.index <- createDataPartition(VIN1$MODEL, p = .6, list = FALSE)
train <- VIN1[ train.index,]
overige_data <- VIN1[-train.index,]
test.index<-createDataPartition(overige_data$MODEL, p = .5, list = FALSE)
test<-overige_data[test.index,]
testset2<-overige_data[-test.index,]
#make decision three :
library(rpart)
library(rpart.plot)
library(rattle)
library(RColorBrewer)
tree<- rpart(MODEL ~., train, method="class")
But the last one, making the tree, is running for more than 2 weeks already.
The dataset is around 3 million rows, so the trainingset is around 1,8 million rows. Is it running so long because it’s too much data for rpart or is there another problem?

No, something is obviously wrong. It may take long, but not 2 weeks.
The question - how many labels (classes there are)? Decision trees tend to be slow when the number of classes is large (by large I mean more than 50).

Related

Errors using powerSim and powerCurve for a clmm in R

I'm new to clmm and run into the following problem:
I want to obtain the optimal sample size for my study with R using powerSim and powerCurve. Because my data is ordinal, I'm using a clmm. Study participants (VPN) should evaluate three sentence types (SH1,SM1, and SP1) on a 5 point likert scale (evaluation.likert). I need to account for my participants as a random factor while the sentence types and the evaluation are my fixed factors.
Here's a glimpse of my data (count of VPN goes up to 40 for each of the parameters, I just shortened it here):
VPN parameter evaluation.likert
1 1 SH1 2
2 2 SH1 4
3 3 SH1 5
4 4 SH1 3
...
5 1 SM1 4
6 2 SM1 2
7 3 SM1 2
8 4 SM1 5
...
9 1 SP1 1
10 2 SP1 1
11 3 SP1 3
12 4 SP1 5
...
Now, with some help I created the following model:
clmm(likert~parameter+(1|VPN), data=dfdata)
With this model, I'm doing the simulation:
ps1 <- powerSim(power, test=fixed("likert:parameter", "anova"), nsim=40)
Warning:
In observedPowerWarning(sim) :
This appears to be an "observed power" calculation
print(ps1)
Power for predictor 'likert:parameter', (95% confidence interval):
0.00% ( 0.00, 8.81)
Test: Type-I F-test
Based on 40 simulations, (0 warnings, 40 errors)
alpha = 0.05, nrow = NA
Time elapsed: 0 h 0 m 0 s
nb: result might be an observed power calculation
In the above example, I tried it with 40 participants but I already also ran a simulation with 2000000 participants to check if I just need a huge amount of people. The results were the same though: 0.0%.
lastResult()$errors tells me that I'm using a method which is not applicable for clmm:
not applicable method for'simulate' on object of class "clmm"
But besides the anova I'm doing here, I've also already tried z, t, f, chisq, lr, sa, kr, pb. (And instead of test=fixed, I've also already tried test=compare, test=fcompare, test=rcompare, and even test=random())
So I guess there must be something wrong with my model? Or are really none of these methods applicaple for clmms?
Many thanks in advance, your help is already very much appreciated!

Adjusted survival curve based on weigthed cox regression

I'm trying to make an adjusted survival curve based on a weighted cox regression performed on a case cohort data set in R, but unfortunately, I can't make it work. I was therefore hoping that some of you may be able to figure it out why it isn't working.
In order to illustrate the problem, I have used (and adjusted a bit) the example from the "Package 'survival'" document, which means im working with:
data("nwtco")
subcoh <- nwtco$in.subcohort
selccoh <- with(nwtco, rel==1|subcoh==1)
ccoh.data <- nwtco[selccoh,]
ccoh.data$subcohort <- subcoh[selccoh]
ccoh.data$age <- ccoh.data$age/12 # Age in years
fit.ccSP <- cch(Surv(edrel, rel) ~ stage + histol + age,
data =ccoh.data,subcoh = ~subcohort, id=~seqno, cohort.size=4028, method="LinYing")
The data set is looking like this:
seqno instit histol stage study rel edrel age in.subcohort subcohort
4 4 2 1 4 3 0 6200 2.333333 TRUE TRUE
7 7 1 1 4 3 1 324 3.750000 FALSE FALSE
11 11 1 2 2 3 0 5570 2.000000 TRUE TRUE
14 14 1 1 2 3 0 5942 1.583333 TRUE TRUE
17 17 1 1 2 3 1 960 7.166667 FALSE FALSE
22 22 1 1 2 3 1 93 2.666667 FALSE FALSE
Then, I'm trying to illustrate the effect of stage in an adjusted survival curve, using the ggadjustedcurves-function from the survminer package:
library(suvminer)
ggadjustedcurves(fit.ccSP, variable = ccoh.data$stage, data = ccoh.data)
#Error in survexp(as.formula(paste("~", variable)), data = ndata, ratetable = fit) :
# Invalid rate table
But unfortunately, this is not working. Can anyone figure out why? And can this somehow be fixed or done in another way?
Essentially, I'm looking for a way to graphically illustrate the effect of a continuous variable in a weighted cox regression performed on a case cohort data set, so I would, generally, also be interested in hearing if there are other alternatives than the adjusted survival curves?
Two reasons it is throwing errors.
The ggadjcurves function is not being given a coxph.object, which it's halp page indicated was the designed first object.
The specification of the variable argument is incorrect. The correct method of specifying a column is with a length-1 character vector that matches one of the names in the formula. You gave it a vector whose value was a vector of length 1154.
This code succeeds:
fit.ccSP <- coxph(Surv(edrel, rel) ~ stage + histol + age,
data =ccoh.data)
ggadjustedcurves(fit.ccSP, variable = 'stage', data = ccoh.data)
It might not answer your desires, but it does answer the "why-error" part of your question. You might want to review the methods used by Therneau, Cynthia S Crowson, and Elizabeth J Atkinson in their paper on adjusted curves:
https://cran.r-project.org/web/packages/survival/vignettes/adjcurve.pdf

Run DBSCAN against grouped coordinates

I'm attempting to run DBSCAN against some grouped coordinates in order to get sub-clusters. I've clustered some spacial data and I'd now like to further divide these regions according to the density of points within them. I think DBSCAN is probably the best way to do this.
My issue is that I can't figure out how to run DBSCAN against each cluster seperately and then output the cluster assignment as a new column. Here's some sample data:
library(dplyr)
library(dbscan)
# Create sample data
df <- data.frame(
"ID"=1:200,
"X"=c(1.0083,1.3166,1.3072,1.1311,1.2984,1.2842,1.1856,1.3451,1.1932,1.0926,1.2464,1.3197,1.2331,1.2996,1.3482,
1.1944,1.2800,1.3051,1.4471,0.9068,1.3150,1.1846,1.0232,1.0005,1.0640,1.3177,1.1015,0.9598,1.0354,1.2203,
0.8388,0.8655,1.3387,1.0133,1.0106,1.1753,1.3200,1.0139,1.1511,1.3508,1.2747,1.3681,1.1074,1.2735,1.2245,
0.9695,1.3250,1.0537,1.2020,1.3093,0.9268,1.3244,1.2626,1.3123,1.2819,1.1063,0.8759,1.0063,1.0173,1.0187,
1.2396,1.0241,1.2619,1.2682,1.0008,1.0827,1.3639,1.3099,1.0004,0.8886,1.2359,1.1370,1.2783,1.0803,1.1918,
1.1156,1.3313,1.1205,1.0776,1.3895,1.3559,0.8518,1.1315,1.3521,1.2281,1.2589,0.9974,1.1487,1.4204,0.9998,
1.0154,1.0098,0.8851,1.0252,0.9331,1.2197,1.0084,1.2303,1.2808,1.3125,0.5500,0.6694,0.3301,0.3787,0.6492,
0.6568,0.6773,0.3769,0.6237,0.7265,0.5509,0.3579,0.7201,0.2631,0.3881,0.7596,0.3343,0.7049,0.3430,0.2951,
0.5483,0.7699,0.3806,0.6555,0.2524,0.4030,0.6329,0.5006,0.2701,0.0822,0.5442,0.5233,0.7105,0.5660,0.3962,
0.3187,0.3143,0.5673,0.3731,0.7310,0.6376,0.4864,0.8865,0.3352,0.7540,0.0690,0.7983,0.6990,0.4090,0.5658,
0.5636,0.5420,0.7223,0.6146,0.5648,0.2711,0.3422,0.7214,0.2196,0.2848,0.6496,0.7907,0.7418,0.7825,0.4550,
0.4361,0.7417,0.2661,0.8978,0.7875,0.2343,0.3853,0.6874,0.7761,0.2905,0.6092,0.5329,0.6189,0.0684,0.5726,
0.5740,0.7060,0.4609,0.3568,0.7037,0.2874,0.6200,0.7149,0.5100,0.7059,0.2520,0.3105,0.6870,0.7888,0.3674,
0.6514,0.7271,0.6679,0.3752,0.7067),
"Y"=c(-1.2547,-1.1499,-1.1803,-1.0626,-1.2877,-1.1151,-1.0958,-1.1339,-1.0808,-1.5461,-1.0775,-1.1431,-1.0499,
-1.1521,-1.1675,-1.0963,-1.1407,-1.1916,-1.1229,-1.2297,-1.1308,-1.0341,-1.3071,-1.2370,-1.5043,-1.1154,
-1.5452,-1.0349,-1.5412,-1.0348,-1.3620,-1.3776,-1.1830,-1.2552,-1.2354,-1.0838,-1.1214,-1.2396,-1.4937,
-1.0793,-1.1857,-1.0679,-1.5425,-1.1633,-1.1620,-1.0838,-1.0750,-1.3493,-1.4155,-1.1354,-1.0615,-1.1494,
-1.1620,-1.1582,-1.1800,-1.5230,-1.3019,-1.2484,-1.5490,-1.2435,-1.0487,-1.2330,-1.1234,-1.0924,-1.0702,
-1.0446,-1.1077,-1.1144,-1.2170,-1.2715,-1.1537,-1.5077,-1.1305,-1.3396,-1.2107,-1.5458,-1.1482,-1.1224,
-1.3690,-1.2058,-1.1685,-1.3400,-1.5033,-1.2152,-1.3805,-1.1439,-1.5183,-1.4288,-1.1252,-1.2330,-1.2511,
-1.5429,-1.3333,-1.1851,-1.1367,-1.3952,-1.1240,-1.2113,-1.1632,-1.1965,-0.9917,-0.7416,-0.7729,-1.1279,
-0.9323,-0.9372,-0.7013,-1.1746,-0.9191,-0.9356,-0.7873,-1.1957,-0.9838,-0.5825,-1.0738,-0.9302,-0.7713,
-0.9407,-0.7774,-0.8160,-0.9861,-1.0440,-0.9896,-0.6478,-0.8865,-1.0601,-1.0640,-0.9898,-0.5989,-0.7375,
-0.7689,-0.9799,-0.9147,-1.1048,-0.9735,-0.8591,-0.7913,-1.0085,-0.7231,-0.9688,-0.9272,-0.9395,-0.9494,
-0.7859,-1.0817,-0.7262,-0.9915,-0.9329,-1.0953,-1.0425,-1.0806,-1.0132,-0.8514,-1.0785,-1.1109,-0.8542,
-1.0849,-0.9665,-0.5940,-0.6145,-0.7830,-0.9601,-0.8996,-0.7717,-0.7447,-1.0406,-1.0067,-0.5710,-0.9839,
-1.0594,-0.7069,-1.1202,-0.9705,-1.0100,-0.6377,-1.0632,-0.9450,-0.9163,-0.7865,-1.0090,-1.1005,-1.0049,
-0.8042,-1.0781,-0.6829,-0.5962,-1.0759,-0.7918,-0.9732,-0.7353,-0.5615,-1.2002,-0.9295,-0.9944,-1.1570,
-0.9524,-0.9257,-0.9360,-1.1328,-0.7661),
"cluster"=c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2))
# How do you run DBSCAN against the points within each cluster?
I first thought I'd try to use the group_by function in dplyr but DBSCAN requires a data matrix input and group_by doesn't work for matrices.
matrix <- as.matrix(df[, -1])
set.seed(1234)
db = matrix %>%
group_by(cluster) %>%
dbscan(matrix, 0.4, 4)
#Error in UseMethod("group_by_") :
# no applicable method for 'group_by_' applied to an object of class "c('matrix', 'double', 'numeric')"
I've also tried using by() but get duplicate results for each cluster grouping, which isn't right:
by(data = df, INDICES = df$cluster, FUN = function(x) {
out <- dbscan(as.matrix(df[, c(2:3)]),eps=.0215,minPts=4)
})
#df$cluster: 1
#DBSCAN clustering for 200 objects.
#Parameters: eps = 0.0215, minPts = 4
#The clustering contains 10 cluster(s) and 138 noise points.
#
# 0 1 2 3 4 5 6 7 8 9 10
#138 11 12 4 5 8 2 4 8 4 4
#
#Available fields: cluster, eps, minPts
#--------------------------------------------------------------------------
#df$cluster: 2
#DBSCAN clustering for 200 objects.
#Parameters: eps = 0.0215, minPts = 4
#The clustering contains 10 cluster(s) and 138 noise points.
#
# 0 1 2 3 4 5 6 7 8 9 10
#138 11 12 4 5 8 2 4 8 4 4
#
#Available fields: cluster, eps, minPts
Can anyone point me in the right direction?
To be clear, dbscan::dbscan works fine on data.frame objects. You do not need to convert to matrix. It returns an object that includes a vector with the same dimension as the number of records in your input. The issue is that dplyr exposes variables to other functions as individual vectors, rather than as data.frame or matrix objects. You are free to do something like:
df %>%
group_by(cluster) %>%
mutate(
dbscan_cluster = dbscan::dbscan(
data.frame(X, Y),
eps = 0.0215,
minPts = 4
)[["cluster"]]
)
dplyr is not necessary, by also works, you just need to supply a generic function rather than one that directly references the source object directly. Your data must already be ordered by cluster.
df$dbscan_cluster <- unlist(
by(
df,
INDICES = df$cluster,
function(x) dbscan::dbscan(x[,c(2,3)], eps = 0.0215, minPts = 4)[["cluster"]]
)
)
However, you can still get garbage results if you don't have a good way to pick your epsilon. You might consider using dbscan::optics instead.

Getting percentile values from gamlss centile curves

This question is related to: Selecting Percentile curves using gamlss::lms in R
I can get centile curve from following data and code:
age = sample(5:15, 500, replace=T)
yvar = rnorm(500, age, 20)
mydata = data.frame(age, yvar)
head(mydata)
age yvar
1 12 13.12974
2 14 -18.97290
3 10 42.11045
4 12 27.89088
5 11 48.03861
6 5 24.68591
h = lms(yvar, age , data=mydata, n.cyc=30)
centiles(h,xvar=mydata$age, cent=c(90), points=FALSE)
How can I now get yvar on the curve for each of x value (5:15) which would represent 90th percentiles for data after smoothening?
I tried to read help pages and found fitted(h) and fv(h) to get fitted values for entire data. But how to get values for each age level at 90th centile curve level? Thanks for your help.
Edit: Following figure show what I need:
I tried following but it is correct since value are incorrect:
mydata$fitted = fitted(h)
aggregate(fitted~age, mydata, function(x) quantile(x,.9))
age fitted
1 5 6.459680
2 6 6.280579
3 7 6.290599
4 8 6.556999
5 9 7.048602
6 10 7.817276
7 11 8.931219
8 12 10.388048
9 13 12.138104
10 14 14.106250
11 15 16.125688
The values are very different from 90th quantile directly from data:
> aggregate(yvar~age, mydata, function(x) quantile(x,.9))
age yvar
1 5 39.22938
2 6 35.69294
3 7 25.40390
4 8 26.20388
5 9 29.07670
6 10 32.43151
7 11 24.96861
8 12 37.98292
9 13 28.28686
10 14 43.33678
11 15 44.46269
See if this makes sense. The 90th percentile of a normal distribution with mean and sd of 'smn' and 'ssd' is qnorm(.9, smn, ssd): So this seems to deliver (somewhat) sensible results, albeit not the full hack of centiles that I suggested:
plot(h$xvar, qnorm(.9, fitted(h), h$sigma.fv))
(Note the massive overplotting from only a few distinct xvars but 500 points. Ande you may want to set the ylim so that the full range can be appreciated.)
The caveat here is that you need to check the other parts of the model to see if it is really just an ordinary Normal model. In this case it seems to be:
> h$mu.formula
y ~ pb(x)
<environment: 0x10275cfb8>
> h$sigma.formula
~1
<environment: 0x10275cfb8>
> h$nu.formula
NULL
> h$tau.formula
NULL
So the model is just mean-estimate with a fixed-variance (the ~1) across the range of the xvar, and there are no complications from higher order parameters like a Box-Cox model. (And I'm unable to explain why this is not the same as the plotted centiles. For that you probably need to correspond with the package authors.)

How to organize data for a multivariate probit model?

I've conducted a psychometric test on some subjects, and I'm trying to create a multivariate probit model.
The test was conducted as follows:
To subject 1 was given a certain stimulous under 11 different conditions, 10 times for each condition. Answers (correct=1, uncorrect=0) were registered.
So for subject 1, I have the following results' table:
# Subj 1
correct
cnt 1 0
1 0 10
2 0 10
3 1 9
4 5 5
5 7 3
6 10 0
7 10 0
8 10 0
9 9 1
10 10 0
11 10 0
This means that Subj1 answered uncorrectly 10 times under condition 1 and 2, and answered 10 times correctly under condition 10 and 11. For the other conditions, the response was increasing from condition 3 to condition 9.
I hope I was clear.
I usually analyze the data using the following code:
prob.glm <- glm(resp.mat1 ~ cnt, family = binomial(link = "probit"))
Here resp.mat1 is the responses' table, while cnt is the contrast c(1,11). So I'm able to draw the sigmoid curve using the predict() function. The graph, for the subject-1 is the following.
Now suppose I've conducted the same test on 20 subjects. I have now 20 tables, organized like the first one.
What I want to do is to compare subgroups, for example: male vs. female; young vs. older and so on. But I want to keep the inter-individual variability, so simply "adding" the 20 tables will be wrong.
How can I organize the data in order to use the glm() function?
I want to be able to write a command like:
prob.glm <- glm(resp.matTOT ~ cnt + sex, family = binomial(link = "probit"))
And then graphing the curve for sex=M, and sex=F.
I tried using the rbind() function, to create a unique table, then adding columns for Subj (1 to 20), Sex, Age. But it looks me a bad solution, so any alternative solutions will be really appreciated.
Looks like you are using the wrong function for the job. Check the first example of glmer in package lme4; it comes quite close to what you want. herd should be replaced by the subject number, but make sure that you do something like
mydata$subject = as.factor(mydata$subject)
when you have numerical subject numbers.
# Stolen from lme4
library(lattice)
library(
xyplot(incidence/size ~ period|herd, cbpp, type=c('g','p','l'),
layout=c(3,5), index.cond = function(x,y)max(y))
(gm1 <- glmer(cbind(incidence, size - incidence) ~ period + (1 | herd),
data = cbpp, family = binomial))
There's a multivariate probit command in the mlogit library of all things. You can see an example of the data structure required here:
https://stats.stackexchange.com/questions/28776/multinomial-probit-for-varying-choice-set

Resources