Why variable importance is not reflected in variable actually used in tree construction? - r

I generated an (unpruned) classification tree on R with the following code:
fit <- rpart(train.set$line ~ CountryCode + OrderType + Bon + SupportCode + prev_AnLP + prev_TXLP + prev_ProfLP + prev_EVProfLP + prev_SplLP + Age + Sex + Unknown.Position + Inc + Can + Pre + Mol, data=train.set, control=rpart.control(minsplit=5, cp=0.001), method="class")
printcp(fit) shows:
Variables actually used in tree construction:
Age
CountryCode
SupportCode
OrderType
prev_AnLP
prev_EVProfLP
prev_ProfLP
prev_TXLP
prev_SplLP
Those are the same variables I can see at each node in the classification tree, so they are correct.
What I do not understand is the result of summary(fit):
Variable importance:
29 prev_EVProfLP
19 prev_AnLP
16 prev_TXLP
15 prev_SplLP
9 prev_ProfLP
7 CountryCode
2 OrderType
1 Pre
1 Mol
From summary(fit) results it seems that variables Pre and Mol are more important than SupportCode and Age, but in the tree Pre and Mol are not used to split the data, while SupportCode and Age are used (just before two leafs, actually... but still used!).
Why?

The importance of an attribute is based on the sum of the improvements in all nodes in which the attribute appears as a splitter (weighted by the fraction of the training data in each node split). Surrogates are also included in the importance calculations, which means that even a variable that never splits a node may be assigned a large importance score. This allows the variable importance rankings to reveal variable masking and nonlinear correlation among the attributes. Importance scores may optionally be confined to splitters; comparing the splitters-only and the full (splitters and surrogates) importance rankings is a useful diagnostic.
Also see chapter 10 of book 'The Top Ten Algorithms in Data Mining' for more information
https://www.researchgate.net/profile/Dan_Steinberg2/publication/265031802_Chapter_10_CART_Classification_and_Regression_Trees/links/567dcf8408ae051f9ae493fe/Chapter-10-CART-Classification-and-Regression-Trees.pdf.

Related

Using predict in metafor when each author has multiple rows in the data

I'm running a meta-analysis where I'm interested in the effect of X on the effect of age on habitat use (raw mean values and variances) using the metafor package.
An example of one of my models is:
mod6 <-
rma.mv(
yi = Used_value,
V = Used_variance,
slab = Citation,
mods = ~ Age + poly(Slope, degrees = 2),
random = ~ 1 | Region,
data = vel.focal,
method = "ML"
)
My justification for not using Citation as a random effect is that using only Region accounts for more of the heterogeneity than when random = list( ~ 1 | Citation/ID, ~ 1 | Region) or when Citation/ID is used by itself.
What I need for output is the prediction for each age by region, but the predict() function for the model and the associated forest plot spits out the prediction for each row, as it assumes each row in the data is a unique study. In my case it is not as I have my input values separated by age and season.
predict(mod6)
pred se ci.lb ci.ub pi.lb pi.ub
Riehle and Griffith 1993.1 9.3437 2.3588 4.7205 13.9668 0.2362 18.4511
Riehle and Griffith 1993.2 9.3437 2.3588 4.7205 13.9668 0.2362 18.4511
Riehle and Griffith 1993.3 9.3437 2.3588 4.7205 13.9668 0.2362 18.4511
Spina 2000.1 8.7706 2.7386 3.4030 14.1382 -0.7364 18.2776
Spina 2000.2 8.5407 2.7339 3.1824 13.8991 -0.9611 18.0426
Spina 2000.3 8.5584 2.7406 3.1868 13.9299 -0.9509 18.0676
Vondracek and Longanecker 1993.1 12.6116 2.5138 7.6847 17.5385 3.3462 21.8769
Vondracek and Longanecker 1993.2 12.6116 2.5138 7.6847 17.5385 3.3462 21.8769
Vondracek and Longanecker 1993.3 12.3817 2.5327 7.4176 17.3458 3.0965 21.6669
Vondracek and Longanecker 1993.4 12.3817 2.5327 7.4176 17.3458 3.0965 21.6669
Does anybody know a way to modify the arguments inside predict() to tell it how you want your predictions output or to tell it that there are multiple rows per slab?
You need to use the newmods argument to specify the values for Age for which you want predicted values. You will have to plug in something for the linear and quadratic terms for the Slope variable as well (e.g., holding Slope constant at its mean and hence the quadratic term will just be the mean squared). Region is not a fixed effect, so it is not relevant if you want to compute predicted values based on the fixed effects. If you want to compute BLUPs for those random effects, you can do so with ranef(). One can then combine the predictions based on the fixed effects with the BLUPs. That would be the general idea, but implementing this will require a bit of programming.

R Quantreg: Singularity with categorical survey data

For my Bachelor's thesis I am trying to apply a linear median regression model on constant sum data from a survey (see formula from A.Blass (2008)). It is an attempt to recreate the probability elicitation approach proposed by A. Blass et al (2008) - Using Elicited Choice Probabilities to Estimate Random Utility Models: Preferences for Electricity Reliability
My dependent variable is the log-odds transformation of the constant sum allocations. Calculated using the following formula:
PE_raw <- PE_raw %>% group_by(sys_RespNum, Task) %>% mutate(LogProb = c(log(Response[1]/Response[1]),
log(Response[2]/Response[1]),
log(Response[3]/Response[1])))
My independent variables are delivery costs, minimum order quantity and delivery window, each categorical variables with levels 0, 1, 2 and 3. Here, level 0 represent the none-option.
Data snapshot
I tried running the following quantile regression (using R's quantreg package):
LAD.factor <- rq(LogProb ~ factor(`Delivery costs`) + factor(`Minimum order quantity`) + factor(`Delivery window`) + factor(NoneOpt), data=PE_raw, tau=0.5)
However, I ran into the following error indicating singularity:
Error in rq.fit.br(x, y, tau = tau, ...) : Singular design matrix
I ran a linear regression and applied R's alias function for further investigation. This informed me of three cases of perfect multicollinearity:
minimum order quantity 3 = delivery costs 1 + delivery costs 2 + delivery costs 3 - minimum order quantity 1 - minimum order quantity 2
delivery window 3 = delivery costs 1 + delivery costs 2 + delivery costs 3 - delivery window 1 - delivery window 2
NoneOpt = intercept - delivery costs 1 - delivery costs 2 - delivery costs 3
In hindsight these cases all make sense. When R dichotomizedthe categorical variables you get these results by construction as, delivery costs 1 + delivery costs 2 + delivery costs 3 = 1 and minimum order quantity 1 + minimum order quantity 2 + minimum order quantity 3 = 1. Rewriting gives the first formula.
It looks like a classic dummy trap. In an attempt to workaround this issue I tried to manually dichotomize the data and used the following formula:
LM.factor <- rq(LogProb ~ Delivery.costs_1 + Delivery.costs_2 + Minimum.order.quantity_1 + Minimum.order.quantity_2 + Delivery.window_1 + Delivery.window_2 + factor(NoneOpt), data=PE_dichomitzed, tau=0.5)
Instead of an error message I now got the following:
Warning message:
In rq.fit.br(x, y, tau = tau, ...) : Solution may be nonunique
When using the summary function:
> summary(LM.factor)
Error in base::backsolve(r, x, k = k, upper.tri = upper.tri, transpose = transpose, :
singular matrix in 'backsolve'. First zero in diagonal [2]
In addition: Warning message:
In summary.rq(LM.factor) : 153 non-positive fis
Is anyone familiar with this issue? I am looking for alternative solutions. Perhaps I am making mistakes using the rq() function, or the data might be misrepresented.
I am grateful for any input, thank you in advance.
Reproducible example
library(quantreg)
#### Raw dataset (PE_raw_SO) ####
# quantile regression (produces singularity error)
LAD.factor <- rq(
LogProb ~ factor(`Delivery costs`) +
factor(`Minimum order quantity`) + factor(`Delivery window`) +
factor(NoneOpt),
data = PE_raw_SO,
tau = 0.5
)
# linear regression to check for singularity
LM.factor <- lm(
LogProb ~ factor(`Delivery costs`) +
factor(`Minimum order quantity`) + factor(`Delivery window`) +
factor(NoneOpt),
data = PE_raw_SO
)
alias(LM.factor)
# impose assumptions on standard errors
summary(LM.factor, se = "iid")
summary(LM.factor, se = "boot")
#### Manually created dummy variables to get rid of
#### collinearity (PE_dichotomized_SO) ####
LAD.di.factor <- rq(
LogProb ~ Delivery.costs_1 + Delivery.costs_2 +
Minimum.order.quantity_1 + Minimum.order.quantity_2 +
Delivery.window_1 + Delivery.window_2 + factor(NoneOpt),
data = PE_dichotomized_SO,
tau = 0.5
)
summary(LAD.di.factor) #backsolve error
# impose assumptions (unusual results)
summary(LAD.di.factor, se = "iid")
summary(LAD.di.factor, se = "boot")
# linear regression to check for singularity
LM.di.factor <- lm(
LogProb ~ Delivery.costs_1 + Delivery.costs_2 +
Minimum.order.quantity_1 + Minimum.order.quantity_2 +
Delivery.window_1 + Delivery.window_2 + factor(NoneOpt),
data = PE_dichotomized_SO
)
alias(LM.di.factor)
summary(LM.di.factor) #regular results, all significant
Link to sample data + code: GitHub
The Solution may be nonunique behaviour is not unusual when doing quantile regressions with dummy explanatory variables.
See, e.g., the quantreg FAQ:
The estimation of regression quantiles is a linear programming
problem. And the optimal solution may not be unique.
A more intuitive explanation for what is happening is given by Roger Koenker (the author of quantreg) on r-help back in 2006:
When computing the median from a sample with an even number of
distinct values there is inherently some ambiguity about its value:
any value between the middle order statistics is "a" median.
Similarly, in regression settings the optimization problem solved by
the "br" version of the simplex algorithm, modified to do general
quantile regression identifies cases where there may be non
uniqueness of this type. When there are "continuous" covariates this
is quite rare, when covariates are discrete then it is relatively
common, atleast when tau is chosen from the rationals. For univariate
quantiles R provides several methods of resolving this sort of
ambiguity by interpolation, "br" doesn't try to do this, instead
returning the first vertex solution that it comes to.
Your second warning -- "153 non-positive fis" -- is a warning related to how the local densities are calculated by rq. Occasionally, it could be possible that local densities of the quantile regression function end up being negative (which is obviously impossible). If this happens, rq automatically sets them to zero. Again, quoting from the FAQ:
This is generally harmless, leading to a somewhat conservative
(larger) estimate of the standard errors, however if the reported
number of non-positive fis is large relative to the sample size then
it is an indication of misspecification of the model.

Grouping Variables in Multilevel Linear Models

I am trying to learn hierarchical models in R and I have generated some sample data for myself. I am having trouble with the correct syntax for coding a multilevel regression problem.
I generated some data for salaries in a Business school. I made the salaries depend linearly on the number of years of employment and the total number of publications by the faculty member. The faculty are in various departments and I made the base salary(intercept) different for each department and also the yearly hike(slopes) different for each department. This way, I have the intercept (base salary) and slope(w.r.t experience in number of years) of the salary depend on the nested level (department) and slope w.r.t another explanatory variable (Publications) not depend on the nested level. What would be the correct syntax to model this in R?
here's my data
Data <-data.frame(Sl_No = c(1:40),
+ Dept = as.factor(sample(c("Mark","IT","Fin"),40,replace = TRUE)),
+ Years = round(runif(40,1,10)))
pubs <-round(Data$Years*runif(40,1,3))
Data$Pubs <- pubs
lookup_table<-data.frame(Dept = c("Mark","IT","Fin","Strat","Ops"),
+ base = c(100000,140000,150000,150000,120000),
+ slope = c(6000,5000,3000,2000,4000))
Data <- merge(Data,lookup_table,by = 'Dept')
salary <-Data$base+Data$slope*Data$Years+Data$Pubs*10000+rnorm(length(Data$Dept))*10000
Data$base<-NULL
Data$slope<-NULL
I have tried the following:
1)
multilevel_model<-lmer(Salary~1|Dept+Pubs+Years|Dept, data = Data)
Error in model.matrix.default(eval(substitute(~foo, list(foo = x[[2]]))), :
model frame and formula mismatch in model.matrix()
2)
multilevel_model<-lmer(`Salary`~ Dept + `Pubs`+`Years`|Dept , data = Data)
boundary (singular) fit: see ?isSingular
I want to see the estimates of the salary intercept and yearly hike by Dept and the estimate of the effect of publication as a standalone (pooled). Right now I am not getting the code to work at all.
I know the base salary and the yearly hike by dept and the effect of a publication (since I generated it).
Dept base Slope
Fin 150000 3000
Mark 100000 6000
Ops 120000 4000
IT 140000 5000
Strat 150000 2000
Every publication increases the salary by 10,000.
ANSWER:
Thanks to #Ben 's answer here I think the correct model is
multilevel_model<-lmer(Salary~(1|Dept)+ Pubs +(0+Years|Dept), data = Data)
This gives me the following fixed effects by running
summary(multilevel_model)
Fixed effects:
Estimate Std. Error t value
(Intercept) 131667.4 10461.0 12.59
Pubs 10235.0 550.8 18.58
Correlation of Fixed Effects:
Pubs -0.081
The Department level coefficients are as follows:
coef(multilevel_model)
$Dept
Years (Intercept) Pubs
Fin 3072.5133 148757.6 10235.02
IT 5156.6774 136710.7 10235.02
Mark 5435.8301 102858.3 10235.02
Ops 3433.1433 118287.1 10235.02
Strat 963.9366 151723.1 10235.02
These are pretty good estiamtes of the original values. Now I need to learn to assess "how good" they are. :)
(1)
multilevel_model<-lmer(`Total Salary`~ 1|Dept +
`Publications`+`Years of Exp`|Dept , data = sample_data)
I can't immediately diagnose why this gives a syntax error, but parentheses are generally recommended around random-effect terms because the | operator has high precedence in formulas. Thus the response/right-hand side (RHS) formula
~ (1|Dept) + (`Publications`+`Years of Exp`|Dept)
might work, except that it would be problematic because both terms contain the same intercept term: if you wanted to do this you'd probably need
~ (1|Dept) + (0+`Publications`+`Years of Exp`|Dept)
(2)
~ Dept + `Publications`+`Years of Exp`|Dept
It doesn't really make any sense to put the same variable (Dept) on both the left- and right-hand sides of the bar.
You should probably use
~ pubs + years_exp + (1 + years_exp|Dept)
Since in principle the effect of publication could vary across departments, the maximal model would be
~ pubs + years_exp + (1 + pubs + years_exp|Dept)
It rarely makes sense to include a random effect without its corresponding fixed effect.
Note that you may get singular fits even if you have the right model; see the ?isSingular man page.
if the 18 observations listed above represent your whole data set, it's very likely too small to fit the maximal model successfully. Rule of thumb is that you need 10-20 observations per parameter estimated, and the maximal model has (intercept + 2 fixed-effect params + (3*4)/2=6 random-effect parameters) = 9 parameters. (Since it's simulated, you can easily simulate a big data set ...)
I'd recommend renaming variables in your data frame so you don't have to fuss with backtick-protecting variable names with spaces in them ...
The GLMM FAQ has more on model specification

lme4 translate formula to code in 3-level model

I have been provided with the following formulas and need to find the correct lme4 code. I find this rather challenging and could not find a good example I could follow...perhaps you can help?
I have two patient groups: group1 and group2. Both groups came to the lab 4 times for testing (4 VISITs) and during each of these visits, their memory was tested 4 times (4 RECALLs). The memory performance should be predicted from Age, Sex and two sleep parameters, which were assessed during each VISIT.
Thus Level 1 should be the RECALL (index i), Level 2 the VISIT (index j) and Level 3 the subject level (index k).
Level 1:
MEMSCOREijk = β0jk + β1jk * RECALLijk + Rijk
Level 2:
β0jk = γ00k + γ01k * VISITjk + U0jk
β1jk = γ10k + γ11k * VISITjk + U1jk
Level 3:
γ00k = δ000 + δ001 * SLEEPPARAM + V0k
γ10k = δ100 + δ101 * SLEEPPARAM + V1k
Thanks so much for your thoughts!
Something like this should work:
lmer(memscore ~ age + sex + sleep1 + sleep2 + (1 | visit) + (1 + sleep1 + sleep2 | subject), data = mydata)
By adding sleep1 and sleep2 in (1 + sleep1 + sleep2 | subject) you will be allowing the effects of the two sleep parameters to vary by participant (random slopes), and have a random intercept (more on that next sentence).(1 | visit) will allow random intercepts for each visit (random intercepts would model data where different visits had a higher or lower mean memscore), but not random slopes; I don't think that you want random slopes for the sleep parameters by visit - were they only measured one time per visit? If so, there would be no slope variation to model, I believe.
Hope that helps! I found this book very useful:
Snijders, T. A. B., & Bosker, R. J. (2012). Multilevel analysis: an introduction to basic and advanced multilevel modeling (2nd ed.): Sage.

producing a 2 class label prediction with neuralnet package in R

I am working with the neuralnet package in R (I am more familiar with nnet).
My target variable is a 2 label class. (Phone_Sales 1/0). I have a train and test set. Also, all variable were normalized to [0,1] scale.
My nn model is:
wireless_model <- neuralnet(formula = Phone_sale ~ Topflight + Balance +
Qual_miles + cc1_miles. + cc2_miles. +
cc3_miles. + Bonus_miles + Bonus_trans +
Flight_miles_12mo + Flight_trans_12 +
Online_12 + Email + Club_member + Any_cc_miles_12mo,
data = wireless_train, hidden=1, linear.output=FALSE)
the predicted results from wireless_model$net.result are produced as floats between 0 and 1 (in fact almost all hover very close to zero). ie .07 and .21, etc instead of 1 or 0.
So obviously when I compare my train to my test- my prediction is bad b/c of the two different types of DV.
I want the predicted results to be in the form of either 1 or 0. I am sure I did not use specify a correct setting somewhere in the neuralnet package.
A guess is that I may need to set the "family" in the formula for logistic so I get on 1 or 0 output. But not sure how that works in this package.
Any help?

Resources