Treatment of categorical variables in rpart

Treatment of categorical variables in rpart - r

I wonder how rpart treats categorical variables. There are several references suggesting that for unordered factors it looks through all combinations. Actually, even the vignette at the end section 6.2 states
(F)or a categorical predictor with m levels, all 2^m−1 different possible
splits are tested.
However, given my experience with the code, I find it difficult to believe. The vignette shows a supporting evidence that running
rpart(Reliability ~ ., data=car90)
takes a really long, long time. However, in my case, it runs in seconds. Despite having an unordered factor variable with 30 levels.
To demonstrate the issue further, I have created several variables with 52 levels, meaning that 2^51 - 1 ~ 2.2 10^15 splits would need to be checked if all possibilities were explored. This code runs in about a minute, IMHO proving that all combinations are not checked.
NROW = 50000
NVAR = 20
rand_letters = data.frame(replicate(NVAR, as.factor(c(
letters[sample.int(26, floor(NROW/2), replace = TRUE)],
LETTERS[sample.int(26, ceiling(NROW/2), replace = TRUE)]))))
rand_letters$target = rbinom(n = NROW, size = 1, prob = 0.1)
system.time({
tree_letter = rpart(target ~., data = rand_letters, cp = 0.0003)
})
tree_letter
What combinations of categorical variables are ACTUALLY checked in rpart?

I know it is an old question but I found this link that might answer some of it.
Bottom line is that rpart seems to be applying a simple algorithm:
First, sort the conditional means, p_i = E(Y|X = x_i)
Then compute Gini indices based on groups obtained from that ordering.
Pick the two groups giving the maximum of these Gini indices.
So it should not be nearly as computationally expensive.
However, I personally have a case where I have a single categorical variable, whose categories are US states, and rpart overtimes when trying to use it to produce a classification tree. Creating dummy variables and running rpart with the 51 variables (1 for each state) works fine.

Related

Latent class growth modelling in R/flexmix with multinomial outcome variable

How to run Latent Class Growth Modelling (LCGM) with a multinomial response variable in R (using the flexmix package)?
And how to stratify each class by a binary/categorical dependent variable?
The idea is to let gender shape the growth curve by cluster (cf. Mikolai and Lyons-Amos (2017, p. 194/3) where the stratification is done by education. They used Mplus)
I think I might have come close with the following syntax:
lcgm_formula <- as.formula(rel_stat~age + I(age^2) + gender + gender:age)
lcgm <- flexmix::stepFlexmix(.~ .| id,
data=d,
k=nr_of_classes, # would be 1:12 in real analysis
nrep=1, # would be 50 in real analysis to avoid local maxima
control = list(iter.max = 500, minprior = 0),
model = flexmix::FLXMRmultinom(lcgm_formula,varFix=T,fixed = ~0))
,which is close to what Wardenaar (2020,p. 10) suggests in his methodological paper for a continuous outcome:
stepFlexmix(.~ .|ID, k = 1:4,nrep = 50, model = FLXMRglmfix(y~ time, varFix=TRUE), data = mydata, control = list(iter.max = 500, minprior = 0))
The only difference is that the FLXMRmultinom probably does not support varFix and fixed parameters, altough adding them do produce different results. The binomial equivalent for FLXMRmultinom in flexmix might be FLXMRglm (with family="binomial") as opposed FLXMRglmfix so I suspect that the restrictions of the LCGM (eg. fixed slope & intercept per class) are not specified they way it should.
The results are otherwise sensible, but model fails to put men and women with similar trajectories in the same classes (below are the fitted probabilities for each relationship status in each class by gender):
We should have the following matches by cluster and gender...
1<->1
2<->2
3<->3
...but instead we have
1<->3
2<->1
3<->2
That is, if for example men in class one and women in class three would be forced in the same group, the created group would be more similar than the current first row of the plot grid.
Here is the full MVE to reproduce the code.
Got similar results with another dataset with diffent number of classes and up to 50 iterations/class. Have tried two alternative ways to predict the probabilities, with identical results. I conclude that the problem is most likely in the model specification (stepflexmix(...,model=FLXMRmultinom(...) or this is some sort of label switch issue.
If the model would be specified correctly and the issue is that similar trajectories for men/women end up in different classes, is there a way to fix that? By for example restricting the parameters?
Any assistance will be highly appreciated.

This seems to be a an identifiability issue apparently common in mixture modelling. In other words the labels are switched so that while there might not be a problem with the modelling as such, men and women end up in different groups and that will have to be dealt with one way or another
In the the new linked code, I have swapped the order manually and calculated the predictions with by hand.
Will be happy to hear, should someone has an alternative approach to deal with the label swithcing issue (like restricting parameters or switching labels algorithmically). Also curious if the model could/should be specified in some other way.
A few remarks:
I believe that this is indeed performing a LCGM as we do not specify random effects for the slopes or intercepts. Therefore I assume that intercepts and slopes are fixed within classes for both sexes. That would mean that the model performs LCGM as intended. By the same token, it seems that running GMM with random intercept, slope or both is not possible.
Since we are calculating the predictions by hand, we need to be able to separate parameters between the sexes. Therefore I also added an interaction term gender x age^2. The calculations seems to slow down somewhat, but the estimates are similar to the original. It also makes conceptually sense to include the interaction for age^2 if we have it for age already.
varFix=T,fixed = ~0 seem to be reduntant: specifying them do not change anything. The subsampling procedure (of my real data) was unaffected by the set.seed() command for some reason.
The new model specification becomes:
lcgm_formula <- as.formula(rel_stat~ age + I(age^2) +gender + age:gender + I(age^2):gender)
lcgm <- flexmix::flexmix(.~ .| id,
data=d,
k=nr_of_classes, # would be 1:12 in real analysis
#nrep=1, # would be 50 in real analysis to avoid local maxima (and we would use the stepFlexmix function instead)
control = list(iter.max = 500, minprior = 0),
model = flexmix::FLXMRmultinom(lcgm_formula))
And the plots:

How to run a multinomial logit regression with both individual and time fixed effects in R

Long story short:
I need to run a multinomial logit regression with both individual and time fixed effects in R.
I thought I could use the packages mlogit and survival to this purpose, but I am cannot find a way to include fixed effects.
Now the long story:
I have found many questions on this topic on various stack-related websites, none of them were able to provide an answer. Also, I have noticed a lot of confusion regarding what a multinomial logit regression with fixed effects is (people use different names) and about the R packages implementing this function.
So I think it would be beneficial to provide some background before getting to the point.
Consider the following.
In a multiple choice question, each respondent take one choice.
Respondents are asked the same question every year. There is no apriori on the extent to which choice at time t is affected by the choice at t-1.
Now imagine to have a panel data recording these choices. The data, would look like this:
set.seed(123)
# number of observations
n <- 100
# number of possible choice
possible_choice <- letters[1:4]
# number of years
years <- 3
# individual characteristics
x1 <- runif(n * 3, 5.0, 70.5)
x2 <- sample(1:n^2, n * 3, replace = F)
# actual choice at time 1
actual_choice_year_1 <- possible_choice[sample(1:4, n, replace = T, prob = rep(1/4, 4))]
actual_choice_year_2 <- possible_choice[sample(1:4, n, replace = T, prob = c(0.4, 0.3, 0.2, 0.1))]
actual_choice_year_3 <- possible_choice[sample(1:4, n, replace = T, prob = c(0.2, 0.5, 0.2, 0.1))]
# create long dataset
df <- data.frame(choice = c(actual_choice_year_1, actual_choice_year_2, actual_choice_year_3),
x1 = x1, x2 = x2,
individual_fixed_effect = as.character(rep(1:n, years)),
time_fixed_effect = as.character(rep(1:years, each = n)),
stringsAsFactors = F)
I am new to this kind of analysis. But if I understand correctly, if I want to estimate the effects of respondents' characteristics on their choice, I may use a multinomial logit regression.
In order to take advantage of the longitudinal structure of the data, I want to include in my specification individual and time fixed effects.
To the best of my knowledge, the multinomial logit regression with fixed effects was first proposed by Chamberlain (1980, Review of Economic Studies 47: 225–238). Recently, Stata users have been provided with the routines to implement this model (femlogit).
In the vignette of the femlogit package, the author refers to the R function clogit, in the survival package.
According to the help page, clogit requires data to be rearranged in a different format:
library(mlogit)
# create wide dataset
data_mlogit <- mlogit.data(df, id.var = "individual_fixed_effect",
group.var = "time_fixed_effect",
choice = "choice",
shape = "wide")
Now, if I understand correctly how clogit works, fixed effects can be passed through the function strata (see for additional details this tutorial). However, I am afraid that it is not clear to me how to use this function, as no coefficient values are returned for the individual characteristic variables (i.e. I get only NAs).
library(survival)
fit <- clogit(formula("choice ~ alt + x1 + x2 + strata(individual_fixed_effect, time_fixed_effect)"), as.data.frame(data_mlogit))
summary(fit)
Since I was not able to find a reason for this (there must be something that I am missing on the way these functions are estimated), I have looked for a solution using other packages in R: e.g., glmnet, VGAM, nnet, globaltest, and mlogit.
Only the latter seems to be able to explicitly deal with panel structures using appropriate estimation strategy. For this reason, I have decided to give it a try. However, I was only able to run a multinomial logit regression without fixed effects.
# state formula
formula_mlogit <- formula("choice ~ 1| x1 + x2")
# run multinomial regression
fit <- mlogit(formula_mlogit, data_mlogit)
summary(fit)
If I understand correctly how mlogit works, here's what I have done.
By using the function mlogit.data, I have created a dataset compatible with the function mlogit. Here, I have also specified the id of each individual (id.var = individual_fixed_effect) and the group to which individuals belongs to (group.var = "time_fixed_effect"). In my case, the group represents the observations registered in the same year.
My formula specifies that there are no variables correlated with a specific choice, and which are randomly distributed among individuals (i.e., the variables before the |). By contrast, choices are only motivated by individual characteristics (i.e., x1 and x2).
In the help of the function mlogit, it is specified that one can use the argument panel to use panel techniques. To set panel = TRUE is what I am after here.
The problem is that panel can be set to TRUE only if another argument of mlogit, i.e. rpar, is not NULL.
The argument rpar is used to specify the distribution of the random variables: i.e. the variables before the |.
The problem is that, since these variables does not exist in my case, I can't use the argument rpar and then set panel = TRUE.
An interesting question related to this is here. A few suggestions were given, and one seems to go in my direction. Unfortunately, no examples that I can replicate are provided, and I do not understand how to follow this strategy to solve my problem.
Moreover, I am not particularly interested in using mlogit, any efficient way to perform this task would be fine for me (e.g., I am ok with survival or other packages).
Do you know any solution to this problem?
Two caveats for those interested in answering:
I am interested in fixed effects, not in random effects. However, if you believe there is no other way to take advantage of the longitudinal structure of my data in R (there is indeed in Stata but I don't want to use it), please feel free to share your code.
I am not interested in going Bayesian. So if possible, please do not suggest this approach.

How to filter independent variables in decision-tree in R with rpart or party package

I am a SAS user and currently studying how to make decision tree using R-package.
I do have a good finding associated with each nodes, but now I'm facing 3 questions:
Can I start with a specific variable (top-to-bottom), say, categorical var like gender? ( I did it in FICO-Model builder but now I dont have it anymore)
I have a binary var(gender:1-Male/0-Female), but the nodes split at 0.5?(I tried change it to factor, but didn't work? Also I have a var "AGE", should I change the type to "xxx" instead of "numeric"?)
Based on cp value (below table), I set 0.0128 to prune the tree, but only two vars left, can I choose to keep specific vars?( I do play with the numbers of cp, but the result is not changing )
#tree
library(rpart)
library(party)
library(rpart.plot)
#1
minsplit<-60
ct <- rpart.control(xval=10, minsplit=minsplit,minbucket =
minsplit/3,cp=0.01)
iris_tree <- rpart(Overday_E60dlq ~ .
,
data= x, method="class",
parms = list(prior = c(0.65,0.35), split = "information")
,control=ct)
#plot split.
plot_tris<-rpart.plot(iris_tree, branch=1 , branch.type= 1, type= 2, extra=
103,
shadow.col="gray", box.col="green",
border.col="blue", split.col="red",
cex=0.65, main="Kyphosis-tree")
plot_tris
#summary
summary(iris_tree)
#===========prune process=========
printcp(iris_tree)
## min-xerror cp：
fitcp<-prune(iris_tree, cp=
iris_tree$cptable[which.min(iris_tree$cptable[,"xerror"]),"CP"])
#cp table
fit2<-prune(fitcp,cp= 0.0128 )
#plot fit2
rpart.plot(fit2, branch=1 , branch.type= 1, type= 2, extra= 103,
shadow.col="gray", box.col="green",
border.col="blue", split.col="red",
cex=0.65, main="Kyphosis fit2")

I don't think that one of the more popular tree packages in R has a built-in option for specifying fixed initial splits. Using the partykit package (successor to the party package), however, has infrastructure that can be leveraged to put together such trees with a little bit of effort, see: How to specify split in a decision tree in R programming?
You should use factor variables for unordered categorical covariates (like gender), ordered factors for ordinal covariates, and numeric or integer for numeric covariates. Note that this may not only matter in the visual display but also in the recursive partitioning itself. When using an exhaustive search algorithm like rpart/CART it is not relevant, but for unbiased inference-based algorithms like ctree or mob this may be an important difference.
Cost-complexity pruning does not allow to keep specific covariates. It is a measure for the overall tree, not for individual variables.

Speeding up the felm command in R (lfe library)

I am using the felm from the lfe library, and am running into serious speed issues when using a large data set. By large I mean 100 million rows. My data consists of one dependent variable and five categorical variables (factors). I am running regressions with no covariates, only factors.
The felm algorithm does not converge. And I also tried some of the tricks used in this short article, but it did not improve. My code is as follows:
library(lfe)
my_data=read.csv("path_to//data.csv")
attach(data.frame(my_data))
lev1 = unique(my_data$fac1)
my_data$fac1 <- factor(my_data$fac1, levels = lev1)
lev2 = unique(my_data$fac2)
my_data$fac2 <- factor(my_data$fac2, levels = lev2)
lev3 = unique(my_data$fac3)
my_data$fac3 <- factor(my_data$fac3, levels = lev3)
and now I run the regression, without covariates (because I'm only interested in the residuals), and with interactions as follows:
est <- felm(y ~ 0|fac1:fac2+fac1:fac3, my_data)
This line takes forever and does not converge. Note the dimension of the are as follows:
fac1 has about 6000 unique values
fac2 has about 100 unique values
fac3 has about 10 unique values
(and remember there are 100 million rows). I suspect there must be something wrong with how I use the factors, because I imagine that R should be able to handle such sizes (stata's reghdfe command handles it without problems). Any suggestions are highly appreciated.

Fixing a coefficient on variable in MNL [duplicate]

This question already has an answer here:
Set one or more of coefficients to a specific integer
(1 answer)
Closed 6 years ago.
In R, how can I set weights for particular variables and not observations in lm() function?
Context is as follows. I'm trying to build personal ranking system for particular products, say, for phones. I can build linear model based on price as dependent variable and other features such as screen size, memory, OS and so on as independent variables. I can then use it to predict phone real cost (as opposed to declared price), thus finding best price/goodness coefficient. This is what I have already done.
Now I want to "highlight" some features that are important for me only. For example, I may need a phone with large memory, thus I want to give it higher weight so that linear model is optimized for memory variable.
lm() function in R has weights parameter, but these are weights for observations and not variables (correct me if this is wrong). I also tried to play around with formula, but got only interpreter errors. Is there a way to incorporate weights for variables in lm()?
Of course, lm() function is not the only option. If you know how to do it with other similar solutions (e.g. glm()), this is pretty fine too.
UPD. After few comments I understood that the way I was thinking about the problem is wrong. Linear model, obtained by call to lm(), gives optimal coefficients for training examples, and there's no way (and no need) to change weights of variables, sorry for confusion I made. What I'm actually looking for is the way to change coefficients in existing linear model to manually make some parameters more important than others. Continuing previous example, let's say we've got following formula for price:
price = 300 + 30 * memory + 56 * screen_size + 12 * os_android + 9 * os_win8
This formula describes best possible linear model for dependence between price and phone parameters. However, now I want to manually change number 30 in front of memory variable to, say, 60, so it becomes:
price = 300 + 60 * memory + 56 * screen_size + 12 * os_android + 9 * os_win8
Of course, this formula doesn't reflect optimal relationship between price and phone parameters any more. Also dependent variable doesn't show actual price, just some value of goodness, taking into account that memory is twice more important for me than for average person (based on coefficients from first formula). But this value of goodness (or, more precisely, value of fraction goodness/price) is just what I need - having this I can find best (in my opinion) phone with best price.
Hope all of this makes sense. Now I have one (probably very simple) question. How can I manually set coefficients in existing linear model, obtained with lm()? That is, I'm looking for something like:
coef(model)[2] <- 60
This code doesn't work of course, but you should get the idea. Note: it is obviously possible to just double values in memory column in data frame, but I'm looking for more elegant solution, affecting model, not data.

The following code is a bit complicated because lm() minimizes residual sum of squares and with a fixed, non optimal coefficient it is no longed minimal, so that would be against what lm() is trying to do and the only way is to fix all the rest coefficients too.
To do that, we have to know coefficients of the unrestricted model first. All the adjustments have to be done by changing formula of your model, e.g. we have
price ~ memory + screen_size, and of course there is a hidden intercept. Now neither changing the data directly nor using I(c*memory) is good idea. I(c*memory) is like temporary change of data too, but to change only one coefficient by transforming the variables would be much more difficult.
So first we change price ~ memory + screen_size to price ~ offset(c1*memory) + offset(c2*screen_size). But we haven't modified the intercept, which now would try to minimize residual sum of squares and possibly become different than in original model. The final step is to remove the intercept and to add a new, fake variable, i.e. which has the same number of observations as other variables:
price ~ offset(c1*memory) + offset(c2*screen_size) + rep(c0, length(memory)) - 1
# Function to fix coefficients
setCoeffs <- function(frml, weights, len){
el <- paste0("offset(", weights[-1], "*",
unlist(strsplit(as.character(frml)[-(1:2)], " +\\+ +")), ")")
el <- c(paste0("offset(rep(", weights[1], ",", len, "))"), el)
as.formula(paste(as.character(frml)[2], "~",
paste(el, collapse = " + "), " + -1"))
}
# Example data
df <- data.frame(x1 = rnorm(10), x2 = rnorm(10, sd = 5),
y = rnorm(10, mean = 3, sd = 10))
# Writing formula explicitly
frml <- y ~ x1 + x2
# Basic model
mod <- lm(frml, data = df)
# Prime coefficients and any modifications. Note that "weights" contains
# intercept value too
weights <- mod$coef
# Setting coefficient of x1. All the rest remain the same
weights[2] <- 3
# Final model
mod2 <- update(mod, setCoeffs(frml, weights, nrow(df)))
# It is fine that mod2 returns "No coefficients"
Also, probably you are going to use mod2 only for forecasting (actually I don't know where else it could be used now) so that could be made in a simpler way, without setCoeffs:
# Data for forecasting with e.g. price unknown
df2 <- data.frame(x1 = rpois(10, 10), x2 = rpois(5, 5), y = NA)
mat <- model.matrix(frml, model.frame(frml, df2, na.action = NULL))
# Forecasts
rowSums(t(t(mat) * weights))

It looks like you are doing optimization, not model fitting (though there can be optimization within model fitting). You probably want something like the optim function or look into linear or quadratic programming (linprog and quadprog packages).
If you insist on using modeling tools like lm then use the offset argument in the formula to specify your own multiplyer rather than computing one.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex