what does "~0" mean in the R model matrix - r

I am trying to understand model matrix in R (model.matrix), to convert categorical variables to dummy variables and came across the following code
# Option 2: use model.matrix() to convert all categorical variables in the data frame into a set of dummy variables. We must then turn the resulting data matrix back into
# a data frame for further work.
xtotal <- model.matrix(~ 0 + REMODEL, data = df)
xtotal <- as.data.frame(xtotal)
Can someone please help me understand what "~0" here mean ? And what the code is trying to do ?

+ 0 means that the model will not have an intercept, i.e., a column of 1s. In the presence of a factor variable, when an intercept is present, one of the levels will be removed in order to ensure that the model matrix is full rank, which is required in ordinary least squares regression. When the intercept is absent, all levels of the factor can remain.
So, this code is a way to convert a factor to a matrix where dummy variables for all the levels are present. Omitting + 0 would replace one of the dummies with a column of 1s, which may not be useful for your purpose.

Related

What does a proportional matrix look like for glmnet response variable in R?

I'm trying to use glmnet to fit a GLM that has a proportional response variable (using the family="binomial").
The help file for glmnet says that the response variable:
"For family="binomial" should be either a factor with
two levels, or a two-column matrix of counts or proportions (the second column
is treated as the target class"
But I don't really understand how I would have a two column matrix. My variable is currently just a single column with values between 0 and 1. Can someone help me figure out how this needs to be formatted so that glmnet will run it properly? Also, can you explain what the target class means?
It is a matrix of positive label and negative label counts, for example in the example below we fit a model for proportion of Claims among Holders :
data = MASS::Insurance
y_counts = cbind(data$Holders - data$Claims,data$Claims)
x = model.matrix(~District+Age+Group,data=data)
fit1 = glmnet(x=x,y=y_counts,family="binomial",lambda=0.001)
If possible, so you should go back to before your calculation of the response variable and retrieve these counts. If that is not possible, you can provide a matrix of proportion, 2nd column for success but this assumes the weight or n is same for all observations:
y_prop = y_counts / rowSums(y_counts)
fit2 = glmnet(x=x,y=y_prop,family="binomial",lambda=0.001)

Making BestNormalize to recognize different factor levels for better data transformation

I'm using bestNormalize package to transform a variable with 5 factor levels (Groups). I use the following code to transform my data and see histograms and normality test results for the transformed data(nooutliers is my dataset, totalscore is my dependent variable, and Grade is a factor with 5 factor levels):
(BNobjectall <- bestNormalize(nooutliers$totalscore))
nooutliers$transformed <- predict(BNobjectall)
ggplot(nooutliers,aes(x=transformed, fill= Grade))+geom_histogram(binwidth=3)+facet_grid(~Grade)+theme_bw()
nooutliers %>%
summarise(statistic = shapiro.test(transformed)$statistic,
p.value = shapiro.test(transformed)$p.value)
My problem is that bestNormalize does not consider factor levels and finds the best transformation method as this variable was a single group. As a result, the transformed dependent variable values for one of my factor levels does not become normal. When I create a subset just for this factor level and apply the same code, I get the desired result. However, I don't know how I can apply this same transformation (with same values) to other factor levels.
Is there a way for bestNormalize to consider factor levels or to apply the same transformation with same values to different subsets?
I'm not sure I fully understand your goal, but I can offer one thought.
bestNormalize requires "training" data, so you could consider training it on a single group, then applying it to the other groups using predict:
# Trains data on Grade 1
(BNobjectall <- bestNormalize(nooutliers$totalscore[nooutliers$Grade == 1]))
# Applies Grade 1 transform to all data points
nooutliers$transformed <- predict(BNobjectall)
If you are trying to ensure normality within each factor level, you will need to do the subsetting as you've done already and note that the normalization transformations may differ because the best normalizing transformation is different across groups. If you are trying to keep a consistent transformation across factor levels, then my advise would be to use a data-invariant transform such as a log or square-root transformation, or use the above approach.

Missing Categories in Validation Data

I built a classification model in R based on training dataset with 12 categorical predictors, each variable holds tens to hundreds of categories.
The problem is that in the dataset I use for validation, some of the variables has less categories than in the training data.
For example, if I have in the training data variable v1 with 3 categories - 'a','b','c', in the validation dataset v1 has only 2 categories - 'a','b'.
In tree based methods like decision tree or random forest it makes no problem, but in logistic regression methods (I use LASSO) that require a preparation of a dummy variables matrix, the number of columns in the training data matrix and validation data matrix doesn't match. If we go back to the example of variable v1, in the training data I get three dummy variables for v1, and in the validation data I get only 2.
Any idea how to solve this?
You can try to avoid this problem by setting the levels correctly. Look at following very stupid example:
set.seed(106)
thedata <- data.frame(
y = rnorm(100),
x = factor(sample(letters[1:3],100,TRUE))
)
head(model.matrix(y~x, data = thedata))
thetrain <- thedata[1:7,]
length(unique(thetrain$x))
head(model.matrix(y~x, data = thetrain))
I make a dataset with a x and a y variable, and x is a factor with 3 levels. The training dataset only has 2 levels of x, but the model matrix is still constructed correctly. That is because R kept the level data of the original dataset:
> levels(thetrain$x)
[1] "a" "b" "c"
The problem arises when your training set is somehow constructed using eg the function data.frame() or any other method that drops the levels information of the factor.
Try the following:
thetrain$x <- factor(thetrain$x) # erases the levels
levels(thetrain$x)
head(model.matrix(y~x, data = thetrain))
You see in the second line that the level "b" has been dropped, and consequently the model matrix isn't what you want any longer. So make sure that all factors in your training dataset actually have all levels, eg:
thetrain$x <- factor(thetrain$x, levels = c("a","b","c"))
On a sidenote: if you build your model matrices yourself using either model.frame() or model.matrix(), the argument xlev might be of help:
thetrain$x <- factor(thetrain$x) # erases the levels
levels(thetrain$x)
head(model.matrix(y~x, data = thetrain,
xlev = list(x = c('a','b','c'))))
Note that this xlev argument is actually from model.frame, and model.matrix doesn't call model.frame in every case. So that solution is not guaranteed to always work, but it should for data frames.

SVD with missing values in R

I am performing a SVD analysis with R, but I have a matrix with structural NA values. Is it possible to obtain a SVD decomposition in this case? Are there alternative solutions? Thanks in advance
You might want to try out the SVDmiss function in SpatioTemporal package which does missing value imputation as well as computes the SVD on the imputed matrix. Check this link SVDmiss Function
However, you might want to be wary of the nature of your data and whether missing value imputation makes sense in your case.
I have tried using the SVM in R with NA values without succes.
Sometimes they are important in analysis so I usually transform my data as follows:
If you have lots of variables try to reduce their number (clustering, lasso, etc...)
Transform the remaining predictors like this:
- for quantitative variables:
- calculate deciles per predictor (leaving missing obs out)
- calculate frequency of Y per decile (assuming Y is qualitative)
- regroup deciles on their Y freq similarity into 2/3/4 groups
(you can do this by looking at their plot too)
- create for each group a new binary variable
(X11 = 1 if X1 takes values in the interval ...)
- calculate Y frequency for missing obs of that predictor
- join the missing obs category to the variable that has the closest Y freq
- for qualitative variables:
- if you have variables with lots of levels you should do clustering by Y
variable
- for variables with lesser levels, you can calculate Y freq per class
- regroup the classes like above
- calculate the same thing for missing obs and attach it to the most similar
group of non-missing
- recode the variable as for numeric case*
There, now you have a complete database of dummy variables and the chance to perform SVM, neural networks, LASSO, etc...

Lasso, glmnet, preprocessing of the data

Im trying to use the glmnet package to fit a lasso (L1 penalty) on a model with a binary outcome (a logit). My predictors are all binary (they're 1/0 not ordered, ~4000) except for one continuous variable.
I need to convert the predictors into a sparse matrix, since it takes forever and a day otherwise.
My question is: it seems that people are using sparse.model.matrix rather than just converting their matrix into a sparse matrix. Why is that? and do I need to do this here? Outcome is a little different for both methods.
Also, do my factors need to be coded as factors (when it comes to both the outcome and the predictors) or is it sufficient to use the sparse matrix and specify in the glmnet model that the outcome is binomial?
Here's what im doing so far
#Create a random dataset, y is outcome, x_d is all the dummies (10 here for simplicity) and x_c is the cont variable
y<- sample(c(1:0), 200, replace = TRUE)
x_d<- matrix(data= sample(c(1:0), 2000, replace = TRUE), nrow=200, ncol=10)
x_c<- sample(60:90, 200, replace = TRUE)
#FIRST: scale that one cont variable.
scaled<-scale(x_c,center=TRUE, scale=TRUE)
#then predictors together
x<- cbind(x_d, scaled)
#HERE'S MY MAIN QUESTION: What i currently do is:
xt<-Matrix(x , sparse = TRUE)
#then run the cross validation...
cv_lasso_1<-cv.glmnet(xt, y, family="binomial", standardize=FALSE)
#which gives slightly different results from (here the outcome variable is in the x matrix too)
xt<-sparse.model.matrix(data=x, y~.)
#then run CV.
So to sum up my 2 questions are:
1-Do i need to use sparse.model.matrix even if my factors are just binary and not ordered? [and if yes what does it actually do differently from just converting the matrix to a sparse matrix]
2- Do i need to code the binary variables as factors?
the reason i ask that is my dataset is huge. it saves a lot of time to just do it without coding as factors.
I don't think you need a sparse.model.matrix, as all that it really gives you above a regular matrix is expansion of factor terms, and if you're binary already that won't give you anything. You certainly don't need to code as factors, I frequently use glmnet on a regular (non-model) sparse matrix with only 1's. At the end of the day glmnet is a numerical method, so a factor will get converted to a number in the end regardless.

Resources