Why the function sparse.model.matrix() ignored the columns? - r

Can somebody help me please? I want to prepare data for XGBoost prediction so I need edit factor datas. I use sparse.model.matrix() but there is a problem. I don't know, why function ignored some of the columns. I'll try to explain. I have dataset dataset with many variables, but now these 3 are important:
Tsunami.Event.Validity - Factor with 6 classes: -1,0,1,2,3,4
Tsunami.Cause.Code - Factor with 6 classes: 0,1,2,3,4,5
Total.Death.Description - Factor with 5 classes: 0,1,2,3,4
But when I use sparse.model.matrix() I get matrix only with 15 columns not 6+6+5=17 as expected. Can somebody give ma an advice?
sp_matrix = sparse.model.matrix(Deadly ~ Tsunami.Event.Validity + Tsunami.Cause.Code + Total.Death.Description -1, data = datas)
str(sp_matrix)
Output:
Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
..# i : int [1:2510] 0 1 2 3 4 5 6 7 8 9 ...
..# p : int [1:16] 0 749 757 779 823 892 1495 2191 2239 2241 ...
..# Dim : int [1:2] 749 15
..# Dimnames:List of 2
.. ..$ : chr [1:749] "1" "2" "3" "4" ...
.. ..$ : chr [1:15] "Tsunami.Event.Validity-1" "Tsunami.Event.Validity0" "Tsunami.Event.Validity1" "Tsunami.Event.Validity2" ...
..# x : num [1:2510] 1 1 1 1 1 1 1 1 1 1 ...
..# factors : list()
..$ assign : int [1:15] 0 1 1 1 1 1 2 2 2 2 ...
..$ contrasts:List of 3
.. ..$ Tsunami.Event.Validity : chr "contr.treatment"
.. ..$ Tsunami.Cause.Code : chr "contr.treatment"
.. ..$ Total.Death.Description: chr "contr.treatment"

This question is a duplicate of In R, for categorical data with N unique categories, why does sparse.model.matrix() not produce a one-hot encoding with N columns? ... but that question was never answered.
The answers to this question explain how you could get the full model matrix you're looking for, but don't explain why you might not want to. (For what it's worth, unlike regular linear models regression trees are robust to multicollinearity, so a full model matrix would actually work in this case, but it's worth understanding why R gives you the answer it does, and why this won't hurt your predictive accuracy ...)
This is a fundamental property of the way that linear models based (additively) on more than one categorical predictor work (and hence the way that R constructs model matrices). When you construct a model matrix based on factors f1, ..., fn with numbers of levels n1, ..., nn the number of predictor variables is 1 + sum(ni-1), not sum(ni). Let's see how this works with a slightly simpler example:
xx <- expand.grid(A=factor(1:2), B = factor(1:2), C = factor(1:2))
model.matrix(~A+B+C-1, xx)
A1 A2 B2 C2
1 1 0 0 0
2 0 1 0 0
3 1 0 1 0
4 0 1 1 0
5 1 0 0 1
6 0 1 0 1
7 1 0 1 1
8 0 1 1 1
We have a total of (1 + 3*(2-1) =) 4 parameters.
The first parameter (A1) describes the expected mean in the baseline level of all parameters (A=1, B=1, C=1). The second parameter describes the expected difference between an observation with A=1 and one with A=2 (independent of the other factors). Parameters 3 and 4 (B2, C2) describe analogous differences between B1 and B2.
You might be thinking "but I want predictor variables for all the levels of all the factors, e.g.
m <- do.call(cbind, lapply(xx, \(x) t(fac2sparse(x))))
dim(m)
## [1] 8 6
This has all six columns expected, not just 4. But if you examine this matrix, or call rankMatrix(m) or caret::findLinearCombos(m), you'll discover that it is multicollinear. In a typical (fixed-effect) additive linear model, you can only estimate an intercept plus the differences between levels, not values associated with every level. In a regression tree model, the multicollinearity will make your computations slightly less efficient, and will make results about variable importance confusing, but shouldn't hurt your predictions.

Related

How to optimize this process?

I have somewhat of broad question, but I will try to make my intent as clear as possible so that people can make suggestions. I am trying to optimize a process I am doing. Generally, what I am doing is feeding a function a data frame of values and generating a prediction off of operations on specific columns. Basically a custom function that is being used with sapply (code below). What I'm doing is much to large to provide any meaningful example, so instead I will try to describe the inputs to the process. I know this will restrict how helpful answers can be, but I am interested in any ideas for optimizing the time it takes me to compute a prediction. Currently it is taking me about 10 seconds to generate one prediction (run the sapply for one line of a dataframe).
mean_rating <- function(df){
user<-df$user
movie<-df$movie
u_row<-which(U_lookup == user)[1]
m_row<-which(M_lookup==movie)[1]
knn_match<- knn_txt[u_row,1:100]
knn_match1<-as.numeric(unlist(knn_match))
dfm_test<- dfm[knn_match1,]
dfm_mov<- dfm_test[,m_row] # row number from DFM associated with the query_movie
C<-mean(dfm_mov)
}
test<-sapply(1:nrow(probe_test),function(x) mean_rating(probe_test[x,]))
Inputs:
dfm is my main data matrix, users in the rows and movies in the columns. Very sparse.
> str(dfm)
Formal class 'dgTMatrix' [package "Matrix"] with 6 slots
..# i : int [1:99072112] 378 1137 1755 1893 2359 3156 3423 4380 5103 6762 ...
..# j : int [1:99072112] 0 0 0 0 0 0 0 0 0 0 ...
..# Dim : int [1:2] 480189 17770
..# Dimnames:List of 2
.. ..$ : NULL
.. ..$ : NULL
..# x : num [1:99072112] 4 5 4 1 4 5 4 5 3 3 ...
..# factors : list()
probe_test is my test set, the set I'm trying to predict for. The actual probe test contains approximately 1.4 million rows but I am trying it on a subset first to optimize the time. It is being fed into my function.
> str(probe_test)
'data.frame': 6 obs. of 6 variables:
$ X : int 1 2 3 4 5 6
$ movie : int 1 1 1 1 1 1
$ user : int 1027056 1059319 1149588 1283744 1394012 1406595
$ Rating : int 3 3 4 3 5 4
$ Rating_Date: Factor w/ 1929 levels "2000-01-06","2000-01-08",..: 1901 1847 1911 1312 1917 1803
$ Indicator : int 1 1 1 1 1 1
U_lookup is the lookup I use to convert between user id and the line of the matrix a user is in since we lose user id's when they are converted to a sparse matrix.
> str(U_lookup)
'data.frame': 480189 obs. of 1 variable:
$ x: int 10 100000 1000004 1000027 1000033 1000035 1000038 1000051 1000053 1000057 ...
M_lookup is the lookup I use to convert between movie id and the column of a matrix a movie is in for similar reasons as above.
> str(M_lookup)
'data.frame': 17770 obs. of 1 variable:
$ x: int 1 10 100 1000 10000 10001 10002 10003 10004 10005 ...
knn_text contains the 100 nearest neighbors for all the lines of dfm
> str(knn_txt)
'data.frame': 480189 obs. of 200 variables:
Thank you for any advice you can provide to me.

R anesrake issue with list names non-binary argument

I am using anesrake to weight some survey data, but am getting a non-binary argument error. The error only occurs after I have added the names to the list to use as targets:
gender1<-c(0.516166000986901,0.483833999013099)
age<-c(0.15828262425613,0.364861110549873,0.429947760183493,0.0469085050104993)
mylist<-list(gender1,age)
names(mylist)<-c("gender1","age")
result<-anesrake(mylist,france,caseid=france$caseid, iterate=TRUE)
Error in x + weights : non-numeric argument to binary operator
In addition: Warning message:
In anesrake(targets, france, caseid = france$caseid, iterate = TRUE) :
Targets for age do not sum to 100%. Adjusting values to total 100%
This also says that the targets for age don't add to 100%, which they do, so also not sure what that's about. If I leave out the names(mylist) bit, I get the following error, presumably because R doesn't know which variables to use, but not a non-binary error:
Error in selecthighestpcts(discrep1, inputter, pctlim) :
No variables are off by more than 5 percent using the method you have chosen, either weighting is unnecessary or a smaller pre-raking limit should be chosen.
The variables in the data frame are called the same as the targets in the list, and are numeric:
> str(france)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 993 obs. of 5 variables:
$ Gender :Classes 'labelled', 'numeric' atomic [1:993] 2 2 2 2 2 2 2 2 2 2 ...
.. ..- attr(*, "label")= chr "Gender"
$ Age2 : num 2 3 2 2 2 2 2 1 2 3 ...
$ gender1: num 2 2 2 2 2 2 2 2 2 2 ...
$ caseid : int 1 2 3 4 5 6 7 8 9 10 ...
$ age : num 2 3 2 2 2 2 2 1 2 3 ...
I have also tried converting gender1 and age to factor variables (as the numbers represent levels of each variable - gender has 2, age has 4), but with the same result. I have used anesrake before successfully, so there must be something I am missing, but cannot see it! Any help greatly appreciated....

Grouping error with lmer

I have a data frame with the following structure:
> t <- read.csv("combinedData.csv")[,1:7]
> str(t)
'data.frame': 699 obs. of 7 variables:
$ Awns : int 0 0 0 0 0 0 0 0 1 0 ...
$ Funnel : Factor w/ 213 levels "MEL001","MEL002",..: 1 1 2 2 2 3 4 4 4 4 ...
$ Plant : int 1 2 1 3 8 1 1 2 3 5 ...
$ Line : Factor w/ 8 levels "a","b","c","cA",..: 2 2 1 1 1 3 1 1 1 1 ...
$ X : int 1 2 3 4 7 8 9 10 11 12 ...
$ ID : Factor w/ 699 levels "MEL_001-1b","MEL_001-2b",..: 1 2 3 4 5 6 7 8 9 10 ...
$ BobWhite_c10082_241: int 2 2 2 2 2 2 0 2 2 0 ...
I want to construct a mixed effect model. I know in my data frame that the random effect I want to include (Funnel) is a factor, but it does not work:
> lmer(t$Awns ~ (1|t$Funnel) + t$BobWhite_c10082_241)
Error: couldn't evaluate grouping factor t$Funnel within model frame: try adding grouping factor to data frame explicitly if possible
In fact this happens whatever I want to include as a random effect e.g. Plant:
> lmer(t$Awns ~ (1|t$Plant) + t$BobWhite_c10082_241)
Error: couldn't evaluate grouping factor t$Plant within model frame: try adding grouping factor to data frame explicitly if possible
Why is R giving me this error? The only other answer I could google fu is that the random effect fed in wasn't a factor in the DF. But as str shows, df$Funnel certainly is.
It is actually not so easy to provide a convenient syntax for modeling functions and at the same time have a robust implementation. Most package authors assume that you use the data parameter and even then scoping issues can occur. Thus, strange things can happen if you specify variables with DF$col syntax since package authors rarely spend a lot of effort to make this work correctly and don't include a lot of unit tests for this.
It is therefore strongly recommended to use the data parameter if the model function offers a formula method. Strange things can happen if you don't follow that praxis (also with other model functions like lm).
In your example:
lmer(Awns ~ (1|Funnel) + BobWhite_c10082_241, data = t)
This not only works, but is also more convenient to write.

C5.0 decision tree - c50 code called exit with value 1

I am getting the following error
c50 code called exit with value 1
I am doing this on the titanic data available from Kaggle
# Importing datasets
train <- read.csv("train.csv", sep=",")
# this is the structure
str(train)
Output :-
'data.frame': 891 obs. of 12 variables:
$ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
$ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
$ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
$ Name : Factor w/ 891 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
$ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
$ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
$ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
$ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
$ Ticket : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
$ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
$ Cabin : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
$ Embarked : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
Then I tried using C5.0 dtree
# Trying with C5.0 decision tree
library(C50)
#C5.0 models require a factor outcome otherwise error
train$Survived <- factor(train$Survived)
new_model <- C5.0(train[-2],train$Survived)
So running the above lines gives me this error
c50 code called exit with value 1
I'm not able to figure out what's going wrong? I was using similar code on different dataset and it was working fine. Any ideas about how can I debug my code?
-Thanks
For anyone interested, the data can be found here: http://www.kaggle.com/c/titanic-gettingStarted/data. I think you need to be registered in order to download it.
Regarding your problem, first of I think you meant to write
new_model <- C5.0(train[,-2],train$Survived)
Next, notice the structure of the Cabin and Embarked Columns. These two factors have an empty character as a level name (check with levels(train$Embarked)). This is the point where C50 falls over. If you modify your data such that
levels(train$Cabin)[1] = "missing"
levels(train$Embarked)[1] = "missing"
your algorithm will now run without an error.
Just in case. You can take a look to the error by
summary(new_model)
Also this error occurs when there are a special characters in the name of a variable. For example, one will get this error if there is "я"(it's from Russian alphabet) character in the name of a variable.
Here is what worked finally:-
Got this idea after reading this post
library(C50)
test$Survived <- NA
combinedData <- rbind(train,test)
combinedData$Survived <- factor(combinedData$Survived)
# fixing empty character level names
levels(combinedData$Cabin)[1] = "missing"
levels(combinedData$Embarked)[1] = "missing"
new_train <- combinedData[1:891,]
new_test <- combinedData[892:1309,]
new_model <- C5.0(new_train[,-2],new_train$Survived)
new_model_predict <- predict(new_model,new_test)
submitC50 <- data.frame(PassengerId=new_test$PassengerId, Survived=new_model_predict)
write.csv(submitC50, file="c50dtree.csv", row.names=FALSE)
The intuition behind this is that in this way both the train and test data set will have consistent factor levels.
I had the same error, but I was using a numeric dataset without missing values.
After a long time, I discovered that my dataset had a predictive attribute called "outcome" and the C5.0Control use this name, and this was the error cause :'(
My solution was changing the column name. Other way, would be create a C5.0Control object and change the value of the label attribute and then pass this object as parameter for the C50 method.
I also struggled some hours with the same Problem (Return code "1") when building a model as well as when predicting.
With the hint of answer of Marco I have written a small function to remove all factor levels equal to "" in a data frame or vector, see code below. However, since R does not allow for pass by reference to functions, you have to use the result of the function (it can not change the original dataframe):
removeBlankLevelsInDataFrame <- function(dataframe) {
for (i in 1:ncol(dataframe)) {
levels <- levels(dataframe[, i])
if (!is.null(levels) && levels[1] == "") {
levels(dataframe[,i])[1] = "?"
}
}
dataframe
}
removeBlankLevelsInVector <- function(vector) {
levels <- levels(vector)
if (!is.null(levels) && levels[1] == "") {
levels(vector)[1] = "?"
}
vector
}
Call of the functions may look like this:
trainX = removeBlankLevelsInDataFrame(trainX)
trainY = removeBlankLevelsInVector(trainY)
model = C50::C5.0.default(trainX,trainY)
However, it seems, that C50 has a similar Problem with character columns containing an empty cell, so you will have probably to extend this to handle also character attributes if you have some.
I also got the same error, but it was because of some illegal characters in the factor levels of one the columns.
I used make.names function and corrected the factor levels:
levels(FooData$BarColumn) <- make.names(levels(FooData$BarColumn))
Then the problem was resolved.

How do I automatically specify the correct regression model when the number of variables differs between input data sets?

I have a working R program that will be used by my internal client for analysing their nutrient intake data. For each dataset that they have, they will re-run the R program.
A key part of the dataset is an nonlinear mixed method analysis, using nlmer from the lme4 package, that incorporates dummy variables for age. Depending on whether they will be analysing children or adults, the number of age band dummies in the formula will differ, although the reference age band dummy will always be the youngest. I think that the number of possible age bands ranges from 4 to about 6, so it's not a large range. It is a trivial matter to count the number of age band dummies, if I need to condition based on that.
What is the most efficient way for me to wrap the model-based code (the lmer that provides the starting parameter values, the function for the nlmer model, and the model specification in nlmer itself) so that the correct function and models are applied based on the number of age band dummies in the model? The other variables in the model are constant across datasets.
I've already got the program set up for automatically generating the relevant dummies and dropping those that aren't used in the current analysis. The program after the model is pretty well set up as automated as well. I'm just stuck on what to do with automating the two lme4-based analyses and function. These will only be run once for each dataset.
I've been wondering whether I need to write a function to contain all the lme4 related code, or whether there was an easier way. I would appreciate some pointers on how to do this. It took me one day to work out how to get the function working that I needed for the nlmer model, so I am still at a beginner level with functions.
I've searched for other R related automation questions on the site and I didn't find anything similar to what I would like to do.
Thanks in advance.
Update in response to suggestion in the comments about using a string. That sounds like an easy way forward for me, except that I don't then know how to apply the string content in a function as each dummy variable level (excluding the reference category) is used in the function for nlmer. How can I pull apart the string and use only the dummy variables that I have in a function? For example, one analysis might have AgeBand2, AgeBand3, AgeBand4, and another analysis might have AgeBand5 as well as those 3? If this was VBA, I would just create subfunctions based on the number of age dummy variables. I have no idea how to do this efficiently in R.
Can I just wrap a while loop around the lmer, function, and nlmer parts, so I have a series of while loops?
This is the section of code I wish to automate, the number of AgeBand dummy variables differs depending on the dataset that will be analysed (children vs. adults). This is using the dataset that I have been testing a SAS to R translation on, but the real datasets will be very similar. It is necessary to have a nonlinear model as this is the basis of the peer-reviewed published method that I am working off.
library(lme4)
Male.lmer <- lmer(BoxCoxXY ~ AgeBand4 + AgeBand5 + AgeBand6 + AgeBand7 +
AgeBand8 + Race1 + Race3 + Weekend + IntakeDay + (1|RespondentID),
data=Male.AddSugar,
weights=Replicates)
Male.lmer.fixef <- fixef(Male.lmer)
Male.lmer.fixef <- as.data.frame(Male.lmer.fixef)
bA <- Male.lmer.fixef[1,1]
bB <- Male.lmer.fixef[2,1]
bC <- Male.lmer.fixef[3,1]
bD <- Male.lmer.fixef[4,1]
bE <- Male.lmer.fixef[5,1]
bF <- Male.lmer.fixef[6,1]
bG <- Male.lmer.fixef[7,1]
bH <- Male.lmer.fixef[8,1]
bI <- Male.lmer.fixef[9,1]
bJ <- Male.lmer.fixef[10,1]
MD <- deriv(expression(b0 + b1*AgeBand4 + b2*AgeBand5 + b3*AgeBand6 +
b4*AgeBand7 + b5*AgeBand8 + b6*Race1 + b7*Race3 + b8*Weekend + b9*IntakeDay),
namevec=c("b0","b1","b2","b3", "b4", "b5", "b6", "b7", "b8", "b9"),
function.arg=c("b0","b1","b2","b3", "b4", "b5", "b6", "b7", "b8", "b9",
"AgeBand4","AgeBand5","AgeBand6","AgeBand7","AgeBand8",
"Race1","Race3","Weekend","IntakeDay"))
Male.nlmer <- nlmer(BoxCoxXY ~ MD(b0,b1,b2,b3,b4,b5,b6,b7,b8,b9,AgeBand4,AgeBand5,AgeBand6,AgeBand7,AgeBand8,
Race1,Race3,Weekend,IntakeDay)
~ b0|RespondentID,
data=Male.AddSugar,
start=c(b0=bA, b1=bB, b2=bC, b3=bD, b4=bE, b5=bF, b6=bG, b7=bH, b8=bI, b9=bJ),
weights=Replicates
)
These will be the required changes between the datasets:
the number of fixed effect coefficients that I need to assign out of the lmer will change.
in the function, the expression, name.vec, and function.arg parts will change
the nlmer, the model statement and start parameter list will change.
I can change the lmer model statement so it takes AgeBand as a factor with levels, but I still need to pull out the values of the coefficients afterwards.
str(Male.AddSugar) gives:
'data.frame': 10287 obs. of 23 variables:
$ RespondentID: int 9966 9967 9970 9972 9974 9976 9978 9979 9982 9993 ...
$ RACE : int 2 3 2 2 3 2 2 2 2 1 ...
$ RNDW : int 26290 7237 10067 75391 1133 31298 20718 23908 7905 1091 ...
$ Replicates : num 41067 2322 17434 21723 375 ...
$ DRXTNUMF : int 27 11 13 18 17 13 13 19 11 11 ...
$ DRDDAYCD : int 1 1 1 1 1 1 1 1 1 1 ...
$ IntakeAmt : num 33.45 2.53 9.58 43.34 55.66 ...
$ RIAGENDR : int 1 1 1 1 1 1 1 1 1 1 ...
$ RIDAGEYR : int 39 23 16 44 13 36 16 60 13 16 ...
$ Subgroup : Ord.factor w/ 6 levels "3"<"4"<"5"<"6"<..: 4 3 2 4 1 4 2 5 1 2 ...
$ WKEND : int 1 1 1 0 1 0 0 1 1 1 ...
$ AmtInd : num 1 1 1 1 1 1 1 1 1 1 ...
$ IntakeDay : num 0 0 0 0 0 0 0 0 0 0 ...
$ Weekend : int 1 1 1 0 1 0 0 1 1 1 ...
$ Race1 : num 0 0 0 0 0 0 0 0 0 1 ...
$ Race3 : num 0 1 0 0 1 0 0 0 0 0 ...
$ AgeBand4 : num 0 0 1 0 0 0 1 0 0 1 ...
$ AgeBand5 : num 0 1 0 0 0 0 0 0 0 0 ...
$ AgeBand6 : num 1 0 0 1 0 1 0 0 0 0 ...
$ AgeBand7 : num 0 0 0 0 0 0 0 1 0 0 ...
$ AgeBand8 : num 0 0 0 0 0 0 0 0 0 0 ...
$ YN : num 1 1 1 1 1 1 1 1 1 1 ...
$ BoxCoxXY : num 7.68 1.13 3.67 8.79 9.98 ...
The AgeBand data is incorrectly shown as the ordered factor Subgroup. Because I haven't used it, I haven't gone back and correct this to a plain factor.
This assumes that you have one variable, "ageband", which is a factor with levels: AgeBand2, AgeBand3, AgeBand4, and perhaps others that you want to be ignored. Since factors are generally treated by R regression functions using the lowest lexigraphic values as the reference levels, you would get your correct level chosen automagically. You pick your desired levels by creating a dataset hat has only the desired levels.
agelevs <- c("AgeBand2", "AgeBand3", "AgeBand4")
dsub <- subset(inpdat, ageband %in agelevs)
res <- your_fun(dsub) nlmer(y ~ ageband + <other-parameters>, data=dsub, ...)
If you have gone to the trouble of creating separate variables, then you need to learn to use factors correctly rather than holding to inefficent habits enforced by training in SPSS or other clunky macro processors.

Resources