what to do with non-numeric predictor in plsreg1? - r

I am using the function ""plsreg1" in the "plsdepot" package to run a PLS regression.
As the package introduction states, in the argument "plsreg1(predictors, response, comps = 2, crosval = TRUE)", the predictors represents "A numeric matrix or data frame with the predictor variables (which may contain missing data)."
And the response represents "A numeric vector for the reponse variable. No missing data allowed."
My question is, what if some of my predictor are categorical variable ?
can I still use this package? How should I revise my script?
Example script from the package is as below
library(plsdepot)
data(cornell)
matrix of correlations
round(as.dist(cor(cornell)),3)
# partial least squares regression
mypls1 <- plsreg1(cornell[,1:7], cornell[,8,drop=F],comp=3)
mypls1
Thank you very much.

Related

What does a proportional matrix look like for glmnet response variable in R?

I'm trying to use glmnet to fit a GLM that has a proportional response variable (using the family="binomial").
The help file for glmnet says that the response variable:
"For family="binomial" should be either a factor with
two levels, or a two-column matrix of counts or proportions (the second column
is treated as the target class"
But I don't really understand how I would have a two column matrix. My variable is currently just a single column with values between 0 and 1. Can someone help me figure out how this needs to be formatted so that glmnet will run it properly? Also, can you explain what the target class means?
It is a matrix of positive label and negative label counts, for example in the example below we fit a model for proportion of Claims among Holders :
data = MASS::Insurance
y_counts = cbind(data$Holders - data$Claims,data$Claims)
x = model.matrix(~District+Age+Group,data=data)
fit1 = glmnet(x=x,y=y_counts,family="binomial",lambda=0.001)
If possible, so you should go back to before your calculation of the response variable and retrieve these counts. If that is not possible, you can provide a matrix of proportion, 2nd column for success but this assumes the weight or n is same for all observations:
y_prop = y_counts / rowSums(y_counts)
fit2 = glmnet(x=x,y=y_prop,family="binomial",lambda=0.001)

R - Machine Learning: Subset selection packages and approaches for a categorical response variable

I have a clinical data set with around 26 variables (mix of numerical and categorical) including response variable.
Categorical response variable 'RETINOPATHY' has 2 factor levels: "Yes" and "No".
Now, the task is to find a best feature variables out of 26 total to predict the categorical response variable. Verify the AIC(Akaike) etc value for different subset.
I found the library called 'leaps' is pretty handy for this task,but it works only when response variable is Numerical. "regsubsets" won't work for categorical variables as it's based on linear regression. see below:
regsubsets(finalDataDF$RETINOPATHY~., data = finalDataDF, nbest = 5, method="exhaustive")
In above sample, RETINOPATHY is a categorical response variable, which is a function of remaining variables.
I searched a lot but couldn't get the proper explanation like which R package is available to get the feature subsets for the prediction of RETINOPATHY.
Please guide me further. Thanks in advance.

XGBoost Error in R Studio ("'data' has class 'character' and length...")

I am having difficulties fitting my data to an xgboost classifier model. When I run this:
classifier = xgboost(data = as.matrix(training_set[c(4:15, 17:18,20:28)]),
label = training_set$posted_ind, nrounds = 10)
R Studio tells me:
Error in xgb.DMatrix(data, label = label, missing = missing) :
'data' has class 'character' and length 1472000.
'data' accepts either a numeric matrix or a single filename.
The training set data has both continuous and categorical data, but all categorical data has been encoded as such (and the same data fit to random forest and naive bayes models). Is there some additional step I need to complete so that I can use these data in an xgboost model?
Make sure that your "training_set" does not have any columns that are factors. If you encoded your categorical variables as numeric but casted them as factors, you will get this error.
I came across the same problem and found a complete solution. You have to use:
sparse_matrix <- sparse.model.matrix(label_y ~ ., data = df)[,-1]
X_train_dmat = xgb.DMatrix(sparse_matrix, label = df$label)
This transforms the categorical data to dummy variables. Several encoding methods exist, e.g., one-hot encoding is a common approach. The above is dummy contrast coding which is popular because it produces “full rank” encoding (also see this blog post by Max Kuhn).
The purpose is to transform each value of each categorical feature into a binary feature {0, 1}.
For example, a column Treatment will be replaced by two columns, TreatmentPlacebo, and TreatmentTreated. Each of them will be binary. Therefore, an observation which has the value Placebo in column Treatment before the transformation will have after the transformation the value 1 in the new column TreatmentPlacebo and the value 0 in the new column TreatmentTreated. The column TreatmentPlacebo will disappear during the contrast encoding, as it would be absorbed into a common constant intercept column.
Source: https://cran.r-project.org/web/packages/xgboost/vignettes/discoverYourData.html#conversion-from-categorical-to-numeric-variables
What is working for me while using tidymodels is adding a recipe step for dummy encoding:
step_dummy(all_nominal_predictors(), one_hot = TRUE)

Error in bn.fit predict function in bnlear R

I have learned and fitted Bayesian Network in bnlearn R package and I wish to predict it's "event" node value.
fl="data/discrete_kdd_10.txt"
h=TRUE
dtbl1 = read.csv(file=fl, head=h, sep=",")
net=hc(dtbl1)
fitted=bn.fit(net,dtbl1)
I want to predict the value of "event" node based on the evidence stored in another file with the same structure as the file used for learning.
fileName="data/dcmp.txt"
dtbl2 = read.csv(file=fileName, head=h, sep=",")
predict(fitted,"event",dtbl2)
However, predict fails with
Error in check.data(data) : variable duration must have at least two levels.
I don't understand why there should be any restriction on number of levels of variables in the evidence data.frame.
The dtbl2 data.frame contains only few rows, one for each scenario in which I want to predict the "event" value.
I know I can use cpquery, but I wish to use the predict function also for networks with mixed variables (both discrete and continuous). I haven't found out how to make use of evidence of continuous variable in cpqery.
Can someone please explain what I'm doing wrong with the predict function and how should I do it right?
Thanks in advance!
The problem was that reading the evidence data.frame in
fileName="data/dcmp.txt"
dtbl2 = read.csv(file=fileName, head=h, sep=",")
predict(fitted,"event",dtbl2)
caused categoric variables to be factors with different number of levels (subset of levels of the original training set).
I used following code to solve this issue.
for(i in 1:dim(dtbl2)[2]){
dtbl2[[i]] = factor(dtbl2[[i]],levels = levels(dtbl1[[i]]))
}
By the way bnlearn package does fit models with mixed variables and also provides functions for predictions in them.

Classification column is removed after using dummyVars in caret package - R

I am playing around with the caret package and came upon this question.
I am using dummyVars to split my categorical columns into separate dummy variables. It seems that dummyVars code removes the classification column in the input data set. For example:
library(earth)
data(etitanic)
dummies <- dummyVars(survived ~ ., data = etitanic, levelsOnly = FALSE)
et<-as.data.frame(predict(dummies, newdata = etitanic))
names(et)
[1] "pclass.1st" "pclass.2nd" "pclass.3rd" "sex.female" "sex.male" "age"
[7] "sibsp" "parch"
So when I try to split the data, I get an error.
train = createDataPartition(et$survived, p=.75, list=FALSE)
Error in createDataPartition(et$survived, p = 0.75, list = FALSE) :
y must have at least 2 data points
Could anyone let me know if this is the expected behavior of caret's dummyVars. I can easily add in the survived column into the data set using say,
et$survived<-etitanic$survived
and then train a model. But I presume that there must be a better way or else the caret package would not remove the classification column. Am I missing something here? Could someone throw more light on this please?
Thanks
As far as I know there is no way to keep the classification column in (or at least not as a factor; and that is because the output is a matrix and therefore it is always numeric). This is because the reason of the dummyVars function is to create dummy variables for the factor predictor variables. It is also designed to provide an alternative to the base R function model.matrix which offers more choices (model.matrix also does not keep the classification column).
Also, and maybe more importantly, functions that require the classification column to be of factor class and only of factor class offer either a way to provide the factor as a separate argument (like the svm function from the e1071 package) or specifically require it as a separate argument (like the knn function from the FNN package). In both cases you do not need to have the factor in your data.frame. You just need to provide it as a separate vector in the function you want to use.
However, there is an alternative for the cases where you do not need the classification column to be of factor type in which case you can simply do:
library(earth)
data(etitanic)
etitanic2 <- etitanic
#convert the classification colunn to numeric
etitanic2$survived <- as.numeric(etitanic2$survived)
#use formula without specifying the response variable
dummies <- dummyVars( ~ ., data = etitanic, levelsOnly = FALSE)
et<-as.data.frame(predict(dummies, newdata = etitanic))
names(et)
> names(et)
[1] "pclass.1st" "pclass.2nd" "pclass.3rd" "survived" "sex.female" "sex.male" "age"
[8] "sibsp" "parch"
By converting the classification column into numeric and by not specifying a response variable in the formula, the survived column is kept in the output data.frame but as of numeric class.

Resources