Trying to perform LDA with LASSO - r

PenalizedLDA( x = train_x, y =train_y) returns
Error in sort.int(x, na.last = na.last, decreasing = decreasing, ...) :
'x' must be atomic
I'm trying to use linear discriminant analysis with lasso on the sampbase dataset from UCI.(I've added the headers to the columns and where appropriate return the columns to an interval [0,1].
The first time I ran the code it gave an error
Error in PenalizedLDA(x = train_x, y = train_y) :
y must be a numeric vector, with values as follows: 1, 2, ....
I solved that by passing train_y as
train_y =as.list.numeric_version(training_set[,58])
When I ran it again it I got the error
Error in sort.int(x, na.last = na.last, decreasing = decreasing, ...) :
'x' must be atomic
Here I got stuck.
library(penalizedLDA)
data = read.csv("spambase.csv",header = TRUE)
new_data = data/100
new_data[,c(55,56,57,58)] = data[,c(55,56,57,58)]
new_data[,58]= factor(new_data[,58])
# Splitting dataset into Training set and Test set
set.seed(seeds)
split = sample.split(new_data$factor, SplitRatio = 0.7)
training_set = subset(new_data, split == TRUE)
test_set = subset(new_data, split == FALSE)
#scale data
training_set[-58] = scale(training_set[,-58])
test_set[-58] = scale(test_set[,-58])
train_x =training_set[,-58]
train_y =as.list.numeric_version(training_set[,58])
#Sparse linear discriminant Analysis
classifier = PenalizedLDA( x = training_set[,-58], y =training_set[,58],K = 1,lambda = "standard")

According to the help-page of PenalizedLDA(), its parameter y = should be:
A n-vector containing the class labels. Should be coded as 1, 2, . . . , nclasses, where nclasses is the number of classes.
It means that the levels of the variable of interest (position 58 in your case) should start be one and not 0. Moreover, don't use the function as.list.numeric_version(), because it creates a list, whereas a vector is required.
data = read.csv("...")
new_data = data/100
new_data[,c(55,56,57,58)] = data[,c(55,56,57,58)]
new_data[,58] = factor(new_data[,58] + 1) # in order to start at 1 and not 0
new_data[-58] = scale(new_data[,-58])
classifier = PenalizedLDA(x = new_data[,-58], y = new_data[,58], K = 1, lambda = .1)

Related

Error in glmnet if I specify a variable to be a factor

I have a database in R where I would like to perform a glmnet task. The y variable consists on an originally numeric variable which however takes on only 0 and 1 values. If I specify the latter to be a factor variable as follows
df_ML_1976[,names] <- lapply(df_ML_1976[,names] , factor)
and then apply glmnet after dividing into training and test set:
library("dplyr")
df_ML_1976 %>%
select(where(~ any(. != 0)))
#df_ML_1976 <- subset(df_ML_1976, select = -c(X))
library("caret")
default_idx = createDataPartition(df_ML_1976$y_tr4, p = 0.75, list = FALSE)
default_trn = df_ML_1976[default_idx, ]
default_tst = df_ML_1976[-default_idx, ]
## Fitting elasticnet:
cv_5 = trainControl(method = "cv", number = 5)
def_elnet = train(
y_tr4 ~ ., data = default_trn,
method = "glmnet",
trControl = cv_5
)
def_elnet
an error occurs:
Error in h(simpleError(msg, call)) :
error in evaluating the argument 'x' in selecting a method for function 'drop': non-conformable arguments
which does not appear if I do not specify
df_ML_1976[,names] <- lapply(df_ML_1976[,names] , factor)
why is it like so?
Thank you

Issue with length of sparse.matrix

I've been trying to train this dataset using xgboost. However when I turn it into a sparse matrix there is the following error message;
Error in setinfo.xgb.DMatrix(dmat, names(p), p[[1]]) :
The length of labels must equal to the number of rows in the input data
I'm incredibly confused because the label was derived from the dataset - therefore I don't understand how it is a different length to the sparse matrix.
From what I can tell - the dataframe has 2048 rows, as does the label that was derived from it. However when I turn this into a sparse matrix - 300 rows are added.
Can anyone think of a fix to sort this out?
require(xgboost)
require(methods)
require(Matrix)
require(data.table)
require(vcd)
require(dplyr)
train = read.csv("French Ligue 1 train.csv", header = TRUE, stringsAsFactors = F)
test = read.csv("French Ligue 1 test.csv", header = TRUE, stringsAsFactors = F)
df <- data.table(train, keep.rownames = F)
sparse_matrix <- sparse.model.matrix(Response ~.-1, data = df)
output_vector = sparse_matrix[,Response] == 1
bst <- xgboost(data = sparse_matrix, label = output_vector, max.depth = 4,
eta = 1, nthread = 2, nrounds = 4, objective = "binary:logistic")

R: Error in model.frame.default(form = variable lengths differ (found for 'excel')

I am using a for loop, and I need to predict multiple columns and store them the same time.
cols is a vector containing all the columns I need to predict, mat is data.frame (my text features basically).
df is main dataframe having text, and prediction columns.
for (colm in cols){
label <- as.factor(df[[colm]])
dfm <- mat
dfm[[colm]] <- label
#Boruta(as.factor(colm)~., data=dfm, pValue = 0.01, mcAdj = TRUE, maxRuns = 20,
# doTrace = 2, holdHistory = TRUE, getImp = getImpRfZ) -> Bor.rf
#dfm <- as.data.frame(as.matrix(dfm[,getSelectedAttributes(Bor.rf)]))
#dfm[[colm]] <- label
#train the RF model
modelRF.bor <- train(colm~., data=dfm, method="rf", trControl=control)
pred.RF.bor = predict(modelRF.bor, newdata = dfm[ ,!(colnames(dfm) == st(colm))])
print("Predictions for Column")
print(colm)
print(pred.RF.bor)
table(pred.RF.bor,dfm$colm)
acc.RF.bor = mean(pred.RF.bor==dfm$colm)
print("Accuracy ")
print(acc.RF.bor)
print("Confusion Matrix")
print(confusionMatrix(table(pred.RF.bor,dfm$colm)))
output[,i] <- pred.RF.bor
i = i+1
}
I am getting this error, and have checked everything in my code, and also similar questions here.
Error in model.frame.default(form = colm ~ ., data = dfm, na.action = na.fail) :
variable lengths differ (found for 'excel')
I can't share the data and all code, it's big and not needed I think.
Try looking upstream to see if you have data attached.

Linear Model error or data type error?

The program generates a matrix with location and treatment. I generates some data from a normal distribution. It then tries to fit a linear model predicting yield based on treatment and location. The linear model does not work. Why?
trts = paste(0:6)
locs = paste(1:6)
reps = paste(1:4)
plotsize = 4
DF = expand.grid(locs, reps, trts, plotsize, stringsAsFactors = TRUE)
colnames(DF) = c("Location","Replicate","Treatment")
vector = rnorm(1000000, mean=138.2, sd=54.89)
DF$Treatment = as.numeric(DF$Treatment)
DF$Location = as.numeric(DF$Location)
#This approach takes one set of "plotsize" values from "vector" and adds for 5 for each treatment.
DF$Yield = apply(DF, 1, function(x) (5*DF$Treatment)+mean(sample vector,plotsize)))
DF<-t(DF)
Yield<-DF$Yield
trt=as.factor(DF$Treatment)
loc=as.factor(DF$Location)
summary(fm1 <- aov(Yield ~ loc*trt))
result1<-TukeyHSD(fm1, "trtm", ordered = TRUE)

(list) object cannot be coerced to type 'double'

I just started the package SIS in R. I use their test data set an get an error. I am quite sure there is a problem.
install.packages("SIS",dependencies=T)
library(SIS)
data(prostate.test)
I then try to use the function SIS, which has as input
SIS(x, y, family = c("gaussian","binomial","poisson","cox"),
penalty=c("SCAD","MCP","lasso"), concavity.parameter =
switch(penalty, SCAD=3.7, 3), tune = c("cv","aic","bic","ebic"),
nfolds = 10, type.measure = c("deviance","class","auc","mse",
"mae"), gamma.ebic = 1, nsis = NULL, iter = TRUE, iter.max =
ifelse(greedy==FALSE,10,floor(nrow(x)/log(nrow(x)))), varISIS =
c("vanilla","aggr","cons"), perm = FALSE, q = 1, greedy = FALSE,
greedy.size = 1, seed = 0, standardize = TRUE)
where x is the design matrix, of dimensions n * p, without an intercept. Each row is an observation vector and y the response vector of dimension n * 1. I format their test data (the last column is the response)
prostate.test->k
k[,-dim(k)[2]]->k1
k[,dim(k)[2]]->k11
SIS(k1,k11)
then I get Error in storage.mode(x) = "numeric" :
(list) object cannot be coerced to type 'double'
Could somebody tell me how I can avoid that error?

Resources