I'm trying to optimize my code by saving parameters in vector form and pass it to lda() for modeling. The following method works fine for lm, but not for qda or lda. The error message I received is highlighted in yellow.
intvars <- c("x*y","y*t","z*w")
intfm <- paste("clickthrough", "~", paste(intvars, collapse = " + "))
lda_model_int <- lda(intfm, data = s_train)
Error in lda.default(intfm, data = s_train) : 'x' is not a matrix
You will have to change your string to formula or you can reformulate
intvars <- c("x*y","y*t","z*w")
intfm <- reformulate(intvars,"clickthrough")
lda_model_int <- lda(intfm, data = s_train)
If you wanted to do it your way you will have to do
intvars <- c("x*y","y*t","z*w")
intfm <- as. formula(paste("clickthrough", "~", paste(intvars, collapse = " + ")))
lda_model_int <- lda(intfm, data = s_train)
Related
I keep getting this error message, I am not too sure what went wrong as I am trying to do a linear regression analysis.
Ind_v is the independent variable and dep_v is the dependent variable. I switched the data.frame to [] and it doesn't work as well. Thank you so much, everyone!
I split the training and test data to 70/30.
linear_regression <- function(training_dataset,
test_dataset,
dependent_variables,
independent_variables){
formular_me <- paste(dependent_variables, "~", independent_variables)
linear_model <- lm(formula = formular_me, data = training_dataset)
ind_v_from_test_dataset <- subset(test_dataset,select=independent_variables)
linear_model_analysis <- predict(linear_model,ind_v_from_test_dataset)
dep_v_from_test_dataset <- test_dataset[,dependent_variables]
RMSE_me <- round(Nrmse(actual = dep_v_from_test_dataset, predicted = linear_model_analysis),digits=2)
MAE_me <- round(Nmae(actual = dep_v_from_test_dataset,predicted = linear_model_analysis),digits=2)
R2_me <- round(Nr2(linear_model_analysis),digits=2)
linear_analysis_error <- dep_v_from_test_dataset - linear_model_analysis
linear_results<- data.frame(dep_v_from_test_dataset,ind_v_from_test_dataset,linear_analysis_error)
linear_results<- linear_analysis_error[order(ind_v_from_test_dataset),]
plot(linear_results[,independent_variables],
linear_results$ind_v_from_test_dataset,
pch=4,
ylab="dependent variable",
xlab="independent variables",
main="Linear Regression Errors",
sub=paste("MAE=",mae,"RMSE=",RMSE," R2=",r2))
abline(linear_model,col = "blue", lwd=6)
suppressWarnings(arrows(linear_results[,ind_v_from_test_dataset],
linear_results$dep_v_from_test_dataset,
linear_results[,independent_variables],
linear_results$dep_v_from_test_dataset-linear_results$error,
length=0.05,angle=90,code=3,col="red"))
return(
list(RMSE_me=RMSE,
MAE_me=mae,
R2_me=r2))
}
I'm attempting to create a genetic algorithm (not picky about library, ga and genalg produce same errors) to identify potential columns for use in a linear regression model, by minimizing -adj. r^2. Using mtcars as a play-set, trying to regress on mpg.
I have the following fitness function:
mtcarsnompg <- mtcars[,2:ncol(mtcars)]
evalFunc <- function(string) {
costfunc <- summary(lm(mtcars$mpg ~ ., data = mtcarsnompg[, which(string == 1)]))$adj.r.squared
return(-costfunc)
}
ga("binary",fitness = evalFunc, nBits = ncol(mtcarsnompg), popSize = 100, maxiter = 100, seed = 1, monitor = FALSE)
this causes:
Error in terms.formula(formula, data = data) :
'.' in formula and no 'data' argument
Researching this error, I decided I could work around it this way:
evalFunc = function(string) {
child <- mtcarsnompg[, which(string == 1)]
costfunc <- summary(lm(as.formula(paste("mtcars$mpg ~", paste(child, collapse = "+"))), data = mtcars))$adj.r.squared
return(-costfunc)
}
ga("binary",fitness = evalFunc, nBits = ncol(mtcarsnompg), popSize = 100, maxiter = 100, seed = 1, monitor = FALSE)
but this results in:
Error in terms.formula(formula, data = data) :
invalid model formula in ExtractVars
I know it should work, because I can evaluate the function by hand written either way, while not using ga:
solution <- c("1","1","1","0","1","0","1","1","1","0")
evalFunc(solution)
[1] -0.8172511
I also found in "A quick tour of GA" (https://cran.r-project.org/web/packages/GA/vignettes/GA.html) that using "string" in which(string == 1) is something the GA ought to be able to handle, so I have no idea what GA's issue with my function is.
Any thoughts on a way to write this to get ga or genalg to accept the function?
Turns out I didn't consider that a solution string of 0s (or indeed, a string of 0s with one 1) would cause the internal paste to read "mpg ~ " which is not a possible linear regression.
I have defined MV1 below with a value, and have used the MV1 in the output name. However, when I run the summary function on my model output I get the following meesage
'Error: unexpected symbol in: "assign(paste("Model", MV1, sep = '') <- model1 summary" '
MVx is a value that is defined as a numeric in my code already, and MV1 equates to "_3" in my code.
MV = MVx+1
MV1= paste("_", MV, sep="")
assign(paste("Model", MV1, sep = '') = model1 <- glm(tv1~., family=binomial(link='logit'), data=train70)
summary(Model_3) #Error occurs here
Would anyone know how to get around that?
Try this code
MV = MVx+1
MV1= paste("_", MV, sep="")
model1 <- paste('Model', MV1, sep="")
I am applying rpart function to a data frame named train having all the integer values.
There are too many features so for that I have created a formula.
columns_features <- (paste(colnames(train)[31:50], collapse = "+"))
formulas <- as.formula(train$left_eye_center_x ~ columns_features)
tree_pred <- rpart(formulas , data = train)
Here , I get the error message
Error in model.frame.default(formula = formulas, data = train, na.action = function (x) : variable lengths differ (found for 'columns_features')
When I check formulas it has
train$left_eye_center_x ~ columns_features
and for column_features it has
[1] "l_1+ l_2+ l_3+ l_4+ l_5+ l_6+ l_7+ l_8+ l_9+ l_10+ l_11+ l_12+ l_13+ l_14+ l_15+ l_16+ l_17+ l_18+ l_19+ l_20"
For checking purpose when I manually enter the column names here, it works
formulas <- as.formula(train$left_eye_center_x ~ l_1+ l_2+ l_3+ l_4+ l_5+ l_6+ l_7+ l_8+ l_9+ l_10+ l_11+ l_12+ l_13+ l_14+ l_15+ l_16+ l_17+ l_18+ l_19+ l_20 )
tree_pred <- rpart(formulas , data = train)
Is double quote creating the error? What could be solution to this? I have many features so I cannot afford to enter each and every feature manually.
From the ?as.formula examples:
## Create a formula for a model with a large number of variables:
xnam <- paste0("x", 1:25)
(fmla <- as.formula(paste("y ~ ", paste(xnam, collapse= "+"))))
Which implies that in your case the following should work:
formulas <- as.formula(paste("train$left_eye_center_x ~", paste(colnames(train)[31:50], collapse = "+")))
A work-around, instead of using your approach would be (NB: I never used rpart, but I am confident that this works):
formulas <- as.formula(train$left_eye_center_x ~ .)
tree_pred <- rpart(formulas , data = train[,31:50])
If rpart does not like getting indexed data you could define a new dataframe:
train4rpart <- train[,31:50]
tree_pred <- rpart(formulas , data = train4rpart)
Actually, reading through ?rpart, you can skip the whole formula thing:
tree_pred <- rpart(train$left_eye_center_x ~ . , data = train[,31:50])
OR
tree_pred <- rpart(train$left_eye_center_x ~ . , data = train4rpart)
I have a large dataset (questionnaire results) of mostly categorical variables. I have tested for dependency between the variables using chi-square test. There are incomprehensible number of dependencies between variables. I used the chaid() function in the CHAID package to detect interactions and separate out (what I hope to be) the underlying structure of these dependencies for each variable. What typically happens is that the chi-square test will reveal a large number of dependencies (say 10-20) for a variable and the chaid function will reduce this to something much more comprehensible (say 3-5). What I want to do is to extract the names of those variable that were shown to be relevant in the chaid() results.
The chaid() output is in the form of a constparty object. My question is how to extract the variable names associated with the nodes in such an object.
Here is a self contained code example:
library(evtree) # for the ContraceptiveChoice dataset
library(CHAID)
library(vcd)
library(MASS)
data("ContraceptiveChoice")
longform = formula(contraceptive_method_used ~ wifes_education +
husbands_education + wifes_religion + wife_now_working +
husbands_occupation + standard_of_living_index + media_exposure)
z = chaid(longform, data = ContraceptiveChoice)
# plot(z)
z
# This is the part I want to do programatically
shortform = formula(contraceptive_method_used ~ wifes_education + husbands_occupation)
# The thing I want is a programatic way to extract 'shortform' from 'z'
# Examples of use of 'shortfom'
loglm(shortform, data = ContraceptiveChoice)
One possible sollution:
nn <- nodeapply(z)
n.names= names(unlist(nn[[1]]))
ext <- unlist(sapply(n.names, function(x) grep("split.varid.", x, value=T)))
ext <- gsub("kids.split.varid.", "", ext)
ext <- gsub("split.varid.", "", ext)
dep.var <- as.character(terms(z)[1][[2]]) # get the dependent variable
plus = paste(ext, collapse=" + ")
mul = paste(ext, collapse=" * ")
shortform <- as.formula(paste (dep.var, plus, sep = " ~ "))
satform <- as.formula(paste (dep.var, mul, sep = " ~ "))
mosaic(shortform, data = ContraceptiveChoice)
#stp <- step(glm(satform, data=ContraceptiveChoice, family=binomial), direction="both")