Getting invalid model formula in ExtractVars when using rpart function in R - r

The dataset can be downloaded from http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/
Getting the following error:
formula(formula, data = data) :
invalid model formula in ExtractVars
Using the following code:
install.packages("rpart")
library("rpart")
# you'll need to change the following from windows to work on a linux box:
mydata <- read.csv(file="c:/Users/md7968/downloads/winequality-red.csv")
# grow tree
fit <- rpart(YouSweetBoy ~ "residual sugar" + "citric acid", method = "class", data = mydata
Mind you I've changed the delimiters in the CSV file to commas.
perhaps it's not reading the data correctly. Forgive me, I'm new to R and not a very good programmer.

Look at names(mydata). When you create a data.frame, read.table() will turn "bad" column names into good ones. You can't (well, shouldn't) have a space in a column name so R changes spaces to periods. Plus, you should never have quoted strings in a formula. Try
fit <- rpart(quality ~ residual.sugar + citric.acid, method = "class", data = mydata)
(I have no idea what "YouSweetBoy" was supposed to be since that wasn't in the dataset so i changed it to "quality").

Removing the space in independent variable names and taking off the quotes made it to work.
Instead of "residual sugar", use residual_sugar

Alternatively, wrap your variable names with ``
So
`residual sugar`
This should work:
fit <- rpart(quality ~ `residual sugar` + `citric acid`, method = "class", data = mydata)

Related

error with rda test in vegan r package. Variable not being read correctly

I am trying to perform a simple RDA using the vegan package to test the effects of depth, basin and sector on genetic population structure using the following data frame.
datafile.
The "ALL" variable is the genetic population assignment (structure).
In case the link to my data doesn't work well, I'll paste a snippet of my data frame here.
I read in the data this way:
RDAmorph_Oct6 <- read.csv("RDAmorph_Oct6.csv")
My problems are two-fold:
1) I can't seem to get my genetic variable to read correctly. I have tried three things to fix this.
gen=rda(ALL ~ Depth + Basin + Sector, data=RDAmorph_Oct6, na.action="na.exclude")
Error in eval(specdata, environment(formula), enclos = globalenv()) :
object 'ALL' not found
In addition: There were 12 warnings (use warnings() to see them)
so, I tried things like:
> gen=rda("ALL ~ Depth + Basin + Sector", data=RDAmorph_Oct6, na.action="na.exclude")
Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric
so I specified numeric
> RDAmorph_Oct6$ALL = as.numeric(RDAmorph_Oct6$ALL)
> gen=rda("ALL ~ Depth + Basin + Sector", data=RDAmorph_Oct6, na.action="na.exclude")
Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric
I am really baffled. I've also tried specifying each variable with dataset$variable, but this doesn't work either.
The strange thing is, I can get an rda to work if I look the effects of the environmental variables on a different, composite, variable
MC = RDAmorph_Oct6[,5:6]
H_morph_var=rda(MC ~ Depth + Basin + Sector, data=RDAmorph_Oct6, na.action="na.exclude")
Note that I did try to just extract the ALL column for the genetic rda above. This didn't work either.
Regardless, this leads to my second problem.
When I try to plot the rda I get a super weird plot. Note the five dots in three places. I have no idea where these come from.
I will have to graph the genetic rda, and I figure I'll come up with the same issue, so I thought I'd ask now.
I've been though several tutorials and tried many iterations of each issue. What I have provided here is I think the best summary. If anyone can give me some clues, I would much appreciate it.
The documentation, ?rda, says that the left-hand side of the formula specifying your model needs to be a data matrix. You can't pass it the name of a variable in the data object as the left-hand side (or at least if this was ever anticipated, doing so exposes bugs in how we parse the formula which is what leads to further errors).
What you want is a data frame containing a variable ALL for the left-hand side of the formula.
This works:
library('vegan')
df <- read.csv('~/Downloads/RDAmorph_Oct6.csv')
ALL <- df[, 'ALL', drop = FALSE]
Notice the drop = FALSE, which stops R from dropping the empty dimension (i.e. converting the single column data frame to a vector.
Then your original call works:
ord <- rda(ALL ~ Basin + Depth + Sector, data = df, na.action = 'na.exclude')
The problem is that rda expects a separate df for the first part of the formula (ALL in your code), and does not use the one in the data = argument.
As mentioned above, you can create a new df with the variable needed for analysis, but here's a oneline solution that should also work:
gen <- rda(RDAmorph_Oct6$ALL ~ Depth + Basin + Sector, data = RDAmorph_Oct6, na.action = na.exclude)
This is partly similar to Gavin simpson's answer. There is also a problem with the categorical vectors in your data frame. You can either use library(data.table) and the rowid function to set the categorical variables to unique integers. Most preferably, not use them. I also wanted to set the ID vector as site names, but I am too lazy now.
library(data.table)
RDAmorph_Oct6 <- read.csv("C:/........../RDAmorph_Oct6.csv")
#remove NAs before. I like looking at my dataframes before I analyze them.
RDAmorph_Oct6 <- na.omit(RDAmorph_Oct6)
#I removed one duplicate
RDAmorph_Oct6 <- RDAmorph_Oct6[!duplicated(RDAmorph_Oct6$ID),]
#Create vector with only ALL
ALL <- RDAmorph_Oct6$ALL
#Create data frame with only numeric vectors and remove ALL
dfn <- RDAmorph_Oct6[,-c(1,4,11,12)]
#Select all categorical vectors.
dfc <- RDAmorph_Oct6[,c(1,11,12)]
#Give the categorical vectors unique integers doesn't do this for ID (Why?).
dfc2 <- as.data.frame(apply(dfc, 2, function(x) rowid(x)))
#Bind back with numeric data frame
dfnc <- cbind.data.frame(dfn, dfc2)
#Select only what you need
df <- dfnc[c("Depth", "Basin", "Sector")]
#The rest you know
rda.out <- rda(ALL ~ ., data=df, scale=T)
plot(rda.out, scaling = 2, xlim=c(-3,2), ylim=c(-1,1))
#Also plot correlations
plot(cbind.data.frame(ALL, df))
Sector and depth have the highest variation. Almost logical, since there are only three vectors used. The assignment of integers to the categorical vector has probably no meaning at all. The function assigns from top to bottom unique integers to the following unique character string. I am also not really sure which question you want to answer. Based on this you can organize the data frame.

Error in class(x) while creating panel data using plm function

I'm trying to create a Panel data using the plm function for pooling a model from a balanced Panel data that I imported from Excel.
When I run the code I get the following error:
Error in class(x) <- setdiff(class(x), "pseries") : invalid to set
the class to matrix unless the dimension attribute is of length 2 (was
0)
library(plm)
library(readxl)
library(tidyr)
library(rJava)
library(xlsx)
library(xlsxjars)
all_met<- read_excel("data.xlsx", sheet = "all_met")
attach(all_met)
Y_all_met <- cbind(methane)
X_all_met <- cbind(gdp, ecogr, trade)
pdata_all_met <- plm.data(all_met, index=c("id","time"))
pooling_all_met <- plm(Y_all_met ~ X_all_met, data=pdata_all_met, model= "pooling")
After running the code I was supposed to get summary statistics of a pooled ols regression of my data. Can someone tell me how I can fix this issue? Thanks in advance.
1st:
Avoid plm.data and use pdata.frame instead:
pdata_all_met <- pdata.frame(all_met, index=c("id","time"))
If plm.data does not give you a deprecation warning, use a newer version of the package.
2nd (and addressing the question):
Specify the column names in the formula, not the variables from the global environment if you use the data argument of plm, i.e., try this:
plm(methane ~ gdp + ecogr + trade, data=pdata_all_met, model= "pooling")
check in the structure of your data if variables used in the regression are declared as factor, you can do that by typing: str(all_met).
if yes, then you should declare it as double, or as numeric, (try not to use as.numeric() function, it could change values in your data).
personally i've changed that by the next specification in the import code:
data <- read_csv("C:/Users/Uness/Desktop/Mydata.csv",
col_types = cols(variable1 = col_double(),
variable2 = col_double()))
View(data)
where variable1 and variable2 are the names of the variables I use, make sure you change that if you copy the code ;)

lme within a user defined function in r

I need to use mixed model lme function many times in my code. But I do not know how to use it within a function. If used otherwise, the lme function works just well but when used within the function, it throws errors:
myfunc<- function(cc, x, y, z)
{
model <- lme(fixed = x ~1 , random = ~ 1|y/z,
data=cc,
method="REML")
}
on calling this function:
myfunc (dbcon2, birthweight, sire, dam)
I get the error :
Error in model.frame.default(formula = ~x + y + z, data = list(animal
= c("29601/9C1", : invalid type (list) for variable 'x'
I think, there is a different procedure for using this which I am unaware of. Any help would be greatly appreciated.
Thanks in advance
Not sure if you are looking for this, you may try to use this, as correctly pointed out by #akrun, you may use paste, I am using paste0 however(its a special case of paste), paste concatenates two strings:
Here the idea is to concatenate the variable names with the formula, but since paste converts it to a string hence you can't refer that as formula to build a model,so you need to convert that string to a formula using as.formula which is wrapped around paste0 statement.
To understand above, Try writing a formula like below using paste:
formula <-paste0("mpg~", paste0("hp","+", "am"))
print(formula)
[1] "mpg~hp+am"
class(formula)
[1] "character" ##This should ideally be a formula rather than character
formula <- as.formula(formula) ##conversion of character string to formula
class(formula)
[1] "formula"
To work inside a model, you would always require a formula object, also please also try to learn about collapse and sep option in paste they are very handy.
I don't have your data , hence I have used mtcars data to represent the same.
library("nlme")
myfunc<- function(cc, x, y, z)
{
model <- lme(fixed = as.formula(paste0(x," ~1")) , random = as.formula(paste0("~", "1|",y,"/",z)),
data=cc,
method="REML")
}
models <- myfunc(cc=mtcars, x="hp", y="mpg", z="am")
summary(models)
You can read more about paste by typing ?paste in your console.

Panel regression error in R

I am running an unbalanced panel regression.
Independent Variable is Gross
Dependent Varibales are DEX, GRW, Debt and Life.
Time is Year
Grouping is Country
I have successfully executed the following commands:
tino=read.delim("clipboard")
tino
summary(tino)
Dep<- with(tino, cbind(Gross, index=c("Country, Year"))
Ind<- tino[ , c('DEX', 'GRW' , 'Debt', 'Life')]
install.packages("plm")
library('plm')
pandata<-plm.data(tino)
tino
summary(pandata)
summary(Dep)
summary(Ind)
However, When I run the Command below for results, I get an error.
pooling<- plm(Dep~Ind, data = pandata, model= "pooling")
gives error below
Error in model.frame.default(terms(formula, lhs = lhs, rhs = rhs, data = data,: invalid type (list) for variable 'Ind'
Please help.
Thanks
Without access to your data, it is impossible to confirm that this will work, but I am going to try to point out several issues in your code that are likely contributing to the error.
This line is fine:
tino=read.delim("clipboard")
Here is where you start to make errors:
Dep<- with(tino, cbind(Gross, index=c("Country, Year"))
Ind<- tino[ , c('DEX', 'GRW' , 'Debt', 'Life')]
with() is typically used to create new vectors out of a data.frame. All it does is allow you to drop the $ notation for referencing variables in a data.frame and nothing else. From the read of your code, you may be thinking that with() is actually modifying the tino object, which it is not.
Further, when you want to construct a data.frame for use in a regression model, you want all of the right-hand and left-hand side variables in one data.frame or matrix rather than separating them. This is because most modelling functions operate using a "formula" and data argument, which are passed to model.frame() to preprocess the data before modelling.
This means you presumably want to do something like the following, skipping all of the above:
pandata <- plm.data(tino, index = c("Country", "Year"))
pooling <- plm(Gross ~ DEX + GRW + Debt + Life, data = pandata, model = "pooling")
summary(pooling)
If you have a lot of right-hand side variables, you can subset your data.frame, with something like:
pandata2 <- plm.data(tino[ , c('Gross', 'DEX', 'GRW' , 'Debt', 'Life')], index = c("Country", "Year"))
pooling2 <- plm(Gross ~ ., data = pandata2, model = "pooling")
using the . notation as a shorthand for "all other columns in the data."

predict in caret ConfusionMatrix is removing rows

I'm fairly new to using the caret library and it's causing me some problems. Any
help/advice would be appreciated. My situations are as follows:
I'm trying to run a general linear model on some data and, when I run it
through the confusionMatrix, I get 'the data and reference factors must have
the same number of levels'. I know what this error means (I've run into it before), but I've double and triple checked my data manipulation and it all looks correct (I'm using the right variables in the right places), so I'm not sure why the two values in the confusionMatrix are disagreeing. I've run almost the exact same code for a different variable and it works fine.
I went through every variable and everything was balanced until I got to the
confusionMatrix predict. I discovered this by doing the following:
a <- table(testing2$hold1yes0no)
a[1]+a[2]
1543
b <- table(predict(modelFit,trainTR2))
dim(b)
[1] 1538
Those two values shouldn't disagree. Where are the missing 5 rows?
My code is below:
set.seed(2382)
inTrain2 <- createDataPartition(y=HOLD$hold1yes0no, p = 0.6, list = FALSE)
training2 <- HOLD[inTrain2,]
testing2 <- HOLD[-inTrain2,]
preProc2 <- preProcess(training2[-c(1,2,3,4,5,6,7,8,9)], method="BoxCox")
trainPC2 <- predict(preProc2, training2[-c(1,2,3,4,5,6,7,8,9)])
trainTR2 <- predict(preProc2, testing2[-c(1,2,3,4,5,6,7,8,9)])
modelFit <- train(training2$hold1yes0no ~ ., method ="glm", data = trainPC2)
confusionMatrix(testing2$hold1yes0no, predict(modelFit,trainTR2))
I'm not sure as I don't know your data structure, but I wonder if this is due to the way you set up your modelFit, using the formula method. In this case, you are specifying y = training2$hold1yes0no and x = everything else. Perhaps you should try:
modelFit <- train(trainPC2, training2$hold1yes0no, method="glm")
Which specifies y = training2$hold1yes0no and x = trainPC2.

Resources