Panel regression error in R - r

I am running an unbalanced panel regression.
Independent Variable is Gross
Dependent Varibales are DEX, GRW, Debt and Life.
Time is Year
Grouping is Country
I have successfully executed the following commands:
tino=read.delim("clipboard")
tino
summary(tino)
Dep<- with(tino, cbind(Gross, index=c("Country, Year"))
Ind<- tino[ , c('DEX', 'GRW' , 'Debt', 'Life')]
install.packages("plm")
library('plm')
pandata<-plm.data(tino)
tino
summary(pandata)
summary(Dep)
summary(Ind)
However, When I run the Command below for results, I get an error.
pooling<- plm(Dep~Ind, data = pandata, model= "pooling")
gives error below
Error in model.frame.default(terms(formula, lhs = lhs, rhs = rhs, data = data,: invalid type (list) for variable 'Ind'
Please help.
Thanks

Without access to your data, it is impossible to confirm that this will work, but I am going to try to point out several issues in your code that are likely contributing to the error.
This line is fine:
tino=read.delim("clipboard")
Here is where you start to make errors:
Dep<- with(tino, cbind(Gross, index=c("Country, Year"))
Ind<- tino[ , c('DEX', 'GRW' , 'Debt', 'Life')]
with() is typically used to create new vectors out of a data.frame. All it does is allow you to drop the $ notation for referencing variables in a data.frame and nothing else. From the read of your code, you may be thinking that with() is actually modifying the tino object, which it is not.
Further, when you want to construct a data.frame for use in a regression model, you want all of the right-hand and left-hand side variables in one data.frame or matrix rather than separating them. This is because most modelling functions operate using a "formula" and data argument, which are passed to model.frame() to preprocess the data before modelling.
This means you presumably want to do something like the following, skipping all of the above:
pandata <- plm.data(tino, index = c("Country", "Year"))
pooling <- plm(Gross ~ DEX + GRW + Debt + Life, data = pandata, model = "pooling")
summary(pooling)
If you have a lot of right-hand side variables, you can subset your data.frame, with something like:
pandata2 <- plm.data(tino[ , c('Gross', 'DEX', 'GRW' , 'Debt', 'Life')], index = c("Country", "Year"))
pooling2 <- plm(Gross ~ ., data = pandata2, model = "pooling")
using the . notation as a shorthand for "all other columns in the data."

Related

Insert multiple variables in the lda function from a list R

I have 6 variables for which I want to test which one is the best combination for a linear discriminant analysis lda .
I created a list with all the combinations.
I would like to loop through this list and run a lda for each combination
The lda formula wants column names to be specified with a + as follow:
lda(classification~ variable1+variable2, data=mydata)
However if I insert the value of my list in the lda function I get an error
unlist(mylist[i])
"variable1" "variable2"
Error in model.frame.default(formula = mylist ~ unlist(mylist[i]), :
variable lengths differ
reproducible example (variables are constant for illustrative purpose)
classification<-c("a","b","c","d","e","f")
variable1<-c(1,1,1,1,1,1)
variable2<-c(1,1,1,1,1,1)
variable3<-c(1,1,1,1,1,1)
variable4<-c(1,1,1,1,1,1)
variable5<-c(1,1,1,1,1,1)
variable6<-c(1,1,1,1,1,1)
mydata<-data.frame("classification","variable1","variable2","variable3","variable4","variable5","variable6")
para_combo1<-combn(mydata[2:7],1, simplify = FALSE)
para_combo2<-combn(mydata[2:7],2, simplify = FALSE)
para_combo3<-combn(mydata[2:7],3, simplify = FALSE)
para_combo4<-combn(mydata[2:7],4, simplify = FALSE)
para_combo5<-combn(mydata[2:7],5, simplify = FALSE)
para_combo6<-combn(mydata[2:7],6, simplify = FALSE)
para_combo<-c(para_combo1,para_combo2, para_combo3,
para_combo4,para_combo5, para_combo6)
#manual example
lda_table<-lda(classification~ variable1+variable2, data= mydata)
#example I would loop
lda_table<-lda(classification~ para_combo[7] , data= mydata)
I do not know how I could code my combination in the format lda requires
Apart from providing a formula, you can alternatively provide the features and the classes in the parameters x and grouping, respectively:
lda.result <- lda(x=mydata[,c(1,3)], grouping=mydata$classification)
# or simply:
lda.result <- lda(mydata[,c(1,3)], mydata$classification)
Note that the function lda in R actually does not only work with two variables, but with an arbitrary number of variables (sometimes called "multiple discriminant analysis"). There is thus no need to try out all pairs of variable combinations, but you can let lda figure it out for itself.

error with rda test in vegan r package. Variable not being read correctly

I am trying to perform a simple RDA using the vegan package to test the effects of depth, basin and sector on genetic population structure using the following data frame.
datafile.
The "ALL" variable is the genetic population assignment (structure).
In case the link to my data doesn't work well, I'll paste a snippet of my data frame here.
I read in the data this way:
RDAmorph_Oct6 <- read.csv("RDAmorph_Oct6.csv")
My problems are two-fold:
1) I can't seem to get my genetic variable to read correctly. I have tried three things to fix this.
gen=rda(ALL ~ Depth + Basin + Sector, data=RDAmorph_Oct6, na.action="na.exclude")
Error in eval(specdata, environment(formula), enclos = globalenv()) :
object 'ALL' not found
In addition: There were 12 warnings (use warnings() to see them)
so, I tried things like:
> gen=rda("ALL ~ Depth + Basin + Sector", data=RDAmorph_Oct6, na.action="na.exclude")
Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric
so I specified numeric
> RDAmorph_Oct6$ALL = as.numeric(RDAmorph_Oct6$ALL)
> gen=rda("ALL ~ Depth + Basin + Sector", data=RDAmorph_Oct6, na.action="na.exclude")
Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric
I am really baffled. I've also tried specifying each variable with dataset$variable, but this doesn't work either.
The strange thing is, I can get an rda to work if I look the effects of the environmental variables on a different, composite, variable
MC = RDAmorph_Oct6[,5:6]
H_morph_var=rda(MC ~ Depth + Basin + Sector, data=RDAmorph_Oct6, na.action="na.exclude")
Note that I did try to just extract the ALL column for the genetic rda above. This didn't work either.
Regardless, this leads to my second problem.
When I try to plot the rda I get a super weird plot. Note the five dots in three places. I have no idea where these come from.
I will have to graph the genetic rda, and I figure I'll come up with the same issue, so I thought I'd ask now.
I've been though several tutorials and tried many iterations of each issue. What I have provided here is I think the best summary. If anyone can give me some clues, I would much appreciate it.
The documentation, ?rda, says that the left-hand side of the formula specifying your model needs to be a data matrix. You can't pass it the name of a variable in the data object as the left-hand side (or at least if this was ever anticipated, doing so exposes bugs in how we parse the formula which is what leads to further errors).
What you want is a data frame containing a variable ALL for the left-hand side of the formula.
This works:
library('vegan')
df <- read.csv('~/Downloads/RDAmorph_Oct6.csv')
ALL <- df[, 'ALL', drop = FALSE]
Notice the drop = FALSE, which stops R from dropping the empty dimension (i.e. converting the single column data frame to a vector.
Then your original call works:
ord <- rda(ALL ~ Basin + Depth + Sector, data = df, na.action = 'na.exclude')
The problem is that rda expects a separate df for the first part of the formula (ALL in your code), and does not use the one in the data = argument.
As mentioned above, you can create a new df with the variable needed for analysis, but here's a oneline solution that should also work:
gen <- rda(RDAmorph_Oct6$ALL ~ Depth + Basin + Sector, data = RDAmorph_Oct6, na.action = na.exclude)
This is partly similar to Gavin simpson's answer. There is also a problem with the categorical vectors in your data frame. You can either use library(data.table) and the rowid function to set the categorical variables to unique integers. Most preferably, not use them. I also wanted to set the ID vector as site names, but I am too lazy now.
library(data.table)
RDAmorph_Oct6 <- read.csv("C:/........../RDAmorph_Oct6.csv")
#remove NAs before. I like looking at my dataframes before I analyze them.
RDAmorph_Oct6 <- na.omit(RDAmorph_Oct6)
#I removed one duplicate
RDAmorph_Oct6 <- RDAmorph_Oct6[!duplicated(RDAmorph_Oct6$ID),]
#Create vector with only ALL
ALL <- RDAmorph_Oct6$ALL
#Create data frame with only numeric vectors and remove ALL
dfn <- RDAmorph_Oct6[,-c(1,4,11,12)]
#Select all categorical vectors.
dfc <- RDAmorph_Oct6[,c(1,11,12)]
#Give the categorical vectors unique integers doesn't do this for ID (Why?).
dfc2 <- as.data.frame(apply(dfc, 2, function(x) rowid(x)))
#Bind back with numeric data frame
dfnc <- cbind.data.frame(dfn, dfc2)
#Select only what you need
df <- dfnc[c("Depth", "Basin", "Sector")]
#The rest you know
rda.out <- rda(ALL ~ ., data=df, scale=T)
plot(rda.out, scaling = 2, xlim=c(-3,2), ylim=c(-1,1))
#Also plot correlations
plot(cbind.data.frame(ALL, df))
Sector and depth have the highest variation. Almost logical, since there are only three vectors used. The assignment of integers to the categorical vector has probably no meaning at all. The function assigns from top to bottom unique integers to the following unique character string. I am also not really sure which question you want to answer. Based on this you can organize the data frame.

Programmatically detect function calls in R formulae, e.g. y ~ x + log(z), and surround them in backticks

Let me explain my goal first because while the title expresses my strategy, I don't think it is likely to be the only way to solve the problem.
I have an R function to which I pass fitted model objects, like those from lm, and the function extracts the model frame, saves that as a data frame, standardizes the variables in the new data frame, then refits the model with the standardized variables to ease the interpretation of the model's coefficients.
Example code without wrapping it in a function:
mod <- lm(mpg ~ wt, data = mtcars)
new_data <- model.frame(mod)
new_data <- data.frame(lapply(new_data, FUN = scale))
standardized_mod <- update(mod, data = new_data)
Now a summary of standardized_mod by virtue of being fitted with standardized data will give standardized coefficients.
This isn't the most efficient way of doing things, I admit, since I could do something like multiplying the estimates and SEs by each variable's standard deviation. But in the context of the function, I'm trying to be more flexible; this gets less straightforward when working with survey package objects and the like. I also use the same logic to fit models with interaction terms for simple slopes analysis. But this is besides the main point of the question, I just want to offer some explanation to avoid getting bogged down with "there's other ways to standardize coefficients" responses. I'm more interested in this general problem with formulae than the specific application.
The solution above falls apart when a function is applied to any of the variables. For example,
mod <- lm(mpg ~ log(wt), data = mtcars)
new_data <- model.frame(mod)
new_data <- data.frame(lapply(new_data, FUN = scale), check.names = FALSE)
standardized_mod <- update(mod, data = new_data)
This will break on update(mod, data = new_data), because lm is going to look for a column called wt to apply log to in new_data, which only has columns called mpg and log(wt).
What I would like to do is manipulate the model formula in such a way that it goes from mpg ~ log(data) to mpg ~ `log(data)`. Of course, if it was just log I was worried about, I might be able to get something really hacky going to address it. But I'd like to be able to do the same regardless of the function in the formula, like if it's poly or some such.
Here are some solutions I've considered:
Instead of update, re-fit the model with lm directly and use the . for the RHS of the formula. This would work for some cases, but has big drawbacks, too. This will ignore any interaction terms in the original formula or other arithmetic uses of the formula from the original model. It also won't fix the problem if the function was applied to the LHS of the formula in the original model.
Use some kind of convoluted regex matching to isolate terms that appear to be functions on the basis of being right before (, but as a general rule I'm fearful of using string manipulation since it may fail in confusing ways. I'm not completely ruling this route out, but I haven't wrapped my head around how to do it safely and am not sure how to match terms with functions without accidentally capturing other parts of the formula.
I've tried messing around with the terms object and trying to use that as a way to use update on the formula itself, but haven't had much luck figuring out how to edit the terms object in the right ways.
We can avoid having to re-create the formula like this. mm0 is the model matrix columns except for the intercept. scale that giving mm0_std0. Now compute the new standardized lm:
mod <- lm(mpg ~ log(wt) * qsec, data = mtcars)
response <- mod$model[1]
mm0 <- model.matrix(mod)[, -1]
mm0_std <- scale(mm0)
mod_std <- lm(cbind(response, mm0_std))
If you do want the formula this will give it:
formula(mod_std)
## mpg ~ `log(wt)` + qsec + `log(wt):qsec`
## <environment: 0x000000000b1988c8>
I've thought of another potential solution as well, but I've not extensively tested it and it uses regex, which is in my understanding not the most R way of doing things.
mod <- lm(mpg ~ log(wt) * qsec, data = mtcars)
new_data <- model.frame(mod)
new_data <- data.frame(lapply(new_data, FUN = scale), check.names = FALSE)
We have the usual start, above.
Now I pull the variable names from the terms object.
vars <- as.character(attributes(terms(mod))$variables)
vars <- vars[-1] # gets rid of "list"
And save the full formula as a string.
char_form <- as.character(deparse(formula(mod)))
Now I iterate through the variables and use regex to surround each one in backticks. This gets around the trickier regex I was worried about with regard to detect which variables had functions applied.
for (var in vars) {
backtick_name <- paste("`", var, "`", sep = "")
char_form <- gsub(var, backtick_name, char_form, fixed = TRUE)
}
If I want to specify a variable not to standardize, like the outcome variable, I can exclude it from the vars vector programmatically. For instance, I can do this:
response <- as.character(formula(mod))[2]
vars <- vars[vars != response]
Of course, we can remove the response by dropping the first item in the list, but the above is for demonstrative purposes.
Now I can refit the model with the new data and new formula.
new_model <- update(mod, formula = as.formula(char_form), data = new_data)
In this narrow case, I don't really need to use update since I have all I need for lm. But if I was starting with a glm object or some other model, other user-supplied arguments like family are preserved.
Note: Weights and offsets can be problematic here, but it's not an intractable problem. I think the most straightforward thing to do is explicitly exclude columns named "(weights)" and "(offset)" from the model frame before scaling, then cbinding it back together afterwards. Then the user can use conditionals or some such to decide when to supply weights = `(weights)` and offset = `(offset)` arguments to update.

Getting invalid model formula in ExtractVars when using rpart function in R

The dataset can be downloaded from http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/
Getting the following error:
formula(formula, data = data) :
invalid model formula in ExtractVars
Using the following code:
install.packages("rpart")
library("rpart")
# you'll need to change the following from windows to work on a linux box:
mydata <- read.csv(file="c:/Users/md7968/downloads/winequality-red.csv")
# grow tree
fit <- rpart(YouSweetBoy ~ "residual sugar" + "citric acid", method = "class", data = mydata
Mind you I've changed the delimiters in the CSV file to commas.
perhaps it's not reading the data correctly. Forgive me, I'm new to R and not a very good programmer.
Look at names(mydata). When you create a data.frame, read.table() will turn "bad" column names into good ones. You can't (well, shouldn't) have a space in a column name so R changes spaces to periods. Plus, you should never have quoted strings in a formula. Try
fit <- rpart(quality ~ residual.sugar + citric.acid, method = "class", data = mydata)
(I have no idea what "YouSweetBoy" was supposed to be since that wasn't in the dataset so i changed it to "quality").
Removing the space in independent variable names and taking off the quotes made it to work.
Instead of "residual sugar", use residual_sugar
Alternatively, wrap your variable names with ``
So
`residual sugar`
This should work:
fit <- rpart(quality ~ `residual sugar` + `citric acid`, method = "class", data = mydata)

Subsetting data breaks GLM

I have a GLM Logit regression that works correctly, but when I add a subset argument to the GLM command, I get the following error:
invalid type (list) for variable '(weights)'.
So, the following command works:
glm(formula = A ~ B + C,family = "binomial",data = Data)
But the following command yield the error:
glm(formula = A ~ B + C,family = "binomial",data = Data,subset(Data,D<10))
(I realize that it may be difficult to answer this without seeing my data, but any general help on what may be causing my problem would be greatly appreciated)
Try subset=D<10 instead (you don't need to specify Data again, it is implicitly used as the environment for the subset argument). Because you haven't named the argument, R is interpreting it as the weights argument (which is the next argument after data).

Resources