Using data from the fivethirtyeight package...
library(fivethirtyeight)
grads <- college_recent_grads
Created a subset of the grads data to include desired variables
data <- grads[, c("men", "major_category", "employed",
"employed_fulltime_yearround", "p25th",
"p75th", "total")]
Then, I split the data subset up by major category and omitted the one NA value in the data
majorcats <- split(data, data$major_category)
names(majorcats)
majorcats <- majorcats %>% na.omit()
And tried to run a regression model in a function called facts, where the user could specify x, y, and z, z being a major category (hence why I split up the data subset by major_category)
facts <- function(x, y, z){
category <- majorcats[["z"]]
summary(lm(y ~ x, data = category))
}
Unfortunately, when I try to input variables into facts (that are part of the majorcats data set, such as
facts(men, p25th, Arts)
I get the error below:
Error in model.frame.default(formula = y ~ x, data = category,
drop.unused.levels = TRUE) :
invalid type (NULL) for variable 'y'
Called from: model.frame.default(formula = y ~ x, data = category,
drop.unused.levels = TRUE)
Browse[1]>
Can someone please explain what this error means, and how I might be able to fix it?
Simply pass the parameters as string literals and create a formula from string:
facts <- function(x, y, z){
category <- majorcats[[z]]
model <- as.formula(paste(y, "~", x))
# ALTERNATIVE: model <- reformulate(x, response=y)
summary(lm(model, data = category))
}
facts("men", "p25th", "Arts")
Related
I want to define a function panel_fit which will perform panel fit for dependent variable (y), and independent variables (x). The panel regression should has linear trend within it.
I want to show you my work on the data following :
library(plm)
data("EmplUK", package="plm")
dep_var <- EmplUK['capital']
#deleting dependent variable - it's meaningless but, it's only for defining function purpose
df1 <- EmplUK[-6]
panel_fit <- function(y, x, inputs = list(), model_type) {
x[, length(x) + 1] <- y
x <- x %>%
group_by_at(1) %>%
mutate(Trend = row_number())
varnames <- names(x)[3:(length(x))]
varnames <- varnames[!(varnames == names(y))]
form <- paste0(varnames, collapse = "+")
model <- plm(as.formula(paste0(names(y), "~", form)), data = x, model = model_type)
summary(model)
}
The error I get is :
panel_fit(dep_var,df1,model_type='within')
Warning messages:
1: In Ops.pseries(y, bX) :
indexes of pseries have same length but not same content: result was assigned first operand's index
Do you know why I got such ? What should I do to solve this problem ?
I Want to make variables using character name in for loop.
For instance:
cultivar <- c("uri", "keumgang", "saeal", "ahnbaek")
for(i in cultivar){
anova_height_i <- aov(plant_height ~ treatment, data = data_i)
}
From the above, I expect to have variables
anova_height_uri <- aov(plant_height ~ treatment, data = data_uri)
...
...
anova_height_ahnbaek <- aov(plant_height ~ treatment, data = data_ahnbaek)
But I got the error message:
Error in aov(plant_height ~ treatment, data = data_i) :
object 'data_i' not found
How could I get the results that I expect?
You need to use assign and get:
cultivar <- c("uri", "keumgang", "saeal", "ahnbaek")
for(i in cultivar){
assign(
paste0('anova_height_', i),
aov(plant_height ~ treatment, data = get(paste0('data_', i)))
)
}
I would like to use lapply() to compute several models in R, but it seems that the update() function can't handle models generated through lapply().
A minimal example:
d1 <- data.frame(y = log(1:9), x = 1:9, trt = rep(1:3, each = 3))
f <- list(y ~ 1, y ~ x, y ~ trt)
modsa <- lapply(f, function(formula) glm(formula, data = d1))
modsb <- lapply(f, glm, data = d1)
update(modsa[[1]], data = d1[1:7, ])
#> Error: object of type 'closure' is not subsettable
update(modsb[[1]], data = d1[1:7, ])
#> Error in FUN(formula = X[[i]], data = d1[1:7, ]): could not find function "FUN"
Is there a way that allows update() to deal with models generated through lapply()?
The error is occurring because the call elements of the glm objects are being overwritten by the argument name passed to the anonymous function
modsa <- lapply(f, function(x) glm(x, data = d1))
modsa[[1]]$call
glm(formula = x, data = d1)
#compare with a single instance of the model
moda1<-glm(y ~ 1, data=d1)
moda1$call
glm(formula = y ~ 1, data = d1)
If you add back in the formula, it will correctly recreate the call
update(modsa[[1]], data = d1[1:7, ], formula=f[[1]])
This doesn't work for the second instance, but you can see that if you manually update the call element the update functionality is rescued.
modsb[[1]]$call<-getCall(moda1)
update(modsb[[1]], data = d1[1:7, ])
Esther is correct, the problem is with the call element of glm. From ?update:
‘update’ will update and (by default) re-fit a model. It does this by
extracting the call stored in the object, updating the call and (by
default) evaluating that call.
As already mentioned, one can update including the formula as well:
update(modsa[[1]], data = d1[1:7, ], formula=f[[1]])
If for some reason this is not convenient, here is how to run your lapply() and have it assign directly the correct formula to the call element:
modsa <- lapply(f, function(formula) eval(substitute(glm(F, data = d1), list(F=formula))))
This substutites the respective formula to the glm call, and then evaluates it. With this long one-liner you can run update(modsa[[1]], data = d1[1:7, ]) with no problem.
In the modelr package the function gather_predictions can be used to add predictions from multiple models to a data frame, I'm however unsure on how to specify these models in the function call. The help documentation gives the following exmaple:
df <- tibble::data_frame(
x = sort(runif(100)),
y = 5 * x + 0.5 * x ^ 2 + 3 + rnorm(length(x))
)
m1 <- lm(y ~ x, data = df)
grid <- data.frame(x = seq(0, 1, length = 10))
grid %>% add_predictions(m1)
m2 <- lm(y ~ poly(x, 2), data = df)
grid %>% spread_predictions(m1, m2)
grid %>% gather_predictions(m1, m2)
here the models are specifically mentioned in the function call. That works fine if we have a few models we want predictions for, but what if we have a large or unknown amount of models? In this case manually specifying the models isn't really workable anymore.
the way the help documentation phrases the arguments segment seems to suggest you need to add every model as a separate argument.
gather_predictions and spread_predictions take multiple models. The
name will be taken from either the argument name of the name of the
model.
And for example inputting a list of models into gather_predictions doesn't work.
Is there some easy way to input a list / large amount of models to gather_predictions?
example for 10 models in a list:
modelslist <- list()
for (N in 1:10) {
modelslist[[N]] <- lm(y ~ poly(x, N), data = df)
}
If having the models stored some other way than a list works better, that's fine as well.
m <- grid %>% gather_predictions(lm(y ~ poly(x, 1), data = df))
for (N in 2:10) {
m <- rbind(m, grid %>% gather_predictions(lm(y ~ poly(x, N), data = df)))
}
There are workarounds to solve this problem. My approach was to:
1. build a list of models with specific names
2. use a tweaked version of modelr::gather_predictions() to apply all models in the list to data
# prerequisites
library(tidyverse)
set.seed(1363)
# I'll use generic name 'data' throughout the code, so you can easily try other datasets.
# for this example I'll use your data df
data=df
# data visualization
ggplot(data, aes(x, y)) +
geom_point(size=3)
your sample data
# build a list of models
models <-vector("list", length = 5)
model_names <- vector("character", length=5)
for (i in 1:5) {
modelformula <- str_c("y ~ poly(x,", i, ")", sep="")
models[[i]] <- lm(as.formula(modelformula), data = data)
model_names[[i]] <- str_c('model', i) # remember we name the models here sequantially
}
# apply names to the models list
names(models) <- model_names
# this is modified verison of modelr::gather_predictions() in order to accept list of models
gather.predictions <- function (data, models, .pred = "pred", .model = "model")
{
df <- map2(models, .pred, modelr::add_predictions, data = data)
names(df) <- names(models)
bind_rows(df, .id = .model)
}
# the rest is the same as modelr's function...
grids <- gather.predictions(data = data, models = models, .pred = "y")
ggplot(data, aes(x, y)) +
geom_point() +
geom_line(data = grids, colour = "red") +
facet_wrap(~ model)
example of polynomial models (degree 1:5) applied to your sample data
side note: there are good reasons why I chose strings to build the model...to discuss.
Using the setup:
id <- c(1,1,2,2,3,4)
t <- c(1,2,1,2,1,1)
x <- c(1,2,2,1,2,1)
y <- c(1,0,0,0,1,0)
df <- data.frame(id, t, x, y)
Preparing fitted values from an OLS regression as the starting values:
tstart <- lm(y ~ x, data = df)
Running probit:
tfit <- pglm(y ~ x, data = df, index = c("id","t"),
model = "within", family = binomial('probit'), start = tstart$fitted.values)
Returns the error
Error in lnl.binomial(param = start, y = y, X = X, id = id, model = model, :
object 'Li' not found
This error seems very uninformative to me.
I am not even looking for some object 'Li' in any of the calls, and have no idea what this object should be.
The traceback makes it seem to occur in the function:
9: lnl.binomial(param = start, y = y, X = X, id = id, model = model,
link = link, rn = rn) at <text>#1
But trying to look at the code of the function for when the error occurs reveals that there is not even a function
lnl.binomial()
Where did I go wrong about this?