I want to convert the following string vector:
variables <- c("temperature", "rain", "sun_days", "season")
into the following formula:
formula <- pred ~ treatment*(temperature + rain + sun_days + season)
The way I converted the variables vector into a formula style is the following:
predictors <- paste0(variables, collapse = "+")
However, it does not make the trick when I write the formula in the following way:
formula <- pred ~ treatment*(variables)
It doesn't work because of the "" that characterises the string vector.
Any idea?
formula <- as.formula(
paste("pred ~ treatment * (", paste(variables, collapse = "+"), ")")
)
Result:
> formula
pred ~ treatment * (temperature + rain + sun_days + season)
Related
In R using GLM to include all variables you can simply use a . as shown How to succinctly write a formula with many variables from a data frame?
for example:
y <- c(1,4,6)
d <- data.frame(y = y, x1 = c(4,-1,3), x2 = c(3,9,8), x3 = c(4,-4,-2))
mod <- lm(y ~ ., data = d)
however I am struggling to do this with svydesign. I have many exploratory variables and an ID and weight variable, so first I create my survey design:
des <-svydesign(ids=~id, weights=~wt, data = df)
Then I try creating my binomial model using weights:
binom <- svyglm(y~.,design = des, family="binomial")
But I get the error:
Error in svyglm.survey.design(y ~ ., design = des, family = "binomial") :
all variables must be in design = argument
What am I doing wrong?
You typically wouldn't want to do this, because "all the variables" would include design metadata such as weights, cluster indicators, stratum indicators, etc
You can use col.names to extract all the variable names from a design object and then reformulate, probably after subsetting the names, eg with the api example in the package
> all_the_names <- colnames(dclus1)
> all_the_actual_variables <- all_the_names[c(2, 11:37)]
> reformulate(all_the_actual_variables,"y")
y ~ stype + pcttest + api00 + api99 + target + growth + sch.wide +
comp.imp + both + awards + meals + ell + yr.rnd + mobility +
acs.k3 + acs.46 + acs.core + pct.resp + not.hsg + hsg + some.col +
col.grad + grad.sch + avg.ed + full + emer + enroll + api.stu
To fix certain coefficient in regression to one we can use offset function.
I want to set all coefficients to 1.
Let's take this example:
set.seed(42)
y <- rnorm(100)
df <- data.frame("Uni" = runif(100), "Exp" = rexp(100), "Wei" = rweibull(100, 1))
lm(y~ offset(2*get("Uni")) + Exp + Wei, data = df)
Call:
lm(formula = y ~ offset(Uni) + offset(Exp) + offset(Wei), data = df)
Coefficients:
(Intercept)
-2.712
This code works, however what if I have huge amount of data e.g. 800 variables and I want to do for all of them ? Writing all their names would be not so efficient. Is there any solution which allows us to do it more tricky ?
I think I found one solution if we do it this way:
set.seed(42)
# Assign everything to one data frame
df <- data.frame("Dep" = rnorm(100), "Uni" = runif(100),
"Exp" = rexp(100), "Wei" = rweibull(100, 1))
varnames <- names(df)[-1]
# Create formula for the sake of model creation
form <- paste0("offset","(",varnames, ")",collapse = "+")
form <- as.formula(paste0(names(df)[1], "~", form))
lm(form, data = df)
1) terms/update The following one-liner will produce the indicated formula.
update(formula(terms(y ~ ., data = df)), ~ offset(.))
## y ~ offset(Uni + Exp + Wei)
2) reformulate/sprintf another approach is:
reformulate(sprintf("offset(%s)", names(df)), "y")
## y ~ offset(Dep) + offset(Uni) + offset(Exp) + offset(Wei)
3) rowSums Another approach is to simply sum each row:
lm(y ~ offset(rowSums(df)))
4) lm.fit We could use lm.fit in which case we don't need a formula:
lm.fit(cbind(y^0), y, offset = rowSums(df))
5) mean If you only need the coefficient then it is just:
mean(y - rowSums(df))
After looking at many examples and lots of trying, I'm still failing to combine text strings and an expression into ggplot2 axis labels to exactly what I want.
what I am trying to get here is the x-axis label to be:
the ingredients:
parname <- 'FL.Red.Total'
xmean <- 123.34
xsigma <- 2580.23
to change the numbers to 10^n notations I use this formula:
sci_form10 <- function(x) {
paste(gsub("e\\+", " \xB7 10^", scientific_format()(x)))
}
the name would then be build by:
labs( x = bquote(.(gsub('\\.', '\\ ', parname)) ~ " (a.u.) (" ~ mu ~ "=" ~ .(sci_form10(xmean)) ~ ", " ~ sigma ~ " =" ~ .(sci_form10(xsigma)) ~ ")" ))
I'm hoping to replace 10^04 with 10 followed by a 4 in superscript and to add a linebreak to the labels as the first image shows
The test code:
library(ggplot2)
library(scales)
sci_form10 <- function(x) {
paste(gsub("e\\+", " * 10^", scientific_format()(x)))
}
parname <- 'FL.Red.Total'
xmean <- 123.34
xsigma <- 2580.23
ggplot(mtcars, aes(x=mpg,y=cyl)) +
geom_point() +
labs( x = bquote(.(gsub('\\.', '\\ ', parname)) ~ " (a.u.) (" ~ mu ~ "=" ~ .(sci_form10(xmean)) ~ ", " ~ sigma ~ " =" ~ .(sci_form10(xsigma)) ~ ")" ))
gives:
p.s. I also tried
sci_form10 <- function(x) {
paste(gsub(".*e\\+", "10^", scientific_format()(x)))
}
which only gives the 10^03 part to see if that would change the outcome of my label, but no.
An option would be wrap with atop to create line breaks
sci_form10 <- function(x) {
paste(gsub("e\\+", " \u00B7 10^", scientific_format()(x)))
}
x1 <- sci_form10(xmean)
x2 <- sci_form10(xsigma)
lst1 <- strsplit(c(x1,x2), "\\s(?=10)", perl = TRUE)
pre <- sapply(lst1, `[`, 1)
post <- sapply(lst1, `[`, 2)
xmean1 <- parse(text = paste0("'", pre[1], "'"))[[1]]
xsigma1 <- parse(text = paste0("'", pre[2], "'"))[[1]]
post1 <- parse(text = post[1])[[1]]
post2 <- parse(text = post[2])[[1]]
ggplot(mtcars, aes(x=mpg,y=cyl)) +
geom_point() +
labs( x = bquote(atop(.(gsub("\\.", "\\ ",
parname))~"(a.u.)"~phantom(), "(" ~ mu~ " = "~ .(xmean1) ~ .(post1) ~ ", " ~ sigma ~ " = " ~ .(xsigma1) ~ .(post2)~ ")")))
-output
I have something that does most of what you wanted.
changeSciNot <- function(n) {
output <- format(n, digits=3, scientific = TRUE) # Transforms the number into scientific notation even if small
output <- sub("e", "*10^", output) # Replace e with 10^
output <- sub("\\+0?", "", output) # Remove + symbol and leading zeros on exponent, if > 1
output <- sub("-0?", "-", output) # Leaves - symbol but removes leading zeros on exponent, if < 1
output
}
# example data
parname <- "FL.Red.Total"
xmean <- 123.34
xsigma <- 2580.23
label <- bquote(atop(.(gsub("\\.", "\\ ", parname)) ~ "(a.u.)",
mu*"="*.(changeSciNot(xmean))*"," ~ sigma*"="*.(changeSciNot(xsigma))))
ggplot(mtcars, aes(x=mpg,y=cyl)) +
geom_point() +
labs(x = label)
The changeSciNot function came from this thread. I had some problems using \xB7 for the multiplication, so I left *. I also hard coded the number of digits for the format, but you can also make it into an argument. Hopefully, this will get you closer to the exact desired output.
I'm wondering if there is essentially a faster way of getting predictions from a regression model for certain values of the covariates without manually specifying the formulation. For example, if I wanted to get a prediction for a given dependent variable at means of the covariates, I can do something like this:
glm(ins ~ retire + age + hstatusg + qhhinc2 + educyear + married + hisp,
family = binomial, data = dat)
meanRetire <- mean(dat$retire)
meanAge <- mean(dat$age)
meanHStatusG <- mean(dat$hStatusG)
meanQhhinc2 <- mean(dat$qhhinc2)
meanEducyear <- mean(dat$educyear)
meanMarried <- mean(dat$married)
meanYear <- mean(dat$year)
ins_predict <- coef(r_3)[1] + coef(r_3)[2] * meanRetire + coef(r_3)[3] * meanAge +
coef(r_3)[4] * meanHStatusG + coef(r_3)[5] * meanQhhinc2 +
coef(r_3)[6] * meanEducyear + coef(r_3)[7] * meanMarried +
coef(r_3)[7] * meanHisp
Oh... There is a predict function:
fit <- glm(ins ~ retire + age + hstatusg + qhhinc2 + educyear + married + hisp,
family = binomial, data = dat)
newdat <- lapply(dat, mean) ## column means
lppred <- predict(fit, newdata = newdat) ## prediction of linear predictor
To get predicted response, use:
predict(fit, newdata = newdat, type = "response")
or (more efficiently from lppred):
binomial()$linkinv(lppred)
There is a nice piece of R code for fitting and visualising alternative linear models at www.alastairsanderson.com/R/tutorials/linear-regression-with-a-factor/. How do I possibly generalise this framework to allow for lagged predictors, e.g., by using dyn or dynlm?
Try this:
library(dyn)
library(ggplot2)
forms.ch <- c(
"y ~ x",
"y ~ class / x",
"y ~ class / Lag(x, 0:1)",
"y ~ class / Lag(x, 0:2)"
)
forms <- sapply(forms.ch, as.formula)
Lag <- function(x, k = 1) lag(x, -k)
# L is a list of zoo objects which is fit to each formula
L <- lapply(mydata, zoo, order.by = mydata$x)
models <- lapply(forms, dyn$lm, data = L)
# create zero width zoo object, width0, which is merged with fitted. fitted would
# otherwise be shorter than mydata (since we can't fit points at beginning due to
# lack of laggged points at boundary). Also we convert mydata$x to numeric,
# from integer, to avoid warnings later on.
width0 <- zoo(, as.numeric(mydata$x))
models.sum <- lapply(models, function(x)
data.frame(mydata,
fitted = coredata(merge(fitted(x), width0)),
strip = paste(format(formula(x)), "AIC:", round(AIC(x), 1)),
formula = format(formula(x))
)
)
models.long <- na.omit(do.call(rbind, models.sum))
models.long$class[ models.long$formula == forms.ch[1] ] <- NA # first model has no class
ggplot(models.long, aes(x, y, colour = class)) +
geom_line(aes(y = fitted)) +
geom_point() +
facet_wrap(~ strip)