I have been trying to create a series of coplots using a nested for loop but the loop takes too long to run (the original data set is very big). I have looked at similar questions and they suggest using the sapply function but I am still unclear about how to convert between the 2. I understand I need to create a plotting function to use (see below) but what I don't understand is how the i's and j's of the nested for loop into sapply arguements.
I have made some sample data, the nested for loop that I have been using and the plotting function I created that are below. Could someone walk me through how I convert my nested for loop into sapply arguements. I have been doing all of this in R. Many Thanks
y = rnorm(n = 200, mean = 10, sd = 2)
x1 = rnorm(n = 200, mean = 5, sd = 2)
x2 = rnorm(n = 200, mean = 2.5, sd = 2)
x3 = rep(letters[1:4], each = 50)
x4 = rep(LETTERS[1:8], each = 25)
dat = data.frame(y = y, x1 = x1, x2 = x2, x3 = x3, x4 = x4)
for(i in dat[, 2:3]){
for(j in dat[, 4:5]){
coplot(y ~ i | j, rows = 1, data = dat)
}
}
coplop_fun = function(data, x, y, x, na.rm = TRUE){
coplot(.data[[y]] ~ .data[[x]] | .data[[z]], data = data, rows = 1)
}
I think you might be able to use mapply here and not sapply. mapply is similar to sapply but allows for you to pass two inputs instead of one.
y = rnorm(n = 200, mean = 10, sd = 2)
x1 = rnorm(n = 200, mean = 5, sd = 2)
x2 = rnorm(n = 200, mean = 2.5, sd = 2)
x3 = rep(letters[1:4], each = 50)
x4 = rep(LETTERS[1:8], each = 25)
dat = data.frame(y = y, x1 = x1, x2 = x2, x3 = x3, x4 = x4)
for(i in dat[, 2:3]){
for(j in dat[, 4:5]){
coplot(y ~ i | j, rows = 1, data = dat)
}
}
mapply(function(x,j){coplot(dat[["y"]]~x|j,rows =1)}, dat[,2:3],dat[,4:5])
We can use a combination of functions expand.grid, formula and apply to accept character column names into coplot.
# combinations of column names for plotting
vars <- expand.grid(y = "y", x = c("x1", "x2"), z = c("x3", "x4"))
# cycle through column name variations, construct formula for each combination
apply(vars, MARGIN = 1,
FUN = function(x) coplot(
formula = formula(paste(x[1], "~", x[2], "|", x[3])),
data = dat, row = 1
)
)
Here's a tidyverse version of #nya's solution with expand.grid() and apply(). Each row in ds_plot_parameters represents a single plot. The equation variable is the string eventually passed to coplot().
Each equation is passed to purrr::walk(), which then calls coplot()
to produce one graph each. as.equation() converts the string to an equation.
ds_plot_parameters <-
tidyr::expand_grid(
v = c("x1", "x2"),
w = c("x3", "x4")
) |>
dplyr::mutate(
equation = paste0("y ~ ", v, " | ", w),
)
ds_plot_parameters$equation |>
purrr::walk(
\(e) coplot(as.formula(e), rows = 1, data = dat)
)
Gravy:
If you want to more input to the graph, then expand ds_plot_parameters to include other things like graph & axis titles.
ds_plot_parameters <-
tidyr::expand_grid(
v = c("x1", "x2"),
w = c("x3", "x4")
) |>
dplyr::mutate(
equation = paste0("y ~ ", v, " | ", w),
label_y = "Outcome (mL)",
label_x = paste(v, " (log 10)")
)
ds_plot_parameters |>
dplyr::select(
# Make sure this order exactly matches the function signature
equation,
label_x,
label_y,
) |>
purrr::pwalk(
.f = \(equation, label_x, label_y) {
coplot(
formula = as.formula(equation),
xlab = label_x,
ylab = label_y,
rows = 1,
data = dat
)
}
)
ds_plot_parameters
# # A tibble: 4 x 5
# v w equation label_y label_x
# <chr> <chr> <chr> <chr> <chr>
# 1 x1 x3 y ~ x1 | x3 Outcome (mL) x1 (log 10)
# 2 x1 x4 y ~ x1 | x4 Outcome (mL) x1 (log 10)
# 3 x2 x3 y ~ x2 | x3 Outcome (mL) x2 (log 10)
# 4 x2 x4 y ~ x2 | x4 Outcome (mL) x2 (log 10)
Related
I'm trying to run a regression with a constraint to set all coefficients greater than zero. To do this, I am utilizing the nls function. However, I am having an error:
"Error in nls(formula = y ~ . - 1, data = X, start = low, lower = low, :
parameters without starting value in 'data': ."
I believe everything is correct here, I tried to set a lower and upper bound on all variables, so I am not sure what is wrong.
Attempt 1:
library(magrittr)
X <- data.frame(
x1 = seq(10),
x2 = seq(10),
x3 = seq(10),
x4 = seq(10),
x5 = seq(10),
y = seq(10)
)
low <- dplyr::select(X, -y) %>% names %>% lapply( function(e) 0)
up <- dplyr::select(X, -y) %>% names %>% lapply( function(e) Inf)
names(low) <- dplyr::select(X, -y) %>% names -> names(up)
fit1 <- nls(formula = y ~ . -1 , data = X,
start = low,
lower = low,
upper = up,
algorithm = "port"
)
Attempt 2:
Here I try to set the formula manually but then I get a new error:
"Error in qr(.swts * gr) :
dims [product 5] do not match the length of object [10]"
library(magrittr)
X <- data.frame(
x1 = seq(10),
x2 = seq(10),
x3 = seq(10),
x4 = seq(10),
x5 = seq(10),
y = seq(10)
)
n <- X %>% dplyr::select( -y ) %>% names %>% paste0( collapse = " + " )
f <- "y ~ %s -1" %>% sprintf( n ) %>% as.formula
low <- dplyr::select(X, -y) %>% names %>% lapply( function(e) 0)
up <- dplyr::select(X, -y) %>% names %>% lapply( function(e) Inf)
names(low) <- dplyr::select(X, -y) %>% names -> names(up)
fit1 <- nls(formula = f , data = X,
start = low,
lower = low,
upper = up,
algorithm = "port"
)
How can I fix this? Thanks!
1) There are several problems here:
nls does not use the same formula notation as lm. Have fixed below.
the example does not have identifiable parameters, i.e. they are not unique so the calculation will fail. Below we change the example.
although 0 starting values seem to work here in general numeric optimization with constraints tends to work better if the starting values are in the interior of the feasible region.
Using the above we have
set.seed(123)
X <- data.frame(
x1 = rnorm(10),
x2 = rnorm(10),
x3 = rnorm(10),
x4 = rnorm(10),
x5 = rnorm(10),
y = rnorm(10)
)
fo <- y ~ b1 * x1 + b2 * x2 + b3 * x3 + b4 * x4 + b5 * x5
st <- c(b1 = 1, b2 = 1, b3 = 1, b4 = 1, b5 = 1)
nls(fo, X, start = st, lower = numeric(5), algorithm = "port")
giving:
Nonlinear regression model
model: y ~ b1 * x1 + b2 * x2 + b3 * x3 + b4 * x4 + b5 * x5
data: X
b1 b2 b3 b4 b5
0.0000 0.1222 0.0000 0.2338 0.1457
residual sum-of-squares: 6.477
Algorithm "port", convergence message: relative convergence (4)
2) The nnls (non-negative least squares) package can do this directly. We use X defined in (1).
nnls(as.matrix(X[-6]), X$y)
giving the following
Nonnegative least squares model
x estimates: 0 0.1221646 0 0.2337857 0.1457373
residual sum-of-squares: 6.477
reason terminated: The solution has been computed sucessfully.
This is a partial answer: you can combine it with #G.Grothendieck's answer to answer your question about "what if you have too many variables to type out by hand".
As implied by the comment thread, the model you're trying to set up doesn't include an intercept by default. The easiest way to handle this is probably to add a column of 1s to your data frame (mydata <- data.frame(x0 = 1, mydata))
## define variable names and parameter names
nx <- ncol(X)-1
vars <- names(X)[1:nx] ## assumes response is *last* column
pars <- gsub("x", "b", vars)
## construct formula
form <- reformulate(response = "y",
sprintf("%s*%s", pars, vars))
lwr <- setNames(rep(0, nx), pars)
upr <- setNames(rep(Inf, nx), pars)
start <- setNames(rep(1, nx), pars)
I have this dataframe
x <- data.frame(
matrix(
c(letters[1:3], c("x", "x", "y") ,
sample(c(rep(1,100),0), size = 1),
sample(c(rep(1,100),0), size = 1),
sample(c(rep(1,100),0), size = 1)), ncol = 3)
)
I would like to do multiplication by group X and Y.
My suggestion
agg <- aggregate(x$X3,
by = list(x$X2),
FUN = *)
I would like to use something like sum, mean byt to multiply
+ is to sum as * is to prod (for product).
Your sample data follows the anti-pattern of data.frame(matrix()). A matrix can only have one data type. You mix character and numeric data in the matrix, and the matrix makes it all character class, and you can't do math on characters. Here's proper sample data and a demonstration the solution works. Also note that using by = X["X2"] instead of by = list(x$X2) gives a nicer column name in the result.
(x <- data.frame(
X1 = letters[1:3],
X2 = c("x", "x", "y") ,
X3 = 2:4
))
# X1 X2 X3
# 1 a x 2
# 2 b x 3
# 3 c y 4
aggregate(x$X3, by = x["X2"], FUN = prod)
# X2 x
# 1 x 6
# 2 y 4
Either use prod or use Reduce with *. Also convert X3 to numeric and and use single brackets as shown to preserve the names. Alternately use the aggregate formula method, shown only for prod but applies to Reduce as well.
xx <- transform(x, X3 = as.numeric(X3))
aggregate(xx["X3"], by = xx["X2"], FUN = prod)
aggregate(xx["X3"], by = xx["X2"], FUN = Reduce, f = `*`) # same
aggregate(X3 ~ X2, xx, FUN = prod)
A better example might be to use mtcars that comes with R:
aggregate(mtcars["mpg"], by = mtcars["cyl"], FUN = prod)
aggregate(mtcars["mpg"], by = mtcars["cyl"], FUN = Reduce, f = `*`) # same
aggregate(mpg ~ cyl, mtcars, FUN = prod)
I have a dataset where I'm trying to loop over a factor variable (location) and building a separate model for each level of that factor. Depending on the location, however, there are single-level factor variables, which is giving me this error:
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
Called from: `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]])
So depending on the location, I want to drop any single-level factors from the model. I've tried splitting the data into one dataset that doesn't have any single-level factors and another that does, but I don't know how to drop a given factor variable depending on the location.
This will give you the error:
library(data.table)
dt <- data.table(df, key = "location")
lapply(unique(dt$location), function(z) lm(y ~ x1 + x2 + x3 + x4 + x5, data = dt[J(z),]))
I'm not very comfortable with data.table, however, so any non-data.table solution would be really helpful. Thank you.
Some data:
y <- rnorm(n = 100, mean = 50, 5)
x1 <- rnorm(n = 100, mean = 10, sd = 3)
location <- factor(c(rep(1, 20), rep(2, 20), rep(3, 20), rep(4, 20), rep(5, 20)))
x2 <- rnorm(n = 100, mean = 25, sd = 3)
x3 <- factor(sample(c(0, 1), size = 100, replace = TRUE))
x4 <- factor(ifelse(location == 1, 0,
ifelse(location == 2, sample(c(0, 1), size = 20, replace = TRUE),
ifelse(location == 3, 1,
ifelse(location == 4, 0, sample(c(0, 1), size = 20, replace = TRUE))))))
x5 <- factor(ifelse(location == 1, sample(c(0, 1), size = 20, replace = TRUE),
ifelse(location == 2, 1,
ifelse(location == 3, sample(c(0, 1), size = 20, replace = TRUE),
ifelse(location == 4, sample(c(0, 1), size = 20, replace = TRUE), 0)))))
df <- data.frame(y, location, x1, x2, x3, x4, x5)
Your model's formula is conditional on whether or not there are enough levels in each independent variable to be included.
You can create a formula based on these conditions (e.g., using ifelse()) and then feed the formula to the model inside lapply().
Here is a solution:
lapply(unique(df$location), function(z) {
sub_df = dplyr::filter(df, location == z) # subset by location
form_x4 = ifelse(length(unique(sub_df$x4)) > 1, "+ x4", "")
form_x5 = ifelse(length(unique(sub_df$x5)) > 1, "+ x5", "")
form = as.formula(paste("y ~ x1 + x2 + x3", form_x4, form_x5))
return(lm(data = sub_df, formula = form))
})
The form inside the above lapply(...) combines the consistent part of the lm() formula with multiple variables that meet the conditions to be used in the formula. If a variable only has a single level, the ifelse() statement allows you to treat it as if it's not there when putting it in the formula.
set.seed(1)
library(data.table)
data=data.table(STUDENT = 1:1000,
OUTCOME = sample(20:90, r = T),
X1 = runif(1000),
X2 = runif(1000),
X3 = runif(1000))
data[, X1 := fifelse(X1 > .9, NA_real_, X1)]
data[, X2 := fifelse(X2 > .78 & X2 < .9, NA_real_, X1)]
data[, X3 := fifelse(X3 < .1, NA_real_, X1)]
Say you have data as shown and you wish to impute values for X1, X2, X3 and leave out STUDENT and OUTCOME for the imputation processing.
I can do
library(mice)
dataIMPUTE=mice(data[, c("X1", "X2", "X3")], m = 1)
but how do I get together the imputing values from dataIMPUTE with STUDENT and OUTCOME? I am afraid that I will merge wrong and that is why I ask if you have advice for this.
One possibility is to use the complete data set in the imputation, but change the predictorMatrix so that STUDENT and OUTCOME are not used in the imputation model.
First, you need to run mice to extract the predictorMatrix (without calculating the imputation). Then you can set all columns to 0 that shouldn't be included in the imputation model. However, all your variables are still contained in your dataIMPUTE object:
set.seed(1)
library(data.table)
data=data.table(STUDENT = 1:1000,
OUTCOME = sample(20:90, r = T),
X1 = runif(1000),
X2 = runif(1000),
X3 = runif(1000))
index_1 <- sample(1:1000, 100)
index_2 <- sample(1:1000, 100)
index_3 <- sample(1:1000, 100)
data[index_1, X1 := NA_real_]
data[index_2, X2 := NA_real_]
data[index_3, X3 := NA_real_]
library(mice)
init <- mice(data, maxit = 0, print = FALSE)
# extract the predictor matrix
pred_mat <- init$predictorMatrix
# remove STUDENT and OUTCOME as predictors
pred_mat[, c("STUDENT", "OUTCOME")] <- 0
# do the imputation
dataIMPUTE = mice(data, pred = pred_mat, m = 1)
I would like to loop over various regressions referencing different data subsets, however I'm unable to appropriately call different subsets. For example:
dat <- data.frame(y = rnorm(10), x1 = rnorm(10), x2 = rnorm(10), x3 = rnorm(10) )
x.list <- list(dat$x1,dat$x2,dat$x3)
dat1 <- dat[-9,]
fit <- list()
for(i in 1:length(x.list)){ fit[[i]] <- summary(lm(y ~ x.list[[i]], data = dat))}
for(i in 1:length(x.list)){ fit[[i]] <- summary(lm(y ~ x.list[[i]], data = dat1))}
Is there a way to call in "dat1" such that it subsets the other variables accordingly? Thanks for any recs.
I'm not sure it makes sense to copy your covariates into a new list like that. Here's a way to loop over columns and to dynamically build formulas
dat <- data.frame(y = rnorm(10), x1 = rnorm(10), x2 = rnorm(10), x3 = rnorm(10) )
dat1 <- dat[-9,]
#x.list not used
fit <- list()
for(i in c("x1","x2","x3")){ fit[[i]] <- summary(lm(reformulate(i,"y"), data = dat))}
for(i in c("x1","x2","x3")){ fit[[i]] <- summary(lm(reformulate(i,"y"), data = dat1))}
How about this?
dat <- data.frame(y = rnorm(10), x1 = rnorm(10), x2 = rnorm(10), x3 = rnorm(10) )
mods <- lapply(list(y ~ x1, y ~ x2, y ~ x3), lm, data = dat1)
If you have lots of predictors, create the formulas something like this:
lapply(paste('y ~ ', 'x', 1:10, sep = ''), as.formula)
If your data was in long format, it would be similarly simple to do by using lapply on a split data.frame.
dat <- data.frame(y = rnorm(30), x = rnorm(30), f = rep(1:3, each = 10))
lapply(split(dat, dat$f), function(x) lm(y ~ x, data = x))
Sorry being late - but have you tried to apply the data.table solution similar to yours in:
R data.table loop subset by factor and do lm()
I have just applied the links solution by altering your data which should illustrate how I understood your question:
set.seed(1)
df <- data.frame(x1 = letters[1:3],
x2 = sample(c("a","b","c"), 30, replace = TRUE),
x3 = sample(c(20:50), 30, replace = TRUE),
y = sample(c(20:50), 30, replace = TRUE))
dt <- data.table(df,key="x1")
fits <- lapply(unique(dt$x1),
function(z)lm(y~x2+x3, data=dt[J(z),], y=T))
fit <- dt[, lm(y ~ x2 + x3)]
# Using id as a "by" variable you get a model per id
coef_tbl <- dt[, as.list(coef(lm(y ~ x2 + x3))), by=x1]
# coefficients
sapply(fits,coef)
anova_tbl = dt[, as.list(anova(lm(y ~ x2 + x3))), by=x1]
row_names = dt[, row.names(anova(lm(y ~ x2 + x3))), by=x1]
anova_tbl[, variable := row_names$V1]
It extends your solution.