Looping over grouped data using the nls function in R - r

I have a grouped dataset. I have my data grouped by GaugeID. I have an nls function that I want to loop over each group and provide an output value.
library(tidyverse)
library(stats)
# sample of data (yearly), first column is gauge (grouping variable), year, then two formula inputs PETvP and ETvP
# A tibble: 10 x 4
GaugeID WATERYR PETvP ETvP
<chr> <dbl> <dbl> <dbl>
1 06892000 1981 0.854 0.754
2 06892000 1982 0.798 0.708
3 06892000 1983 1.12 0.856
4 06892000 1984 0.905 0.720
5 06892000 1985 0.721 0.618
6 06892000 1986 0.717 0.625
7 06892000 1987 0.930 0.783
8 06892000 1988 1.57 0.945
9 06892000 1989 1.15 0.739
10 06892000 1990 0.933 0.805
11 08171300 1981 0.854 0.754
12 08171300 1982 0.798 0.708
13 08171300 1983 1.12 0.856
14 08171300 1984 0.905 0.720
15 08171300 1985 0.721 0.618
16 08171300 1986 0.717 0.625
17 08171300 1987 0.930 0.783
18 08171300 1988 1.57 0.945
19 08171300 1989 1.15 0.739
20 08171300 1990 0.933 0.805
# attempted for loop
for (i in unique(yearly$GaugeID)) {
myValue = nls(ETvP[i] ~ I(1 + PETvP[i] - (1 + PETvP[i]^(w))^(1/w)), data = yearly,
start = list(w = 2), trace = TRUE)
}
I get the following error
Error in model.frame.default(formula = ~ETvP + i + PETvP, data = yearly) :
variable lengths differ (found for 'i')
I haven't found much information regarding looping with the nls function. Essentially, I am producing curves and need the value of the curve (w) to output for each gauge.
It works if I assign the formula to just one gauge (if I subset the data, i.e for the first gauge), but not when I try to use it on the entire data frame with grouped data.
For example, this works
# gaugeA
# A tibble: 10 x 4
GaugeID WATERYR PETvP ETvP
<chr> <dbl> <dbl> <dbl>
1 06892000 1981 0.854 0.754
2 06892000 1982 0.798 0.708
3 06892000 1983 1.12 0.856
4 06892000 1984 0.905 0.720
5 06892000 1985 0.721 0.618
6 06892000 1986 0.717 0.625
7 06892000 1987 0.930 0.783
8 06892000 1988 1.57 0.945
9 06892000 1989 1.15 0.739
10 06892000 1990 0.933 0.805
test = nls(ETvP ~ I(1 + PETvP - (1 + PETvP^(w))^(1/w)), data = gaugeA,
start = list(w = 2), trace = TRUE)
1.574756 (4.26e+00): par = (2)
0.2649549 (1.46e+00): par = (2.875457)
0.09466832 (3.32e-01): par = (3.59986)
0.08543699 (2.53e-02): par = (3.881397)
0.08538308 (9.49e-05): par = (3.907099)
0.08538308 (1.13e-06): par = (3.907001)
> test
Nonlinear regression model
model: ETvP ~ I(1 + PETvP - (1 + PETvP^(w))^(1/w))
data: gaugeA
w
3.907
residual sum-of-squares: 0.08538
Number of iterations to convergence: 5
Achieved convergence tolerance: 1.128e-06
Any ideas on how I can get the subset results for my entire grouped dataframe? It has over 600 different gauges in it. Thank you in advance.

Any of the following will work:
Using summarise:
df %>%
group_by(GaugeID) %>%
summarise(result = list(nls(ETvP ~ I(1 + PETvP - (1 + PETvP^(w))^(1/w)),
data = cur_data(),
start = list(w = 2)))) %>%
pull(result)
[[1]]
Nonlinear regression model
model: ETvP ~ I(1 + PETvP - (1 + PETvP^(w))^(1/w))
data: cur_data()
w
3.607
residual sum-of-squares: 0.01694
Number of iterations to convergence: 5
Achieved convergence tolerance: 7.11e-08
[[2]]
Nonlinear regression model
model: ETvP ~ I(1 + PETvP - (1 + PETvP^(w))^(1/w))
data: cur_data()
w
1.086
residual sum-of-squares: 0.1532
Number of iterations to convergence: 5
Achieved convergence tolerance: 2.685e-07
Using map:
df %>%
group_split(GaugeID) %>%
map(~nls(ETvP ~ I(1 + PETvP - (1 + PETvP^(w))^(1/w)),
data = .x,
start = list(w = 2)))

I usally prefer purrr and dplyr for looping functions on grouped data.
I cant edit the data, but maybe this works:
library(dplyr)
library(purrr)
yearly %>% group_by(GaugeID) %>% summarise(test = nls(ETvP ~ I(1 + PETvP - (1 + PETvP^(w))^(1/w)), data = gaugeA, start = list(w = 2), trace = TRUE)

A single model can be formulated eliminating loops. Ensure that GaugeID is a factor, subscript w by GaugeID in the formula and provide a starting value list whose w component is a vector with a starting value for each level of GaugeID.
df$GaugeID <- factor(df$GaugeID)
fo <- ETvP ~ 1 + PETvP - (1 + PETvP^(w[GaugeID]))^(1/w[GaugeID])
st <- list(w = rep(2, nlevels(df$GaugeID)))
fm <- nls(fo, df, start = st)
fm
summary(fm)
data.frame(GaugeID = levels(df$GaugeID), coef(summary(fm)), check.names = FALSE)

Related

Run linear regression model on nested dataset after grouping (multiply imputed dataset)

*I want to group nested (multiply imputed) dataset and then apply linear regression on each dataset. I have tried a number of approaches, including the map options (2) and the for loop (3). I have had no luck at all. I want the model results to look like results from summary(mod1). Does anyone know what I could be doing wrong?
# get dependencies
library(mice)
library(tidyverse)
# impute the boys dataset from mice package
boys_imp <- mice(boys)
# 1) I want to run a model like this on my multiply imputed dataset
mod <- boys %>%
group_by(reg) %>%
do(tidy(
lm(
data=.,
formula = wgt ~ bmi),
conf.int = T))
summary(mod1)
# A tibble: 12 × 8
# Groups: reg [6]
reg term estimate std.error statistic p.value conf.low conf.high
<fct> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 north (Intercept) -81.9 9.84 -8.32 2.48e-12 -101. -62.3
2 north bmi 6.84 0.500 13.7 2.53e-22 5.85 7.84
3 east (Intercept) -75.3 7.62 -9.89 3.21e-18 -90.4 -60.3
4 east bmi 6.29 0.420 15.0 4.53e-32 5.46 7.12
5 west (Intercept) -91.9 6.31 -14.6 2.48e-34 -104. -79.4
6 west bmi 7.17 0.347 20.7 3.49e-54 6.49 7.86
7 south (Intercept) -79.8 6.73 -11.9 1.83e-24 -93.1 -66.5
8 south bmi 6.47 0.373 17.3 1.63e-40 5.73 7.20
9 city (Intercept) -92.0 13.9 -6.61 6.75e- 9 -120. -64.2
10 city bmi 6.95 0.757 9.18 1.39e-13 5.44 8.46
11 NA (Intercept) -88.6 43.8 -2.02 2.92e- 1 -645. 468.
12 NA bmi 6.46 2.89 2.24 2.68e- 1 -30.2 43.1
# 2) the map way --------------------------------------------------------
mod_imp <- boys_imp %>%
mice::complete("all") %>%
map(group_by, reg) %>%
map(lm, formula = wgt ~ bmi) %>%
pool()
summary(mod_imp)
term estimate std.error statistic df p.value
1 (Intercept) -85.473428 3.5511961 -24.06891 715.1703 0
2 bmi 6.793622 0.1945322 34.92287 693.7835 0
# 3) for loop way-------------------------------------------------------
# nest the mids dataset
boys_imp2 <- boys_imp %>%
mice::complete("all")
dat1 <- replicate(length(boys_imp2), NULL) # preallocate same size
# run the for loop
for (i in seq_along(boys_imp2)) {
dat1[[i]] <- boys_imp2[[i]] %>%
group_by(reg) %>%
do(lm(wgt ~ bmi, data = boys_imp2[[i]]))
}
|==================================================================|100% ~0 s remaining Error in `do()`:
! Results 1, 2, 3, 4, 5, ... must be data frames, not lm.
Run `rlang::last_error()` to see where the error occurred.*
I have found a solution to the problem. This involve grouping the data by ID and variable of interest, subsequently I map lm on to the datasets. I then finish off with unnesting the data
boys_imp %>%
mice::complete("long", include = FALSE) %>%
group_by(.imp, reg) %>%
nest() %>%
mutate(lm_model = map(data, ~lm(bmi ~ phb, data = .))) %>%
group_by(reg) %>%
summarise(model = list(tidy(pool(lm_model),conf.int = T))) %>%
unnest_wider(model) %>%
unnest(cols = c(term, estimate, std.error,
statistic, p.value, conf.low, conf.high))
# A tibble: 30 × 16
reg term estimate std.error statistic p.value conf.low conf.high b df dfcom fmi lambda m riv ubar
<fct> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 north (Intercept) 19.3 0.332 57.9 0 18.6 19.9
2 north phb.L 5.10 0.678 7.53 1.81e-10 3.75 6.46
3 north phb.Q 1.25 0.800 1.56 1.24e- 1 -0.357 2.86
4 north phb.C -0.430 0.882 -0.487 6.30e- 1 -2.25 1.39
5 north phb^4 -1.10 0.948 -1.16 2.57e- 1 -3.07 0.862
6 north phb^5 -0.156 1.08 -0.144 8.87e- 1 -2.41 2.10
7 east (Intercept) 18.7 0.244 76.8 0 18.3 19.2
8 east phb.L 4.83 0.509 9.48 4.44e-15 3.82 5.84
9 east phb.Q 1.10 0.692 1.60 1.27e- 1 -0.343 2.55
10 east phb.C -0.518 0.671 -0.772 4.49e- 1 -1.91 0.878
# … with 20 more rows
# ℹ Use `print(n = ...)` to see more rows

Export ‘epiR" Output to Tables

Good Morning,
i am using the "epiR" packages to assess test accuracy.
https://search.r-project.org/CRAN/refmans/epiR/html/epi.tests.html
## Generate a data set listing test results and true disease status:
dis <- c(rep(1, times = 744), rep(0, times = 842))
tes <- c(rep(1, times = 670), rep(0, times = 74),
rep(1, times = 202), rep(0, times = 640))
dat.df02 <- data.frame(dis, tes)
tmp.df02 <- dat.df02 %>%
mutate(dis = factor(dis, levels = c(1,0), labels = c("Dis+","Dis-"))) %>%
mutate(tes = factor(tes, levels = c(1,0), labels = c("Test+","Test-"))) %>%
group_by(tes, dis) %>%
summarise(n = n())
tmp.df02
## View the data in conventional 2 by 2 table format:
pivot_wider(tmp.df02, id_cols = c(tes), names_from = dis, values_from = n)
rval.tes02 <- epi.tests(tmp.df02, method = "exact", digits = 2,
conf.level = 0.95)
summary(rval.tes02)
The data type is listed as "epi.test". I would like to export the summary statistics to a table (i.e. gtsummary or flextable).
As summary is a function of base R, I am struggling to do this. Can anyone help? Thank you very much
The epi.tests function has been edited so it writes the results out to a data frame (instead of a list). This will simplify export to gtsummary or flextable. epiR version 2.0.50 to be uploaded to CRAN shortly.
This was not quite as straight forward as I expected.
It appears that summary() when applied to an object x of class epi.tests simply prints x$details. x$details is a list of data.frames with statistic names as row names. That last bit makes things slightly more complicated than they would otherwise have been.
A potential tidyverse solution is
library(tidyverse)
lapply(
names(rval.tes02$detail),
function(x) {
as_tibble(rval.tes02$detail[[x]]) %>%
add_column(statistic=x, .before=1)
}
) %>%
bind_rows()
# A tibble: 18 × 4
statistic est lower upper
<chr> <dbl> <dbl> <dbl>
1 ap 0.550 0.525 0.574
2 tp 0.469 0.444 0.494
3 se 0.901 0.877 0.921
4 sp 0.760 0.730 0.789
5 diag.ac 0.826 0.806 0.844
6 diag.or 28.7 21.5 38.2
7 nndx 1.51 1.41 1.65
8 youden 0.661 0.607 0.710
9 pv.pos 0.768 0.739 0.796
10 pv.neg 0.896 0.872 0.918
11 lr.pos 3.75 3.32 4.24
12 lr.neg 0.131 0.105 0.163
13 p.rout 0.450 0.426 0.475
14 p.rin 0.550 0.525 0.574
15 p.tpdn 0.240 0.211 0.270
16 p.tndp 0.0995 0.0789 0.123
17 p.dntp 0.232 0.204 0.261
18 p.dptn 0.104 0.0823 0.128
Which is a tibble containing the same information as summary(rval.tes02), which you should be able to pass on to gtsummary or flextable. Unusually, the broom package doesn't have a tidy() verb for epi.tests objects.

lapply instead of for loop for randomised hypothesis testing r

I have a df that looks something like this like this:
set.seed(42)
ID <- sample(1:30, 100, rep=T)
Trait <- sample(0:1, 100, rep=T)
Year <- sample(1992:1999, 100, rep=T)
df <- cbind(ID, Trait, Year)
df <- as.data.frame(df)
Where ID is an individual organism, trait is a presence/absence of a phenotype and Year is the year an observation was made.
I would like to model if trait is random between individuals, something like this
library(MCMCglmm)
m <- MCMCglmm(Trait ~ ID, random = ~ Year, data = df, family = "categorical")
Now, would like to shuffle the Trait column and run x permutations, to check if my observed mean and CI fall outside of what's expected from random.
I could do this with a for loop, but I'd rather use a tidyverse solution.
I've read that lapply is a bette(?) alternative, but I am struggling to find a specific enough walk-through that I can follow.
I'd appreciate any advice offered here.
Cheers!
Jamie
EDIT October 10th. Cleaned up the code and per comment below added the code to give you back a nice organized tibble\dataframe
### decide how many shuffles you want and name them
### in an orderly fashion for the output
shuffles <- 1:10
names(shuffles) <- paste0("shuffle_", shuffles)
library(MCMCglmm)
library(dplyr)
library(tibble)
library(purrr)
ddd <- purrr::map(shuffles,
~ df %>%
mutate(Trait = sample(Trait)) %>%
MCMCglmm(fixed = Trait ~ ID,
random = ~ Year,
data = .,
family = "categorical",
verbose = FALSE)) %>%
purrr::map( ~ tibble::as_tibble(summary(.x)$solutions, rownames = "model_term")) %>%
dplyr::bind_rows(., .id = 'shuffle')
ddd
#> # A tibble: 20 x 7
#> shuffle model_term post.mean `l-95% CI` `u-95% CI` eff.samp pMCMC
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 shuffle_1 (Intercept) 112. 6.39 233. 103. 0.016
#> 2 shuffle_1 ID -6.31 -13.5 -0.297 112. 0.014
#> 3 shuffle_2 (Intercept) 24.9 -72.5 133. 778. 0.526
#> 4 shuffle_2 ID -0.327 -6.33 5.33 849. 0.858
#> 5 shuffle_3 (Intercept) 4.39 -77.3 87.4 161. 0.876
#> 6 shuffle_3 ID 1.04 -3.84 5.99 121. 0.662
#> 7 shuffle_4 (Intercept) 7.71 -79.0 107. 418. 0.902
#> 8 shuffle_4 ID 0.899 -4.40 6.57 408. 0.694
#> 9 shuffle_5 (Intercept) 30.4 -62.4 144. 732. 0.51
#> 10 shuffle_5 ID -0.644 -6.61 4.94 970. 0.866
#> 11 shuffle_6 (Intercept) -45.5 -148. 42.7 208. 0.302
#> 12 shuffle_6 ID 4.73 -0.211 11.6 89.1 0.058
#> 13 shuffle_7 (Intercept) -16.2 -133. 85.9 108. 0.696
#> 14 shuffle_7 ID 2.47 -2.42 10.3 47.8 0.304
#> 15 shuffle_8 (Intercept) 0.568 0.549 0.581 6.60 0.001
#> 16 shuffle_8 ID -0.0185 -0.0197 -0.0168 2.96 0.001
#> 17 shuffle_9 (Intercept) -6.95 -112. 92.2 452. 0.886
#> 18 shuffle_9 ID 2.07 -3.30 8.95 370. 0.476
#> 19 shuffle_10 (Intercept) 43.8 -57.0 159. 775. 0.396
#> 20 shuffle_10 ID -1.36 -7.44 5.08 901. 0.62
Your original data
set.seed(42)
ID <- sample(1:30, 100, rep=T)
Trait <- sample(0:1, 100, rep=T)
Year <- sample(1992:1999, 100, rep=T)
df <- cbind(ID, Trait, Year)
df <- as.data.frame(df)

Many regressions using tidyverse and broom: Same dependent variable, different independent variables

This link shows how to answer my question in the case where we have the same independent variables, but potentially many different dependent variables: Use broom and tidyverse to run regressions on different dependent variables.
But my question is, how can I apply the same approach (e.g., tidyverse and broom) to run many regressions where we have the reverse situation: same dependent variables but different independent variable. In line with the code in the previous link, something like:
mod = lm(health ~ cbind(sex,income,happiness) + faculty, ds) %>% tidy()
However, this code does not do exactly what I want, and instead, produces:
Call:
lm(formula = income ~ cbind(sex, health) + faculty, data = ds)
Coefficients:
(Intercept) cbind(sex, health)sex
945.049 -47.911
cbind(sex, health)health faculty
2.342 1.869
which is equivalent to:
lm(formula = income ~ sex + health + faculty, data = ds)
Basically you'll need some way to create all the different formulas you want. Here's one way
qq <- expression(sex,income,happiness)
formulae <- lapply(qq, function(v) bquote(health~.(v)+faculty))
# [[1]]
# health ~ sex + faculty
# [[2]]
# health ~ income + faculty
# [[3]]
# health ~ happiness + faculty
Once you have all your formula, you can map them to lm and then to tidy()
library(purrr)
library(broom)
formulae %>% map(~lm(.x, ds)) %>% map_dfr(tidy, .id="model")
# A tibble: 9 x 6
# model term estimate std.error statistic p.value
# <chr> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 1 (Intercept) 19.5 0.504 38.6 1.13e-60
# 2 1 sex 0.755 0.651 1.16 2.49e- 1
# 3 1 faculty -0.00360 0.291 -0.0124 9.90e- 1
# 4 2 (Intercept) 19.8 1.70 11.7 3.18e-20
# 5 2 income -0.000244 0.00162 -0.150 8.81e- 1
# 6 2 faculty 0.143 0.264 0.542 5.89e- 1
# 7 3 (Intercept) 18.4 1.88 9.74 4.79e-16
# 8 3 happiness 0.205 0.299 0.684 4.96e- 1
# 9 3 faculty 0.141 0.262 0.539 5.91e- 1
Using sample data
set.seed(11)
ds <- data.frame(income = rnorm(100, mean=1000,sd=200),
happiness = rnorm(100, mean = 6, sd=1),
health = rnorm(100, mean=20, sd = 3),
sex = c(0,1),
faculty = c(0,1,2,3))
You could use the combn function to get all combinations of n independent variables and then iterate over them. Let's say n=3 here:
library(tidyverse)
ds <- data.frame(income = rnorm(100, mean=1000,sd=200),
happiness = rnorm(100, mean = 6, sd=1),
health = rnorm(100, mean=20, sd = 3),
sex = c(0,1),
faculty = c(0,1,2,3))
ivs = combn(names(ds)[names(ds)!="income"], 3, simplify=FALSE)
# Or, to get all models with 1 to 4 variables:
# ivs = map(1:4, ~combn(names(ds)[names(ds)!="income"], .x, simplify=FALSE)) %>%
# flatten()
names(ivs) = map(ivs, ~paste(.x, collapse="-"))
models = map(ivs,
~lm(as.formula(paste("income ~", paste(.x, collapse="+"))), data=ds))
map_df(models, broom::tidy, .id="model")
model term estimate std.error statistic p.value
* <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 happiness-health-sex (Intercept) 1086. 201. 5.39 5.00e- 7
2 happiness-health-sex happiness -25.4 21.4 -1.19 2.38e- 1
3 happiness-health-sex health 3.58 6.99 0.512 6.10e- 1
4 happiness-health-sex sex 11.5 41.5 0.277 7.82e- 1
5 happiness-health-faculty (Intercept) 1085. 197. 5.50 3.12e- 7
6 happiness-health-faculty happiness -25.8 20.9 -1.23 2.21e- 1
7 happiness-health-faculty health 3.45 6.98 0.494 6.23e- 1
8 happiness-health-faculty faculty 7.86 18.2 0.432 6.67e- 1
9 happiness-sex-faculty (Intercept) 1153. 141. 8.21 1.04e-12
10 happiness-sex-faculty happiness -25.9 21.4 -1.21 2.28e- 1
11 happiness-sex-faculty sex 3.44 46.2 0.0744 9.41e- 1
12 happiness-sex-faculty faculty 7.40 20.2 0.366 7.15e- 1
13 health-sex-faculty (Intercept) 911. 143. 6.35 7.06e- 9
14 health-sex-faculty health 3.90 7.03 0.554 5.81e- 1
15 health-sex-faculty sex 15.6 45.6 0.343 7.32e- 1
16 health-sex-faculty faculty 7.02 20.4 0.345 7.31e- 1

Multiple LM model returning the same coefficients

Hello Stack Community,
I am trying to model wage growth across US territories using linear models to forecast into the future. I want to try and create a model for each state/ territory (DC, VI, and PR), however, when I look at the coefficients for my models, they are the same for each state.
I have used a combination of plyr ,dplyr, and broom thus far to create and sort my data frame (named stuben_dat) for this project
#Wage Growth
state_data = stuben_dat %>% group_by(st) %>%
do (state_wg= lm(wage_growth ~ us_wage_growth + lag_wage_growth + dum1
+dum2 +dum3,
data= stuben_dat, subset=yr>= (current_year - 5)))
#The dummy variables adjust for seasonality (q1 vs q2 vs q3 vs q4)
#The current_year = whatever year I last updated the program
#The current_year-5 value lets me change the look back period
#This look back period can be used to exclude recessions or outliers
Here is just a snapshot of my output, and as you can see, the beta coefficients and regression statistics are exactly the same for each state (Just AK and AL) are shown here. However, I want to build a different model for each state.
# A tibble: 318 x 6
# Groups: st [53]
st term estimate std.error statistic p.value
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 AK (Intercept) -1.75 0.294 -5.97 3.28e- 9
2 AK us_wage_growth 996. 23.6 42.2 1.82e-228
3 AK lag_wage_growth 0.191 0.0205 9.34 5.58e- 20
4 AK dum1 -0.245 0.304 -0.806 4.21e- 1
5 AK dum2 -0.321 0.304 -1.06 2.90e- 1
6 AK dum3 0.0947 0.303 0.312 7.55e- 1
7 AL (Intercept) -1.75 0.294 -5.97 3.28e- 9
8 AL us_wage_growth 996. 23.6 42.2 1.82e-228
9 AL lag_wage_growth 0.191 0.0205 9.34 5.58e- 20
10 AL dum1 -0.245 0.304 -0.806 4.21e- 1
# ... with 308 more rows
It is because you are using the same data in your do() call. Try out:
state_data = stuben_dat %>%
group_by(st) %>%
do(state_wg = lm(wage_growth ~ us_wage_growth + lag_wage_growth +
dum1 + dum2 + dum3,
data = ., subset = (yr >= (current_year - 5))))

Resources