Perform nonlinear regression with nlsLM within a function in r - r

I want to add a modification factor to an existing equation to fit data. The original equation is defined through a function because the variable N_l is a vector of numbers and the function is selecting the largest outcome of the equation by going through all possible values in the vector N_l. The original function is defined as:
library(utils)
R <- function(x){
N_b <- x[1]
N_l <- x[2]
A <- x[3]
x.sqr <- x[4]
S <- x[10]
e <- x[grepl("e_\\d",names(x))]
f <- sapply(seq(N_l),function(k) max(Multi.Presence$m[k] * ((k/N_b) +
(A * combn(e,k,sum) / x.sqr))))
c(val = max(f), pos = which.max(f))
}
DATA.GIRDER1 <- cbind(DATA.GIRDER1, vars = t(apply(DATA.GIRDER1, 1, R)))
colnames(DATA.GIRDER1)[12:13] <- c("Proposed.Girder1","Lanes")
The equation defines in the function is:
The first 5 rows of the dataframe DATA.GIRDER1 and dataframe Multi.Presence are provided:
> dput(DATA.GIRDER1[(1:5),]
structure(list(N_b = c(5, 5, 5, 5, 5), N_l = c(4, 4, 4, 4, 4),
A = c(-12, -12, -12, -12, -12), x.sqr = c(1440, 1440,
1440, 1440, 1440), e_1 = c(21.8, 21.8, 21.8, 21.8, 21.8),
e_2 = c(9.8, 9.8, 9.8, 9.8, 9.8), e_3 = c(-2.2, -2.2, -2.2,
-2.2, -2.2), e_4 = c(-14.2, -14.2, -14.2, -14.2, -14.2),
e_5 = c(0, 0, 0, 0, 0), S = c(12, 12, 12, 12, 12),
R = c(0.59189685884369, 0.583646426252063,
0.556293941275237, 0.576160481501275, 0.597435112708129)),
row.names = c(NA, 5L), class = "data.frame")
> dput(Multi.Presence)
structure(list(N_l = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), m = c(1.2,
1, 0.85, 0.65, 0.65, 0.65, 0.65, 0.65, 0.65, 0.65)), row.names = c(NA,
-10L), class = "data.frame")
The theoretical data to fit the equation to is CSi.Girder1. At the moment, the way the function is set up, it calculates the maximum R for each row of dataframe DATA.GIRDER1.
I want to add a regression term based on variable S in dataframe DATA.GIRDER1to the second part of the equation to find parameters a and b to best fit the data in CSi.Girder1. The desired output would implement the equation below:
To use nlsLM I need to define a function for the equation such as:
library(minpack.lm)
Prposed.Girder1 <- function(N_b, N_l,A,x.sqr,e_1,e_2,e_3,e_4,e_5,S,a,b) {
R <- function(x){
N_b <- x[1]
N_l <- x[2]
A <- x[3]
x.sqr <- x[4]
e <- x[grepl("e_\\d",names(x))]
f <- sapply(seq(N_l),function(k) max(Multi.Presence$m[k] * ((k/N_b) +
(A * combn(e,k,sum) / x.sqr) * (b*S^a))))
c(val = max(f), pos = which.max(f))
}
DATA.GIRDER1 <- cbind(DATA.GIRDER1, vars = t(apply(DATA.GIRDER1, 1, R)))
colnames(DATA.GIRDER1)[12:13] <- c("Proposed.Girder1","Lanes")
return(R)
}
Girder1_nlsLM <- nlsLM(R ~ Prposed.Girder1(N_b, N_l,A,x.sqr,e_1,e_2,e_3,e_4,e_5,S,a,b),
data = DATA.GIRDER1,
start = c(a = 0.01, b = 0.01))
summary(Girder1_nlsLM)
But this is not successful and I get the error:
Error in model.frame.default(formula = ~R + N_b + N_l + A + x.sqr + e_1 + :
object is not a matrix
How can I add this modification factor in terms of variable S to solve for the parameters a and b.

Related

Rolling average of values that satisfy multiple conditions in R

This is my first question on Stackoverflow, so please bear with me if I make any mistakes or omit necessary information.
I have a dataset consisting of a time series where I need to find the 5-day rolling average of a binary variable for each specific hour of the day. An example of my data can be created using:
library(dplyr)
library(zoo)
set.seed(69)
df <- data.frame(Hour = rep(c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24), times = 10),
Reg = rep(round(runif(24*10, 0, 1))),
HumidityLevel = rep(runif(24*10, 0, 100)))
df_ranges <- data.frame(LowerRange = rep(cbind(rollapply(df$HumidityLevel, 24, min, by = 24)), each = 24)
,UpperRange = rep(cbind(rollapply(df$HumidityLevel, 24, max, by = 24)), each = 24))
df <- cbind(df, df_ranges)
I have computed the simple rolling average using the following code:
df <- df %>%
group_by(Hour) %>%
mutate(AvgReg = lag(rollapplyr(Reg, 5, mean, na.rm = T, partial = T), n = 1))
What I need to do is compute the rolling average of Reg using previous rows where HumidityLevel lies within the range for that specific day. The lower and upper boundary of the range is determined by two columns (LowerRange, UpperRange). The boundary values are dependent on the lowest and highest HumidityLevel-values for the day.
For instance, a day may have levels between 20 and 54. The rolling average for hour 1 of that specific day should then be computed by using previous Hour 1 observations with a HumidityLevel value above or equal to 20 and below or equal to 54.
I hope that my question makes sense.
This is my desired output:
desired_output <- data.frame(RowNum = c(1:10),
Hour = rep(1, times = 10),
Reg = c(1,0,0,1,0,1,0,0,0,0),
HumidityLevel = c(28.36, 65.02, 1.12, 49.61, 24.50, 98.16, 77.33, 97.03, 47.03, 85.71),
LowerBoundary = c(5.67, 7.50, 1.12, 19.32, 0.01, 6.94, 7.48, 0.71, 2.85, 1.59),
UpperBoundary = c(93.60, 89.37, 97.25, 99.63, 91.92, 98.16, 98.48, 99.98, 99.70, 98.86),
AvgReg = c("NA", 1, 0.5, 0.5, 0.5, 0.5, 0.6, 0.4, 0.4, 0.2))
Using data.table you can use between for filter and shift + frollmean for calculation:
setDT(df)[
between(HumidityLevel, LowerRange, UpperRange),
new_col := shift(
frollmean(Reg, c(seq_len(min(5, .N)), rep(5, max(0, .N - 5))), adaptive = TRUE)
),
by = Hour
]

Pooled average marginal effects from survey-weighted and multiple-imputed data

I am working with survey data and their associated weights, in addition to missing data that I imputed using mice(). The model I'm eventually running contains complex interactions between variables for which I want the average marginal effect.
This task seems trivial in STATA, but I'd rather stay in R since that's what I know best. It seems easy to retrieve AME's for each separate imputed dataset and average the estimates. However, I need to make use of pool() (from mice) to make sure I'm getting the correct standard errors.
Here is a reproducible example:
library(tidyverse)
library(survey)
library(mice)
library(margins)
df <- tibble(y = c(0, 5, 0, 4, 0, 1, 2, 3, 1, 12), region = c(1, 1, 1, 1, 1, 3, 3, 3, 3, 3),
weight = c(7213, 2142, 1331, 4342, 9843, 1231, 1235, 2131, 7548, 2348),
x1 = c(1.14, 2.42, -0.34, 0.12, -0.9, -1.2, 0.67, 1.24, 0.25, -0.3),
x2 = c(12, NA, 10, NA, NA, 12, 11, 8, 9, 9))
Using margins() on a simple (non-multiple) svyglm works without a hitch. Running svyglm on each imputation using which() and pooling the results also works well.
m <- with(surv_obj, svyglm(y ~ x1 * x2))
pool(m)
However, wrapping margins() into which() returns an error "Error in .svycheck(design) : argument "design" is missing, with no default"
with(surv_obj, margins(svyglm(y ~ x1 * x2), design = surv_obj))
If I specify the design in the svyglm call, I get "Error in UseMethod("svyglm", design) : no applicable method for 'svyglm' applied to an object of class "svyimputationList""
with(surv_obj, margins(svyglm(y ~ x1 * x2, design = surv_obj), design = surv_obj))
If I drop the survey layer, and simply try to run the margins on each imputed set and then pool, I get a warning: "Warning in get.dfcom(object, dfcom) : Infinite sample size assumed.".
m1 <- with(imputed_df, margins(lm(y ~ x1 * x2)))
pool(m1)
This worries me given that pool() may use sample size in its calculations.
Does anyone know of any method to either (a) use which(), margins() and pool() to retrieve the pooled average marginal effects or (b) knows what elements of margins() I should pass to pool() (or pool.scalar()) to achieve the desired result?
Update following Vincent's comment
Wanted to update this post following Vincent's comment and related package marginaleffects() which ended up fixing my issue. Hopefully, this will be helpful to others stuck on similar problems.
I implemented the code in the vignette linked in Vincent's comment, adding a few steps that allow for survey weighting and modeling. It's worth noting that svydesign() will drop any observations missing on clustering/weighting variables, so marginaleffects() can't predict values back unto the original "dat" data and will throw up an error. Pooling my actual data still throws up an "infinite sample size assumed", which (as noted) should be fine but I'm still looking into fixes.
library(tidyverse)
library(survey)
library(mice)
library(marginaleffects)
fit_reg <- function(dat) {
svy <- svydesign(ids = ~ 1, cluster = ~ region, weight = ~weight, data = dat)
mod <- svyglm(y ~ x1 + x2*factor(x3), design = svy)
out <- marginaleffects(mod, newdata = dat)
class(out) <- c("custom", class(out))
return(out)
}
tidy.custom <- function(x, ...) {
out <- marginaleffects:::tidy.marginaleffects(x, ...)
out$term <- paste(out$term, out$contrast)
return(out)
}
df <- tibble(y = c(0, 5, 0, 4, 0, 1, 2, 3, 1, 12), region = c(1, 1, 1, 1, 1, 3, 3, 3, 3, 3),
weight = c(7213, 2142, 1331, 4342, 9843, 1231, 1235, 2131, 7548, 2348),
x1 = c(1.14, 2.42, -0.34, 0.12, -0.9, -1.2, 0.67, 1.24, 0.25, -0.3),
x2 = c(12, NA, 10, NA, NA, 12, 11, 8, 9, 9),
x3 = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2))
imputed_df <- mice(df, m = 2, seed = 123)
dat_mice <- complete(imputed_df, "all")
mod_imputation <- lapply(dat_mice, fit_reg)
mod_imputation <- pool(mod_imputation)
summary(mod_imputation)

Use combn within a function for nonlinear regression with nlsLM

Given is a few rows of a dataframe DATA:
> dput(DATA[c(1,7,20,25,26,53,89),])
structure(list(Lanes = c(3, 3, 3, 3, 3, 3, 3), N_b = c(5, 5,
5, 5, 5, 5, 5), A = c(-12, -12, -15, -9, -9, -15, -9), x.sqr =
c(1440, 1440, 2250, 810, 810, 2250, 810), e_1 = c(21.8, 21.8,
29, 14.6, 14.6, 29, 14.6), e_2 = c(9.8, 9.8, 17, 2.6, 2.6, 17,
2.6), e_3 = c(-2.2, -2.2, 5, -9.4, -9.4, 5, -9.4), e_4 =
c(-14.2, -14.2, -7, 0, 0, -7, 0), e_5 = c(0, 0, -19, 0, 0, -19, 0),
S = c(12, 12, 15, 9, 9, 15, 9), CSi = c(0.59189685884369,
0.574916237257971, 0.644253184434141, 0.474070747691647,
0.492033722080107, 0.644904371480046, 0.49900365977452),
m = c(0.85, 0.85, 0.85, 0.85, 0.85, 0.85, 0.85)), row.names = c(1L,
7L, 20L, 25L, 26L, 53L, 89L), class = "data.frame")
I write the function below to use for nonlinear regression with nlsLM:
library(minpack.lm)
Prposed <- function(N_b,Lanes,m,A,x.sqr,e_1,e_2,e_3,e_4,e_5,S,a) {
e <- data.frame(e_1,e_2,e_3,e_4,e_5)
CSi <- m * ((Lanes/N_b) + (A * combn(e,Lanes,sum) / x.sqr) * (b*S^a))
return(CSi)
}
nlsLM <- nlsLM(CSi ~ Prposed(N_b,Lanes,m,A,x.sqr,e_1,e_2,e_3,e_4,e_5,S,a,b),
data = DATA,
start = c(a = 0.01, b = 0.01))
summary(nlsLM)
I keep getting an error and it is coming from how I am defining the columns e_1, e_2, etc.. with the combn function.
UPDATE
I found another question: Error when running nlsLM but works for nls
which uses a for loop in the original function, and that seems to work fine with the nls2 function from library(nls2). I was wondering if I could get rid of the combn term altogether by going to a for loop instead.
This is not an actual answer since it generates a new error after fixing the combn error but this might give you some direction.
I think you are trying to run nlsLM function for each row in DATA. You need to pass each row separately in Prposed function. Also note that a and b are required in the function to perform calculation so they need to be passed as an argument of the function and I think passing them using start in nlsLM would not work.
So change your function to :
library(minpack.lm)
Prposed <- function(N_b,Lanes,m,A,x.sqr,e_1,e_2,e_3,e_4,e_5,S,a, b) {
e <- data.frame(e_1,e_2,e_3,e_4,e_5)
CSi <- m * ((Lanes/N_b) + (A * combn(e,Lanes,sum) / x.sqr) * (b*S^a))
return(CSi)
}
Now let's run this for first row of DATA :
x <- DATA[1, ]
Prposed(x[[2]], x[[1]], x[[12]], x[[3]], x[[4]], x[[5]], x[[6]], x[[7]], x[[8]],
x[[9]],x[[10]],a = 0.01, b = 0.01)
#[1] 0.5078651 0.5087365 0.5077053 0.5096079 0.5085767 0.5094481 0.5104793
# 0.5094481 0.5103195 0.5111909
I don't know the theory so I don't know if these numbers make sense/are correct. However, when you plug this in nlsLM function it gives an error.
nlsLM(CSi~Prposed(x[[2]],x[[1]],x[[12]], x[[3]],x[[4]],x[[5]],x[[6]],x[[7]],
x[[8]], x[[9]],x[[10]],a = 0.01, b = 0.01),data = DATA)
Error in getInitial.default(func, data, mCall = as.list(match.call(func, :
no 'getInitial' method found for "function" objects
Is this because nlsLM expects a formula object but what we are passing to it is values? I am not sure.
Once you get the above step working you can plug this in an apply and run it as :
apply(DATA, 1, function(x) {
nlsLM(CSi~Prposed(x[[2]],x[[1]],x[[12]], x[[3]],x[[4]],x[[5]],x[[6]],x[[7]],
x[[8]], x[[9]],x[[10]],a = 0.01, b = 0.01), data = DATA)
})
It works and generates numbers without nlsLM function :
apply(DATA, 1, function(x) {
Prposed(x[[2]],x[[1]],x[[12]], x[[3]],x[[4]],x[[5]],x[[6]],x[[7]],x[[8]],
x[[9]],x[[10]],a = 0.01, b = 0.01)
})
# 1 7 20 25 26 53 89
# [1,] 0.5078651 0.5078651 0.5070307 0.5092470 0.5092470 0.5070307 0.5092470
# [2,] 0.5087365 0.5087365 0.5077293 0.5083395 0.5083395 0.5077293 0.5083395
# [3,] 0.5077053 0.5077053 0.5084280 0.5083395 0.5083395 0.5084280 0.5083395
# [4,] 0.5096079 0.5096079 0.5084280 0.5094980 0.5094980 0.5084280 0.5094980
# [5,] 0.5085767 0.5085767 0.5091267 0.5094980 0.5094980 0.5091267 0.5094980
# [6,] 0.5094481 0.5094481 0.5098253 0.5085905 0.5085905 0.5098253 0.5085905
# [7,] 0.5104793 0.5104793 0.5091267 0.5106565 0.5106565 0.5091267 0.5106565
# [8,] 0.5094481 0.5094481 0.5098253 0.5106565 0.5106565 0.5098253 0.5106565
# [9,] 0.5103195 0.5103195 0.5105240 0.5097490 0.5097490 0.5105240 0.5097490
#[10,] 0.5111909 0.5111909 0.5112227 0.5109075 0.5109075 0.5112227 0.5109075
I had to define the by row operation within the original function
Proposed <- function(N_b,Lanes,m,A,x.sqr,e_1,e_2,e_3,e_4,e_5,S,a,b) {
e <- data.frame(e_1,e_2,e_3,e_4,e_5)
CSi <- m * ((Lanes/N_b) + (max(A * combn(seq_along(e), Lanes, FUN = function(i) rowSums(e[i]))) / x.sqr) * (b*S^a))
return(CSi)
}
nlsLM <- nlsLM(CSi ~ Proposed(N_b,Lanes,m,A,x.sqr,e_1,e_2,e_3,e_4,e_5,S,a,b),
data = DATA,
start = c(a = 0.01, b = 0.01))
summary(nlsLM)

Is there a function in R to split vector by percentage groups?

For example I have a column:
x <- c(-0.5, 1.1, 6.0, 4.5, 0.1, -0.2)
I want to add a new column where each value is assigned with a 3 percentage group
For example :
if x -3<=x<0 => group -3
if x 3<=x<6 => group 6
So I will have a new column:
c(-3, 3, 9, 6, 3, -3)
You can use findInterval or cut for this
x <- c(-0.5, 1.1, 6.0, 4.5, 0.1, -0.2)
brks <- seq(-3, 9, 3)
lbls <- c(-3,3,6,9)
lbls[findInterval(x, brks)]
#[1] -3 3 9 6 3 -3
Or as mentioned by #StupidWolf using cut
cut(x, breaks=brks,right=FALSE, labels=lbls)
Base R solution:
# Define the value using the ranges:
num_frame$perc_group <- ifelse(num_frame$num_vec < 0 & num_frame$num_vec >= -3, -3,
ifelse(num_frame$num_vec == 0, 0,
ifelse(num_frame$num_vec > 0 & num_frame$num_vec <= 3, 3, 6)))
Data:
num_frame <- structure(list(num_vec = c(-0.5, 1.1, 6, 4.5, 0.1, -0.2)),
class = "data.frame",
row.names = c(NA, -6L))

How to adapt the size of multiple plots?

How can I adapt the size of the following plots with regard to their length of the x-axis?
The width of the plots should refer to the length of their respective section of the x-axis. The height should be the same for all plots.
The function you want is base graphics function help("layout").
First I will make up a dataset, since you have not posted one. I will not draw the regression lines, just the points.
Data creation code.
fun <- function(X, A) {
apply(X, 1, function(.x){
xx <- seq(.x[1], .x[2], length.out = 100)
y <- A[1]*xx + A[2] + rnorm(100, 0, 25)
list(xx, y)
})}
Coef <- matrix(c(0.24, 0.54,
0.75, 0.54,
0.33, 2.17,
0.29, 3.3,
0.29, 4.41), byrow = TRUE, ncol = 2)
X <- matrix(c(0.1, 0.49,
0.5, 2.49,
2.5, 3.9,
4.0, 5.9,
6.0, 12.0), byrow = TRUE, ncol = 2)
set.seed(1234)
res <- fun(X, Coef)
The problem.
Define a layout matrix with each plot in a sequence from first to 5th. And the widths given by the X ranges.
layout_mat <- matrix(c(1, 2, 3, 4, 5), 1, 5, byrow = TRUE)
w <- apply(X, 1, diff)
l <- layout(layout_mat, widths = w)
layout.show(l)
Now make some room for the axis annotation, saving the default graphics parameters, and plot the 5 graphs.
om <- par(mar = c(3, 0.1, 0.1, 0.1),
oma = c(3, 2, 0.1, 0.1))
for(i in 1:5) plot(res[[i]][[1]], res[[i]][[2]])
par(om)

Resources