What is this syntax, and how do I use it? - r

I came across this syntax from a previous question on Stack Overflow, and I am unfamiliar with it.
However, it seems to work pretty good and I've been able to work out how to use it, but that doesn't mean I understand it.
Is this base R, or a library?
cor.test( ~ hp + qsec, mtcars)
I am referring to the usage of ~, and the subsequent use of + in the call, and how that allows the specification of columns in a dataframe.

The help page for cor.test lists one form of the function as
        cor.test(formula, data, subset, na.action, ...)
and in the description of the arguments it says:
        formula: a formula of the form ~ u + v
~ hp + qsec is a formula, so you can get a lot of information by looking at the help page help(formula). However, that page emphasizes formulas of the form a ~ b which can be interpreted as something like "a as a function of b". This formula (~ a+ b) has no dependent variable. It can be interpreted as something like "using the variables a and b".

Related

Error in rep(" ", spaces1) : invalid 'times' argument

I'm trying to carry out covariate balancing using the ebal package. The basic code is:
W1 <- weightit(Conformidad ~ SexoCon + DurPetFiscPrisión1 +
Edad + HojaHistPen + NacionCon + AnteVivos +
TipoAbog + Reincidencia + Habitualidad + Delitos,
data = Suspension1,
method = "ebal", estimand = "ATT")
I then want to check the balance using the summary function:
summary(W1)
This originally worked fine but now I get the error message:
Error in rep(" ", spaces1) : invalid 'times' argument
It's the same dataset and same code, except I changed some of the covariates. But now even when I go back to original covariates I get the same error. Any ideas would be very appreciated!!
I'm the author of WeightIt. That looks like a bug. I'll take a look at it. Are you using the most updated version of WeightIt?
Also, summary() doesn't assess balance. To do that, you need to use cobalt::bal.tab(). summary() summarizes the distribution of the weights, which is less critical than examining balance. bal.tab() displays the effect sample size as well, which is probably the most important statistic produced by summary().
I encountered the same error message. This happens when the treatment variable is coded as factor or character, but not as numeric in weightit.
To make summary() work, you need to use 1 and 0.

Programmatically detect function calls in R formulae, e.g. y ~ x + log(z), and surround them in backticks

Let me explain my goal first because while the title expresses my strategy, I don't think it is likely to be the only way to solve the problem.
I have an R function to which I pass fitted model objects, like those from lm, and the function extracts the model frame, saves that as a data frame, standardizes the variables in the new data frame, then refits the model with the standardized variables to ease the interpretation of the model's coefficients.
Example code without wrapping it in a function:
mod <- lm(mpg ~ wt, data = mtcars)
new_data <- model.frame(mod)
new_data <- data.frame(lapply(new_data, FUN = scale))
standardized_mod <- update(mod, data = new_data)
Now a summary of standardized_mod by virtue of being fitted with standardized data will give standardized coefficients.
This isn't the most efficient way of doing things, I admit, since I could do something like multiplying the estimates and SEs by each variable's standard deviation. But in the context of the function, I'm trying to be more flexible; this gets less straightforward when working with survey package objects and the like. I also use the same logic to fit models with interaction terms for simple slopes analysis. But this is besides the main point of the question, I just want to offer some explanation to avoid getting bogged down with "there's other ways to standardize coefficients" responses. I'm more interested in this general problem with formulae than the specific application.
The solution above falls apart when a function is applied to any of the variables. For example,
mod <- lm(mpg ~ log(wt), data = mtcars)
new_data <- model.frame(mod)
new_data <- data.frame(lapply(new_data, FUN = scale), check.names = FALSE)
standardized_mod <- update(mod, data = new_data)
This will break on update(mod, data = new_data), because lm is going to look for a column called wt to apply log to in new_data, which only has columns called mpg and log(wt).
What I would like to do is manipulate the model formula in such a way that it goes from mpg ~ log(data) to mpg ~ `log(data)`. Of course, if it was just log I was worried about, I might be able to get something really hacky going to address it. But I'd like to be able to do the same regardless of the function in the formula, like if it's poly or some such.
Here are some solutions I've considered:
Instead of update, re-fit the model with lm directly and use the . for the RHS of the formula. This would work for some cases, but has big drawbacks, too. This will ignore any interaction terms in the original formula or other arithmetic uses of the formula from the original model. It also won't fix the problem if the function was applied to the LHS of the formula in the original model.
Use some kind of convoluted regex matching to isolate terms that appear to be functions on the basis of being right before (, but as a general rule I'm fearful of using string manipulation since it may fail in confusing ways. I'm not completely ruling this route out, but I haven't wrapped my head around how to do it safely and am not sure how to match terms with functions without accidentally capturing other parts of the formula.
I've tried messing around with the terms object and trying to use that as a way to use update on the formula itself, but haven't had much luck figuring out how to edit the terms object in the right ways.
We can avoid having to re-create the formula like this. mm0 is the model matrix columns except for the intercept. scale that giving mm0_std0. Now compute the new standardized lm:
mod <- lm(mpg ~ log(wt) * qsec, data = mtcars)
response <- mod$model[1]
mm0 <- model.matrix(mod)[, -1]
mm0_std <- scale(mm0)
mod_std <- lm(cbind(response, mm0_std))
If you do want the formula this will give it:
formula(mod_std)
## mpg ~ `log(wt)` + qsec + `log(wt):qsec`
## <environment: 0x000000000b1988c8>
I've thought of another potential solution as well, but I've not extensively tested it and it uses regex, which is in my understanding not the most R way of doing things.
mod <- lm(mpg ~ log(wt) * qsec, data = mtcars)
new_data <- model.frame(mod)
new_data <- data.frame(lapply(new_data, FUN = scale), check.names = FALSE)
We have the usual start, above.
Now I pull the variable names from the terms object.
vars <- as.character(attributes(terms(mod))$variables)
vars <- vars[-1] # gets rid of "list"
And save the full formula as a string.
char_form <- as.character(deparse(formula(mod)))
Now I iterate through the variables and use regex to surround each one in backticks. This gets around the trickier regex I was worried about with regard to detect which variables had functions applied.
for (var in vars) {
backtick_name <- paste("`", var, "`", sep = "")
char_form <- gsub(var, backtick_name, char_form, fixed = TRUE)
}
If I want to specify a variable not to standardize, like the outcome variable, I can exclude it from the vars vector programmatically. For instance, I can do this:
response <- as.character(formula(mod))[2]
vars <- vars[vars != response]
Of course, we can remove the response by dropping the first item in the list, but the above is for demonstrative purposes.
Now I can refit the model with the new data and new formula.
new_model <- update(mod, formula = as.formula(char_form), data = new_data)
In this narrow case, I don't really need to use update since I have all I need for lm. But if I was starting with a glm object or some other model, other user-supplied arguments like family are preserved.
Note: Weights and offsets can be problematic here, but it's not an intractable problem. I think the most straightforward thing to do is explicitly exclude columns named "(weights)" and "(offset)" from the model frame before scaling, then cbinding it back together afterwards. Then the user can use conditionals or some such to decide when to supply weights = `(weights)` and offset = `(offset)` arguments to update.

Subsetting data breaks GLM

I have a GLM Logit regression that works correctly, but when I add a subset argument to the GLM command, I get the following error:
invalid type (list) for variable '(weights)'.
So, the following command works:
glm(formula = A ~ B + C,family = "binomial",data = Data)
But the following command yield the error:
glm(formula = A ~ B + C,family = "binomial",data = Data,subset(Data,D<10))
(I realize that it may be difficult to answer this without seeing my data, but any general help on what may be causing my problem would be greatly appreciated)
Try subset=D<10 instead (you don't need to specify Data again, it is implicitly used as the environment for the subset argument). Because you haven't named the argument, R is interpreting it as the weights argument (which is the next argument after data).

Piece-wise linear and non-linear regression in R

I have a question which is perhaps more a statistical query than one related to r directly, however it may be that I am just invoking an r package incorrectly so I will post the question here. I have the following dataset:
x<-c(1e-08, 1.1e-08, 1.2e-08, 1.3e-08, 1.4e-08, 1.6e-08, 1.7e-08,
1.9e-08, 2.1e-08, 2.3e-08, 2.6e-08, 2.8e-08, 3.1e-08, 3.5e-08,
4.2e-08, 4.7e-08, 5.2e-08, 5.8e-08, 6.4e-08, 7.1e-08, 7.9e-08,
8.8e-08, 9.8e-08, 1.1e-07, 1.23e-07, 1.38e-07, 1.55e-07, 1.76e-07,
1.98e-07, 2.26e-07, 2.58e-07, 2.95e-07, 3.25e-07, 3.75e-07, 4.25e-07,
4.75e-07, 5.4e-07, 6.15e-07, 6.75e-07, 7.5e-07, 9e-07, 1.15e-06,
1.45e-06, 1.8e-06, 2.25e-06, 2.75e-06, 3.25e-06, 3.75e-06, 4.5e-06,
5.75e-06, 7e-06, 8e-06, 9.25e-06, 1.125e-05, 1.375e-05, 1.625e-05,
1.875e-05, 2.25e-05, 2.75e-05, 3.1e-05)
y2<-c(-0.169718017273307, 7.28508517630734, 71.6802510299446, 164.637259265704,
322.02901173786, 522.719633360006, 631.977073772459, 792.321270345847,
971.810607095548, 1132.27551798986, 1321.01923840546, 1445.33152600664,
1568.14204073109, 1724.30089942149, 1866.79717333592, 1960.12465709003,
2028.46548012508, 2103.16027631327, 2184.10965255236, 2297.53360080873,
2406.98288043262, 2502.95194879366, 2565.31085776325, 2542.7485752473,
2499.42610084412, 2257.31567571328, 2150.92120390084, 1998.13356362596,
1990.25434682546, 2101.21333152526, 2211.08405955931, 1335.27559108724,
381.326449703455, 430.9020598199, 291.370887491989, 219.580548355043,
238.708972427248, 175.583544448326, 106.057481792519, 59.8876372379487,
26.965143266819, 10.2965349811467, 5.07812046132922, 3.19125838983254,
0.788251933518549, 1.67980552001939, 1.97695007279929, 0.770663673279958,
0.209216903989619, 0.0117903221723813, 0.000974437796492681,
0.000668823762763647, 0.000545308757270207, 0.000490042305650751,
0.000468780182460397, 0.000322977916070751, 0.000195423690538495,
0.000175847622407421, 0.000135771259866332, 9.15607623591363e-05)
which when plot looks like this:
I have then attempted to use the segmentation package to generate three linear regressions (solid black line) in three regions (10^⁻8--10^⁻7,10^⁻7--10^⁻6 and >10^-6) since I have a theoretical basis for finding different relationships in these different regions. Clearly however my attempt using the following code was unsuccessful:
lin.mod <- lm(y2~x)
segmented.mod <- segmented(lin.mod, seg.Z = ~x, psi=c(0.0000001,0.000001))
Thus my first question- are there further parameters of the segmentation I can tweak other than the breakpoints? So far as I understand I have iterations set to maximum as default here.
My second question is: could I perhaps attempt a segmentation using the nls package? It looks as though the first two regions on the plot (10^⁻8--10^⁻7 and 10^-7--10^-6) are further from linear then the final section so perhaps a polynomial function would be better here?
As an example of a result I find acceptable I have annoted the original plot by hand:
.
Edit: The reason for using linear fits is the simplicity they provide, to my untrained eye it would require a fairly complex nonlinear function to regress the dataset as a single unit. One thought that had crossed my mind was to fit a lognormal model to the data as this may work given the skew along a log x-axis. I do not have enough competence in R to do this however as my knowledge only extends to fitdistr which so far as I understand would not work here.
Any help or guidance in a relevant direction would be most appreciated.
If you are not satisfied with the segmented package, you can try the earth package with the mars algorithm. But here, I find that the result of the segmented model is very acceptable. see the R-Squared below.
lin.mod <- lm(y2~x)
segmented.mod <- segmented(lin.mod, seg.Z = ~x, psi=c(0.0000001,0.000001))
summary(segmented.mod)
Meaningful coefficients of the linear terms:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.163e+02 1.143e+02 -1.893 0.0637 .
x 4.743e+10 3.799e+09 12.485 <2e-16 ***
U1.x -5.360e+10 3.824e+09 -14.017 NA
U2.x 6.175e+09 4.414e+08 13.990 NA
Residual standard error: 232.9 on 54 degrees of freedom
Multiple R-Squared: 0.9468, Adjusted R-squared: 0.9419
Convergence attained in 5 iterations with relative change 3.593324e-14
You can check the result by plotting the model :
plot(segmented.mod)
To get the coefficient of the plots , you can do this:
 intercept(segmented.mod)
$x
              Est.
intercept1 -216.30
intercept2 3061.00
intercept3   46.93
> slope(segmented.mod)
$x
             Est.   St.Err.  t value  CI(95%).l  CI(95%).u
slope1  4.743e+10 3.799e+09  12.4800  3.981e+10  5.504e+10
slope2 -6.177e+09 4.414e+08 -14.0000 -7.062e+09 -5.293e+09
slope3 -2.534e+06 5.396e+06  -0.4695 -1.335e+07  8.285e+06

Using function arguments in update.formula

I am writing a function that takes two variables and separately regresses each of them on a set of controls expressed as a one-sided formula. Right now I'm using the following to make the formula for one of the regressions, but it feels a bit hacked-up:
foo <- function(x, y, controls) {
cl <- match.call()
xn <- cl[["x"]]
xf <- as.formula(paste(xn, deparse(controls)))
}
I'd prefer to do this using update.formula(), but of course update.formula(controls, x ~ .) and update.formula(controls, as.name(x) ~ .) don't work. What should I be doing?
Here's one approach:
right <- ~ a + b + c
left <- ~ y
left_2 <- substitute(left ~ ., list(left = left[[2]]))
update(right, left_2)
But I think you'll have to either paste text strings together, or use substitute. To the best of my knowledge, there are no functions to create one two sided formula from two one-sided formulas (or similar equivalents).
I am not sure about update.formula(), but I have used the approach you take here of pasting text and converting it via as.formula in the past with success. My reading of help(update.formula) does not make me think you can substitute the left-hand side as you desire.
Lastly, trust the dispatching mechanism. If you object is of type formula, just call update which is preferred over the explicit update.formula.

Resources