tuple variable in r regression model - r
I'm trying to use Time of the Day as an independent variable in my model. As time is a circular variable, I'm transforming it to (sin(pi * hour / 12), cos(pi * hour / 12)).
I googled around and I still don't know how to create a column in R with the (sin, cos) formatted vector/tuple values. I don't know if the models lm, glm, glm.nb (MASS) and glmer (lme4) can support this kind of data.
Excuse me for being a novice here. If vector-type variable should not be included in a regression model, I'll go to Cross Validated (stats) for suggestion on dealing with circular variables. Please help and share your experience, thanks!
This is an excellent question.
Internally, R uses matrices to fit models to data. Instead of thinking about your model as a functions of tuples, you need to generate a block matrix; this is how splines are implemented, for example.
In your example, one column contains sin(x) and the other cox(x), and the blocks is tagged with a class attribute; the functions that manage this object internally are makepredictcall and predict.
Any models using the standard model.frame / model.matrix processing should be compatible with this.
sincos <- function(x, period=168/2/pi) {
structure(cbind(`_sin`=sin(x/period),
`_cos`=cos(x/period)),
class="sincos",
period=period)
}
Here we set the class and period as attributes.
makepredictcall.sincos <- function(var, call){
if (as.character(call)[1L] != "sincos")
return(call)
call["period"] <- attr(var, "period")
call
}
Set the period as needed on the call.
predict.sincos <- function(object, newx, ...)
{
if(missing(newx))
return(object)
sincos(newx, period=attr(object, "period"))
}
Invoke our function using the period from the fit model.
Here is a short example using lm:
#FAKE DATA EXAMPLE
N <- 1000
hr <- sample(168, N, replace = TRUE)
Y = 5 + sinpi(hr * 2/168) + cospi(hr * 2/168) + rnorm(N)
lm(Y~sinpi(hr*2/168)+cospi(hr*2/168))
#>
#> Call:
#> lm(formula = Y ~ sinpi(hr * 2/168) + cospi(hr * 2/168))
#>
#> Coefficients:
#> (Intercept) sinpi(hr * 2/168) cospi(hr * 2/168)
#> 5.0078 1.0243 0.9637
Our custom function matches exactly:
lm(Y~sincos(hr))
#>
#> Call:
#> lm(formula = Y ~ sincos(hr))
#>
#> Coefficients:
#> (Intercept) sincos(hr)_sin sincos(hr)_cos
#> 5.0078 1.0243 0.9637
Other functions will also be able to tell that the two columns are a single term in the model:
anova(lm(Y~sincos(hr)))
#> Analysis of Variance Table
#>
#> Response: Y
#> Df Sum Sq Mean Sq F value Pr(>F)
#> sincos(hr) 2 1051.05 525.53 553.24 < 2.2e-16 ***
#> Residuals 997 947.06 0.95
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
I'll add this to the stackoverflow package in the next few days, should anyone else find this helpful.
Related
How to incorporate a random effect into a nonlinear mixed effect (nlme) model?
I'd like to build a nonlinear mixed effect model that describes the relationship between two variables, "x" and "y", which vary randomly by a third variable "r" using an exponential rise to a maximum as described by the equation: y = theta(1-exp(-beta*x)). I've been able to create the nonlinear model for x and y using nls(), but I have not been successful in incorporating a random effect into nlme(). When I build the model using nlme() I end up with an error message: "Error in eval(predvars, data, env) : object 'theta' not found". This error is unexpected to me since the nls() model ran without issue using the same dataframe. To build the dataset: x = c(33,35,16,8,31,31,31,23,7,7,7,7,11,11,3,3,6,6,32,32,1,17,17,17,25,40,40,6,6,29,29,13,23,23,44,44,43,43,13,4,6,15,17,22,28,8,11,22,32,6,12,20,27,15,29,29,29,29,29,12,12,16,16,12,12,2,49,49,14,14,14,37,2.87,4.86,7.90,11.95,16.90,16.90,18.90,18.89,22.00,24.08,27.14,30.25,31.22,32.26,7,14,19,31,36,7,14,19,31,36,7,16,16,16,16,16,16,32,32,32,32,32,32,11,11,11,13,13,13,13,13,13,13,13,13,13,13,13,9,9) y = c(39.61,32.66,27.06,21.74,22.18,38.19,35.02,23.13,9.70,14.20,13.40,15.30,18.80,19.00,3.80,4.90,15.00,14.20,24.90,16.56,1.76,29.29,28.49,18.64,27.10,9.47,14.14,10.27,8.44,26.15,25.43,22.00,19.00,13.00,73.19,67.76,32.34,36.86,8.00,1.57,8.33,16.20,14.69,18.95,20.52,4.92,8.28,15.27,18.37,6.60,10.98,12.56,19.04,5.49,21.00,12.90,17.30,11.40,12.20,15.63,15.22,33.80,17.78,19.33,3.86,8.57,30.40,13.39,11.93,4.55,6.18,12.70,2.71,7.23,5.61,22.74,15.71,16.95,18.31,20.78,17.64,20.00,19.52,24.86,30.06,24.92,4.17,11.02,10.08,14.94,25.98,0.00,3.67,3.67,6.69,11.90,5.06,13.21,10.33,0.00,0.00,6.47,8.38,28.57,25.26,28.67,27.92,33.69,29.61,6.11,7.13,6.93,4.81,15.34,4.90,14.94,8.88,10.24,8.80,10.46,10.48,9.19,9.67,9.40,24.98,50.79) r = c("A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B","B","B","C","C","D","E","E","E","F","G","G","H","H","H","H","I","I","I","J","J","J","J","K","L","L","L","L","L","L","L","L","L","L","L","L","L","L","M","N","N","N","N","N","O","P","P","P","P","P","Q","R","R","S","S","S","T","U","U","U","U","U","U","U","U","U","U","U","U","U","U","V","V","V","V","V","V","V","V","V","V","W","X","X","X","X","Y","Y","Z","Z","Z","Z","Z","Z","AA","AA","AA","AB","AB","AB","AB","AB","AB","AB","AB","AB","AB","AB","AB","AC","AC") df = data.frame(x,y,r) To build the nonlinear model without "r" as a random effect. nls_test = nls(y~theta*(1-exp(-beta*x)), data = df, start = list(beta = 0.2, theta = 38), trace = TRUE) In my model, the only fixed effect is x and the only random effect is r. I've tried building an nlme() model that reflects this, based on the nlme package documentation (https://cran.r-project.org/web/packages/nlme/nlme.pdf),more specifically these lines of code found on page 186 of the documentation linked above. The nlme() object that I've tried to create with my data is as follows: nlme_test = nlme(y ~ theta*(1-exp(-beta*x)), fixed = x~1, random = r~1, data = df, start = c(theta = 38, beta = 0.2)) And results in the following error. Error in eval(predvars, data, env) : object 'theta' not found From what I gather, this is related to 'theta' not being included in the dataframe ("df") used to build the nlme object, but it is unclear to me why this occurs as most examples that I have found for this error are related to the use of the predict() function and missing column or disagreement between column names. Also, since the nls() model (nls_Test) worked fine using the same start = c(theta = 38, beta = 0.2) and without a 'theta' or 'beta' data column in df, I'm a bit confused as to why I'm receiving this error about column name error. Does anyone have suggestions or references to help me incorporate the random effect into my nlme model? Thanks!
Expanding on my (now deleted, because incomplete) comment, I assume this is what you want to do. Please confirm carefully by reading the help page about nlme (i.e. ?nlme::nlme). nlme_test <- nlme(y ~ theta*(1-exp(-beta*x)), fixed = theta + beta ~ 1, random = theta + beta ~ 1, groups = ~ r, data = df, start = c(theta = 38, beta = 0.2)) The fixed and random arguments should not name the variables in your model formula but the regression parameters. This way the function knows which parts of the model are variables (to be found in data) and which parts are parameters. Also, you missed the groups argument in order to specify how the data is clustered. Output: summary(nlme_test) ## Nonlinear mixed-effects model fit by maximum likelihood ## Model: y ~ theta * (1 - exp(-beta * x)) ## Data: df ## AIC BIC logLik ## 887.6224 904.6401 -437.8112 ## ## Random effects: ## Formula: list(theta ~ 1, beta ~ 1) ## Level: r ## Structure: General positive-definite, Log-Cholesky parametrization ## StdDev Corr ## theta 1.145839e+01 theta ## beta 1.061366e-05 0.01 ## Residual 6.215030e+00 ## ## Fixed effects: theta + beta ~ 1 ## Value Std.Error DF t-value p-value ## theta 21.532188 2.8853414 96 7.462614 0e+00 ## beta 0.104404 0.0251567 96 4.150144 1e-04 ## Correlation: ## theta ## beta -0.548 ## ## Standardized Within-Group Residuals: ## Min Q1 Med Q3 Max ## -2.89510795 -0.51882772 -0.09466037 0.34471808 3.66855121 ## ## Number of Observations: 126 ## Number of Groups: 29
How can I print the results of a summary and predict function by running a single code chunk using dplyr?
I am trying to fit several linear models using tidyverse in R. I am interested in printing out the results of the model fit using summary as well as a custom function designed to return statistical parameters not returned by summary like AIC values, and then apply this model to predict values in a set of known data (a test dataset). Here is an example of what I am doing using the mtcars dataset. library(tidyverse);library(magrittr) mtcars%>% filter(gear=="4")%$% lm(hp~mpg)%>% summary() mtcars%>% filter(gear=="4")%$% lm(hp~mpg)%>% AIC() mtcars%>% filter(gear=="4")%$% lm(hp~mpg)%>% predict(newdata=data.frame(mpg=19)) I am often doing a lot of filtering of my data before calling lm (due to missing data that are not missing for all models, using mutate calls, using summarise, or filtering based on a categorical variable of interest), and fitting many different model permutations. However, I end up having to call the same code multiple times in order to obtain the summary statistics. Normally I would just save the lm models as an object but in this case I am interested in just running a preliminary test to see what the results look like to see if that version is worth saving, and I don't want large numbers of lm objects cluttering up my global environment. However it seems once a pipe is called after lm it is not possible to call the temporary lm object again. Is there any tidy way to retain a fitted lm object and fork it in the same string of code such that I can print the results of a summary, predict, and AIC function in a single call?
A magritter pipeline allows for a code block where . is the value coming from the chain. So mtcars%>% filter(gear=="4")%$% lm(hp~mpg)%>% {list( summary(.), AIC(.), predict(., newdata=data.frame(mpg=19)) )} Will work You could also kind of use the %T>% (tee) pipe. But you'll need to explicitly print the values or something in the chain if you want to see them mtcars%>% filter(gear=="4")%$% lm(hp~mpg) %T>% {print(summary(.))} %T>% {print(AIC(.))} %>% predict(newdata=data.frame(mpg=19))
One option is to make a custom function that produces the desired outputs together. Then you can feed whatever data you like in as a single line. library(tidyverse) ## function to produce all desired outputs in one object f <- function(train_data = mtcars, x = "mpg", y = "hp", test_data = data.frame(mpg = 19)) { formula <- as.formula(paste0(y, "~", x)) mod <- lm(formula, data = train_data) list( summary = summary(mod), AIC = AIC(mod), prediction = predict(mod, test_data) ) } f() #> $summary #> #> Call: #> lm(formula = formula, data = train_data) #> #> Residuals: #> Min 1Q Median 3Q Max #> -59.26 -28.93 -13.45 25.65 143.36 #> #> Coefficients: #> Estimate Std. Error t value Pr(>|t|) #> (Intercept) 324.08 27.43 11.813 8.25e-13 *** #> mpg -8.83 1.31 -6.742 1.79e-07 *** #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> #> Residual standard error: 43.95 on 30 degrees of freedom #> Multiple R-squared: 0.6024, Adjusted R-squared: 0.5892 #> F-statistic: 45.46 on 1 and 30 DF, p-value: 1.788e-07 #> #> #> $AIC #> [1] 336.8553 #> #> $prediction #> 1 #> 156.3174 Created on 2022-07-21 by the reprex package (v2.0.1)
including non linearity in fixed effects model in plm
I am trying to build a fixed effects regression with the plm package in R. I am using country level panel data with year and country fixed effects. My problem concerns 2 explanatory variables. One is an interaction term of two varibels and one is a squared term of one of the variables. model is basically: y = x1 + x1^2+ x3 + x1*x3+ ...+xn , with the variables all being in log form It is central to the model to include the squared term, but when I run the regression it always gets excluded because of "singularities", as x1 and x1^2 are obviously correlated. Meaning the regression works and I get estimates for my variables, just not for x1^2 and x1*x2. How do I circumvent this? library(plm) fe_reg<- plm(log(y) ~ log(x1)+log(x2)+log(x2^2)+log(x1*x2)+dummy, data = df, index = c("country", "year"), model = "within", effect = "twoways") summary(fe_reg) ´´´ #I have tried defining the interaction and squared terms as vectors, which helped with the #interaction term but not the squared term. df1.pd<- df1 %>% mutate_at(c('x1'), ~(scale(.) %>% as.vector)) df1.pd<- df1 %>% mutate_at(c('x2'), ~(scale(.) %>% as.vector)) ´´´ I am pretty new to R, so apologies if this not a very well structured question.
You just found two properties of the logarithm function: log(x^2) = 2 * log(x) log(x*y) = log(x) + log(y) Then, obviously, log(x) is collinear with 2*log(x) and one of the two collinear variables is dropped from the estimation. Same for log(x*y) and log(x) + log(y). So, the model you want to estimate is not estimable by linear regression methods. You might want to take different data transformations than log into account or the original variables. See also the reproducible example below wher I just used log(x^2) = 2*log(x). Linear dependence can be detected, e.g., via function detect.lindep from package plm (see also below). Dropping of coefficients from estimation also hints at collinear columns in the model estimation matrix. At times, linear dependence appears only after data transformations invovled in the estimation functions, see for an example of the within transformation the help page ?detect.lindep in the Example section). library(plm) data("Grunfeld") pGrun <- pdata.frame(Grunfeld) pGrun$lvalue <- log(pGrun$value) # log(x) pGrun$lvalue2 <- log(pGrun$value^2) # log(x^2) == 2 * log(x) mod <- plm(inv ~ lvalue + lvalue2 + capital, data = pGrun, model = "within") summary(mod) #> Oneway (individual) effect Within Model #> #> Call: #> plm(formula = inv ~ lvalue + lvalue2 + capital, data = pGrun, #> model = "within") #> #> Balanced Panel: n = 10, T = 20, N = 200 #> #> Residuals: #> Min. 1st Qu. Median 3rd Qu. Max. #> -186.62916 -20.56311 -0.17669 20.66673 300.87714 #> #> Coefficients: (1 dropped because of singularities) #> Estimate Std. Error t-value Pr(>|t|) #> lvalue 30.979345 17.592730 1.7609 0.07988 . #> capital 0.360764 0.020078 17.9678 < 2e-16 *** #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> #> Total Sum of Squares: 2244400 #> Residual Sum of Squares: 751290 #> R-Squared: 0.66525 #> Adj. R-Squared: 0.64567 #> F-statistic: 186.81 on 2 and 188 DF, p-value: < 2.22e-16 detect.lindep(mod) # run on the model #> [1] "Suspicious column number(s): 1, 2" #> [1] "Suspicious column name(s): lvalue, lvalue2" detect.lindep(pGrun) # run on the data #> [1] "Suspicious column number(s): 6, 7" #> [1] "Suspicious column name(s): lvalue, lvalue2"
Precision in summary output of lm R
I am doing some exercises using package r-exams, in which I print a summary from an lm object and ask students things like, “which is the estimated value of the intercept”. The idea is that the student copies the values of the summary output and use that value as the correct answer. The issue here is that I use the values from coef() function as the correct answers, but this is not a good idea since the precision of these values are quite different from the precision of the values shown in the summary output. Here is an example: set.seed(123) library(tidyverse) ## DATA GENERATION xbreaks<-c(runif(1,4,4.8),runif(1,6,6.9),runif(1,7.8,8.5),runif(1,9,10)) ybreaks<-c(runif(1,500,1000),runif(1,1800,4000),runif(1,200,800)) b11<-(ybreaks[2]-ybreaks[1])/(xbreaks[2]-xbreaks[1]) b10<-ybreaks[1]-b11*xbreaks[1] b31<-(ybreaks[3]-ybreaks[2])/(xbreaks[4]-xbreaks[3]) b30<-ybreaks[2]-b31*xbreaks[3] points_df<-data.frame(x=xbreaks,y=ybreaks[c(1,2,2,3)]) n<-rpois(3,120) x1<-runif(n[1],xbreaks[1],xbreaks[2]) x2<-runif(n[2],xbreaks[2],xbreaks[3]) x3<-runif(n[3],xbreaks[3],xbreaks[4]) y<-c(b10+b11*x1+rnorm(n[1],0,200), ybreaks[2]+rnorm(n[2],0,200), b30+b31*x3+rnorm(n[3],0,200)) z0_aw<-data.frame(ph=c(x1,x2,x3),UFC=y,case=factor(c(rep(1,n[1]),rep(2,n[2]),rep(3,n[3])))) mean_x<-z0_aw$ph%>% mean %>% round(2) caserng<-sample(1:4,1) modrng<-sample(1:2,1) if(caserng!=4){ z0_aw<-z0_aw[z0_aw$case == caserng,] } if(modrng==1){ m0<-lm(UFC~ph,data=z0_aw) }else{ cl <- call("lm", formula = UFC ~ I(ph - mean_x), data = as.name("z0_aw")) cl$formula[[3]][[2]][[3]] <- mean_x m0<-eval(cl) } summary(m0) #> #> Call: #> lm(formula = UFC ~ I(ph - 7.2), data = z0_aw) #> #> Residuals: #> Min 1Q Median 3Q Max #> -555.53 -121.98 5.46 115.38 457.08 #> #> Coefficients: #> Estimate Std. Error t value Pr(>|t|) #> (Intercept) 2726.86 57.33 47.57 <2e-16 *** #> I(ph - 7.2) -840.05 31.46 -26.70 <2e-16 *** #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> #> Residual standard error: 182.7 on 116 degrees of freedom #> Multiple R-squared: 0.8601, Adjusted R-squared: 0.8589 #> F-statistic: 713.1 on 1 and 116 DF, p-value: < 2.2e-16 coef(m0) #> (Intercept) I(ph - 7.2) #> 2726.8605 -840.0515 Created on 2021-05-14 by the reprex package (v2.0.0) Suppose that extol: 0.0001 in r-exams is set, and the student is asked to give the estimated value of the intercept. The student will get a wrong answer since he will answer 2726.86 but the correct answer from coef is 2726.8605 . As can be seen, output of summary uses 2 decimals, whereas coef() values has quite more precision. I want to know how many decimals is summary using in order to apply the same format to values produced by coef(). This will ensure that the answer provided by the student is the same as the summary output. I just want to do this: answers<-coef(m0) %>% format(digits=dsum) %>% as.numeric() where dsum is the number of digits used also by the summary output. Note: retain a precision of 4 decimals is needed since I also ask students about the R-squared value provided in the same summary output, so it is not a good idea to set extol: 0.01 for example. Also the problems are generated at random and the magnitude of the estimated coefficients changes, as I have noted that this is directly related to the precision used in summary output.
Some useful information for such questions in R/exams: The extol can also be a vector so that you can set different tolerances for coefficients and R-squared etc. When asking about the R-squared, though, I typically ask for it "in percent". Then the same tolerance may be suitable as for the coefficients. I would recommend to control the size of the coefficients suitably so that digits and extol can be set accordingly. Personally, I typically store the exsolution at a higher precision than I request from the students. For example, exsolution could be 12.345678 while I only set extol to 0.01. This makes sure that when the correct answer is rounded to two decimal places it is inside the correct interval determined by exsolution and extol. Details on formatting of the coefficients in the summary: It is not obvious where exactly the formatting happens: The summary() method for lm objects returns an object of class summary.lm which has its own print() method which in turn calls printCoefmat(). The latter is the function that does the actual formatting. When setting the digits in these functions, this controls the number of significant digits and not the number of decimal places. This is particularly important when the coefficients become relatively large (say, in the thousands or more). The coefficients are not formatted individually but jointly with the corresponding standard errors. The details depend on the digits, the size of both coefficients and standard errors, and whether any coefficients are aliased or exactly zero etc. Without aliased/zero coefficients the formatting from summary(m0) can be replicated using format_coef(m0) as defined below. That's essentially the boiled-down code from printCoefmat(). format_coef <- function(object, digits = max(3L, getOption("digits") - 2L)) { coef_se <- summary(object)$coefficients[, 1L:2L] digmin <- 1L + floor(log10(range(abs(coef_se)))) format(round(coef_se, max(1L, digits - digmin)), digits = digits)[, 1L] }
Different results, using same data and method(?), when using WordMat and R
I am interested to reproduce results calculated by the GNU plugin to MS Word WordMat in R, but I can't get them to arrive at similar results (I am not looking for identical, but simply similar). I have some y and x values and a power function, y = bx^a Using the following data, x <- c(15,31,37,44,51,59) y <- c(126,71,61,53,47,42) I get a = -0.8051 and b = 1117.7472 in WordMat, but a = -0.8026 and B = 1108.2533 in R, slightly different values. Am I using the nls function in some wrong way or is there a better (more transparent) way to calculate it in R? Data and R code, # x <- c(15,31,37,44,51,59) # y <- c(126,71,61,53,47,42) df <- data.frame(x,y) moD <- nls(y~a*x^b, df, start = list(a = 1,b=1)) summary(moD) Formula: y ~ a * x^b Parameters: Estimate Std. Error t value Pr(>|t|) a 1.108e+03 1.298e+01 85.35 1.13e-07 *** b -8.026e-01 3.626e-03 -221.36 2.50e-09 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.3296 on 4 degrees of freedom Number of iterations to convergence: 19 Achieved convergence tolerance: 5.813e-06
It looks like WordMat is estimating the parameters of y=b*x^a by doing the log-log regression rather than by solving the nonlinear least-squares problem: > x <- c(15,31,37,44,51,59) > y <- c(126,71,61,53,47,42) > > (m1 <- lm(log(y)~log(x))) Call: lm(formula = log(y) ~ log(x)) Coefficients: (Intercept) log(x) 7.0191 -0.8051 > exp(coef(m1)[1]) (Intercept) 1117.747 To explain what's going on here a little bit more: if y=b*x^a, taking the log on both sides gives log(y)=log(b)+a*log(x), which has the form of a linear regression (lm() in R). However, log-transforming also affects the variance of the errors (which are implicitly included on the right-hand side of the question), meaning that you're actually solving a different problem. Which is correct depends on exactly how you state the problem. This question on CrossValidated gives more details.