How would I fit this linear trend function in R? - r

I want to fit this linear trend function to my data:
Yt=a+bt+Xt
This is based on time series data.
I believe writing lm(y ~ time) will return the equivalent of Yt=a+Xt but I'm confused how to include bt into this linear trend function in R.

You can simply include it as an explanatory variable
library(data.table)
d <- data.table(id = 1)
d <- d[, .(year=1:200), by=id]
d[, x1 := runif(200)]
# add an erros
d[, e := rnorm(200, 23, 7)]
# add the dependent variable
d[, y := 3.5*x1 + 0.5*year + e ]
m <- lm(y ~ x1 + year, d)
summary(m)
Call:
lm(formula = y ~ x1 + year, data = d)
Residuals:
Min 1Q Median 3Q Max
-19.2008 -4.4356 0.3986 5.2283 16.6819
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 20.064776 1.519766 13.203 <2e-16 ***
x1 3.114048 1.914318 1.627 0.105
year 0.523195 0.009187 56.947 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.469 on 197 degrees of freedom
Multiple R-squared: 0.943, Adjusted R-squared: 0.9424
F-statistic: 1628 on 2 and 197 DF, p-value: < 2.2e-16

Related

R polynomal regression or group values and test between groups + outcome interpreatation

I am trying to model the relation between a scar acquisition rate of a wild population of animals, and I have calculated yearly rates before.
If you see below the plot, it seems to me that rates rise through the middle of the period and than fall again. I have tried to fit a polynomial LM with the code
model1 <- lm(Rate~poly(year, 2, raw = TRUE),data=yearlyratesub)
summary(model1)
model1
I have plotted using:
g <-ggplot(yearlyratesub, aes(year, Rate)) + geom_point(shape=1) + geom_smooth(method = lm, formula = y ~ poly(x, 2, raw = TRUE))
g
The model output was:
Call:
lm(formula = Rate ~ poly(year, 2, raw = TRUE), data = yearlyratesub)
Residuals:
Min 1Q Median 3Q Max
-0.126332 -0.037683 -0.002602 0.053222 0.083503
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -8.796e+03 3.566e+03 -2.467 0.0297 *
poly(year, 2, raw = TRUE)1 8.747e+00 3.545e+00 2.467 0.0297 *
poly(year, 2, raw = TRUE)2 -2.174e-03 8.813e-04 -2.467 0.0297 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.0666 on 12 degrees of freedom
Multiple R-squared: 0.3369, Adjusted R-squared: 0.2264
F-statistic: 3.048 on 2 and 12 DF, p-value: 0.08503
How can I enterpret that now? The overall model p value is not significant but the intercept and single slopes are?
Should I rather try another fit than x² or even group the values and test between groups e.g. with an ANOVA? I know the LM has low fit but I guess it's because I have little values and maybe x² might be not it...?
Would be happy about input regarding model and outcome interpretation..
Grouping
Since the data was not provided (next time please provide a complete reproducible question including all inputs) we used the data in the Note at the end. We see that that the model is highly significant if we group the points using the indicated breakpoints.
g <- factor(findInterval(yearlyratesub$year, c(2007.5, 2014.5))+1); g
## [1] 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3
## Levels: 1 2 3
fm <- lm(rate ~ g, yearlyratesub)
summary(fm)
giving
Call:
lm(formula = rate ~ g, data = yearlyratesub)
Residuals:
Min 1Q Median 3Q Max
-0.064618 -0.018491 0.006091 0.029684 0.046831
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.110854 0.019694 5.629 0.000111 ***
g2 0.127783 0.024687 5.176 0.000231 ***
g3 -0.006714 0.027851 -0.241 0.813574
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.03939 on 12 degrees of freedom
Multiple R-squared: 0.7755, Adjusted R-squared: 0.738
F-statistic: 20.72 on 2 and 12 DF, p-value: 0.0001281
We could consider combining the outer two groups.
g2 <- factor(g == 2)
fm2 <- lm(rate ~ g2, yearlyratesub)
summary(fm2)
giving:
Call:
lm(formula = rate ~ g2, data = yearlyratesub)
Residuals:
Min 1Q Median 3Q Max
-0.064618 -0.016813 0.007096 0.031363 0.046831
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.10750 0.01341 8.015 2.19e-06 ***
g2TRUE 0.13114 0.01963 6.680 1.52e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.03793 on 13 degrees of freedom
Multiple R-squared: 0.7744, Adjusted R-squared: 0.757
F-statistic: 44.62 on 1 and 13 DF, p-value: 1.517e-05
Sinusoid
Looking at the graph it seems that the points are turning up at the left and right edges suggesting we use a sinusoidal fit. a + b * cos(c * year)
fm3 <- nls(rate ~ cbind(a = 1, b = cos(c * year)),
yearlyratesub, start = list(c = 0.5), algorithm = "plinear")
summary(fm3)
giving
Formula: rate ~ cbind(a = 1, b = cos(c * year))
Parameters:
Estimate Std. Error t value Pr(>|t|)
c 0.4999618 0.0001449 3449.654 < 2e-16 ***
.lin.a 0.1787200 0.0150659 11.863 5.5e-08 ***
.lin.b 0.0753754 0.0205818 3.662 0.00325 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.05688 on 12 degrees of freedom
Number of iterations to convergence: 2
Achieved convergence tolerance: 5.241e-08
Comparison
Plotting the fits and looking at their residual sum of squares and AIC we have
plot(yearlyratesub)
# fm0 from Note at end, fm and fm2 are grouping models, fm3 is sinusoidal
L <- list(fm0 = fm0, fm = fm, fm2 = fm2, fm3 = fm3)
for(i in seq_along(L)) {
lines(fitted(L[[i]]) ~ year, yearlyratesub, col = i, lwd = 2)
}
legend("topright", names(L), col = seq_along(L), lwd = 2)
giving the following where lower residual sum of squares and AIC (which takes into account the number of paramters) are better. We see that fm fits the most closely based on residual sum of squares but with fm2 fitting almost as well; however, when taking the number of parameters into account by using AIC fm2 has the lowest and so is most favored by that criterion.
cbind(RSS = sapply(L, deviance), AIC = sapply(L, AIC))
## RSS AIC
## fm0 0.05488031 -33.59161
## fm 0.01861659 -49.80813
## fm2 0.01870674 -51.73567
## fm3 0.04024237 -38.24512
Note
yearlyratesub <-
structure(list(year = c(2004, 2005, 2006, 2007, 2008, 2009, 2010,
2011, 2012, 2013, 2014, 2015, 2017, 2018, 2019), rate = c(0.14099813521287,
0.0949946651016247, 0.0904788394070601, 0.11694517831575, 0.26786193592875,
0.256346628540479, 0.222029818828298, 0.180116679856725, 0.285467976459104,
0.174019208113095, 0.28461698734932, 0.0574827955982996, 0.103378448084776,
0.114593695172686, 0.141105952837639)), row.names = c(NA, -15L
), class = "data.frame")
fm0 <- lm(rate ~ poly(year, 2, raw = TRUE), yearlyratesub)
summary(fm0)
giving
Call:
lm(formula = rate ~ poly(year, 2, raw = TRUE), data = yearlyratesub)
Residuals:
Min 1Q Median 3Q Max
-0.128335 -0.038289 -0.002715 0.054090 0.084792
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -8.930e+03 3.621e+03 -2.466 0.0297 *
poly(year, 2, raw = TRUE)1 8.880e+00 3.600e+00 2.467 0.0297 *
poly(year, 2, raw = TRUE)2 -2.207e-03 8.949e-04 -2.467 0.0297 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.06763 on 12 degrees of freedom
Multiple R-squared: 0.3381, Adjusted R-squared: 0.2278
F-statistic: 3.065 on 2 and 12 DF, p-value: 0.0841

Linear regression on dynamic groups in R

I have a data.table data_dt on which I want to run linear regression so that user can choose the number of columns in groups G1 and G2 using variable n_col. The following code works perfectly but it is slow due to extra time spent on creating matrices. To improve the performance of the code below, is there a way to remove Steps 1, 2, and 3 altogether by tweaking the formula of lm function and still get the same results?
library(timeSeries)
library(data.table)
data_dt = as.data.table(LPP2005REC[, -1])
n_col = 3 # Choose a number from 1 to 3
######### Step 1 ######### Create independent variable
xx <- as.matrix(data_dt[, "SPI"])
######### Step 2 ######### Create Group 1 of dependent variables
G1 <- as.matrix(data_dt[, .SD, .SDcols=c(1:n_col + 2)])
######### Step 3 ######### Create Group 2 of dependent variables
G2 <- as.matrix(data_dt[, .SD, .SDcols=c(1:n_col + 2 + n_col)])
lm(xx ~ G1 + G2)
Results -
summary(lm(xx ~ G1 + G2))
Call:
lm(formula = xx ~ G1 + G2)
Residuals:
Min 1Q Median 3Q Max
-3.763e-07 -4.130e-09 3.000e-09 9.840e-09 4.401e-07
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.931e-09 3.038e-09 -1.623e+00 0.1054
G1LMI -5.000e-01 4.083e-06 -1.225e+05 <2e-16 ***
G1MPI -2.000e+00 4.014e-06 -4.982e+05 <2e-16 ***
G1ALT -1.500e+00 5.556e-06 -2.700e+05 <2e-16 ***
G2LPP25 3.071e-04 1.407e-04 2.184e+00 0.0296 *
G2LPP40 -5.001e+00 2.360e-04 -2.119e+04 <2e-16 ***
G2LPP60 1.000e+01 8.704e-05 1.149e+05 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.762e-08 on 370 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 1.104e+12 on 6 and 370 DF, p-value: < 2.2e-16
This may be easier by just creating the formula with reformulate
out <- lm(reformulate(names(data_dt)[c(1:n_col + 2, 1:n_col + 2 + n_col)],
response = 'SPI'), data = data_dt)
-checking
> summary(out)
Call:
lm(formula = reformulate(names(data_dt)[c(1:n_col + 2, 1:n_col +
2 + n_col)], response = "SPI"), data = data_dt)
Residuals:
Min 1Q Median 3Q Max
-3.763e-07 -4.130e-09 3.000e-09 9.840e-09 4.401e-07
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.931e-09 3.038e-09 -1.623e+00 0.1054
LMI -5.000e-01 4.083e-06 -1.225e+05 <2e-16 ***
MPI -2.000e+00 4.014e-06 -4.982e+05 <2e-16 ***
ALT -1.500e+00 5.556e-06 -2.700e+05 <2e-16 ***
LPP25 3.071e-04 1.407e-04 2.184e+00 0.0296 *
LPP40 -5.001e+00 2.360e-04 -2.119e+04 <2e-16 ***
LPP60 1.000e+01 8.704e-05 1.149e+05 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5.762e-08 on 370 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 1.104e+12 on 6 and 370 DF, p-value: < 2.2e-16

Extract root of dummy variable in model fit summary

In the following example, gender is encoded as dummy variables corresponding to the categories.
fit <- lm(mass ~ height + gender, data=dplyr::starwars)
summary(fit)
# Call:
# lm(formula = mass ~ height + gender, data = dplyr::starwars)
#
# Residuals:
# Min 1Q Median 3Q Max
# -41.908 -6.536 -1.585 1.302 55.481
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) -46.69901 12.67896 -3.683 0.000557 ***
# height 0.59177 0.06784 8.723 1.1e-11 ***
# genderhermaphrodite 1301.13951 17.37871 74.870 < 2e-16 ***
# gendermale 22.39565 5.82763 3.843 0.000338 ***
# gendernone 68.34530 17.49287 3.907 0.000276 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 16.57 on 51 degrees of freedom
# (31 observations deleted due to missingness)
# Multiple R-squared: 0.9915, Adjusted R-squared: 0.9909
# F-statistic: 1496 on 4 and 51 DF, p-value: < 2.2e-16
Is there a way to extract the root of the dummy variable name? For example, for gendernone, gendermale and genderhermaphrodite, the root would be gender, corresponding to the original column name in the dplyr::starwars data.
Get the variable names from the formula and check which one matches the input:
input <- c("gendermale", "height")
v <- all.vars(formula(fit))
v[sapply(input, function(x) which(pmatch(v, x) == 1))]
## [1] "gender" "height"

linear model having 4 predictors

I am trying to fit a linear model having 4 predictors. The problem I am facing is my code doesn't estimate the one parameter. Every time when I put the one variable at last of my lm formula it doesn't estimate it. My code is:
AllData <- read.csv("AllBandReflectance.csv",header = T)
Swir2ref <- AllData$band7
x1 <- AllData$X1
x2 <- AllData$X2
y1 <- AllData$Y1
y2 <- AllData$Y2
linear.model <- lm( Swir2ref ~ x1 + y1 +x2 +y2 , data = AllData )
summary(linear.model)
Call:
lm(formula = Swir2ref ~ x1 + y1 + x2 + y2, data = AllData)
Residuals:
Min 1Q Median 3Q Max
-0.027277 -0.008793 -0.000689 0.010085 0.035097
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.595593 0.002006 296.964 <2e-16 ***
x1 0.002175 0.003462 0.628 0.532
y1 0.001498 0.003638 0.412 0.682
x2 0.022671 0.018786 1.207 0.232
y2 NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.01437 on 67 degrees of freedom
Multiple R-squared: 0.02876, Adjusted R-squared: -0.01473
F-statistic: 0.6613 on 3 and 67 DF, p-value: 0.5787

How to get coefficient for intercept in felm

I run the following code in R. And I do not get coefficient for intercept. How can I get the coefficient for intercept?
#create covariates
x <- rnorm(4000)
x2 <- rnorm(length(x))
#create individual and firm
id <- factor(sample(500,length(x),replace=TRUE))
firm <- factor(sample(300,length(x),replace=TRUE))
#effects
id.eff <- rlnorm(nlevels(id))
firm.eff <- rexp(nlevels(firm))
#left hand side
y <- 50000 + x + 0.25*x2 + id.eff[id] + firm.eff[firm] + rnorm(length(x))
#estimate and print result
est <- felm(y ~ x+x2 | id + firm)
summary(est)
Call: felm(formula = y ~ x + x2 | id + firm)
which gives me
Residuals: Min 1Q Median 3Q Max -3.3129 -0.6147 -0.0009 0.6131 3.2878
Coefficients: Estimate Std. Error t value Pr(>|t|)
x 1.00276 0.01834 54.66 <2e-16 ***
x2 0.26190 0.01802 14.54 <2e-16 ***
Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.02 on 3199 degrees of freedom Multiple R-squared(full model): 0.8778 Adjusted R-squared: 0.8472 Multiple R-squared(proj model): 0.4988 Adjusted R-squared: 0.3735 F-statistic(full model):28.72 on 800 and 3199 DF, p-value: < 2.2e-16 F-statistic(proj model): 1592 on 2 and 3199 DF, p-value: < 2.2e-16

Resources