Changing the contrasts in regression models with R - r

I have a question about estimating a regression model in R. I have the following data (example):
Year XY
2002 5
2003 2
2004 4
2005 8
2006 3
2007 5
2008 10
the regression model I want to estimate is:
XY = B0 + Y2005 + Y2006 + Y2007 + Y2008 + e
Where Y2005,Y2006,Y2007,and Y2008 are yearly indicator variables that take the value of 1 for the year 2005, 2006, 2007, 2008 and 0 otherwise.
What I need to do is to compare the value of (XY) in 2005, 2006, 2007, and 2008 to the mean value of (XY) in the period of (2002-2004).
I hope you can help me to figure out this issue and thank you in advance for your help.

DF <- read.table(text = "Year XY
2002 5
2003 2
2004 4
2005 8
2006 3
2007 5
2008 10", header = TRUE)
DF$facYear <- DF$Year
DF$facYear[DF$facYear < 2005] <- "baseline"
DF$facYear <- factor(DF$facYear)
#make sure that baseline is used as intercept:
DF$facYear <- relevel(DF$facYear, "baseline")
fit <- lm(XY ~ facYear, data = DF)
summary(fit)
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 3.6667 0.8819 4.158 0.0533 .
#facYear2005 4.3333 1.7638 2.457 0.1333
#facYear2006 -0.6667 1.7638 -0.378 0.7418
#facYear2007 1.3333 1.7638 0.756 0.5286
#facYear2008 6.3333 1.7638 3.591 0.0696 .

Related

Calculating the change in % of data by year

I am trying to calculate the % change by year in the following dataset, does anyone know if this is possible?
I have the difference but am unsure how we can change this into a percentage
C diff(economy_df_by_year$gdp_per_capita)
df
year gdp
1998 8142.
1999 8248.
2000 8211.
2001 7926.
2002 8366.
2003 10122.
2004 11493.
2005 12443.
2006 13275.
2007 15284.
Assuming that gdp is the total value, you could do something like this:
library(tidyverse)
tribble(
~year, ~gdp,
1998, 8142,
1999, 8248,
2000, 8211,
2001, 7926,
2002, 8366,
2003, 10122,
2004, 11493,
2005, 12443,
2006, 13275,
2007, 15284
) -> df
df |>
mutate(pdiff = 100*(gdp - lag(gdp))/gdp)
#> # A tibble: 10 × 3
#> year gdp pdiff
#> <dbl> <dbl> <dbl>
#> 1 1998 8142 NA
#> 2 1999 8248 1.29
#> 3 2000 8211 -0.451
#> 4 2001 7926 -3.60
#> 5 2002 8366 5.26
#> 6 2003 10122 17.3
#> 7 2004 11493 11.9
#> 8 2005 12443 7.63
#> 9 2006 13275 6.27
#> 10 2007 15284 13.1
Which relies on the tidyverse framework.
If gdp is the difference, you will need the total to get a percentage, if that is what you mean by change in percentage by year.
df$change <- NA
df$change[2:10] <- (df[2:10, "gdp"] - df[1:9, "gdp"]) / df[1:9, "gdp"]
This assigns the yearly GDP growth to each row except the first one where it remains as NA
df$diff <- c(0,diff(df$gdp))
df$percentDiff <- 100*(c(0,(diff(df$gdp)))/(df$gdp - df$diff))
This is another possibility.

Find average change in timeseries

I have an annual mean timeseries dataset for 15 years, and I am trying to find the average change/increase/decrease in this timeseries.
The timeseries I have is spatial (average values for each grid-cell/pixel, years repeat).
How can I do this in R via dplyr?
Sample data
year = c(2005, 2005, 2005, 2005, 2006, 2006, 2006, 2006, 2007, 2007, 2007, 2007, 2008, 2008, 2008, 2008)
Tmean = c(24, 24.5, 25.8,25, 24.8, 25, 23.5, 23.8, 24.8, 25, 25.2, 25.8, 25.3, 25.6, 25.2, 25)
Code
library(tidyverse)
df = data.frame(year, Tmean)
change = df$year %>%
# Sort by year
arrange(year) %>%
mutate(Diff_change = Tmean - lag(Tmean), # Difference in Tmean between years
Rate_percent = (Diff_change / year)/Tmean * 100) # Percent change # **returns inf values**
Average_change = mean(change$Rate_percent, na.rm = TRUE)
To find the average: mean(). To find the differences or changes: diff()
So, to find the average change:
> avg_change <- mean(diff(Tmean))
> print(avg_change)
[1] 0.06666667
If you need that in percentage, then you want to find out how much the difference between an element and its previous one (this year - last year) is in percentage with respect to last year, like so:
> pct_change <- Tmean[2:length(Tmean)] / Tmean[1:(length(Tmean)-1)] - 1
> avg_pct_change <- mean(pct_change) * 100
> print(avg_pct_change)
[1] 0.3101632
We can put those vectors into a data frame to use with dplyr (...if that's how you want to do it; this is straightforward with base R as well).
library(dplyr)
df <- data.frame(year, Tmean)
change <- df %>%
arrange(year) %>%
mutate(Diff_change = Tmean - lag(Tmean), # Difference in Tmean between years
Diff_time = year - lag(year),
Rate_percent = (Diff_change/Diff_time)/lag(Tmean) * 100) # Percent change
Average_change = mean(change$Rate_percent, na.rm = TRUE)
Results (with updated question data)
> change
year Tmean Diff_change Rate_percent
1 2005 24.0 NA NA
2 2005 24.5 0.5 2.0833333
3 2005 25.8 1.3 5.3061224
4 2005 25.0 -0.8 -3.1007752
5 2006 24.8 -0.2 -0.8000000
6 2006 25.0 0.2 0.8064516
7 2006 23.5 -1.5 -6.0000000
8 2006 23.8 0.3 1.2765957
9 2007 24.8 1.0 4.2016807
10 2007 25.0 0.2 0.8064516
11 2007 25.2 0.2 0.8000000
12 2007 25.8 0.6 2.3809524
13 2008 25.3 -0.5 -1.9379845
14 2008 25.6 0.3 1.1857708
15 2008 25.2 -0.4 -1.5625000
16 2008 25.0 -0.2 -0.7936508
> Average_change
[1] 0.3101632

create new variable based on year for time series [duplicate]

This question already has answers here:
Add ID column by group [duplicate]
(4 answers)
Closed 3 years ago.
gvkey = id
data = dataframe
data$t <- NA;
data[data$year = 2005, "t"] <- 1
data[data$year = 2006, "t"] <- 2
data[data$year = 2007, "t"] <- 3
data[data$year = 2008, "t"] <- 4
data[data$year = 2009, "t"] <- 5
data[data$year = 2010, "t"] <- 6
I want to create variable "t":
gvkey year t
1004 2005 1
1004 2006 2
1004 2007 3
1004 2008 4
1004 2009 5
1004 2010 6
1013 2005 1
1013 2006 2
1013 2007 3
1013 2008 4
1013 2009 5
1013 2010 6
.....
Somehow my code does not work. Do you have any idea why?
Is there a more efficient way to run this code?
I am new to R and would really appreciate your help.
column of interest
Maybe you can try
data$t <- data$year - min(data$year) + 1

Only out-of-sample forecast plot using auto.arima and xreg

this is my first post so sorry if this is clunky or not formatted well.
period texas u3 national u3
1976 5.758333333 7.716666667
1977 5.333333333 7.066666667
1978 4.825 6.066666667
1979 4.308333333 5.833333333
1980 5.141666667 7.141666667
1981 5.291666667 7.6
1982 6.875 9.708333333
1983 7.916666667 9.616666667
1984 6.125 7.525
1985 7.033333333 7.191666667
1986 8.75 6.991666667
1987 8.441666667 6.191666667
1988 7.358333333 5.491666667
1989 6.658333333 5.266666667
1990 6.333333333 5.616666667
1991 6.908333333 6.816666667
1992 7.633333333 7.508333333
1993 7.158333333 6.9
1994 6.491666667 6.083333333
1995 6.066666667 5.608333333
1996 5.708333333 5.416666667
1997 5.308333333 4.95
1998 4.883333333 4.508333333
1999 4.666666667 4.216666667
2000 4.291666667 3.991666667
2001 4.941666667 4.733333333
2002 6.341666667 5.775
2003 6.683333333 5.991666667
2004 5.941666667 5.533333333
2005 5.408333333 5.066666667
2006 4.891666667 4.616666667
2007 4.291666667 4.616666667
2008 4.808333333 5.775
2009 7.558333333 9.266666667
2010 8.15 9.616666667
2011 7.758333333 8.95
2012 6.725 8.066666667
2013 6.283333333 7.375
2014 5.1 6.166666667
2015 4.45 5.291666667
2016 4.633333333 4.866666667
2017 4.258333333 4.35
2018 3.858333333 3.9
2019 ____ 3.5114
2020 ____ 3.477
2021 ____ 3.7921
2022 ____ 4.0433
2023 ____ 4.1339
2024 ____ 4.2269
2025 ____ 4.2738
How can one use auto.arima in R with an external regressor to make a forecast but only plot the out-of-sample values? I believe the forecast values are correct but the years do not match up correctly. So if I have annual data from 1976-2018 and I forecast the dependent variable (column 2) (I want to forecast through 2025), it plots the "forecast" for the time period 2019-2068. Weirdly enough, the figures match up well with the sample data (the "forecast" for 2019 seems to be the model prediction for 1980 and so on, all the way through 2068 matching 2025.
I would like to be able to eliminate that and have it so "2062-2068" results are instead 2019-2025. I'll try and include a picture of the plot so it might be easier to visualize my plight.
Below is the R script:
#Download the CVS file, the dependent variable in the second column, xreg in the third, and years in the first. All columns have headers.
library(forecast)
library(DataCombine)
library(tseries)
library(MASS)
library(TSA)
ts(TXB102[,2], frequency = 1, start = c(1976, 1),end = c(2018, 1)) -> TXB102ts
ts(TXB102[,3], frequency = 1, start = c(1976, 1), end = c(2018,1)) -> TXB102xregtest
ts(TXB102[,3], frequency = 1, start = c(1976, 1), end = c(2025,1)) -> TXB102xreg
as.vector(t(TXB102ts)) -> y
as.vector(t(TXB102xregtest)) -> xregtest
as.vector(t(TXB102xreg)) -> xreg
y <- ts(y,frequency = 1, start = c(1976,1),end = c(2018,1))
xregtest <- ts(xregtest, frequency = 1, start = c(1976,1), end=c(2018,1))
xreg <- ts(xreg, frequency = 1, start = c(1976,1), end=c(2025,1))
summary(y)
plot(y)
ndiffs(y)
ARIMA <- auto.arima(y, trace = TRUE, stepwise = FALSE, approximation = FALSE, xreg=xregtest)
ARIMA
forecast(ARIMA,xreg=xreg)
plot(forecast(ARIMA,xreg=xreg))
The following is a plot of what I get after running the script.
Plot
TLDR: How do I get the real out-of-sample forecast to plot for 2019-2025 as opposed to the in-sample model fit it is passing along as 2019-2068.

Operation conditional on time index for longitudinal data in r

I have some data organized in the longitudinal format, i.e.
id <- c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4)
time=rep(c(1990, 1995, 2000,2005), 4)
w = runif(16, min=0, max=1)
u = runif(16, min=0, max=0.5)
dat <- cbind(id,time,w,u)
dat
id time w u
1 1990 0.6550168 0.2114829
1 1995 0.9669285 0.2253474
1 2000 0.8879138 0.2733263
1 2005 0.1079913 0.4452164
2 1990 0.1483843 0.1949214
2 1995 0.7599596 0.1632965
2 2000 0.7119100 0.3600129
2 2005 0.4164409 0.2456366
3 1990 0.7881798 0.3233312
3 1995 0.8627986 0.1180433
3 2000 0.3253139 0.3491878
3 2005 0.2560138 0.3193816
4 1990 0.2062351 0.3485047
4 1995 0.4145230 0.1413814
4 2000 0.3053510 0.1782681
4 2005 0.7419894 0.3738163
I need to compute B as follows
where t and s refers to time. I tried for a loop using two indexes i and jbut I got no output. Then i tried differently, such as
B.small = list()
for (r in c(1975, 1980, 1985, 1990, 1995, 2000, 2005, 2010)){
B.small = (set$w[set$year ==r, ]%*% t(set$w)%*%set$u[set$year ==r, ]*set$u)/n
}
B = sum(B.Small)/n
Error in set$u[set$year == r, ] : incorrect number of dimensions
Bmust be a scalar. I guess there must be an alternative, also without using a loop
Maybe I didn't understand your formula enough and for sure there could be a better way to do it, but I'd try:
#split the matrix by year
splitDT<-split(as.data.frame(dat[,3:4]),dat[,2])
#build any combinations of indices of year
indices<-expand.grid(1:length(splitDT),1:length(splitDT))
#evaluate the mean of each combination and put in a matrix
res<-matrix(mapply(function(x,y) mean(splitDT[[x]]$u*splitDT[[y]]$u*splitDT[[x]]$w*splitDT[[y]]$w),
indices[,1],indices[,2]),
ncol=length(splitDT))
#get the result
sum(res)/ncol(res)

Resources