Loop instead of iterative calculations in R - r

In my example below I have quarterly data from 2021Q1 to 2022Q3 for variable X. I have forecasted growth rate of variable X (growth_x) from 2022Q4 to 2025Q4. I want to use the growth_x variable to calculate the variable X from 2022Q4 to 2025Q4 iteratively.I am manually calculating it below and still missing 2025Q4. Is it possible to write a function to do it? I am fairly new to writing loops. Any help will be greatly appreciated. Thank you in advance.
library(readxl)
library(dplyr)
library(lubridate)
# Quarterly Data
data <- data.frame(c("2021Q1","2021Q2","2021Q3","2021Q4",
"2022Q1","2022Q2","2022Q3","2022Q4",
"2023Q1","2023Q2","2023Q3","2023Q4",
"2024Q1","2024Q2","2024Q3","2024Q4",
"2025Q1","2025Q2","2025Q3","2025Q4"),
# Variable X - Actuals upto 2022Q3
c(804,511,479,462,
427,330,440,NA,
NA,NA,NA,NA,
NA,NA,NA,NA,
NA,NA,NA,NA),
# Forecasted Growth rates of X from 2022Q4
c(NA,NA,NA,NA,
NA,NA,NA,0.24,
0.49,0.65,0.25,0.71,
0.63,0.33,0.53,0.83,
0.87,0.19,0.99,0.16))
# Renaming the columns
data<-data%>%rename(yrqtr=1,x=2,growth_x=3)
# Creating Date Variable
data<-data%>%mutate(year=substr(yrqtr,1,4),
qtr=substr(yrqtr,5,6),
mon=ifelse(qtr=="Q1",3,
ifelse(qtr=="Q2",6,
ifelse(qtr=="Q3",9,12))),
date=make_date(year,mon,1))
# Computing Growth Rate from 2022Q3 to 2023Q3
Growth_2023_3<-data%>%mutate(forecast_x=(1+growth_x)*lag(x,4),
x=ifelse(date>"2022-09-01",forecast_x,x))%>%select(-forecast_x)
# Computing Growth Rate from 2023Q3 to 2024Q3
Growth_2024_3<-Growth_2023_3%>%mutate(forecast_x=(1+growth_x)*lag(x,4),
x=ifelse(date>"2023-09-01",forecast_x,x))%>%select(-forecast_x)
# Computing Growth Rate from 2024Q3 to 2025Q3
Growth_2025_3<-Growth_2024_3%>%mutate(forecast_x=(1+growth_x)*lag(x,4),
x=ifelse(date>"2024-09-01",forecast_x,x))%>%select(-forecast_x)

Does this do what you want?
n_years <- length(unique(data$year))
for(i in unique(data$year)[2:n_years]){
# Computing Growth Rate from 2022Q3 to 2023Q3
data <- data %>%
mutate(forecast_x=(1+growth_x)*lag(x,4),
x=ifelse(date > as.Date(paste0(i,"-09-01")),forecast_x,x))
}
As an aside, column names can be assigned at the time the data frame is created. For example:
# Quarterly Data
data <- data.frame(yrqtr = c("2021Q1","2021Q2"),
x = c(804,511),
growth_x = c(0.24,0.49))

If you want to avoid using a loop, you can use purrr::reduce().
library(tidyverse)
library(lubridate)
sol <- reduce(
.x = unique(data$year), # iterate over years
.init = data,
\(lhs, rhs) lhs %>%
mutate(x = ifelse(year == rhs & is.na(x), (1+growth_x)*lag(x,4), x))
)
sol
#> yrqtr x growth_x year qtr mon date
#> 1 2021Q1 804.0000 NA 2021 Q1 3 2021-03-01
#> 2 2021Q2 511.0000 NA 2021 Q2 6 2021-06-01
#> 3 2021Q3 479.0000 NA 2021 Q3 9 2021-09-01
#> 4 2021Q4 462.0000 NA 2021 Q4 12 2021-12-01
#> 5 2022Q1 427.0000 NA 2022 Q1 3 2022-03-01
#> 6 2022Q2 330.0000 NA 2022 Q2 6 2022-06-01
#> 7 2022Q3 440.0000 NA 2022 Q3 9 2022-09-01
#> 8 2022Q4 572.8800 0.24 2022 Q4 12 2022-12-01
#> 9 2023Q1 636.2300 0.49 2023 Q1 3 2023-03-01
#> 10 2023Q2 544.5000 0.65 2023 Q2 6 2023-06-01
#> 11 2023Q3 550.0000 0.25 2023 Q3 9 2023-09-01
#> 12 2023Q4 979.6248 0.71 2023 Q4 12 2023-12-01
#> 13 2024Q1 1037.0549 0.63 2024 Q1 3 2024-03-01
#> 14 2024Q2 724.1850 0.33 2024 Q2 6 2024-06-01
#> 15 2024Q3 841.5000 0.53 2024 Q3 9 2024-09-01
#> 16 2024Q4 1792.7134 0.83 2024 Q4 12 2024-12-01
#> 17 2025Q1 1939.2927 0.87 2025 Q1 3 2025-03-01
#> 18 2025Q2 861.7802 0.19 2025 Q2 6 2025-06-01
#> 19 2025Q3 1674.5850 0.99 2025 Q3 9 2025-09-01
#> 20 2025Q4 2079.5475 0.16 2025 Q4 12 2025-12-01

Related

How best to parse fields in R?

Below is the sample data. This is how it comes from the current population survey. There are 115 columns in the original. Below is just a subset. At the moment, I simply append a new row each month and leave it as is. However, there has been a new request that it be made longer and parsed a bit.
For some context, the first character is the race, a = all, b=black, w=white, and h= hispanic. The second character is the gender, x = all, m = male, and f= female. The third variable, which does not appear in all columns is the age. These values are 2024 for ages 20-24, 3039 or 30-39, and so on. Each one will end in the terms, laborforce unemp or unemprate.
stfips <- c(32,32,32,32,32,32,32,32)
areatype <- c(01,01,01,01,01,01,01,01)
periodyear <- c(2021,2021,2021,2021,2021,2021,2021,2021)
period <- (01,02,03,04,05,06,07,08)
xalaborforce <- c(1210.9,1215.3,1200.6,1201.6,1202.8,1209.3,1199.2,1198.9)
xaunemp <- c(55.7,55.2,65.2,321.2,77.8,88.5,92.4,102.6)
xaunemprate <- c(2.3,2.5,2.7,2.9,3.2,6.5,6.0,12.5)
walaborforce <- c(1000.0,999.2,1000.5,1001.5,998.7,994.5,999.2,1002.8)
waunemp <- c(50.2,49.5,51.6,251.2,59.9,80.9,89.8,77.8)
waunemprate <- c(3.4,3.6,3.8,4.0,4.2,4.5,4.1,2.6)
balaborforce <- c (5.5,5.7,5.2,6.8,9.2,2.5,3.5,4.5)
ba2024laborforce <- c(1.2,1.4,1.2,1.3,1.6,1.7,1.4,1.5)
ba2024unemp <- c(.2,.3,.2,.3,.4,.5,.02,.19))
ba2024lunemprate <- c(2.1,2.2,3.2,3.2,3.3,3.4,1.2,2.5)
test2 <- data.frame (stfips,areatype,periodyear, period, xalaborforce,xaunemp,xaunemprate,walaborforce, waunemp,waunemprate,balaborforce,ba2024laborforce,ba2024unemp,ba2024unemprate)
Desired result
stfips areatype periodyear period race gender age laborforce unemp unemprate
32 01 2021 01 x a all 1210.9 55.7 2.3
32 01 2021 02 x a all 1215.3 55.2 2.5
.....(the other six rows for race = x and gender = a
32 01 2021 01 w a all 1000.0 50.2 3.4
32 01 2021 02 w a all 999.2 49.5 3.6
....(the other six rows for race = w and gender = a
32 01 2021 01 b a 2024 1.2 .2 2.1
Edit -- added handling for columns with age prefix. Mostly there, but would be nice to have a concise way to add the - to make 2024 into 20-24....
test2 %>%
pivot_longer(xalaborforce:ba2024laborforce) %>%
separate(name, c("race", "gender", "stat"), sep = c(1,2)) %>%
mutate(age = coalesce(parse_number(stat) %>% as.character, "all"),
stat = str_remove_all(stat, "[0-9]")) %>%
pivot_wider(names_from = stat, values_from = value)
# A tibble: 32 × 10
stfips areatype periodyear period race gender age laborforce unemp unemprate
<dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 32 1 2021 1 x a all 1211. 55.7 2.3
2 32 1 2021 1 w a all 1000 50.2 3.4
3 32 1 2021 1 b a all 5.5 NA NA
4 32 1 2021 1 b a 2024 1.2 NA NA
5 32 1 2021 2 x a all 1215. 55.2 2.5
6 32 1 2021 2 w a all 999. 49.5 3.6
7 32 1 2021 2 b a all 5.7 NA NA
8 32 1 2021 2 b a 2024 1.4 NA NA
9 32 1 2021 3 x a all 1201. 65.2 2.7
10 32 1 2021 3 w a all 1000. 51.6 3.8
# … with 22 more rows
# ℹ Use `print(n = ...)` to see more rows

Using dplyr mutate function to create new variable conditionally based on current row

I am working on creating conditional averages for a large data set that involves # of flu cases seen during the week for several years. The data is organized as such:
What I want to do is create a new column that tabulates that average number of cases for that same week in previous years. For instance, for the row where Week.Number is 1 and Flu.Year is 2017, I would like the new row to give the average count for any year with Week.Number==1 & Flu.Year<2017. Normally, I would use the case_when() function to conditionally tabulate something like this. For instance, when calculating the average weekly volume I used this code:
mutate(average = case_when(
Flu.Year==2016 ~ mean(chcc$count[chcc$Flu.Year==2016]),
Flu.Year==2017 ~ mean(chcc$count[chcc$Flu.Year==2017]),
Flu.Year==2018 ~ mean(chcc$count[chcc$Flu.Year==2018]),
Flu.Year==2019 ~ mean(chcc$count[chcc$Flu.Year==2019]),
),
However, since there are four years of data * 52 weeks which is a lot of iterations to spell out the conditions for. Is there a way to elegantly code this in dplyr? The problem I keep running into is that I want to call values in counts column based on Week.Number and Flu.Year values in other rows conditioned on the current value of Week.Number and Flu.Year, and I am not sure how to accomplish that. Please let me know if there is further information / detail I can provide.
Thanks,
Steven
dat <- tibble( Flu.Year = rep(2016:2019,each = 52), Week.Number = rep(1:52,4), count = sample(1000, size=52*4, replace=TRUE) )
It's bad-form and, in some cases, an error when you use $-indexing within dplyr verbs.
I think a better way to get that average field is to group_by(Flu.Year) and calculate it straight-up.
library(dplyr)
set.seed(42)
dat <- tibble(
Flu.Year = sample(2016:2020, size=100, replace=TRUE),
count = sample(1000, size=100, replace=TRUE)
)
dat %>%
group_by(Flu.Year) %>%
mutate(average = mean(count)) %>%
# just to show a quick summary
slice(1:3) %>%
ungroup()
# # A tibble: 15 x 3
# Flu.Year count average
# <int> <int> <dbl>
# 1 2016 734 578.
# 2 2016 356 578.
# 3 2016 411 578.
# 4 2017 217 436.
# 5 2017 453 436.
# 6 2017 920 436.
# 7 2018 963 558
# 8 2018 609 558
# 9 2018 536 558
# 10 2019 943 543.
# 11 2019 740 543.
# 12 2019 536 543.
# 13 2020 627 494.
# 14 2020 218 494.
# 15 2020 389 494.
An alternative approach is to generate a summary table (just one row per year) and join it back in to the original data.
dat %>%
group_by(Flu.Year) %>%
summarize(average = mean(count))
# # A tibble: 5 x 2
# Flu.Year average
# <int> <dbl>
# 1 2016 578.
# 2 2017 436.
# 3 2018 558
# 4 2019 543.
# 5 2020 494.
dat %>%
group_by(Flu.Year) %>%
summarize(average = mean(count)) %>%
full_join(dat, by = "Flu.Year")
# # A tibble: 100 x 3
# Flu.Year average count
# <int> <dbl> <int>
# 1 2016 578. 734
# 2 2016 578. 356
# 3 2016 578. 411
# 4 2016 578. 720
# 5 2016 578. 851
# 6 2016 578. 822
# 7 2016 578. 465
# 8 2016 578. 679
# 9 2016 578. 30
# 10 2016 578. 180
# # ... with 90 more rows
The result, after chat:
tibble( Flu.Year = rep(2016:2018,each = 3), Week.Number = rep(1:3,3), count = 1:9 ) %>%
arrange(Flu.Year, Week.Number) %>%
group_by(Week.Number) %>%
mutate(year_week.average = lag(cumsum(count) / seq_along(count)))
# # A tibble: 9 x 4
# # Groups: Week.Number [3]
# Flu.Year Week.Number count year_week.average
# <int> <int> <int> <dbl>
# 1 2016 1 1 NA
# 2 2016 2 2 NA
# 3 2016 3 3 NA
# 4 2017 1 4 1
# 5 2017 2 5 2
# 6 2017 3 6 3
# 7 2018 1 7 2.5
# 8 2018 2 8 3.5
# 9 2018 3 9 4.5
We can use aggregate from base R
aggregate(count ~ Flu.Year, data, FUN = mean)

Forecast time series with multiple predictors return error

I have 3 quarterly time-series data: beer, temp, income, and all those data start from 2010 Q1 and end at 2018 Q3.
here is my data:
Qtr1 Qtr2 Qtr3 Qtr4
2010 3.301 2.826 2.712 3.934
2011 3.192 2.975 2.865 3.789
2012 2.728 2.840 2.633 3.837
2013 3.090 2.779 2.594 3.960
2014 2.771 2.860 2.676 3.831
2015 2.986 2.558 2.810 3.743
2016 3.054 2.764 2.985 3.807
2017 3.046 2.880 2.689 4.005
2018 3.013 2.800 2.937
> temp
Qtr1 Qtr2 Qtr3 Qtr4
2010 16.766667 11.433333 9.400000 14.533333
2011 17.033333 11.966667 8.633333 13.900000
2012 15.800000 10.600000 9.700000 13.766667
2013 17.033333 11.333333 10.200000 14.866667
2014 16.266667 11.900000 9.266667 13.900000
2015 17.300000 11.400000 8.733333 13.966667
2016 18.033333 12.400000 9.300000 14.100000
2017 16.533333 11.100000 9.733333 15.300000
2018 18.400000 11.033333 9.700000
> income
Qtr1 Qtr2 Qtr3 Qtr4
2010 48.064 47.755 47.878 47.707
2011 48.226 49.063 49.322 49.518
2012 49.714 49.390 49.683 50.386
2013 50.405 51.476 52.527 53.456
2014 54.309 54.308 54.811 54.723
2015 55.254 55.913 56.472 56.316
2016 58.013 58.312 58.744 59.806
2017 59.881 60.683 61.164 61.887
2018 61.969 62.507 63.054
I tried to forecast 2 years values of beer using trend and seasonal dummy predictor, but R always give me dimension error.
> forecast(tslm(beer~temp+income+trend+season), h = 8)
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
variable lengths differ (found for 'trend')
In addition: Warning message:
'newdata' had 8 rows but variables found have 35 rows
Using data.frame, but it always has warning messages
> df = data.frame(beer,temp,income)
> forecast(tslm(beer~temp+income+trend+season, data = df), h = 8, newdata = df)
Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
2018 Q4 2.991699 2.3132374 3.670161 1.9328516 4.050546
2019 Q1 1.752979 1.0424701 2.463488 0.6441168 2.861841
2019 Q2 1.667426 0.9738984 2.360954 0.5850656 2.749787
2019 Q3 1.875770 1.0662253 2.685315 0.6123465 3.139194
2019 Q4 2.884266 2.1308413 3.637691 1.7084267 4.060105
2020 Q1 1.729527 1.0011085 2.457945 0.5927141 2.866339
2020 Q2 1.599902 0.8838936 2.315910 0.4824569 2.717347
2020 Q3 1.837085 1.0376823 2.636488 0.5894896 3.084681
2020 Q4 2.800613 2.0470872 3.554139 1.6246159 3.976610
2021 Q1 1.566346 0.7583452 2.374347 0.3053320 2.827360
2021 Q2 1.537637 0.7593199 2.315954 0.3229493 2.752324
2021 Q3 1.758491 0.9202322 2.596749 0.4502548 3.066726
2021 Q4 2.766178 1.9445748 3.587782 1.4839351 4.048421
2022 Q1 1.600676 0.8060401 2.395313 0.3605199 2.840833
2022 Q2 1.610888 0.8665356 2.355241 0.4492074 2.772569
2022 Q3 1.870518 1.0513857 2.689650 0.5921317 3.148904
2022 Q4 2.855698 2.1234601 3.587935 1.7129243 3.998471
2023 Q1 1.675867 0.9187581 2.432976 0.4942778 2.857457
2023 Q2 1.590225 0.8580061 2.322445 0.4474806 2.732970
2023 Q3 1.783578 0.9603794 2.606776 0.4988456 3.068310
2023 Q4 2.829362 2.0411286 3.617595 1.5991983 4.059525
2024 Q1 1.629442 0.8509889 2.407896 0.4145418 2.844343
2024 Q2 1.546023 0.7994307 2.292615 0.3808469 2.711199
2024 Q3 1.759382 0.9209619 2.597803 0.4508937 3.067871
2024 Q4 2.906656 2.1369607 3.676351 1.7054240 4.107887
2025 Q1 1.694576 0.9426298 2.446521 0.5210444 2.868107
2025 Q2 1.585464 0.8512783 2.319649 0.4396504 2.731277
2025 Q3 1.858994 1.0774412 2.640548 0.6392561 3.078733
2025 Q4 2.836440 2.0876545 3.585226 1.6678407 4.005040
2026 Q1 1.664587 0.9073179 2.421857 0.4827478 2.846427
2026 Q2 1.628942 0.9118032 2.346081 0.5097325 2.748152
2026 Q3 1.911943 1.1070396 2.716846 0.6557631 3.168123
2026 Q4 2.916889 2.1414033 3.692375 1.7066199 4.127158
2027 Q1 1.649728 0.8839868 2.415469 0.4546670 2.844789
2027 Q2 1.619649 0.8980352 2.341262 0.4934558 2.745842
Warning messages:
1: In forecast.lm(tslm(beer ~ temp + income + trend + season, data = df), :
Could not find required variable temp in newdata. Specify newdata as a named data.frame
2: In forecast.lm(tslm(beer ~ temp + income + trend + season, data = df), :
Could not find required variable income in newdata. Specify newdata as a named data.frame
I tried to rename the column in dataframe, this time works well but the plot doesn't look right
> names(df)[2] = "temp"
> names(df)[3] = "income"
> autoplot(forecast(tslm(beer~temp+income+trend+season, data = df), h = 8, newdata = df))
[enter image description here][1]
But when I exclude the predictor temp and income, it works well
> forecast(tslm(beer~trend+season), h = 8)
Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
2018 Q4 3.854655 3.654084 4.055226 3.542067 4.167244
2019 Q1 3.010562 2.809372 3.211751 2.697010 3.324113
2019 Q2 2.799562 2.598372 3.000751 2.486010 3.113113
2019 Q3 2.757228 2.556039 2.958417 2.443676 3.070780
2019 Q4 3.852745 3.648494 4.056997 3.534421 4.171070
2020 Q1 3.008652 2.803430 3.213874 2.688815 3.328489
2020 Q2 2.797652 2.592430 3.002874 2.477815 3.117489
2020 Q3 2.755318 2.550096 2.960540 2.435481 3.075155
I want forecast 2 years beer value with temp, income, trend, seasonal dummy as predictor, I tried everything I know..
Please help.
Thanks in advance.
There are a couple of problems here. The first is that you are providing historical temp and income data in the newdata argument, when they should be future values for these variables. The second issue is that the forecast package is not particularly good at finding the relevant variables in newdata and is getting confused here. Workarounds are possible, but I suggest you use the newer fable package instead of forecast which makes this sort of thing much easier.
library(tidyverse)
library(lubridate)
library(tsibble)
library(fable)
df <- tsibble(
quarter = seq(yearquarter("2010 Q1"), to=yearquarter("2018 Q3"), by = 1),
beer = c(
3.301, 2.826, 2.712, 3.934, 3.192, 2.975, 2.865, 3.789,
2.728, 2.840, 2.633, 3.837, 3.090, 2.779, 2.594, 3.960,
2.771, 2.860, 2.676, 3.831, 2.986, 2.558, 2.810, 3.743,
3.054, 2.764, 2.985, 3.807, 3.046, 2.880, 2.689, 4.005,
3.013, 2.800, 2.937
),
temp = c(
16.766667, 11.433333, 9.400000, 14.533333, 17.033333, 11.966667, 8.633333, 13.900000,
15.800000, 10.600000, 9.700000, 13.766667, 17.033333, 11.333333, 10.200000, 14.866667,
16.266667, 11.900000, 9.266667, 13.900000, 17.300000, 11.400000, 8.733333, 13.966667,
18.033333, 12.400000, 9.300000, 14.100000, 16.533333, 11.100000, 9.733333, 15.300000,
18.400000, 11.033333, 9.700000
),
income = c(
48.064, 47.755, 47.878, 47.707, 48.226, 49.063, 49.322, 49.518,
49.714, 49.390, 49.683, 50.386, 50.405, 51.476, 52.527, 53.456,
54.309, 54.308, 54.811, 54.723, 55.254, 55.913, 56.472, 56.316,
58.013, 58.312, 58.744, 59.806, 59.881, 60.683, 61.164, 61.887,
61.969, 62.507, 63.054
),
index = quarter
)
df
#> # A tsibble: 35 x 4 [1Q]
#> quarter beer temp income
#> <qtr> <dbl> <dbl> <dbl>
#> 1 2010 Q1 3.30 16.8 48.1
#> 2 2010 Q2 2.83 11.4 47.8
#> 3 2010 Q3 2.71 9.4 47.9
#> 4 2010 Q4 3.93 14.5 47.7
#> 5 2011 Q1 3.19 17.0 48.2
#> 6 2011 Q2 2.98 12.0 49.1
#> 7 2011 Q3 2.86 8.63 49.3
#> 8 2011 Q4 3.79 13.9 49.5
#> 9 2012 Q1 2.73 15.8 49.7
#> 10 2012 Q2 2.84 10.6 49.4
#> # … with 25 more rows
train <- df %>% filter(year(quarter) <= 2016)
test <- df %>% filter(year(quarter) > 2016)
fc <- train %>%
model(TSLM(beer ~ temp + income + trend() + season())) %>%
forecast(new_data = test)
Created on 2020-04-29 by the reprex package (v0.3.0)

How to subtract each Country's value by year

I have data for each Country's happiness (https://www.kaggle.com/unsdsn/world-happiness), and I made data for each year of the reports. Now, I don't know how to get the values for each year subtracted from each other e.g. how did happiness rank change from 2015 to 2017/2016 to 2017? I'd like to make a new df of differences for each.
I was able to bind the tables for columns in common and started to work on removing Countries that don't have data for all 3 years. I'm not sure if I'm going down a complicated path.
keepcols <- c("Country","Happiness.Rank","Economy..GDP.per.Capita.","Family","Health..Life.Expectancy.","Freedom","Trust..Government.Corruption.","Generosity","Dystopia.Residual","Year")
mydata2015 = read.csv("C:\\Users\\mmcgown\\Downloads\\2015.csv")
mydata2015$Year <- "2015"
data2015 <- subset(mydata2015, select = keepcols )
mydata2016 = read.csv("C:\\Users\\mmcgown\\Downloads\\2016.csv")
mydata2016$Year <- "2016"
data2016 <- subset(mydata2016, select = keepcols )
mydata2017 = read.csv("C:\\Users\\mmcgown\\Downloads\\2017.csv")
mydata2017$Year <- "2017"
data2017 <- subset(mydata2017, select = keepcols )
df <- rbind(data2015,data2016,data2017)
head(df, n=10)
tail(df, n=10)
df15 <- df[df['Year']=='2015',]
df16 <- df[df['Year']=='2016',]
df17 <- df[df['Year']=='2017',]
nocon <- rbind(setdiff(unique(df16['Country']),unique(df17['Country'])),setdiff(unique(df15['Country']),unique(df16['Country'])))
Don't have a clear path to accomplish what I want but it would look like
df16_to_17
Country Happiness.Rank ...(other columns)
Yemen (Yemen[Happiness Rank in 2017] - Yemen[Happiness Rank in 2016])
USA (USA[Happiness Rank in 2017] - USA[Happiness Rank in 2016])
(other countries)
df15_to_16
Country Happiness.Rank ...(other columns)
Yemen (Yemen[Happiness Rank in 2016] - Yemen[Happiness Rank in 2015])
USA (USA[Happiness Rank in 2016] - USA[Happiness Rank in 2015])
(other countries)
It's very straightforward with dplyr, and involves grouping by country and then finding the differences between consecutive values with base R's diff. Just make sure to use df and not df15, etc.:
library(dplyr)
rank_diff_df <- df %>%
group_by(Country) %>%
mutate(Rank.Diff = c(NA, diff(Happiness.Rank)))
The above assumes that the data are arranged by year, which they are in your case because of the way you combined the dataframes. If not, you'll need to call arrange(Year) before the call to mutate. Filtering out countries with missing year data isn't necessary, but can be done after group_by() with filter(n() == 3).
If you would like to view the differences it would make sense to drop some variables and rearrange the data:
rank_diff_df %>%
select(Year, Country, Happiness.Rank, Rank.Diff) %>%
arrange(Country)
Which returns:
# A tibble: 470 x 4
# Groups: Country [166]
Year Country Happiness.Rank Rank.Diff
<chr> <fct> <int> <int>
1 2015 Afghanistan 153 NA
2 2016 Afghanistan 154 1
3 2017 Afghanistan 141 -13
4 2015 Albania 95 NA
5 2016 Albania 109 14
6 2017 Albania 109 0
7 2015 Algeria 68 NA
8 2016 Algeria 38 -30
9 2017 Algeria 53 15
10 2015 Angola 137 NA
# … with 460 more rows
The above data frame will work well with ggplot2 if you are planning on plotting the results.
If you don't feel comfortable with dplyr you can use base R's merge to combine the dataframes, and then create a new dataframe with the differences as columns:
df_wide <- merge(merge(df15, df16, by = "Country"), df17, by = "Country")
rank_diff_df <- data.frame(Country = df_wide$Country,
Y2015.2016 = df_wide$Happiness.Rank.y -
df_wide$Happiness.Rank.x,
Y2016.2017 = df_wide$Happiness.Rank -
df_wide$Happiness.Rank.y
)
Which returns:
head(rank_diff_df, 10)
Country Y2015.2016 Y2016.2017
1 Afghanistan 1 -13
2 Albania 14 0
3 Algeria -30 15
4 Angola 4 -1
5 Argentina -4 -2
6 Armenia -6 0
7 Australia -1 1
8 Austria -1 1
9 Azerbaijan 1 4
10 Bahrain -7 -1
Assuming the three datasets are present in your environment with the name data2015, data2016 and data2017, we can add a year column with the respective year and keep the columns which are present in keepcols vector. arrange the data by Country and Year, group_by Country, keep only those countries which are present in all 3 years and then subtract the values from previous rows using lag or diff.
library(dplyr)
data2015$Year <- 2015
data2016$Year <- 2016
data2017$Year <- 2017
df <- bind_rows(data2015, data2016, data2017)
data <- df[keepcols]
data %>%
arrange(Country, Year) %>%
group_by(Country) %>%
filter(n() == 3) %>%
mutate_at(-1, ~. - lag(.)) #OR
#mutate_at(-1, ~c(NA, diff(.)))
# A tibble: 438 x 10
# Groups: Country [146]
# Country Happiness.Rank Economy..GDP.pe… Family Health..Life.Ex… Freedom
# <chr> <int> <dbl> <dbl> <dbl> <dbl>
# 1 Afghan… NA NA NA NA NA
# 2 Afghan… 1 0.0624 -0.192 -0.130 -0.0698
# 3 Afghan… -13 0.0192 0.471 0.00731 -0.0581
# 4 Albania NA NA NA NA NA
# 5 Albania 14 0.0766 -0.303 -0.0832 -0.0387
# 6 Albania 0 0.0409 0.302 0.00109 0.0628
# 7 Algeria NA NA NA NA NA
# 8 Algeria -30 0.113 -0.245 0.00038 -0.0757
# 9 Algeria 15 0.0392 0.313 -0.000455 0.0233
#10 Angola NA NA NA NA NA
# … with 428 more rows, and 4 more variables: Trust..Government.Corruption. <dbl>,
# Generosity <dbl>, Dystopia.Residual <dbl>, Year <dbl>
The value of first row for each Year would always be NA, rest of the values would be subtracted by it's previous values.

Converting ts object to data.frame

I want to transform my ts object to data.frame object. My MWE is given below:
Code
set.seed(12345)
dat <- ts(data=runif(n=10, min=50, max=100), frequency = 4, start = c(1959, 2))
library(reshape2)
df <- data.frame(date=as.Date(index(dat)), Y = melt(dat)$value)
Output
date Y
1 1975-05-14 86.04519
2 1975-05-14 93.78866
3 1975-05-14 88.04912
4 1975-05-15 94.30623
5 1975-05-15 72.82405
6 1975-05-15 58.31859
7 1975-05-15 66.25477
8 1975-05-16 75.46122
9 1975-05-16 86.38526
10 1975-05-16 99.48685
I have lost my quarters in date columns. How can I figure out the problem?
How about
data.frame(Y=as.matrix(dat), date=time(dat))
This returns
Y date
1 86.04519 1959.25
2 93.78866 1959.50
3 88.04912 1959.75
4 94.30623 1960.00
5 72.82405 1960.25
6 58.31859 1960.50
7 66.25477 1960.75
8 75.46122 1961.00
9 86.38526 1961.25
10 99.48685 1961.50
yearmon (from zoo) allows creating Date objects.
> dat <- ts(data=runif(n=10, min=50, max=100), frequency = 4, start = c(1959, 2))
> data.frame(Y=as.matrix(dat), date=as.Date(as.yearmon(time(dat))))
Y date
1 51.72677 1959-04-01
2 57.61867 1959-07-01
3 86.78425 1959-10-01
4 50.05683 1960-01-01
5 69.56017 1960-04-01
6 73.12473 1960-07-01
7 69.40720 1960-10-01
8 70.12426 1961-01-01
9 58.94818 1961-04-01
10 97.58294 1961-07-01
The package timetk has several conversion functions. In your case:
dat <- ts(data=runif(n=10, min=50, max=100), frequency = 4, start = c(1959, 2))
timetk::tk_tbl(dat)
# A tibble: 10 x 2
index value
<S3: yearqtr> <dbl>
1 1959 Q2 86.04519
2 1959 Q3 93.78866
3 1959 Q4 88.04912
4 1960 Q1 94.30623
5 1960 Q2 72.82405
6 1960 Q3 58.31859
7 1960 Q4 66.25477
8 1961 Q1 75.46122
9 1961 Q2 86.38526
10 1961 Q3 99.48685
Seems that converting from xts objects seems to be both reliable and well documented. Below works and with the new date column in date / yearqtr class.
library(xts)
datx <- as.xts(dat)
df <- data.frame(date=index(datx), coredata(datx))
Checking class of date:
class(df$date)
[1] "yearqtr"
And result:
print(df)
date coredata.datx.
1 1959 Q2 86.04519
2 1959 Q3 93.78866
3 1959 Q4 88.04912
4 1960 Q1 94.30623
5 1960 Q2 72.82405
6 1960 Q3 58.31859
7 1960 Q4 66.25477
8 1961 Q1 75.46122
9 1961 Q2 86.38526
10 1961 Q3 99.48685
Package 'ggpp' provides function try_data_frame() (implemented using packages 'xts', 'zoo' and 'lubridate') that does the conversion in a single step. (This function is used in package 'ggpp' to implement a ggplot() method for time series, and returns the time index converted into a class that packages 'ggplot2' and 'scales' can use: Date or POSIXct.)
set.seed(12345)
dat.ts <- ts(data=runif(n=10, min=50, max=100), frequency = 4, start = c(1959, 2))
library(ggpp)
#> Loading required package: ggplot2
#>
#> Attaching package: 'ggpp'
#> The following object is masked from 'package:ggplot2':
#>
#> annotate
dat.df <- try_data_frame(dat.ts)
str(dat.df)
#> 'data.frame': 10 obs. of 2 variables:
#> $ time : Date, format: "1959-05-01" "1959-08-01" ...
#> $ dat.ts: num 86 93.8 88 94.3 72.8 ...
dat.df
#> time dat.ts
#> 1 1959-05-01 86.04519
#> 2 1959-08-01 93.78866
#> 3 1959-11-01 88.04912
#> 4 1960-02-01 94.30623
#> 5 1960-05-01 72.82405
#> 6 1960-08-01 58.31859
#> 7 1960-11-01 66.25477
#> 8 1961-02-01 75.46122
#> 9 1961-05-01 86.38526
#> 10 1961-08-01 99.48685
Created on 2022-09-03 with reprex v2.0.2
See help(try_data_frame()) for the details on how to set the names of columns or alter the way in which dates or times are handled.

Resources