I want to transform my ts object to data.frame object. My MWE is given below:
Code
set.seed(12345)
dat <- ts(data=runif(n=10, min=50, max=100), frequency = 4, start = c(1959, 2))
library(reshape2)
df <- data.frame(date=as.Date(index(dat)), Y = melt(dat)$value)
Output
date Y
1 1975-05-14 86.04519
2 1975-05-14 93.78866
3 1975-05-14 88.04912
4 1975-05-15 94.30623
5 1975-05-15 72.82405
6 1975-05-15 58.31859
7 1975-05-15 66.25477
8 1975-05-16 75.46122
9 1975-05-16 86.38526
10 1975-05-16 99.48685
I have lost my quarters in date columns. How can I figure out the problem?
How about
data.frame(Y=as.matrix(dat), date=time(dat))
This returns
Y date
1 86.04519 1959.25
2 93.78866 1959.50
3 88.04912 1959.75
4 94.30623 1960.00
5 72.82405 1960.25
6 58.31859 1960.50
7 66.25477 1960.75
8 75.46122 1961.00
9 86.38526 1961.25
10 99.48685 1961.50
yearmon (from zoo) allows creating Date objects.
> dat <- ts(data=runif(n=10, min=50, max=100), frequency = 4, start = c(1959, 2))
> data.frame(Y=as.matrix(dat), date=as.Date(as.yearmon(time(dat))))
Y date
1 51.72677 1959-04-01
2 57.61867 1959-07-01
3 86.78425 1959-10-01
4 50.05683 1960-01-01
5 69.56017 1960-04-01
6 73.12473 1960-07-01
7 69.40720 1960-10-01
8 70.12426 1961-01-01
9 58.94818 1961-04-01
10 97.58294 1961-07-01
The package timetk has several conversion functions. In your case:
dat <- ts(data=runif(n=10, min=50, max=100), frequency = 4, start = c(1959, 2))
timetk::tk_tbl(dat)
# A tibble: 10 x 2
index value
<S3: yearqtr> <dbl>
1 1959 Q2 86.04519
2 1959 Q3 93.78866
3 1959 Q4 88.04912
4 1960 Q1 94.30623
5 1960 Q2 72.82405
6 1960 Q3 58.31859
7 1960 Q4 66.25477
8 1961 Q1 75.46122
9 1961 Q2 86.38526
10 1961 Q3 99.48685
Seems that converting from xts objects seems to be both reliable and well documented. Below works and with the new date column in date / yearqtr class.
library(xts)
datx <- as.xts(dat)
df <- data.frame(date=index(datx), coredata(datx))
Checking class of date:
class(df$date)
[1] "yearqtr"
And result:
print(df)
date coredata.datx.
1 1959 Q2 86.04519
2 1959 Q3 93.78866
3 1959 Q4 88.04912
4 1960 Q1 94.30623
5 1960 Q2 72.82405
6 1960 Q3 58.31859
7 1960 Q4 66.25477
8 1961 Q1 75.46122
9 1961 Q2 86.38526
10 1961 Q3 99.48685
Package 'ggpp' provides function try_data_frame() (implemented using packages 'xts', 'zoo' and 'lubridate') that does the conversion in a single step. (This function is used in package 'ggpp' to implement a ggplot() method for time series, and returns the time index converted into a class that packages 'ggplot2' and 'scales' can use: Date or POSIXct.)
set.seed(12345)
dat.ts <- ts(data=runif(n=10, min=50, max=100), frequency = 4, start = c(1959, 2))
library(ggpp)
#> Loading required package: ggplot2
#>
#> Attaching package: 'ggpp'
#> The following object is masked from 'package:ggplot2':
#>
#> annotate
dat.df <- try_data_frame(dat.ts)
str(dat.df)
#> 'data.frame': 10 obs. of 2 variables:
#> $ time : Date, format: "1959-05-01" "1959-08-01" ...
#> $ dat.ts: num 86 93.8 88 94.3 72.8 ...
dat.df
#> time dat.ts
#> 1 1959-05-01 86.04519
#> 2 1959-08-01 93.78866
#> 3 1959-11-01 88.04912
#> 4 1960-02-01 94.30623
#> 5 1960-05-01 72.82405
#> 6 1960-08-01 58.31859
#> 7 1960-11-01 66.25477
#> 8 1961-02-01 75.46122
#> 9 1961-05-01 86.38526
#> 10 1961-08-01 99.48685
Created on 2022-09-03 with reprex v2.0.2
See help(try_data_frame()) for the details on how to set the names of columns or alter the way in which dates or times are handled.
Related
In my example below I have quarterly data from 2021Q1 to 2022Q3 for variable X. I have forecasted growth rate of variable X (growth_x) from 2022Q4 to 2025Q4. I want to use the growth_x variable to calculate the variable X from 2022Q4 to 2025Q4 iteratively.I am manually calculating it below and still missing 2025Q4. Is it possible to write a function to do it? I am fairly new to writing loops. Any help will be greatly appreciated. Thank you in advance.
library(readxl)
library(dplyr)
library(lubridate)
# Quarterly Data
data <- data.frame(c("2021Q1","2021Q2","2021Q3","2021Q4",
"2022Q1","2022Q2","2022Q3","2022Q4",
"2023Q1","2023Q2","2023Q3","2023Q4",
"2024Q1","2024Q2","2024Q3","2024Q4",
"2025Q1","2025Q2","2025Q3","2025Q4"),
# Variable X - Actuals upto 2022Q3
c(804,511,479,462,
427,330,440,NA,
NA,NA,NA,NA,
NA,NA,NA,NA,
NA,NA,NA,NA),
# Forecasted Growth rates of X from 2022Q4
c(NA,NA,NA,NA,
NA,NA,NA,0.24,
0.49,0.65,0.25,0.71,
0.63,0.33,0.53,0.83,
0.87,0.19,0.99,0.16))
# Renaming the columns
data<-data%>%rename(yrqtr=1,x=2,growth_x=3)
# Creating Date Variable
data<-data%>%mutate(year=substr(yrqtr,1,4),
qtr=substr(yrqtr,5,6),
mon=ifelse(qtr=="Q1",3,
ifelse(qtr=="Q2",6,
ifelse(qtr=="Q3",9,12))),
date=make_date(year,mon,1))
# Computing Growth Rate from 2022Q3 to 2023Q3
Growth_2023_3<-data%>%mutate(forecast_x=(1+growth_x)*lag(x,4),
x=ifelse(date>"2022-09-01",forecast_x,x))%>%select(-forecast_x)
# Computing Growth Rate from 2023Q3 to 2024Q3
Growth_2024_3<-Growth_2023_3%>%mutate(forecast_x=(1+growth_x)*lag(x,4),
x=ifelse(date>"2023-09-01",forecast_x,x))%>%select(-forecast_x)
# Computing Growth Rate from 2024Q3 to 2025Q3
Growth_2025_3<-Growth_2024_3%>%mutate(forecast_x=(1+growth_x)*lag(x,4),
x=ifelse(date>"2024-09-01",forecast_x,x))%>%select(-forecast_x)
Does this do what you want?
n_years <- length(unique(data$year))
for(i in unique(data$year)[2:n_years]){
# Computing Growth Rate from 2022Q3 to 2023Q3
data <- data %>%
mutate(forecast_x=(1+growth_x)*lag(x,4),
x=ifelse(date > as.Date(paste0(i,"-09-01")),forecast_x,x))
}
As an aside, column names can be assigned at the time the data frame is created. For example:
# Quarterly Data
data <- data.frame(yrqtr = c("2021Q1","2021Q2"),
x = c(804,511),
growth_x = c(0.24,0.49))
If you want to avoid using a loop, you can use purrr::reduce().
library(tidyverse)
library(lubridate)
sol <- reduce(
.x = unique(data$year), # iterate over years
.init = data,
\(lhs, rhs) lhs %>%
mutate(x = ifelse(year == rhs & is.na(x), (1+growth_x)*lag(x,4), x))
)
sol
#> yrqtr x growth_x year qtr mon date
#> 1 2021Q1 804.0000 NA 2021 Q1 3 2021-03-01
#> 2 2021Q2 511.0000 NA 2021 Q2 6 2021-06-01
#> 3 2021Q3 479.0000 NA 2021 Q3 9 2021-09-01
#> 4 2021Q4 462.0000 NA 2021 Q4 12 2021-12-01
#> 5 2022Q1 427.0000 NA 2022 Q1 3 2022-03-01
#> 6 2022Q2 330.0000 NA 2022 Q2 6 2022-06-01
#> 7 2022Q3 440.0000 NA 2022 Q3 9 2022-09-01
#> 8 2022Q4 572.8800 0.24 2022 Q4 12 2022-12-01
#> 9 2023Q1 636.2300 0.49 2023 Q1 3 2023-03-01
#> 10 2023Q2 544.5000 0.65 2023 Q2 6 2023-06-01
#> 11 2023Q3 550.0000 0.25 2023 Q3 9 2023-09-01
#> 12 2023Q4 979.6248 0.71 2023 Q4 12 2023-12-01
#> 13 2024Q1 1037.0549 0.63 2024 Q1 3 2024-03-01
#> 14 2024Q2 724.1850 0.33 2024 Q2 6 2024-06-01
#> 15 2024Q3 841.5000 0.53 2024 Q3 9 2024-09-01
#> 16 2024Q4 1792.7134 0.83 2024 Q4 12 2024-12-01
#> 17 2025Q1 1939.2927 0.87 2025 Q1 3 2025-03-01
#> 18 2025Q2 861.7802 0.19 2025 Q2 6 2025-06-01
#> 19 2025Q3 1674.5850 0.99 2025 Q3 9 2025-09-01
#> 20 2025Q4 2079.5475 0.16 2025 Q4 12 2025-12-01
I have a data set with a list of event dates and a list of sample dates. Events and samples are grouped by unit. For each sample date, I want to count the number of events that came before that sample date
and the number of different months in which those events occurred, grouped by unit. A couple complications: sometimes the event date happens after the sample date in the same year. Sometimes there are sample dates but no event in a particular year.
Example data (my actual dataset has ~6000 observations):
data<-read.table(header=T, text="
unit eventdate eventmonth sampledate year
a 1996-06-01 06 1996-08-01 1996
a 1997-09-03 09 1997-08-02 1997
a 1998-05-15 05 1998-08-03 1998
a NA NA 1999-08-02 1999
b 1996-05-31 05 1996-08-01 1996
b 1997-05-31 05 1997-08-02 1997
b 1998-05-15 05 1998-08-03 1998
b 1999-05-16 05 1999-08-02 1999")
Output data should look something like this:
year unit numevent nummonth
1996 a 1 1
1997 a 1 1
1998 a 3 3
1999 a 3 3
1996 b 1 1
1997 b 2 1
1998 b 3 1
1999 b 4 1
Note that in 1997 in unit a, the event is not counted because it happened after the sample date.
For smaller datasets, I have manually subset the data by each sample date and counted events/unique months (and then merged the datasets back together), but I can't do that with ~6000 observations.
numevent.1996<-ddply(data[data$eventdate<'1996-08-01',], .(unit),
summarize, numevent=length(eventdate), nummth=length(unique(eventmonth)), year=1996)
This might work:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
data<-read.table(header=T, text="
unit eventdate eventmonth sampledate year
a 1996-06-01 06 1996-08-01 1996
a 1997-09-03 09 1997-08-02 1997
a 1998-05-15 05 1998-08-03 1998
a NA NA 1999-08-02 1999
b 1996-05-31 05 1996-08-01 1996
b 1997-05-31 05 1997-08-02 1997
b 1998-05-15 05 1998-08-03 1998
b 1999-05-16 05 1999-08-02 1999")
data <- data %>%
mutate(eventdate = lubridate::ymd(eventdate),
sampledate = lubridate::ymd(sampledate))
data %>%
group_by(unit, year, eventmonth) %>%
summarise(numevent = sum(sampledate >= eventdate)) %>%
group_by(unit, year) %>%
summarise(nummonth = sum(numevent > 0),
numevent = sum(numevent))
#> `summarise()` has grouped output by 'unit', 'year'. You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by 'unit'. You can override using the
#> `.groups` argument.
#> # A tibble: 8 × 4
#> # Groups: unit [2]
#> unit year nummonth numevent
#> <chr> <int> <int> <int>
#> 1 a 1996 1 1
#> 2 a 1997 0 0
#> 3 a 1998 1 1
#> 4 a 1999 NA NA
#> 5 b 1996 1 1
#> 6 b 1997 1 1
#> 7 b 1998 1 1
#> 8 b 1999 1 1
Created on 2023-01-08 by the reprex package (v2.0.1)
Note, I don't think the data you've included actually produce the output you proposed as the output looks to have 18 events that meet the condition and there are only 8 rows in the sample data provided.
Try this?
data %>%
group_by(unit) %>%
mutate(
numevent = sapply(sampledate, function(z) sum(eventdate < z, na.rm = TRUE)),
nummonth = sapply(sampledate, function(z) length(unique(na.omit(eventmonth[eventdate < z]))))
) %>%
ungroup()
# # A tibble: 8 × 7
# unit eventdate eventmonth sampledate year numevent nummonth
# <chr> <date> <int> <date> <int> <int> <int>
# 1 a 1996-06-01 6 1996-08-01 1996 1 1
# 2 a 1997-09-03 9 1997-08-02 1997 1 1
# 3 a 1998-05-15 5 1998-08-03 1998 3 3
# 4 a NA NA 1999-08-02 1999 3 3
# 5 b 1996-05-31 5 1996-08-01 1996 1 1
# 6 b 1997-05-31 5 1997-08-02 1997 2 1
# 7 b 1998-05-15 5 1998-08-03 1998 3 1
# 8 b 1999-05-16 5 1999-08-02 1999 4 1
Data
data <- structure(list(unit = c("a", "a", "a", "a", "b", "b", "b", "b"), eventdate = structure(c(9648, 10107, 10361, NA, 9647, 10012, 10361, 10727), class = "Date"), eventmonth = c(6L, 9L, 5L, NA, 5L, 5L, 5L, 5L), sampledate = structure(c(9709, 10075, 10441, 10805, 9709, 10075, 10441, 10805), class = "Date"), year = c(1996L, 1997L, 1998L, 1999L, 1996L, 1997L, 1998L, 1999L)), class = "data.frame", row.names = c(NA, -8L))
I have a data series of daily snow depth values over a 60 year period. I would like to see the number of days with a snow depth higher than 30 cm for each season, for example from July 1980 to June 1981. What does the code for this have to look like? I know how I could calculate the daily values higher than 30 cm per season individually, but not how a code could calculate all seasons.
I have uploaded my dataframe on wetransfer: Dataframe
Thank you so much for your help in advance.
Pernilla
Something like this would work
library(dplyr)
library(lubridate)
df<-read.csv('BayrischerWald_Brennes_SH_daily_merged.txt', sep=';')
df_season <-df %>%
mutate(season=(Day %>% ymd() - days(181)) %>% floor_date("year") %>% year())
df_group_by_season <- df_season %>%
filter(!is.na(SHincm)) %>%
group_by(season) %>%
summarize(days_above_30=sum(SHincm>30)) %>%
ungroup()
df_group_by_season
#> # A tibble: 61 × 2
#> season days_above_30
#> <dbl> <int>
#> 1 1961 1
#> 2 1962 0
#> 3 1963 0
#> 4 1964 0
#> 5 1965 0
#> 6 1966 0
#> 7 1967 129
#> 8 1968 60
#> 9 1969 107
#> 10 1970 43
#> # … with 51 more rows
Created on 2022-01-15 by the reprex package (v2.0.1)
Here is an approach using the aggregate() function. After reading the data, convert the Date field to a date object and get rid of the rows with missing values for the date:
snow <- read.table("BayrischerWald_Brennes_SH_daily_merged.txt", header=TRUE, sep=";")
snow$Day <- as.Date(snow$Day)
str(snow)
# 'data.frame': 51606 obs. of 2 variables:
# $ Day : Date, format: "1961-11-01" "1961-11-02" "1961-11-03" "1961-11-04" ...
# $ SHincm: int 0 0 0 0 2 9 19 22 15 5 ...
snow <- snow[!is.na(snow$Day), ]
str(snow)
# 'data.frame': 21886 obs. of 2 variables:
# $ Day : Date, format: "1961-11-01" "1961-11-02" "1961-11-03" "1961-11-04" ...
# $ SHincm: int 0 0 0 0 2 9 19 22 15 5 ...
Notice more than half of your data has missing values for the date. Now we need to divide the data by ski season:
brks <- as.Date(paste(1961:2022, "07-01", sep="-"))
lbls <- paste(1961:2021, 1962:2022, sep="/")
snow$Season <- cut(snow$Day, breaks=brks, labels=lbls)
Now we use aggregate() to get the number of days with over 30 inches of snow:
days30cm <- aggregate(SHincm~Season, snow, subset=snow$SHincm > 30, length)
colnames(days30cm)[2] <- "Over30cm"
head(days30cm, 10)
# Season Over30cm
# 1 1961/1962 1
# 2 1967/1968 129
# 3 1968/1969 60
# 4 1969/1970 107
# 5 1970/1971 43
# 6 1972/1973 101
# 7 1973/1974 119
# 8 1974/1975 188
# 9 1975/1976 126
# 10 1976/1977 112
In addition, you can get other statistics such as the maximum snow of the season or the total cm of snow:
maxsnow <- aggregate(SHincm~Season, snow, max)
totalsnow <- aggregate(SHincm~Season, snow, sum)
I am working on creating conditional averages for a large data set that involves # of flu cases seen during the week for several years. The data is organized as such:
What I want to do is create a new column that tabulates that average number of cases for that same week in previous years. For instance, for the row where Week.Number is 1 and Flu.Year is 2017, I would like the new row to give the average count for any year with Week.Number==1 & Flu.Year<2017. Normally, I would use the case_when() function to conditionally tabulate something like this. For instance, when calculating the average weekly volume I used this code:
mutate(average = case_when(
Flu.Year==2016 ~ mean(chcc$count[chcc$Flu.Year==2016]),
Flu.Year==2017 ~ mean(chcc$count[chcc$Flu.Year==2017]),
Flu.Year==2018 ~ mean(chcc$count[chcc$Flu.Year==2018]),
Flu.Year==2019 ~ mean(chcc$count[chcc$Flu.Year==2019]),
),
However, since there are four years of data * 52 weeks which is a lot of iterations to spell out the conditions for. Is there a way to elegantly code this in dplyr? The problem I keep running into is that I want to call values in counts column based on Week.Number and Flu.Year values in other rows conditioned on the current value of Week.Number and Flu.Year, and I am not sure how to accomplish that. Please let me know if there is further information / detail I can provide.
Thanks,
Steven
dat <- tibble( Flu.Year = rep(2016:2019,each = 52), Week.Number = rep(1:52,4), count = sample(1000, size=52*4, replace=TRUE) )
It's bad-form and, in some cases, an error when you use $-indexing within dplyr verbs.
I think a better way to get that average field is to group_by(Flu.Year) and calculate it straight-up.
library(dplyr)
set.seed(42)
dat <- tibble(
Flu.Year = sample(2016:2020, size=100, replace=TRUE),
count = sample(1000, size=100, replace=TRUE)
)
dat %>%
group_by(Flu.Year) %>%
mutate(average = mean(count)) %>%
# just to show a quick summary
slice(1:3) %>%
ungroup()
# # A tibble: 15 x 3
# Flu.Year count average
# <int> <int> <dbl>
# 1 2016 734 578.
# 2 2016 356 578.
# 3 2016 411 578.
# 4 2017 217 436.
# 5 2017 453 436.
# 6 2017 920 436.
# 7 2018 963 558
# 8 2018 609 558
# 9 2018 536 558
# 10 2019 943 543.
# 11 2019 740 543.
# 12 2019 536 543.
# 13 2020 627 494.
# 14 2020 218 494.
# 15 2020 389 494.
An alternative approach is to generate a summary table (just one row per year) and join it back in to the original data.
dat %>%
group_by(Flu.Year) %>%
summarize(average = mean(count))
# # A tibble: 5 x 2
# Flu.Year average
# <int> <dbl>
# 1 2016 578.
# 2 2017 436.
# 3 2018 558
# 4 2019 543.
# 5 2020 494.
dat %>%
group_by(Flu.Year) %>%
summarize(average = mean(count)) %>%
full_join(dat, by = "Flu.Year")
# # A tibble: 100 x 3
# Flu.Year average count
# <int> <dbl> <int>
# 1 2016 578. 734
# 2 2016 578. 356
# 3 2016 578. 411
# 4 2016 578. 720
# 5 2016 578. 851
# 6 2016 578. 822
# 7 2016 578. 465
# 8 2016 578. 679
# 9 2016 578. 30
# 10 2016 578. 180
# # ... with 90 more rows
The result, after chat:
tibble( Flu.Year = rep(2016:2018,each = 3), Week.Number = rep(1:3,3), count = 1:9 ) %>%
arrange(Flu.Year, Week.Number) %>%
group_by(Week.Number) %>%
mutate(year_week.average = lag(cumsum(count) / seq_along(count)))
# # A tibble: 9 x 4
# # Groups: Week.Number [3]
# Flu.Year Week.Number count year_week.average
# <int> <int> <int> <dbl>
# 1 2016 1 1 NA
# 2 2016 2 2 NA
# 3 2016 3 3 NA
# 4 2017 1 4 1
# 5 2017 2 5 2
# 6 2017 3 6 3
# 7 2018 1 7 2.5
# 8 2018 2 8 3.5
# 9 2018 3 9 4.5
We can use aggregate from base R
aggregate(count ~ Flu.Year, data, FUN = mean)
I have data for each Country's happiness (https://www.kaggle.com/unsdsn/world-happiness), and I made data for each year of the reports. Now, I don't know how to get the values for each year subtracted from each other e.g. how did happiness rank change from 2015 to 2017/2016 to 2017? I'd like to make a new df of differences for each.
I was able to bind the tables for columns in common and started to work on removing Countries that don't have data for all 3 years. I'm not sure if I'm going down a complicated path.
keepcols <- c("Country","Happiness.Rank","Economy..GDP.per.Capita.","Family","Health..Life.Expectancy.","Freedom","Trust..Government.Corruption.","Generosity","Dystopia.Residual","Year")
mydata2015 = read.csv("C:\\Users\\mmcgown\\Downloads\\2015.csv")
mydata2015$Year <- "2015"
data2015 <- subset(mydata2015, select = keepcols )
mydata2016 = read.csv("C:\\Users\\mmcgown\\Downloads\\2016.csv")
mydata2016$Year <- "2016"
data2016 <- subset(mydata2016, select = keepcols )
mydata2017 = read.csv("C:\\Users\\mmcgown\\Downloads\\2017.csv")
mydata2017$Year <- "2017"
data2017 <- subset(mydata2017, select = keepcols )
df <- rbind(data2015,data2016,data2017)
head(df, n=10)
tail(df, n=10)
df15 <- df[df['Year']=='2015',]
df16 <- df[df['Year']=='2016',]
df17 <- df[df['Year']=='2017',]
nocon <- rbind(setdiff(unique(df16['Country']),unique(df17['Country'])),setdiff(unique(df15['Country']),unique(df16['Country'])))
Don't have a clear path to accomplish what I want but it would look like
df16_to_17
Country Happiness.Rank ...(other columns)
Yemen (Yemen[Happiness Rank in 2017] - Yemen[Happiness Rank in 2016])
USA (USA[Happiness Rank in 2017] - USA[Happiness Rank in 2016])
(other countries)
df15_to_16
Country Happiness.Rank ...(other columns)
Yemen (Yemen[Happiness Rank in 2016] - Yemen[Happiness Rank in 2015])
USA (USA[Happiness Rank in 2016] - USA[Happiness Rank in 2015])
(other countries)
It's very straightforward with dplyr, and involves grouping by country and then finding the differences between consecutive values with base R's diff. Just make sure to use df and not df15, etc.:
library(dplyr)
rank_diff_df <- df %>%
group_by(Country) %>%
mutate(Rank.Diff = c(NA, diff(Happiness.Rank)))
The above assumes that the data are arranged by year, which they are in your case because of the way you combined the dataframes. If not, you'll need to call arrange(Year) before the call to mutate. Filtering out countries with missing year data isn't necessary, but can be done after group_by() with filter(n() == 3).
If you would like to view the differences it would make sense to drop some variables and rearrange the data:
rank_diff_df %>%
select(Year, Country, Happiness.Rank, Rank.Diff) %>%
arrange(Country)
Which returns:
# A tibble: 470 x 4
# Groups: Country [166]
Year Country Happiness.Rank Rank.Diff
<chr> <fct> <int> <int>
1 2015 Afghanistan 153 NA
2 2016 Afghanistan 154 1
3 2017 Afghanistan 141 -13
4 2015 Albania 95 NA
5 2016 Albania 109 14
6 2017 Albania 109 0
7 2015 Algeria 68 NA
8 2016 Algeria 38 -30
9 2017 Algeria 53 15
10 2015 Angola 137 NA
# … with 460 more rows
The above data frame will work well with ggplot2 if you are planning on plotting the results.
If you don't feel comfortable with dplyr you can use base R's merge to combine the dataframes, and then create a new dataframe with the differences as columns:
df_wide <- merge(merge(df15, df16, by = "Country"), df17, by = "Country")
rank_diff_df <- data.frame(Country = df_wide$Country,
Y2015.2016 = df_wide$Happiness.Rank.y -
df_wide$Happiness.Rank.x,
Y2016.2017 = df_wide$Happiness.Rank -
df_wide$Happiness.Rank.y
)
Which returns:
head(rank_diff_df, 10)
Country Y2015.2016 Y2016.2017
1 Afghanistan 1 -13
2 Albania 14 0
3 Algeria -30 15
4 Angola 4 -1
5 Argentina -4 -2
6 Armenia -6 0
7 Australia -1 1
8 Austria -1 1
9 Azerbaijan 1 4
10 Bahrain -7 -1
Assuming the three datasets are present in your environment with the name data2015, data2016 and data2017, we can add a year column with the respective year and keep the columns which are present in keepcols vector. arrange the data by Country and Year, group_by Country, keep only those countries which are present in all 3 years and then subtract the values from previous rows using lag or diff.
library(dplyr)
data2015$Year <- 2015
data2016$Year <- 2016
data2017$Year <- 2017
df <- bind_rows(data2015, data2016, data2017)
data <- df[keepcols]
data %>%
arrange(Country, Year) %>%
group_by(Country) %>%
filter(n() == 3) %>%
mutate_at(-1, ~. - lag(.)) #OR
#mutate_at(-1, ~c(NA, diff(.)))
# A tibble: 438 x 10
# Groups: Country [146]
# Country Happiness.Rank Economy..GDP.pe… Family Health..Life.Ex… Freedom
# <chr> <int> <dbl> <dbl> <dbl> <dbl>
# 1 Afghan… NA NA NA NA NA
# 2 Afghan… 1 0.0624 -0.192 -0.130 -0.0698
# 3 Afghan… -13 0.0192 0.471 0.00731 -0.0581
# 4 Albania NA NA NA NA NA
# 5 Albania 14 0.0766 -0.303 -0.0832 -0.0387
# 6 Albania 0 0.0409 0.302 0.00109 0.0628
# 7 Algeria NA NA NA NA NA
# 8 Algeria -30 0.113 -0.245 0.00038 -0.0757
# 9 Algeria 15 0.0392 0.313 -0.000455 0.0233
#10 Angola NA NA NA NA NA
# … with 428 more rows, and 4 more variables: Trust..Government.Corruption. <dbl>,
# Generosity <dbl>, Dystopia.Residual <dbl>, Year <dbl>
The value of first row for each Year would always be NA, rest of the values would be subtracted by it's previous values.