Merge two datasets but one of them is year_month and the other is year_month_week - r

I practice data merging using R nowadays. Here are simple two data df1 and df2.
df1<-data.frame(id=c(1,1,1,2,2,2,2),
year_month=c(202205,202206,202207,202204,202205,202206,202207),
points=c(65,58,47,21,25,27,43))
df2<-data.frame(id=c(1,1,1,2,2,2),
year_month_week=c(2022052,2022053,2022061,2022043,2022051,2022052),
temperature=c(36.1,36.3,36.6,34.3,34.9,35.3))
For df1, 202205 in year_month column means May 2022.
For df2, 2022052 in year_month_week column means 2nd week of May, 2022.
I want to merge df1 and df2 with respect to year_month_week. So, all the elements of df2 are left, but some values of df2 can be copied.
For example, 202205 in year_month includes 2022052 and 2022053. There is no column points in df2. In this case, 65 is copied. My expected output looks like this:
df<-data.frame(id=c(1,1,1,2,2,2),
year_month_week=c(2022052,2022053,2022061,2022043,2022051,2022052),
temperature=c(36.1,36.3,36.6,34.3,34.9,35.3),
points=c(65,65,58,21,25,25))

Create a temporary year_month column in df2 by taking the first six characters of year_month_week, then do a left join on df1 by year_month and id before removing the temporary column.
Using tidyverse, we could do this as follows:
library(tidyverse)
df2 %>%
mutate(year_month = as.numeric(substr(year_month_week, 1, 6))) %>%
left_join(df1, by = c('year_month', 'id')) %>%
select(-year_month)
#> id year_month_week temperature points
#> 1 1 2022052 36.1 65
#> 2 1 2022053 36.3 65
#> 3 1 2022061 36.6 58
#> 4 2 2022043 34.3 21
#> 5 2 2022051 34.9 25
#> 6 2 2022052 35.3 25
Or in base R using merge:
df2$year_month <- substr(df2$year_month_week, 1, 6)
merge(df2, df1, by = c('year_month', 'id'))[-1]
#> id year_month_week temperature points
#> 1 2 2022043 34.3 21
#> 2 1 2022052 36.1 65
#> 3 1 2022053 36.3 65
#> 4 2 2022051 34.9 25
#> 5 2 2022052 35.3 25
#> 6 1 2022061 36.6 58

Related

How can I calculate mean values for each day of an year from a time series data set in R?

I have a data set containing climatic data taken hourly from 01-01-2007 to 31-12-2021.
I want to calculate the mean value for a given variable (e.g. temperature) for each day of the year (1:365).
My dataset look something like this:
dia prec_h tc_h um_h v_d vm_h
<date> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2007-01-01 0.2 22.9 89 42 3
2 2007-01-01 0.4 22.8 93 47 1.9
3 2007-01-01 0 22.7 94 37 1.3
4 2007-01-01 0 22.6 94 38 1.6
5 2007-01-01 0 22.7 95 46 2.3
[...]
131496 2021-12-31 0.0 24.7 87 47 2.6
( "[...]" stands for sequence of data from 2007 - 2014).
I first calculated daily mean temperature for each of my entry dates as follows:
md$dia<-as.Date(md$dia,format = "%d/%m/%Y")
m_tc<-aggregate(tc_h ~ dia, md, mean)
This returned me a data frame with mean temperature values for each analyzed year.
Now, I want to calculate the mean temperature for each day of the year from this data frame, i.e: mean temperature for January 1st up to December 31st.
Thus, I need to end up with a data frame with 365 rows, but I don't know how to do such calculation. Can anyone help me out?
Also, there is a complication: I have 4 leap years in my data frame. Any recommendations on how to deal with them?
Thankfully
First simulate a data set with the relevant columns and number of rows, then aggregate by day giving m_tc.
As for the question, create an auxiliary variable mdia by formating the dates column as month-day only. Compute the means grouping by mdia. The result is a data.frame with 366 rows and 2 columns as expected.
set.seed(2022)
# number of rows in the question
n <- 131496L
dia <- seq(as.Date("2007-01-01"), as.Date("2021-12-31"), by = "1 day")
md <- data.frame(
dia = sort(sample(dia, n, TRUE)),
tc_h = round(runif(n, 0, 40), 1)
)
m_tc <- aggregate(tc_h ~ dia, md, mean)
mdia <- format(m_tc$dia, "%m-%d")
final <- aggregate(tc_h ~ mdia, m_tc, mean)
str(final)
#> 'data.frame': 366 obs. of 2 variables:
#> $ mdia: chr "01-01" "01-02" "01-03" "01-04" ...
#> $ tc_h: num 20.2 20.4 20.2 19.6 20.7 ...
head(final, n = 10L)
#> mdia tc_h
#> 1 01-01 20.20741
#> 2 01-02 20.44143
#> 3 01-03 20.20979
#> 4 01-04 19.63611
#> 5 01-05 20.69064
#> 6 01-06 18.89658
#> 7 01-07 20.15992
#> 8 01-08 19.53639
#> 9 01-09 19.52999
#> 10 01-10 19.71914
Created on 2022-10-18 with reprex v2.0.2
You can pass your data to the function using the pipe (%>%) from R package (magrittr) and calculate the mean values by calling R package (dplyr):
library(dplyr); library(magrittr)
tcmean<-md %>% group_by(dia) %>% summarise(m_tc=mean(tc_h))

Aggregate week and date in R by some specific rules

I'm not used to using R. I already asked a question on stack overflow and got a great answer.
I'm sorry to post a similar question, but I tried many times and got the output that I didn't expect.
This time, I want to do slightly different from my previous question.
Merge two data with respect to date and week using R
I have two data. One has a year_month_week column and the other has a date column.
df1<-data.frame(id=c(1,1,1,2,2,2,2),
year_month_week=c(2022051,2022052,2022053,2022041,2022042,2022043,2022044),
points=c(65,58,47,21,25,27,43))
df2<-data.frame(id=c(1,1,1,2,2,2),
date=c(20220503,20220506,20220512,20220401,20220408,20220409),
temperature=c(36.1,36.3,36.6,34.3,34.9,35.3))
For df1, 2022051 means 1st week of May,2022. Likewise, 2022052 means 2nd week of May,2022. For df2,20220503 means May 3rd, 2022. What I want to do now is merge df1 and df2 with respect to year_month_week. In this case, 20220503 and 20220506 are 1st week of May,2022.If more than one date are in year_month_week, I will just include the first of them. Now, here's the different part. Even if there is no date inside year_month_week,just leave it NA. So my expected output has a same number of rows as df1 which includes the column year_month_week.So my expected output is as follows:
df<-data.frame(id=c(1,1,1,2,2,2,2),
year_month_week=c(2022051,2022052,2022053,2022041,2022042,2022043,2022044),
points=c(65,58,47,21,25,27,43),
temperature=c(36.1,36.6,NA,34.3,34.9,NA,NA))
First we can convert the dates in df2 into year-month-date format, then join the two tables:
library(dplyr);library(lubridate)
df2$dt = ymd(df2$date)
df2$wk = day(df2$dt) %/% 7 + 1
df2$year_month_week = as.numeric(paste0(format(df2$dt, "%Y%m"), df2$wk))
df1 %>%
left_join(df2 %>% group_by(year_month_week) %>% slice(1) %>%
select(year_month_week, temperature))
Result
Joining, by = "year_month_week"
id year_month_week points temperature
1 1 2022051 65 36.1
2 1 2022052 58 36.6
3 1 2022053 47 NA
4 2 2022041 21 34.3
5 2 2022042 25 34.9
6 2 2022043 27 NA
7 2 2022044 43 NA
You can build off of a previous answer here by taking the function to count the week of the month, then generate a join key in df2. See here
df1 <- data.frame(
id=c(1,1,1,2,2,2,2),
year_month_week=c(2022051,2022052,2022053,2022041,2022042,2022043,2022044),
points=c(65,58,47,21,25,27,43))
df2 <- data.frame(
id=c(1,1,1,2,2,2),
date=c(20220503,20220506,20220512,20220401,20220408,20220409),
temperature=c(36.1,36.3,36.6,34.3,34.9,35.3))
# Take the function from the previous StackOverflow question
monthweeks.Date <- function(x) {
ceiling(as.numeric(format(x, "%d")) / 7)
}
# Create a year_month_week variable to join on
df2 <-
df2 %>%
mutate(
date = lubridate::parse_date_time(
x = date,
orders = "%Y%m%d"),
year_month_week = paste0(
lubridate::year(date),
0,
lubridate::month(date),
monthweeks.Date(date)),
year_month_week = as.double(year_month_week))
# Remove duplicate year_month_weeks
df2 <-
df2 %>%
arrange(year_month_week) %>%
distinct(year_month_week, .keep_all = T)
# Join dataframes
df1 <-
left_join(
df1,
df2,
by = "year_month_week")
Produces this result
id.x year_month_week points id.y date temperature
1 1 2022051 65 1 2022-05-03 36.1
2 1 2022052 58 1 2022-05-12 36.6
3 1 2022053 47 NA <NA> NA
4 2 2022041 21 2 2022-04-01 34.3
5 2 2022042 25 2 2022-04-08 34.9
6 2 2022043 27 NA <NA> NA
7 2 2022044 43 NA <NA> NA
>
Edit: forgot to mention that you need tidyverse loaded
library(tidyverse)

R - sum a set number of rows from same column in a different data frame

I have the following data frame:
df <- data.frame( year = c(1985,1986,1987,1988,1989,1990,
1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,
2001,2002,2003,2004,2005,2006,2007,2008,2009, 2010,
2011,2012, 2013,2014,2015,2016,2017,2018,2019,2020),
value = c(0,5,10,2,6,7,3,4,5,9,10,6,8,7,3,5,2,10,9,6,5,10,4,7,8,10,
4,6,8,9,2,3,7,6,2,1))
I want to create a second data frame (df2) that consists of 20 years intervals from the previous data frame, i.e.
df2 <- data.frame(year=c("1985-2005", "1986-2006","1987-2007", "1988-2008","1989-2009",
"1990-2010", "1991-2011","1992-2002", "1993-2003","1994-2004",
"1995-2005", "1996-2006","1997-2007", "1998-2008", "1999-2009",
"2000-2020"))
Now the value for df2 should be the sum of value on df for 20 years intervals
(i.e., for year "1985-2005" in df2, the value is the sum of values from 1985 until 2005 in df - Excel snapshot attached with final values)
How can I perform this calculation? Also any possible automation to define the year interval in df2 without having to type it?
A possible solution:
library(tidyverse)
df <- data.frame( year = c(1985,1986,1987,1988,1989,1990,
1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,
2001,2002,2003,2004,2005,2006,2007,2008,2009, 2010,
2011,2012, 2013,2014,2015,2016,2017,2018,2019,2020),
value = c(0,5,10,2,6,7,3,4,5,9,10,6,8,7,3,5,2,10,9,6,5,10,4,7,8,10,
4,6,8,9,2,3,7,6,2,1))
df2 <- data.frame(year=c("1985-2005", "1986-2006","1987-2007", "1988-2008","1989-2009",
"1990-2010", "1991-2011","1992-2002", "1993-2003","1994-2004",
"1995-2005", "1996-2006","1997-2007", "1998-2008", "1999-2009",
"2000-2020"))
df2 %>%
separate(year, into = c("y1", "y2"), sep="-", convert = T, remove = F) %>%
rowwise %>%
mutate(value = sum(df$value[df$year >= y1 & df$year <= y2])) %>%
select(-y1, -y2) %>% ungroup
#> # A tibble: 16 × 2
#> year value
#> <chr> <dbl>
#> 1 1985-2005 122
#> 2 1986-2006 132
#> 3 1987-2007 131
#> 4 1988-2008 128
#> 5 1989-2009 134
#> 6 1990-2010 138
#> 7 1991-2011 135
#> 8 1992-2002 69
#> 9 1993-2003 74
#> 10 1994-2004 75
#> 11 1995-2005 71
#> 12 1996-2006 71
#> 13 1997-2007 69
#> 14 1998-2008 68
#> 15 1999-2009 69
#> 16 2000-2020 124

Subsetting data set to only retain the mean

Please see attached image of dataset.
What are the different ways to only retain a single value for each 'Month'? I've got a bunch of data points and would only need to retain, say, the mean value.
Many thanks
A different way of using the aggregate() function.
> aggregate(Temp ~ Month, data=airquality, FUN = mean)
Month Temp
1 5 65.54839
2 6 79.10000
3 7 83.90323
4 8 83.96774
5 9 76.90000
library(tidyverse)
library(lubridate)
#example data from airquality:
aq<-as_data_frame(airquality)
aq$mydate<-lubridate::ymd(paste0(2018, "-", aq$Month, "-", aq$Day))
> aq
# A tibble: 153 x 7
Ozone Solar.R Wind Temp Month Day mydate
<int> <int> <dbl> <int> <int> <int> <date>
1 41 190 7.40 67 5 1 2018-05-01
2 36 118 8.00 72 5 2 2018-05-02
3 12 149 12.6 74 5 3 2018-05-03
aq %>%
group_by("Month" = month(mydate)) %>%
summarize("Mean_Temp" = mean(Temp, na.rm=TRUE))
Summarize can return multiple summary functions:
aq %>%
group_by("Month" = month(mydate)) %>%
summarize("Mean_Temp" = mean(Temp, na.rm=TRUE),
"Num" = n(),
"SD" = sd(Temp, na.rm=TRUE))
# A tibble: 5 x 4
Month Mean_Temp Num SD
<dbl> <dbl> <int> <dbl>
1 5.00 65.5 31 6.85
2 6.00 79.1 30 6.60
3 7.00 83.9 31 4.32
4 8.00 84.0 31 6.59
5 9.00 76.9 30 8.36
Lubridate Cheatsheet
A data.table answer:
# load libraries
library(data.table)
library(lubridate)
setDT(dt)
dt[, .(meanValue = mean(value, na.rm =TRUE)), by = .(monthDate = floor_date(dates, "month"))]
Where dt has at least columns value and dates.
We can group by the index of dataset, use that in aggregate (from base R) to get the mean
aggregate(dat, index(dat), FUN = mean)
NB: Here, we assumed that the dataset is xts or zoo format. If the dataset have a month column, then use
aggregate(dat, list(dat$Month), FUN = mean)

Iteration for time series data, using purrr

I have a bunch of time series data stacked on top of one another in a data frame; one series for each region in a country. I'd like to apply the seas() function (from the seasonal package) to each series, iteratively, to make the series seasonally adjusted. To do this, I first have to convert the series to a ts class. I'm struggling to do all this using purrr.
Here's a minimum worked example:
library(seasonal)
library(tidyverse)
set.seed(1234)
df <- data.frame(region = rep(1:10, each = 20),
quarter = rep(1:20, 10),
var = sample(5:200, 200, replace = T))
For each region (indexed by a number) I'd like to perform the following operations. Here's the first region as an example:
tem1 <- df %>% filter(region==1)
tem2 <- ts(data = tem1$var, frequency = 4, start=c(1990,1))
tem3 <- seas(tem2)
tem4 <- as.data.frame(tem3$data)
I'd then like to stack the output (ie. the multiple tem4 data frames, one for each region), along with the region and quarter identifiers.
So, the start of the output for region 1 would be this:
final seasonaladj trend irregular region quarter
1 27 27 96.95 -67.97279 1 1
2 126 126 96.95 27.87381 1 2
3 124 124 96.95 27.10823 1 3
4 127 127 96.95 30.55075 1 4
5 173 173 96.95 75.01355 1 5
6 130 130 96.95 32.10672 1 6
The data for region 2 would be below this etc.
I started with the following but without luck so far. Basically, I'm struggling to get the time series into the tibble:
seas.adjusted <- df %>%
group_by(region) %>%
mutate(data.ts = map(.x = data$var,
.f = as.ts,
start = 1990,
freq = 4))
I don't know much about the seasonal adjustment part, so there may be things I missed, but I can help with moving your calculations into a map-friendly function.
After grouping by region, you can nest the data so there's a nested data frame for each region. Then you can run essentially the same code as you had, but inside a function in map. Unnesting the resulting column gives you a long-shaped data frame of adjustments.
Like I said, I don't have the expertise to know whether those last two columns having NAs is expected or not.
Edit: Based on #wibeasley's question about retaining the quarter column, I'm adding a mutate that adds a column of the quarters listed in the nested data frame.
library(seasonal)
library(tidyverse)
set.seed(1234)
df <- data.frame(region = rep(1:10, each = 20),
quarter = rep(1:20, 10),
var = sample(5:200, 200, replace = T))
df %>%
group_by(region) %>%
nest() %>%
mutate(data.ts = map(data, function(x) {
tem2 <- ts(x$var, frequency = 4, start = c(1990, 1))
tem3 <- seas(tem2)
as.data.frame(tem3$data) %>%
mutate(quarter = x$quarter)
})) %>%
unnest(data.ts)
#> # A tibble: 200 x 8
#> region final seasonaladj trend irregular quarter seasonal adjustfac
#> <int> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl>
#> 1 1 27 27 97.0 -68.0 1 NA NA
#> 2 1 126 126 97.0 27.9 2 NA NA
#> 3 1 124 124 97.0 27.1 3 NA NA
#> 4 1 127 127 97.0 30.6 4 NA NA
#> 5 1 173 173 97.0 75.0 5 NA NA
#> 6 1 130 130 97.0 32.1 6 NA NA
#> 7 1 6 6 97.0 -89.0 7 NA NA
#> 8 1 50 50 97.0 -46.5 8 NA NA
#> 9 1 135 135 97.0 36.7 9 NA NA
#> 10 1 105 105 97.0 8.81 10 NA NA
#> # ... with 190 more rows
I also gave a bit more thought to doing this without nesting, and instead tried doing it with a split. Passing that list of data frames into imap_dfr let me take each split piece of the data frame and its name (in this case, the value of region), then return everything rbinded back together into one data frame. I sometimes shy away from nested data just because I have trouble seeing what's going on, so this is an alternative that is maybe more transparent.
df %>%
split(.$region) %>%
imap_dfr(function(x, reg) {
tem2 <- ts(x$var, frequency = 4, start = c(1990, 1))
tem3 <- seas(tem2)
as.data.frame(tem3$data) %>%
mutate(region = reg, quarter = x$quarter)
}) %>%
select(region, quarter, everything()) %>%
head()
#> region quarter final seasonaladj trend irregular seasonal adjustfac
#> 1 1 1 27 27 96.95 -67.97274 NA NA
#> 2 1 2 126 126 96.95 27.87378 NA NA
#> 3 1 3 124 124 96.95 27.10823 NA NA
#> 4 1 4 127 127 96.95 30.55077 NA NA
#> 5 1 5 173 173 96.95 75.01353 NA NA
#> 6 1 6 130 130 96.95 32.10669 NA NA
Created on 2018-08-12 by the reprex package (v0.2.0).
I put all the action inside of f(), and then called it with purrr::map_df(). The re-inclusion of quarter is a hack.
f <- function( .region ) {
d <- df %>%
dplyr::filter(region == .region)
y <- d %>%
dplyr::pull(var) %>%
ts(frequency = 4, start=c(1990,1)) %>%
seas()
y$data %>%
as.data.frame() %>%
# dplyr::select(-seasonal, -adjustfac) %>%
dplyr::mutate(
quarter = d$quarter
)
}
purrr::map_df(1:10, f, .id = "region")
results:
region final seasonaladj trend irregular quarter seasonal adjustfac
1 1 27.00000 27.00000 96.95000 -6.797279e+01 1 NA NA
2 1 126.00000 126.00000 96.95000 2.787381e+01 2 NA NA
3 1 124.00000 124.00000 96.95000 2.710823e+01 3 NA NA
4 1 127.00000 127.00000 96.95000 3.055075e+01 4 NA NA
5 1 173.00000 173.00000 96.95000 7.501355e+01 5 NA NA
6 1 130.00000 130.00000 96.95000 3.210672e+01 6 NA NA
7 1 6.00000 6.00000 96.95000 -8.899356e+01 7 NA NA
8 1 50.00000 50.00000 96.95000 -4.647254e+01 8 NA NA
9 1 135.00000 135.00000 96.95000 3.671077e+01 9 NA NA
10 1 105.00000 105.00000 96.95000 8.806955e+00 10 NA NA
...
96 5 55.01724 55.01724 60.25848 9.130207e-01 16 1.9084928 1.9084928
97 5 60.21549 60.21549 59.43828 1.013076e+00 17 1.0462424 1.0462424
98 5 58.30626 58.30626 58.87065 9.904130e-01 18 0.1715082 0.1715082
99 5 61.68175 61.68175 58.07827 1.062045e+00 19 1.0537962 1.0537962
100 5 59.30138 59.30138 56.70798 1.045733e+00 20 2.5294523 2.5294523
...

Resources