matching yearly time points to preceding 365 days of data in R - r

I am trying to merge two datasets. The survey dataset consists of biodiversity surveys from different regions conducted every 1-5 years in a certain month (the month is constant within, but not between, regions). The temperature dataset consists of daily temperature readings in each survey region.
For multiple surveys that have different start months and temporal extents, I want to pair each survey*year combination with the twelve months of temperature data preceding it. In other words, I want to pair a May 1983 survey with the 12 months (or 365 days -- I don't care which) of daily temperature records preceding it, ending April 30, 1983. Meanwhile, another survey elsewhere conducted in August 1983 needs to be paired with the 365 days of temperature data ending July 31, 1983.
There are (at least) two ways to do this -- one would be joining the survey data to the (longer) temperature data and then somehow subsetting or identifying which dates fall in the 12 months preceding the survey-date. Another is to start with the survey data and try to pair the temperature data to each row with a matrix-column -- I tried doing this with time-series tools from tsibble and tsModel but couldn't get it to "lag" the right values when grouped by region.
I was able to create an identifier to join the datasets such that each date in the temperature data is matched with the subsequent survey in time. However, not all of those are within 365 days (e.g., in the dataset created below, the date 1983-06-03 is matched with the ref_year aleutian_islands-5-1986 because the survey only happens every 3-5 years).
Here are some examples of the behavior I want for a single region (from the example dataset below), although I'm open to solutions that achieve the same thing but don't look exactly like this:
For this row, the value in the new column that I want to generate (ref_match) should be NA; the date is more than 365 days before ref_year.
region date year month month_year ref_year temperature
<chr> <date> <dbl> <dbl> <chr> <chr> <dbl>
1 aleutian_islands 1982-06-09 1982 6 6-1982 aleutian_islands-5-1983 0
For this row, ref_match should be aleutian_islands-5-2014 since the date is within 12 months of ref_year.
region date year month month_year ref_year temperature
<chr> <date> <dbl> <dbl> <chr> <chr> <dbl>
1 aleutian_islands 2013-07-22 2013 7 7-2013 aleutian_islands-5-2014 0.998
The following script will generate a dataset temp_dat with columns like those in the snippets above from which I hope to generate the ref_match column.
# load packages
library(tidyverse)
library(lubridate)
set.seed=10
# make survey dfs
ai_dat <- data.frame("year" = c(1983, 1986, 1991, 1994, 1997), "region" = "aleutian_islands", "startmonth" = 5)
ebs_dat <- data.frame("year" = seq(1983, 1999, 1), "region" = "eastern_bering_sea", "startmonth" = 5)
# join and create what will become ref_year column
surv_dat <- rbind(ai_dat, ebs_dat) %>%
mutate(month_year = paste0(startmonth,"-",year)) %>%
select(region, month_year) %>%
distinct() %>%
mutate(region_month_year = paste0(region,"-",month_year))
# expand out to all possible month*year combinations for joining with temperature
surv_dat_exploded <- expand.grid(month=seq(1, 12, 1), year=seq(1982, 2000, 1), region=c('aleutian_islands','eastern_bering_sea')) %>% # get a factorial combo of every possible month*year; have to start in 1982 even though we can't use surveys before 1983 because we need to match to temperature data from 1982
mutate(region_month_year = paste0(region,"-",month,"-",year)) %>% # create unique identifier
mutate(ref_year = ifelse(region_month_year %in% surv_dat$region_month_year, region_month_year, NA),
month_year = paste0(month,"-",year)) %>%
select(region, month_year, ref_year) %>%
distinct() %>%
group_by(region) %>%
fill(ref_year, .direction="up") %>% # fill in each region with the survey to which env data from each month*year should correspond
ungroup()
# make temperature dataset and join in survey ref_year column
temp_dat <- data.frame(expand.grid(date=seq(ymd("1982-01-01"), ymd("1999-12-31"), "days"), region=c('aleutian_islands','eastern_bering_sea'))) %>%
mutate(temperature = rnorm(nrow(.), 10, 5), # fill in with fake data
year = year(date),
month = month(date),
month_year = paste0(month,"-",year)) %>%
left_join(surv_dat_exploded, by=c('region','month_year')) %>%
filter(!is.na(ref_year))# get rid of dates that are after any ref_year

Sounds like you want a non-equi join. This is easily done with data.table and is very fast. Here's an example that lightly modifies your MWE:
library(data.table)
# make survey dfs
ai_dat = data.table(year = c(1983, 1986, 1991, 1994, 1997),
region = "aleutian_islands", "startmonth" = 5)
ebs_dat = data.table(year = seq(1983, 1999, 1),
region = "eastern_bering_sea", "startmonth" = 5)
# bind together and create date (and cutoffdate) vars
surv_dat = rbind(ai_dat, ebs_dat)
surv_dat[, startdate := as.IDate(paste(year, startmonth, '01', sep = '-'))
][, cutoffdate := startdate - 365L]
# make temperature df
temp_dat = CJ(date=seq(as.IDate("1982-01-01"), as.IDate("1999-12-31"), "days"),
region=c('aleutian_islands','eastern_bering_sea'))
# add temperature var
temp_dat$temp = rnorm(nrow(temp_dat))
# create duplicate date variable (will make post-join processing easier)
temp_dat[, matchdate := date]
# Optional: Set keys for better join performance
setkey(surv_dat, region, startdate)
setkey(temp_dat, region, matchdate)
# Where the magic happens: Non-equi join
surv_dat = temp_dat[surv_dat, on = .(region == region,
matchdate <= startdate,
matchdate >= cutoffdate)]
# Optional: get rid of unneeded columns
surv_dat[, c('matchdate', 'matchdate.1') := NULL][]
#> date region temp year startmonth
#> 1: 1982-05-01 aleutian_islands 0.3680810 1983 5
#> 2: 1982-05-02 aleutian_islands 0.8349334 1983 5
#> 3: 1982-05-03 aleutian_islands -1.3622227 1983 5
#> 4: 1982-05-04 aleutian_islands 1.4327587 1983 5
#> 5: 1982-05-05 aleutian_islands 0.5068226 1983 5
#> ---
#> 8048: 1999-04-27 eastern_bering_sea -1.2924594 1999 5
#> 8049: 1999-04-28 eastern_bering_sea 0.7519078 1999 5
#> 8050: 1999-04-29 eastern_bering_sea -1.0185174 1999 5
#> 8051: 1999-04-30 eastern_bering_sea -1.4322252 1999 5
#> 8052: 1999-05-01 eastern_bering_sea -1.0412836 1999 5
Created on 2021-05-20 by the reprex package (v2.0.0)

Try this solution.
I basically used your reference column to generate a ref_date and estimate the difference in days between the observation and reference. Then, I used a simple ifelse to test if the dates fall within the 365 days range and then copy them to the temp_valid column.
# load packages
library(tidyverse)
library(lubridate)
set.seed=10
# make survey dfs
ai_dat <- data.frame("year" = c(1983, 1986, 1991, 1994, 1997), "region" = "aleutian_islands", "startmonth" = 5)
ebs_dat <- data.frame("year" = seq(1983, 1999, 1), "region" = "eastern_bering_sea", "startmonth" = 5)
# join and create what will become ref_year column
surv_dat <-
rbind(ai_dat, ebs_dat) %>%
mutate(year_month = paste0(year,"-",startmonth),
region_year_month = paste0(region,"-",year,"-",startmonth))
# expand out to all possible month*year combinations for joining with temperature
surv_dat_exploded <-
expand.grid(month=seq(01, 12, 1), year=seq(1982, 2000, 1), region=c('aleutian_islands','eastern_bering_sea')) %>% # get a factorial combo of every possible month*year; have to start in 1982 even though we can't use surveys before 1983 because we need to match to temperature data from 1982
mutate(year_month = paste0(year,"-",month)) %>%
mutate(region_year_month = paste0(region,"-",year,"-",month)) %>%
mutate(ref_year = ifelse(region_year_month %in% surv_dat$region_year_month, region_year_month,NA)) %>%
group_by(region) %>%
fill(ref_year, .direction="up") %>% # fill in each region with the survey to which env data from each month*year should correspond
ungroup()
# make temperature dataset and join in survey ref_year column
temp_dat <- data.frame(expand.grid(date=seq(ymd("1982-01-01"), ymd("1999-12-31"), "days"), region=c('aleutian_islands','eastern_bering_sea'))) %>%
mutate(temperature = rnorm(nrow(.), 10, 5), # fill in with fake data
year = year(date),
month = month(date),
year_month = paste0(year,"-",month))
final_df <-
left_join(temp_dat, surv_dat_exploded, by=c('region','year_month')) %>%
#split ref_column in ref_year and ref_region
separate(ref_year, c("ref_region","ref_year"), "-", extra="merge") %>%
#convert ref_year into date
mutate_at("ref_year", as.Date, format= "%Y-%M") %>%
#round it down to be in the first day of the month (not needed if the day matters)
mutate_at("ref_year", floor_date, "month" ) %>%
#difference between observed and the reference
mutate(diff_days = date - ref_year) %>%
# ifelse statement for capturing values of interest
mutate(temp_valid = ifelse(between(diff_days, -365, 0),temperature,NA))

Related

How to assign values to a new column based on a range of dates from that overlap years in R?

I have a growth rate, calculated from individual measurements 4 times a year, that I am trying to assign to a different time frame called Year2 (August 1st of year 1 to July 31st of year 2, see attached photo).
My Dataframe:
ID
Date
Year
Year2
Lag
Lapse
Growth
Daily_growth
1
2009-07-30
2009
2009
NA
NA
35.004
NA
1
2009-10-29
2009
2010
2009-07-30
91 days
31.585
0.347
1
2010-01-27
2010
2010
2009-10-29
90 days
63.769
0.709
1
2010-04-27
2010
2010
2010-01-27
90 days
28.329
0.315
1
2010-07-29
2010
2010
2010-04-27
93 days
32.068
0.345
1
2010-11-02
2010
2011
2010-07-29
96 days
128.1617320
1.335
I took the growth rate as follows:
Growth_df <- Growth_df%>%
group_by(ID) %>% # Individuals we measured
mutate(Lag = lag(Date), #Last date measured
Lapse = round(difftime(Date, Lag, units = "days")), #days between Dates monitored
Daily_growth = as.numeric(Growth) / as.numeric(Lapse))
What I am trying to do is assign the daily growth rate between each measurement, matching to the Year2 timeframe:
Growth_df <- Growth_df %>%
mutate(Year = as.numeric(Year),
Year2_growth = ifelse(Year == Year2, Daily_growth*Lapse, 0)) %>%
group_by(Year2) %>%
mutate(Year2_growth = sum(Year2_growth, na.rm = TRUE))
My problem is that I do not know how to get the dates in between the years (something in place of the 0 in the ifelse statement). I need some sort of way that would calculate how many days would be left from the new start date (August 1st) to the most recent measurement, then multiply it by the growth rate, as well as cut the end early (July 31st)
I tried making a second dataframe with nothing by years and days then assigning the growth rate when comparing the two dataframes but I have been stuck on the same issue: partitioning the time frame.
I am sure there's a much much muuuuch more efficient way to deal with this, but this is the way I sorted out:
Make my timeframes
Create a function for the ranges I wanted
Make a dataframe with for both the start and the end ranges
Join them together
Marvel in my lack of r skills.
Start_dates <- seq(ymd('2008-08-01'),ymd('2021-08-1'), by = '12 months')
End_dates <- seq(ymd('2009-07-31'),ymd('2022-07-31'), by = '12 months')
Year2_dates <- data.frame(Start_dates, End_dates)
Year2_dates <- Year2_dates %>%
mutate(Year2 = format(as.Date(Start_dates, format="%d/%m/%Y"),"%Y"),
Year2 = as.numeric(Year2) + 1)
Vegetation <- Vegetation %>%
left_join(Year2_dates)
Range_finder <- function(x,y){
as.numeric(difftime(x, y, unit = "days"))
}
Range_start <- Vegetation %>%
group_by(Year2, ID) %>%
filter(row_number()==1) %>%
filter(Year != Year2) #had to get rid of first year samples as they were the top row but didn't have a change in year
Range_start <- Range_start %>%
mutate(Number_days_start = Range_finder(Date, Start_dates),
Border_range = Number_days_start * Daily_veg) %>%
ungroup() %>%
select(ID, Year2, Date, Border_range)
Range_end <- Vegetation %>%
group_by(Year2, ID) %>%
filter(row_number()==n(),
Year2 != 2022)
Range_end <- Range_end %>%
mutate(Number_days_end = Range_finder(End_dates, Date),
Border_range = Number_days_end * Daily_veg) %>%
ungroup() %>%
select(ID, Year2, Date, Border_range)
Ranges <- full_join(Range_start, Range_end)
Vegetation <- Vegetation %>%
left_join(Ranges)

How to summarize based on multiple columns in R?

I want to summarize the dataset based on "year", "months", and "subdist_id" columns. For each subdist_id, I want to get average values of "Rainfall" for the months 11,12,1,2 but for different years. For example, for subdist_id 81, the mean Rainfall value of 2004 will be the mean Rainfall of months 11, 12 of 2004, and months 1,2 of 2005.
I am getting no clue how to do it, although I searched online rigorously.
Expanding on #Bloxx's answer and incorporating my comment:
# Set up example data frame:
df = data.frame(year=c(rep.int(2004,2),rep.int(2005,4)),
month=((0:5%%4)-2)%%12+1,
Rainfall=seq(.5,by=0.15,length.out=6))
Now use mutate to create year2 variable:
df %>% mutate(year2 = year - (month<3)*1) # or similar depending on the problem specs
And now apply the groupby/summarise action:
df %>% mutate(year2 = year - (month<3)*1) %>%
group_by(year2) %>%
summarise(Rainfall = mean(Rainfall))
Lets assume your dataset is called df. Is this what you are looking for?
df %>% group_by(subdist_id, year) %>% summarise(Rainfall = mean(Rainfall))
I think you can simply do this:
df %>% filter(months %in% c(1,2,11,12)) %>%
group_by(subdist_id, year=if_else(months %in% c(1,2),year-1,year)) %>%
summarize(meanRain = mean(Rainfall))
Output:
subdist_id year meanRain
<dbl> <dbl> <dbl>
1 81 2004 0.611
2 81 2005 0.228
Input:
df = data.frame(
subdist_id = 81,
year=c(2004,2004, 2005, 2005, 2005, 2005),
months=c(11,12,1,2,11,12),
Rainfall = c(.251,.333,.731,1.13,.111,.346)
)

Returning data frame entries based on max date and/or max values (COVID-19 data example)

df <- read.csv ('https://raw.githubusercontent.com/ulklc/covid19-timeseries/master/countryReport/raw/rawReport.csv',
stringsAsFactors = FALSE)
I found this code which finds countries that reported the highest number of deaths and recovered by region were found in this code.
yesterday <- function() Sys.Date() - 1L
yesterday()
# [1] "if it doesn't work yesterday()-1 do it"
library(tidyverse)
death_df <- df %>%
filter(as.Date(day) == yesterday()) %>%
group_by(region) %>%
filter(death == max(death)) %>%
select(Date = day,
countryName,
region,
death,
recovered)
recovered_df <- df %>%
filter(as.Date(day) == yesterday()) %>%
group_by(region) %>%
filter(recovered == max(recovered)) %>%
select(Date = day,
countryName,
region,
death,
recovered)
full_df <- bind_rows(death_df, recovered_df)
However, I need to find the countries that report the most death and recovered to the world in general.
Here is the output I am looking for:
date countryName death recovered
2020/05/06 united State **19580** 500
2020/05/06 İran 11500 **98567**
Note that these values are not real.
The data set is updated daily. However, it may not have been updated for 1 -2 days. let's pay attention to this.
The code below will select the record with the max daily death and max daily recovered for the max date in the data.
## call the dplyr library
library(dplyr)
## read the data into R
df <- read.csv ('https://raw.githubusercontent.com/ulklc/covid19-timeseries/master/countryReport/raw/rawReport.csv', stringsAsFactors = FALSE)
## determine the max date contained within the data
max.date <- df[which.max(as.Date(df$day)),"day"]
## copy the data to preserve original
df1 <- df
## filter the data to only entries from the max day
df1 <- filter(df1, as.Date(date, "%Y/%m/%d") == as.Date(max.date))
## determine the entry with the most deaths
max.deaths <- df1[which.max(df1$death),]
## format the number of deaths as given in the example
max.deaths$death <- paste0("**",max.deaths$death,"**")
## determine the entry with the most recovered
max.recovered <- df1[which.max(df1$recovered),]
## format the number recovered to match the format of the example
max.recovered$recovered <- paste0("**",max.recovered$recovered,"**")
## create a data frame containing our max death and max recovered entries
max.records <- rbind(max.deaths, max.recovered)
## attach a column with the max date which corresponds to the date of the entries selected
max.records$date <- max.date
## organize the data as shown in the example
max.records <- select(max.records, c("day","countryName","death","recovered"))
And this code will calculate aggregate (or total) deaths as totalDeaths and aggregate recovered as totalRecovered for each country. It then returns the record with the max totalDeath and max totalRecovered with the max date within the data.
## call the dplyr library
library(dplyr)
## read the data into R
df <- read.csv ('https://raw.githubusercontent.com/ulklc/covid19-timeseries/master/countryReport/raw/rawReport.csv', stringsAsFactors = FALSE)
## determine the max date contained within the data
max.date <- df[which.max(as.Date(df$day)),"day"]
## copy the data to preserve the original
df1 <- df
## group the data by countries
df1 <- group_by(df1, countryName)
## sum the death and recovered of each country
df1 <- summarise(df1, totalDeaths = sum(death), totalRecovered = sum(recovered))
## ungroup your data to avoid errors
df1 <- ungroup(df1)
## determine country with most total deaths reported
max.deaths <- df1[which.max(df1$totalDeaths),]
## format death numbers to match example
max.deaths$totalDeaths <- paste0("**",max.deaths$totalDeaths,"**")
## determine country with most total recovered reported
max.recovered <- df1[which.max(df1$totalRecovered),]
## format recovered numbers to match example
max.recovered$totalRecovered <- paste0("**",max.recovered$totalRecovered,"**")
## create a data frame containing our max entries
max.records <- rbind(max.deaths, max.recovered)
## attach a column with the max date which corresponds to the most current date the data reports
max.records$date <- max.date
## organize the data as shown in the example
max.records <- select(max.records, c("day","countryName","death","recovered"))
Note: both methods rely upon the dplyr R package. dplyr can be installed by running install.packages(dplyr) in R or RStudio.
I hope this helps!
This is an approach to calculate the information you request for every day using dplyr.
library(dplyr)
result <- df %>% group_by(day) %>%
filter(death == max(death) | recovered == max(recovered)) %>%
mutate(death = case_when(death == max(death) ~ paste0("**",death[death == max(death)],"**"),
TRUE ~ as.character(death)),
recovered = case_when(recovered == max(recovered) ~ paste0("**",recovered[recovered == max(recovered)],"**"),
TRUE ~ as.character(recovered)))
result %>%
filter(day == "2020/04/06")
# A tibble: 2 x 9
# Groups: day [1]
day countryCode countryName region lat lon confirmed recovered death
<chr> <chr> <chr> <chr> <dbl> <dbl> <int> <chr> <chr>
1 2020/04/06 CN China Asia 35 105 81708 **77029** 3331
2 2020/04/06 IT Italy Europe 42.8 12.8 132547 22837 **16523**

Using Mutate to rank specific columns

I'm a relative newbie to dplyr. I have a data.frame organized with each store name and source (made up of the results for 2018) making up the observations. The variables are total revenue, quantity, customer experience score, and a few others.
I'd like to rank each category in the data.frame and create new observations. All variables would be ranked in descending order, but customer experience and one additional column would be ranked in ascending order. The source I'd like to call this would be called "ranks".
store <- c("NYC", "Chicago", "Boston")
source <- c("2018", "2018", "2018")
revenue <- c(10000, 50000, 2000)
quantity <- c(100, 50, 20)
satisfaction <- c(3, 2, 5)
table <- cbind(store, source, revenue, quantity, satisfaction)
I was able to get what I needed using mutate, but I had to manually name each new column. I'm sure there is a more efficient way to rank these values out there!
Here is what I originally did:
table <- table %>%
mutate(revenue_rank = rank(-revenue), quantity_rank = rank(-quantity), satisfaction_rank = rank(satisfaction))
In general, if you're having to do something repeatedly in a data frame, such as calculating ranks, you probably want to reshape to long data. Also note that what you got from cbind is a matrix, not data frame--probably not what you want, since this means numeric variables actually come through as characters. Instead of cbind, use data.frame or data_frame (for a tibble).
What I did here is gathered into a long data frame, grouped by the measures (revenue, quantity, or satisfaction), then gave ranks based on the value, keeping in mind that you wanted different orders for satisfaction and the other measures.
library(tidyverse)
store <- c("NYC", "Chicago", "Boston")
source <- c("2018", "2018", "2018")
revenue <- c(10000, 50000, 2000)
quantity <- c(100, 50, 20)
satisfaction <- c(3, 2, 5)
df <- data_frame(store, source, revenue, quantity, satisfaction)
df %>%
gather(key = measure, value = value, revenue:satisfaction) %>%
group_by(measure) %>%
mutate(rank = ifelse(measure == "satisfaction", rank(value), rank(-value))) %>%
ungroup() %>%
select(-value) %>%
mutate(measure = paste(measure, "rank", sep = "_")) %>%
spread(key = measure, value = rank)
#> # A tibble: 3 x 5
#> store source quantity_rank revenue_rank satisfaction_rank
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 Boston 2018 3 3 3
#> 2 Chicago 2018 2 1 1
#> 3 NYC 2018 1 2 2
Created on 2018-05-04 by the reprex package (v0.2.0).

Calculating change of values between same day in different years

I need to calculate so called MAT (Movie Anual Total), means the % change in sales value between same day in two different year:
ID Sales Day Month Year
A 500 31 12 2015
A 100 1 1 2016
A 200 2 1 2016
...
A 200 1 1 2017
Does anybody have an idea about how to deal with it?
I want to get this:
ID Sales Day Month Yeas **MAT**
With the way your data is set up, you're actually quite close. What you want to do now is group your data by month and day, order each group by year, and then take the successive differences (assuming you want the MAT for sequential years)
library(lubridate)
library(dplyr)
X <-
data.frame(date = seq(as.Date("2014-01-01"),
as.Date("2017-12-31"),
by = 1)) %>%
mutate(day = day(date),
month = month(date),
year = year(date),
sales = rnorm(nrow(.), mean = 100, sd = 5))
X %>%
group_by(month, day) %>%
arrange(month, day, year) %>%
mutate(mat = c(NA, diff(sales))) %>%
ungroup()
If you are wanting to be able to generically take a difference between any two years, this will need some refinements.
Here is a solution with base R. Mainly it is a self-join:
d$prev.Year <- d$Year-1
dd <- merge(d,d, by.x=c("prev.Year", "Month", "Day"), by.y=c("Year", "Month", "Day"))
dd$MAT <- with(dd, (Sales.x-Sales.y)/Sales.y)
If you have different values in ID you eventually want:
dd <- merge(d,d, by.x=c("ID", "prev.Year", "Month", "Day"), by.y=c("ID", "Year", "Month", "Day"))
data:
d <- read.table(header=TRUE, text=
"ID Sales Day Month Year
A 500 31 12 2015
A 100 1 1 2016
A 200 2 1 2016
A 200 1 1 2017")

Resources