Sliding windows: compare series with all series before/after - r

I'm fairly new to rolling windows. I'm looking to calculate a function that compares, say, a correlation between a window in the data vs. all windows before/after of the same size. Assume no gaps. I'd like to use a tidyverse-sque approach such as tsibble and/or #Davis Vaughan slider
df <- structure(list(sales = c(2, 4, 6, 2, 8, 10, 9, 3, 5, 2), index = structure(c(1567123200, 1567209600, 1567296000, 1567382400, 1567468800, 1567555200, 1567641600, 1567728000, 1567814400, 1567900800), class = c("POSIXct", "POSIXt"), tzone = "UTC")), row.names = c(NA, -10L), class = ("tbl_df", "tbl", "data.frame"))
Suppose I want to calculate the Pearson correlation between the first 3 days of the series vs. all previous 3 days windows:

We could create a grouping index with gl for every 3 rows after removing the first 3 rows, then do the cor between the first 3 and each of the blocks of 'sales'
library(dplyr)
n <- 3
df %>%
slice(-seq_len(n)) %>%
group_by(grp = as.integer(gl(n(), n, n()))) %>%
filter(n() == n) %>%
summarise(cor = cor(df$sales[seq_len(n)], sales))
-output
# A tibble: 2 x 2
# grp cor
# <int> <dbl>
#1 1 0.961
#2 2 -0.655
data
df <- data.frame(sales = c(2, 4, 6, 2, 8, 10, 9, 3, 5, 2),
index = seq(as.Date("2019-08-30"), length.out = 10, by = '1 day'))

Related

How to find sum of a column given the date and month is the same

I am wondering how I can find the sum of a column, (in this case it's the AgeGroup_20_to_24 column) for a month and year. Here's the sample data:
https://i.stack.imgur.com/E23Th.png
I essentially want to find the total amount of cases per month/year.
For an example: 01/2020 = total sum cases of the AgeGroup
02/2020 = total sum cases of the AgeGroup
I tried doing this, however I get this:
https://i.stack.imgur.com/1eH0O.png
xAge20To24 <- covid%>%
mutate(dates=mdy(Date), year = year(dates), month = month(dates))%>%
mutate(total = sum(AgeGroup_20_to_24))%>%
select(Date, year, month, AgeGroup_20_to_24)%>%
group_by(year)
View(xAge20To24)
Any help will be appreciated.
structure(list(Date = c("3/9/2020", "3/10/2020", "3/11/2020",
"3/12/2020", "3/13/2020", "3/14/2020"), AgeGroup_0_to_19 = c(1,
0, 2, 0, 0, 2), AgeGroup_20_to_24 = c(1, 0, 2, 0, 2, 1), AgeGroup_25_to_29 = c(1,
0, 1, 2, 2, 2), AgeGroup_30_to_34 = c(0, 0, 2, 3, 4, 3), AgeGroup_35_to_39 = c(3,
1, 2, 1, 2, 1), AgeGroup_40_to_44 = c(1, 2, 1, 3, 3, 1), AgeGroup_45_to_49 = c(1,
0, 0, 2, 0, 1), AgeGroup_50_to_54 = c(2, 1, 1, 1, 0, 1), AgeGroup_55_to_59 = c(1,
0, 1, 1, 1, 2), AgeGroup_60_to_64 = c(0, 2, 2, 1, 1, 3), AgeGroup_70_plus = c(2,
0, 2, 0, 0, 0)), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
I'm not sure if your question and your data match up. You're asking for by-month summaries of data, but your data only includes March entries. I've provided two examples of summarizing your data below, one that uses the entire date and one that uses by-day summaries since we can't use month. If your full data set has more months included, you can just swap the day for month instead. First, a quick summary of just the dates can be done with this code:
#### Load Library ####
library(tidyverse)
library(lubridate)
#### Pivot and Summarise Data ####
covid %>%
pivot_longer(cols = c(everything(),
-Date),
names_to = "AgeGroup",
values_to = "Cases") %>%
group_by(Date) %>%
summarise(Sum_Cases = sum(Cases))
This pivots your data into long format, groups by the entire date, then summarizes the cases, which gives you this by-date sum of data:
# A tibble: 6 × 2
Date Sum_Cases
<chr> <dbl>
1 3/10/2020 6
2 3/11/2020 16
3 3/12/2020 14
4 3/13/2020 15
5 3/14/2020 17
6 3/9/2020 13
Using the same pivot_longer principle, you can mutate the data to date format like you already did, pivot to longer format, then group by day, thereafter summarizing the cases:
#### Theoretical Example ####
covid %>%
mutate(Date=mdy(Date),
Year = year(Date),
Month = month(Date),
Day = day(Date)) %>%
pivot_longer(cols = c(everything(),
-Date,-Year,-Month,-Day),
names_to = "AgeGroup",
values_to = "Cases") %>%
group_by(Day) %>% # use by day instead of month
summarise(Sum_Cases = sum(Cases))
Which you can see below. Here we can see the 14th had the most cases:
# A tibble: 6 × 2
Day Sum_Cases
<int> <dbl>
1 9 13
2 10 6
3 11 16
4 12 14
5 13 15
6 14 17

How to mutate columns in R based on ordering of subset of these columns?

To begin with, let's suppose we have a dataset like this:
data <- data.frame(
id = 1:5,
time = c(0.1, 0.2, 0.1, 0.1, 0.2),
obj_a_size = c(1, 3, 8, 4, 2),
obj_a_cuteness = c(3, 6, 4, 1, 2),
obj_b_size = c(5, 4, 4, 2, 5),
obj_b_cuteness = c(6, 2, 10, 9, 6),
obj_c_size = c(3, 6, 7, 1, 6),
obj_c_cuteness = c(10, 1, 6, 8, 8)
)
It has columns concerning whole experiment (like time) and object-specific columns (like X_size and X_cuteness). These objects are ordered randomly, though, so I'd like to mutate these column to order the objects by size for each experiment separately. The result I expect to be like that:
data <- data.frame(
id = 1:5,
time = c(0.1, 0.2, 0.1, 0.1, 0.2),
obj_max_size = c(5, 6, 8, 4, 6),
obj_max_cuteness = c(6, 1, 4, 1, 8),
obj_2nd_size = c(3, 4, 7, 2, 5),
obj_2nd_cuteness = c(10, 2, 6, 9, 6),
obj_min_size = c(1, 3, 3, 1, 2),
obj_min_cuteness = c(3, 6, 10, 8, 2)
)
Notice that cuteness isn't ordered descending or ascending, but I want cuteness to be considered part of an object and set obj_max_cuteness = obj_2_cuteness wherever obj_max_size = obj_2_size, and so on.
Number of objects is known in advance (there are four of them), columns are known as well, and there are four columns describing each object. There is no missing data. I'm willing to use any package, if necessary. Also, original dataset is about 500k by 30, so bonus points for quick or memory-friendly code.
EDIT: Some noticed that the description is not very clear. What I'm after is a bit object-oriented thing: in the case above each object within experiment could be described as such (X in obj_X_ means that it belongs to experiment no. X):
obj_1_a = {"size": 1, "cuteness": 3}
obj_1_b = {"size": 5, "cuteness": 6}
obj_1_c = {"size": 3, "cuteness": 10}
obj_2_a = {"size": 3, "cuteness": 6}
...
I want to reorder them by size so that (in the resulting data frame):
obj_1_max = {"size": 5, "cuteness": 6}
obj_1_2nd = {"size": 3, "cuteness": 10}
obj_1_min = {"size": 1, "cuteness": 3}
obj_2_max = {"size": 6, "cuteness": 1}
...
Is this what you are after?
The min and max value calculations are straightforward. To find the 2nd max you need to do a bit more work. My interpretation of the 2nd values is that it is the 2nd value of the sorted and unique values. My output differs from yours but that may be due to a different interpretation of what you mean by the 2nd value. My reading: you are looking for the first value down from the max value; from the groups of 3 columns (size, cuteness).
library(dplyr)
data <- data.frame(
id = 1:5,
time = c(0.1, 0.2, 0.1, 0.1, 0.2),
obj_a_size = c(1, 3, 8, 4, 2),
obj_a_cuteness = c(3, 6, 4, 1, 2),
obj_b_size = c(5, 4, 4, 2, 5),
obj_b_cuteness = c(6, 2, 10, 9, 6),
obj_c_size = c(3, 6, 7, 1, 6),
obj_c_cuteness = c(10, 1, 6, 8, 8)
)
obj_max_size <- data %>%
pivot_longer(cols = contains('size')) %>%
group_by(id) %>%
summarise(obj_max_size = max(value)) %>%
ungroup() %>%
select(obj_max_size)
obj_min_size <- data %>%
pivot_longer(cols = contains('size')) %>%
group_by(id) %>%
summarise(obj_min_size = min(value)) %>%
ungroup() %>%
select(obj_min_size)
obj_2nd_size <- data %>%
pivot_longer(cols = contains('size')) %>%
group_by(id) %>%
distinct(value) %>%
arrange(desc(value)) %>%
slice(2) %>%
ungroup() %>%
select(obj_2nd_size = value)
obj_max_cuteness <- data %>%
pivot_longer(cols = contains('cuteness')) %>%
group_by(id) %>%
summarise(obj_max_cuteness = max(value)) %>%
ungroup() %>%
select(obj_max_cuteness)
obj_min_cuteness <- data %>%
pivot_longer(cols = contains('cuteness')) %>%
group_by(id) %>%
summarise(obj_min_cuteness = min(value)) %>%
ungroup() %>%
select(obj_min_cuteness)
obj_2nd_cuteness <- data %>%
pivot_longer(cols = contains('cuteness')) %>%
group_by(id) %>%
distinct(value) %>%
arrange(desc(value)) %>%
slice(2) %>%
ungroup() %>%
select(obj_2nd_cuteness = value)
output <- bind_cols(id = data$id, obj_max_size, obj_min_size, obj_2nd_size, obj_max_cuteness, obj_min_cuteness, obj_2nd_cuteness)
With output looking like this:
> output
# A tibble: 5 x 7
id obj_max_size obj_min_size obj_2nd_size obj_max_cuteness obj_min_cuteness obj_2nd_cuteness
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 5 1 3 10 3 6
2 2 6 3 4 6 1 2
3 3 8 4 7 10 4 6
4 4 4 1 2 9 1 8
5 5 6 2 5 8 2 6

how to add together dataframes within a list but only for matching dates

I have a list of dataframes that I want to consolidate these dataframes into one data frame. I am looking to solve two problems:
How to add together the columns
How to only include common dates across all the dfs withing the list
This is what I have:
library(tidyverse)
library(lubridate)
df1 <- data.frame(
date = ymd(c("2019-02-01", "2019-02-02", "2019-02-03", "2019-02-04",
"2019-02-05")),
x = c(1, 2, 3, 4, 5),
y = c(2, 3, 4, 5, 6),
z = c(3, 4, 5, 6, 7)
)
df2 <- data.frame(
date = ymd(c("2019-02-01", "2019-02-02", "2019-02-04", "2019-02-05")),
x = c(1, 2, 3, 4),
y = c(2, 3, 4, 5),
z = c(3, 4, 5, 6)
)
df3 <- data.frame(
date = ymd(c("2019-02-01", "2019-02-02", "2019-02-03", "2019-02-04")),
x = c(1, 2, 3, 4),
y = c(2, 3, 4, 5),
z = c(3, 4, 5, 6)
)
dfl <- list(df1, df2, df3)
This is the output I am looking for:
data.frame(
date = ymd(c("2019-02-01", "2019-02-02", "2019-02-04")),
x = c(3, 6, 11),
y = c(6, 9, 14),
z = c(9, 12, 17)
)
I have tried inner_join and tried looping through the list but it got too complicated and I still didn't manage to land on the answer.
Is there a more cleaner way to get to the final answer
How about this?
bind_rows(dfl) %>%
group_by(date) %>%
mutate(n = 1) %>%
summarise_all(sum) %>%
filter(n == length(dfl)) %>%
select(-n)
## A tibble: 3 x 4
# date x y z
# <date> <dbl> <dbl> <dbl>
#1 2019-02-01 3 6 9
#2 2019-02-02 6 9 12
#3 2019-02-04 11 14 17
This assumes that there are no duplicate dates in a single data.frame of dfl.

Summary of N recent values

I am trying to get summary statistics (sum and max here) with most N recent values.
Starting data:
dt = data.table(id = c('a','a','a','a','b','b','b','b'),
week = c(1,2,3,4,1,2,3,4),
value = c(2, 3, 1, 0, 5, 7,3,2))
Desired result:
dt = data.table(id = c('a','a','a','a','b','b','b','b'),
week = c(1,2,3,4,1,2,3,4),
value = c(2, 3, 1, 0, 5, 7,3,2),
sum_recent2week = c(NA, NA, 5, 4, NA, NA, 12, 10),
max_recent2week = c(NA, NA, 3, 3, NA, NA, 7, 7))
With the data, I would like to have sum and max of 2 (N=2) most recent values for each row by id. 4th(sum_recent2week) and 5th (max_recent2week) columns are my desired columns
You can use rollsum and rollmax from the zoo package.
dt[, `:=`(sum_recent2week =
shift(rollsum(value, 2, align = 'left', fill = NA), 2),
max_recent2week =
shift(rollmax(value, 2, align = 'left', fill = NA), 2))
, id]
For the sum, if you're using data table version >= 1.12, you can use data.table::frollmean. The default for frollmean is fill = NA, so no need to specify that in this case.
dt[, `:=`(sum_recent2week =
shift(frollmean(value, 2, align = 'left')*2, 2),
max_recent2week =
shift(rollmax(value, 2, align = 'left', fill = NA), 2))
, id]
I'm sure it can be done in a much more elegant way, but here is one tidyverse possibility:
dt %>%
group_by(id) %>%
mutate(sum_recent2week = lag(value + lead(value), n = 2),
max_recent2week = pmax(lag(value, n = 2), lag(value, n = 1))) %>%
rowid_to_column() %>%
select(-week, -value) %>%
top_n(-2) %>%
right_join(dt %>%
rowid_to_column(), by = c("rowid" = "rowid",
"id" = "id")) %>%
select(-rowid)
id sum_recent2week max_recent2week week value
<chr> <dbl> <dbl> <dbl> <dbl>
1 a NA NA 1. 2.
2 a NA NA 2. 3.
3 a 5. 3. 3. 1.
4 a 4. 3. 4. 0.
5 b NA NA 1. 5.
6 b NA NA 2. 7.
7 b 12. 7. 3. 3.
8 b 10. 7. 4. 2.
First, it is computing the "sum_recent2week" and "max_recent2week" per group. Second, it selects the last two rows per group. Finally, it merges it with the original data.
Or if you want to compute it for all rows, not just for the last two rows per group:
dt %>%
group_by(id) %>%
mutate(sum_recent2week = lag(value + lead(value), n = 2),
max_recent2week = pmax(lag(value, n = 2), lag(value, n = 1)))

R: new column based whether categorical levels of another column are the same or different from each other

I am having a problem creating a new column in a data where the column content is defined by levels in a factor in a different column are the same or different, which is dependent on another 2 columns.
Basically, I have a bunch of cows with different ID's that can have different parities. The quarter is the udder quarter affected by the disease and I would like to create a new column with a result that is based on whether quarters are the same or different or occurring once. Any help would be appreciated. Code for abbreviated data frame below/ The new column is the one I would like to achieve.
AnimalID <- c(10,10,10,10,12,12,12,12,14)
Parity <- c(8,8,9,9,4,4,4,4,2)
Udder_quarter <- c("LH","LH","RH","RH","LH","RH","LF","RF","RF")
new_column <- c("same quarter","same quarter","different quarter","different quarter","different quarter","different quarter","different quarter","different quarter","one quarter")
quarters<- data.frame(AnimalID,Parity,Udder_quarter,new_column)
structure(list(HerdAnimalID = c(100165, 100165, 100327, 100327,
100450, 100450), Parity = c(6, 6, 5, 5, 3, 3), no_parities = c(1,
1, 1, 1, 1, 1), case = c("1pathogen_lact", "1pathogen_lact",
"1pathogen_lact", "1pathogen_lact", "1pathogen_lact", "1pathogen_lact"
), FARM = c(1, 1, 1, 1, 1, 1), `CASE NO` = c("101", "101", "638",
"638", "593", "593"), MASTDATE = structure(c(1085529600, 1087689600,
1097884800, 1101254400, 1106092800, 1106784000), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), QRT = c("LF", "LF", "RH", "LF", "LH",
"LH"), MastitisDiagnosis = c("Corynebacterium spp", "Corynebacterium spp",
"S. uberis", "S. uberis", "Bacillus spp", "Bacillus spp"), PrevCalvDate =
structure(c(1075334400,
1075334400, 1096156800, 1096156800, 1091145600, 1091145600), class =
c("POSIXct",
"POSIXt"), tzone = "UTC")), .Names = c("HerdAnimalID", "Parity",
"no_parities", "case", "FARM", "CASE NO", "MASTDATE", "QRT",
"MastitisDiagnosis", "PrevCalvDate"), row.names = c(NA, -6L), class =
c("tbl_df",
"tbl", "data.frame"))
library(dplyr)
quarters %>%
group_by(AnimalID) %>%
mutate(new_column = ifelse(n()==1, 'one quarter', NA)) %>%
group_by(Parity, add=T) %>%
mutate(new_column=ifelse(length(unique(Udder_quarter))==1 & is.na(new_column),
"same quarter",
ifelse(length(unique(Udder_quarter))>1,
"different quarter",
new_column))) %>%
data.frame()
Output is:
AnimalID Parity Udder_quarter new_column
1 10 8 LH same quarter
2 10 8 LH same quarter
3 10 9 RH same quarter
4 10 9 RH same quarter
5 12 4 LH different quarter
6 12 4 RH different quarter
7 12 4 LF different quarter
8 12 4 RF different quarter
9 14 2 RF one quarter
Sample data:
quarters <- structure(list(AnimalID = c(10, 10, 10, 10, 12, 12, 12, 12, 14
), Parity = c(8, 8, 9, 9, 4, 4, 4, 4, 2), Udder_quarter = structure(c(2L,
2L, 4L, 4L, 2L, 4L, 1L, 3L, 3L), .Label = c("LF", "LH", "RF",
"RH"), class = "factor")), .Names = c("AnimalID", "Parity", "Udder_quarter"
), row.names = c(NA, -9L), class = "data.frame")
I would use ave to do that:
f <- function(x) {
if (length(x)==1) return("one")
else if (all(x == x[1])) return("same")
else return("different")
}
ave(Udder_quarter, interaction(AnimalID, Parity), FUN=f)
# [1] "same" "same" "same" "same" "different"
# [6] "different" "different" "different" "one"

Resources