sum up values of compressed time series over time - r

I try to describe my problem via the code below. I have a data frame of a 'compressed' time series in the form of data frame: have. It contains the start and end date of a period with a value over time. I want to repeat the data as in data frame: want to ultimately get to the data frame: ultimately_want which sums up the value over time. Maybe I do not need want and get straight to ultimately_want somehow? Thanks.
library(dplyr)
start_date <- as.Date(c("2004-08-02", "2004-08-03"))
end_date <- as.Date(c("2004-08-04", "2004-08-05"))
value <- c(5, 6)
have <- data.frame(start_date, end_date, value)
have
date <- as.Date(c("2004-08-02", "2004-08-03", "2004-08-04", "2004-08-03", "2004-08-04", "2004-08-05"))
value <- c(5, 5, 5, 6, 6, 6)
want <- data.frame(date, value)
want
ultimately_want <- want %>%
group_by(date) %>%
summarise(total = sum(value))
ultimately_want

Here is a data.table approach,
library(data.table)
setDT(have)[, .(value = value, date = seq(start_date, end_date, by = "day")),
by = 1:nrow(have)][,.(total = sum(value)), date][]
# date total
#1: 2004-08-02 5
#2: 2004-08-03 11
#3: 2004-08-04 11
#4: 2004-08-05 6

Related

Calculate cumulative sum after a set period of time

I have a data frame with COVID data and I'm trying to make a column calculating the number of recovered people based off of the number of positive tests.
My data has a location, a date, and the number of tests administered/positive results/negative results each day. Here's a few lines using one location as an example (the real data has several months worth of dates):
loc date tests pos neg active
spot1 2020-04-10 1 1 0 5
spot1 2020-04-11 2 1 1 6
spot1 2020-04-12 0 0 0 6
spot1 2020-04-13 11 1 10 7
I want to make a new column that cumulatively counts each positive test in each location 14 days after it is recorded. On 2020-04-24, the 5 active classes are not active anymore, so I want a recovered column with 5. For each date I want the newly nonactive cases to be added.
My first thought was to try it in a loop:
df1 <- df %>%
mutate(date = as.Date(date)) %>%
group_by(loc) %>%
mutate(rec = for (i in 1:nrow(df)) {
#getting number of new cases
x <- df$pos[i]
#add 14 days to the date
d <- df$date + 14
df$rec <- sum(x)
})
As you can see, I'm not the best at writing for loops. That gives me a bunch of numbers, but bear very little meaningful relationship to the data.
Also tried it with map_dbl:
df1 <- df %>%
mutate(date = as.Date(date)) %>%
group_by(loc) %>%
mutate(rec = map_dbl(date, ~sum(pos[(date <= . + 14) & date >= .])))
Which resulted in the same number printed on the entire rec column.
Any suggestions? (Sorry for the lengthy explanation, just want to make sure this all makes sense)
Your sample data shows that -
you have all continuous dates despite 0 tests (12 April)
Active column seems like already a cumsum
Therefore I think you can simply use lag function with argument 14
example code
df %>% group_by(loc) %>% mutate(recovered = lag(active, 14)) %>% ungroup()
You could use aggregate to sum the specific column and then applying
cut in order to set a 14 day time frame for each sum:
df <- data.frame(loc = rep("spot1", 30),
date = seq(as.Date('2020-04-01'), as.Date('2020-04-30'),by = 1),
test = seq(1:30),
positive = seq(1:30),
active = seq(1:30))
output <- aggregate(positive ~ cut(date, "14 days"), df, sum)
output
Console output:
cut(date, "14 days") positive
1 2020-04-01 105
2 2020-04-15 301
3 2020-04-29 59
my solution:
library(dplyr)
date_seq <- seq(as.Date("2020/04/01"), by = "day", length.out = 30)
pos <- rpois(n = 60, lambda = 10)
mydf <-
data.frame(loc = c(rep('loc1', 30), rep('loc2', 30)),
date = date_seq,
pos = pos)
head(mydf)
getPosSum <- function(max, tbl, myloc, daysBack = 14) {
max.Date <- as.Date(max)
sum(tbl %>%
filter(date >= max.Date - (daysBack - 1) &
date <= max.Date & loc == myloc) %>%
select(pos))
}
result <-
mydf %>%
group_by(date, loc) %>%
mutate(rec = getPosSum(max = date, tbl = mydf, myloc = loc))
library(tidyverse)
library(lubridate)
data %>%
mutate(date = as_date(date),
cut = cut(date, '14 days') %>%
group_by(loc) %>%
arrange(cut) %>%
mutate(cum_pos = accumulate(pos, `+`)) # accumulate(pos, sum) should also work
As a general rule of thumb, avoid loops, especially within mutate - that won't work. Instead of map_dbl you should check out purrr::accumulate. There's specialized functions for this in R's base library such as cumsum and cummin but their behavior is a lot less predictable in relation to purrr's.

Last 3 month lags in r

The data is :
Category <- c(rep("A",4))
Month <- c(1,2,3,4)
Sales <- c(10,15,20,25)
df <- data.frame(Category,Month,Sales)
df <- df %>% filter(Category=='A') %>%
group_by(Month) %>%
summarise(Sales=sum(Sales,na.rm=TRUE)) %>%
mutate(lag_1 = dplyr::lag(Sales, 1),
lag_2 = dplyr::lag(Sales, 2),
lag_3 = dplyr::lag(Sales, 3),
lag_3_mean = rollapply(Sales,3,mean,align='right',fill=NA))
Present Output
I want the lag_3_mean to be the mean of last 3 months, not including the present month. For example, in Month 4 lag_3_mean = Average(Sales value in month 3,2,1).
The expected output should be:
Use a width of list(-(1:3)) to get offsets of -1, -2, -3.
rollapplyr(Sales, list(-(1:3)), mean, fill = NA)
Note that this recent question is very similar Variable frameshift rolling average for multiple variables

Sequence a group of dates in R

I wish to generate some Tidy data.
26 companies are observed everyday for 10 days.
Each day a value is recorded.
The first day is: 2020/1/1
How do I create a list of dates so that the first 26 rows of the date column of the date frame is "2020/1/1" (Year, Month, Day) and the next 26 rows are "2020/1/2" etc.
Here is the data frame without the date column:
library(tidyverse)
set.seed(33)
date_chunk <- rep(as.Date("2020/1/1"), 26)
# Tidy data. 10 sequential days starting 2020/1/1/
df <- tibble(
company = rep(letters, 10),
value = sample(0:5, 260, replace = TRUE),
color = "grey"
)
You can try this
rep(seq(as.Date("2020-01-01"),as.Date("2020-01-10"),1),each=26)
This will return a list of dates from 2020-01-01 to 2020-01-10 where each date will be repeated 26 times
For each company we can add row_number() to first date_chunk to get an incremental sequence of dates.
library(dplyr)
df %>%
group_by(company) %>%
mutate(date = first(date_chunk) + row_number() - 1)

summarizing data.table - creating multiple columns subset by date in R

I have data about ID and the corresponding amount over multiple years. Something like this:
ID <- c(rep("A", 5), rep("B", 7), rep("C", 3))
amount <- c(sample(1:10000, 15))
Date <- c("2016-01-22","2016-07-25", "2016-09-22", "2017-10-22", "2017-01-02",
"2016-08-22", "2016-09-22", "2016-10-22", "2017-08-22", "2017-09-22", "2017-10-22", "2018-08-22",
"2016-10-22","2017-10-25", "2018-10-22")
Now, I want to analyse every year of every ID. Specifically, I am interested in the amount. For one, I want to know the overall amount for every year. Then, i also want to know the overall amount for first 11 months of every year, first 10 months of every year, first 9 months of every year and first 8 months of every year. For this purpose I have calculated the cumSum for every ID per year as follows:
myData <- cbind(ID, amount, Date)
myData <- as.data.table(myData)
# createe cumsum per ID per Year
myData$Date <- as.Date(myData$Date, format = "%Y-%m-%d")
myData[order(clientID, clDate)]
myData[, CumSum := cumsum(amount), by =.(ID, year(Date))]
How can summarise the data.table such that i get columns amount9month, amount10month, amount11month for every ID in every year?
Between cumsum, by and dcast this is almost quite straightforward. The most difficult bit is dealing with those months without any data in. Hence this solution isn't as brief as it almost was, but it does do things the "data.table way" and avoids slow operations like looping through rows.
# Just sort the formatting out first
myData[, Date:=as.Date(Date)]
myData[, `:=`(amount = as.numeric(amount),
year = year(Date),
month = month(Date))]
bycols <- c('ID', 'year', 'month')
# Summarise all transactions for the same ID in the same month
summary <- myData[, .(amt = sum(amount)), by=bycols]
# Create a skeleton table with all possible combinations of ID, year and month, to fill in any gaps.
skeleton <- myData[, CJ(ID, year, month = 1:12, unique = TRUE)]
# Join the skeleton to the actual data, to recreate the data but with no gaps in
result.long <- summary[skeleton, on=bycols, allow.cartesian=TRUE]
result.long[, amt.cum:=cumsum(fcoalesce(amt, 0)), by=c('ID', 'year')]
# Cast the data into wide format to have one column per month
result.wide <- dcast(result.long, ID + year ~ paste0('amount',month,'month'), value.var='amt.cum')
NB. If you don't have fcoalesce, update your data.table package.
In which format do you want it? There are two simple options. You can get the requested result easily in two different formats:
# Prepare the data
ID <- c(rep("A", 5), rep("B", 7), rep("C", 3))
amount <- c(sample(1:1, 15, replace = TRUE))
Date <- c("2016-01-22","2016-07-25", "2016-09-22", "2017-10-22", "2017-01-02", "2016-08-22", "2016-09-22", "2016-10-22", "2017-08-22", "2017-09-22", "2017-10-22", "2018-08-22", "2016-10-22","2017-10-25", "2018-10-22")
myData <- data.frame(ID, amount, Date)
# Add year column
myData$Date <- as.Date(myData$Date, format = "%Y-%m-%d")
myData$year <- format(myData$Date,"%Y")
Please note that I changed the amounts for testing purposes. Now two solutions.
# Format 1
by(myData$amount, list(myData$ID, myData$year), cumsum, simplify = TRUE)
# Format 2
aggregate(myData$amount, list(ID = myData$ID, Date = myData$year), cumsum)
However, you might want to have the result to be a new column in the data frame? You can solve it:
# Format: New column
myData <- myData[order(myData$year, myData$ID),] # sort by year and ID
myData$cumsum <- rep(0, nrow(myData))
for (r in 1:nrow(myData)) {
if (r > 1 && myData$year[r-1] == myData$year[r] && myData$ID[r-1] == myData$ID[r])
myData$cumsum[r] <- myData$cumsum[r-1] + myData$amount[r]
else
myData$cumsum[r] <- myData$amount[r]
}
I do not know a smooth solution with basic R. Maybe someone from the "dplr faction" has a neat trick up their sleeve?

How to check if an id comes into data on a particular date that it stays until an exit date

I have a data set that looks something like below. Basically, I am interested in checking if a particular id is present at the beginning of the year(in this case jan,1,2003) that it is present everyday until the end of the year( dec 31 2003) then starting the checking process over again with the start of next year as people might change from year to year but should not change within a year. If on certain day, an id is not present I would like to know which day and which id.
I first started with a for loop and checked every two days but this is super inefficient since my data set spans roughly 50 years and will grow later on with new data.
dates <- rep(seq(as.Date("2003/01/01"), as.Date("2004/12/31"), "days"),each = 3)
id <- rep(1:3,times = length(unique(dates)))
df <- data.frame( dates = dates,id = id)
Edit:The above chunk has all the dates in it but if I delete for example id = 1 on the second day, the code should tell me it is missing so the count shouldn't be the same. I added the piece to delete the id = 1 on the second day below.
df <- df[-4,]
The code below will make the same data set but delete id = 1 for jan 2, 2003 and jan 3, 2003. I am trying to get something that returns the id that is missing and the date.
dates <- rep(seq(as.Date("2003/01/01"), as.Date("2004/12/31"), "days"),each = 3)
id <- rep(1:3,times = length(unique(dates)))
df <- data.frame( dates = dates,id = id)
df <- df[-4,]
df <- df[-6,]
This code chunk will count number of times a person appears in each year. if the answer is 365 or 366 in leap years a person was there everyday of the year.
library(dplyr)
library(tidyr)
dates <- rep(seq(as.Date("2003/01/01"), as.Date("2004/12/31"), "days"),each = 3)
id <- rep(1:3,times = length(unique(dates)))
df <- data.frame( dates = dates,id = id)
dfx <- df %>%
mutate(yrs = lubridate::year(dates)) %>%
group_by(id, dates) %>%
filter(row_number()==1) %>%
group_by(id, yrs) %>%
tally
#remove values
dfa <- df[c(-4,-6),]
The in oder to find the date of missing value add an indicator column to the data set. then fill in the missing dates by id. After this the val column will have missing values. Filter the data to get the dates where it went missing.
dfx <- dfa %>%
mutate(val = 1) %>%
complete(nesting(id),
dates = seq(min(dates),max(dates),by = "day")) %>%
filter(is.na(val))

Resources