I have a panel data set for daily revenue (and other variables) by ID, where the day with 0 revenue go unreported. I want to fill in these blanks with 0 for my analysis, meaning that for each ID's time series, I need to make sure there is an observation for each day. Each series can begin or end on a date distinct from the other series. I have been attempting to use the "padr" package, but I keep getting an "unused argument" error using the following sample code:
library(padr)
library(dplyr)
#unbalanced panel data
ID <- c(1,1,1,1,
2,2,2,2,2,2,
3,3,3,3,3,3,3,
4,4,4)
DT <- today() + c(1,3,4,5, #ID = 1
3,4,7,8,9,10, #ID = 2
2,5,6,7,8,9,10, #ID = 3
8,10,11) #ID = 4
#The end date denote the max date for each ID
EndDT <- today() + c(5,5,5,5, #ID = 1
13,13,13,13,13,13, #ID = 2
10,10,10,10,10,10,10, #ID = 3
15,15,15) #ID = 4
#random variables v1 and v2 to represent revenue and other variables
set.seed(1)
v1 <- rnorm(20,mean = 10000, sd = 5)
v2 <- rnorm(20,mean = 5000, sd = 1.5)
df <- as.data.frame(cbind(ID,DT,EndDT,v1,v2))
#format to simpler date
df$DT <- as.Date(DT, origin="1970-01-01")
df$EndDT <- as.Date(EndDT, origin="1970-01-01")
df_padded <- arrange(df,ID,DT) %>%
pad(by='DT',group='ID', end_val='EndDT') %>%
fill_by_value(v1,v2, value=0)
My error message:
Error in pad(., by = "DT", group = "ID", end_val = "EndDT") :
unused argument (group = "ID")
Answers not involving padr are also highly welcome.
After tussling with padr for a while, I decided to write my own function. This function works for the example set, but quickly ran into issues with real data. Either way, I figured this may be of use to someone else, so here it is:
date.pad <- function(df, date.var, group, replace.vars, new.val=0){
require("dplyr")
require("lazyeval")
require("lubridate")
tempdf1 <- arrange_(df,group,date.var)
finaldf <- tempdf1[0,]
unique.id <- unique(tempdf1[,group])
nonreplaced.vars <- setdiff(colnames(tempdf1),replace.vars)
nonreplaced.vars <- nonreplaced.vars[!nonreplaced.vars==date.var]
for(i in seq_along(unique.id)){
filter_criteria <- interp(~y==x, .values=list(y=as.name(group),x=i)) #necessary for NSE
tempdf2 <- filter_(tempdf1,filter_criteria)
min.date <- min(tempdf2[[date.var]])
max.date <- max(tempdf2[[date.var]])
all.days <- as.Date(seq(min.date,max.date,by="days"),origin="1970-01-01")
distinct.days <- unique(tempdf2[,date.var])
app.days <- as.Date(setdiff(all.days,distinct.days),origin="1970-01-01")
tempdf3 <- tempdf2[0,]
for(n in seq_along(app.days)){
tempdf3[n,date.var] <- app.days[n]
}
for(j in seq_along(nonreplaced.vars)){
tempdf3[1:nrow(tempdf3),nonreplaced.vars[j]] <- tempdf2[1,nonreplaced.vars[j]]
}
finaldf <- bind_rows(finaldf,tempdf3)
}
finaldf[replace.vars] <-new.val
finaldf <- bind_rows(finaldf,df) %>% arrange_(group,date.var)
return(finaldf)
}
for.exmpl <- date.pad(df=df1, date.var="DT", group="ID", replace.vars=c("v1","v2"), new.val=0)
for.exmpl
Here is a new answer I've devised that is far more applicable outside my one application, and uses way less code:
library(tidyverse)
temp <- group_by(df1,ID) %>%
complete(DT = seq.Date(min(DT),max(EndDT),by="day")) %>%
fill(EndDT,sometext) %>%
arrange(ID,DT)
temp[is.na(temp)] <- 0
View(temp)
Which results in:
# A tibble: 33 x 6
# Groups: ID [4]
ID DT EndDT v1 v2 sometext
<dbl> <date> <date> <dbl> <dbl> <chr>
1 1. 2018-05-04 2018-05-08 9997. 5001. textvar
2 1. 2018-05-05 2018-05-08 0. 0. textvar
3 1. 2018-05-06 2018-05-08 10001. 5001. textvar
4 1. 2018-05-07 2018-05-08 9996. 5000. textvar
5 1. 2018-05-08 2018-05-08 10008. 4997. textvar
6 2. 2018-05-06 2018-05-16 10002. 5001. textvar
7 2. 2018-05-07 2018-05-16 9996. 5000. textvar
8 2. 2018-05-08 2018-05-16 0. 0. textvar
9 2. 2018-05-09 2018-05-16 0. 0. textvar
10 2. 2018-05-10 2018-05-16 10002. 5000. textvar
# ... with 23 more rows
(Please ignore the "sometext" variable. I created that while testing out my function below.)
Your code does not run because you specify a character at the end_val argument. This should be a Date, you can only specify a single date over all the groups.
In order to do what you want with padr, you should combine the DT and EndDT columns. This way for each ID its final date is present in the DT column:
df %>%
group_by(ID) %>%
summarise(DT = max(EndDT)) %>%
mutate(v1 = NA, v2 = NA) %>%
bind_rows(df %>% select(-EndDT), .) %>%
group_by(ID, DT) %>%
filter(row_number() == 1) %>%
group_by(ID) %>%
pad()
Related
I have a data frame with COVID data and I'm trying to make a column calculating the number of recovered people based off of the number of positive tests.
My data has a location, a date, and the number of tests administered/positive results/negative results each day. Here's a few lines using one location as an example (the real data has several months worth of dates):
loc date tests pos neg active
spot1 2020-04-10 1 1 0 5
spot1 2020-04-11 2 1 1 6
spot1 2020-04-12 0 0 0 6
spot1 2020-04-13 11 1 10 7
I want to make a new column that cumulatively counts each positive test in each location 14 days after it is recorded. On 2020-04-24, the 5 active classes are not active anymore, so I want a recovered column with 5. For each date I want the newly nonactive cases to be added.
My first thought was to try it in a loop:
df1 <- df %>%
mutate(date = as.Date(date)) %>%
group_by(loc) %>%
mutate(rec = for (i in 1:nrow(df)) {
#getting number of new cases
x <- df$pos[i]
#add 14 days to the date
d <- df$date + 14
df$rec <- sum(x)
})
As you can see, I'm not the best at writing for loops. That gives me a bunch of numbers, but bear very little meaningful relationship to the data.
Also tried it with map_dbl:
df1 <- df %>%
mutate(date = as.Date(date)) %>%
group_by(loc) %>%
mutate(rec = map_dbl(date, ~sum(pos[(date <= . + 14) & date >= .])))
Which resulted in the same number printed on the entire rec column.
Any suggestions? (Sorry for the lengthy explanation, just want to make sure this all makes sense)
Your sample data shows that -
you have all continuous dates despite 0 tests (12 April)
Active column seems like already a cumsum
Therefore I think you can simply use lag function with argument 14
example code
df %>% group_by(loc) %>% mutate(recovered = lag(active, 14)) %>% ungroup()
You could use aggregate to sum the specific column and then applying
cut in order to set a 14 day time frame for each sum:
df <- data.frame(loc = rep("spot1", 30),
date = seq(as.Date('2020-04-01'), as.Date('2020-04-30'),by = 1),
test = seq(1:30),
positive = seq(1:30),
active = seq(1:30))
output <- aggregate(positive ~ cut(date, "14 days"), df, sum)
output
Console output:
cut(date, "14 days") positive
1 2020-04-01 105
2 2020-04-15 301
3 2020-04-29 59
my solution:
library(dplyr)
date_seq <- seq(as.Date("2020/04/01"), by = "day", length.out = 30)
pos <- rpois(n = 60, lambda = 10)
mydf <-
data.frame(loc = c(rep('loc1', 30), rep('loc2', 30)),
date = date_seq,
pos = pos)
head(mydf)
getPosSum <- function(max, tbl, myloc, daysBack = 14) {
max.Date <- as.Date(max)
sum(tbl %>%
filter(date >= max.Date - (daysBack - 1) &
date <= max.Date & loc == myloc) %>%
select(pos))
}
result <-
mydf %>%
group_by(date, loc) %>%
mutate(rec = getPosSum(max = date, tbl = mydf, myloc = loc))
library(tidyverse)
library(lubridate)
data %>%
mutate(date = as_date(date),
cut = cut(date, '14 days') %>%
group_by(loc) %>%
arrange(cut) %>%
mutate(cum_pos = accumulate(pos, `+`)) # accumulate(pos, sum) should also work
As a general rule of thumb, avoid loops, especially within mutate - that won't work. Instead of map_dbl you should check out purrr::accumulate. There's specialized functions for this in R's base library such as cumsum and cummin but their behavior is a lot less predictable in relation to purrr's.
I have a question related to filtering on dates in R. I found e.g. this link
dplyr filter on Date, which answers the question how to filter with help of dplyr in a specific date range. I would like to select a dynamic range, e.g. calculate the number of critical Jobs in a specific window e.g. the last seven days starting from the current date in the dataset. The code I have in mind would look something like this:
my.data %>%
group_by(category) %>%
filter(date > date - days(7) & date <= date) %>%
mutate(ncrit = sum(critical == 'yes'))
This is not working properly. Is there a way to get this running with dplyr?
Edit:
Apologies for the unclear post. To complete the post first the idea: imagine computers running jobs. If a computer fails to compute jobs the past x days it is more likely that it also fails in calculating the current job. A dummy dataset includes the computer categories (e.g. A/B), the date, and failure (yes/no)
Using the dataset from Rui Barradas, I would like to add with dplyr the following column 'number of critical Jobs in past 3 days" (in this case x = 3):
head(my.data, 7)
category date critical number of critical jobs in past 3 days
1 A 2018-08-14 yes NA
2 A 2018-08-15 no NA
3 A 2018-08-16 yes NA
4 A 2018-08-17 no 2
5 A 2018-08-18 yes 1
6 A 2018-08-19 no 2
7 A 2018-08-20 yes 1
Data (Rui Barradas):
set.seed(3635)
my.data <- data.frame(category = rep(c('A', 'B'), each = 10), #
date = rep(seq(Sys.Date() - 9, Sys.Date(), by = 'days')),
critical = sample(c('no', 'yes'), 20, TRUE))
Without an example dataset it's not very easy to say, but given your description of the problem I believe the following is on the right track.
The code uses function rollapplyr from package zoo, inspired not by the accepted but by the second answer to this question.
library(zoo)
library(dplyr)
sumCrit <- function(DF, crit = "yes", window = 3){
DF %>%
group_by(category) %>%
mutate(ncrit = rollapplyr(critical == crit, list(-seq(3)), sum, fill = NA))
}
result <- sumCrit(my.data)
head(result, 7)
## A tibble: 7 x 4
## Groups: category [1]
# category date critical ncrit
# <fct> <date> <fct> <int>
#1 A 2018-08-14 yes NA
#2 A 2018-08-15 no NA
#3 A 2018-08-16 yes NA
#4 A 2018-08-17 no 2
#5 A 2018-08-18 yes 1
#6 A 2018-08-19 no 2
#7 A 2018-08-20 yes 1
Data.
This is a made up dataset meant to test the code above.
set.seed(3635) # Make the results reproducible
my.data <- data.frame(category = rep(c("A", "B"), each = 10),
date = rep(seq(Sys.Date() - 9, Sys.Date(), by = "days"), 2),
critical = sample(c("no", "yes"), 20, TRUE))
Data generation
DATE1 <- as.Date("2018-08-23")
DATE2 <- as.Date("2018-07-23")
# creating a data range with the start and end date:
dates <- seq(DATE2, DATE1, by="days")
dt<-data.frame(category=sample(1:6,32,replace = T),deadline=dates)
Filter the dates
library("tidyverse")
dt %>%
group_by(category) %>%
filter(deadline %in% seq(Sys.Date()-7,Sys.Date() , by="days") )
Using the dataset that Rui Barradas created, providing a lubridate formulation, using intervals
set.seed(3635) # Make the results reproducible
my.data <- data.frame(category = rep(c("A", "B"), each = 10),
date = rep(seq(Sys.Date() - 9, Sys.Date(), by = "days"), 2),
critical = sample(c("no", "yes"), 20, TRUE))
library(lubridate) #use lubridate to create intervals
INT_check<-interval(Sys.Date()-7,Sys.Date()) # 7 days from today
my.data %>%
filter(date %within% INT_check ) %>%
group_by(category)%>%
summarise(ncrit = sum(critical == 'yes'))
you can also specify INT_Check as
INT_check<-interval("2018-08-16","2018-08-18") # if you want to use absolute dates
INT_check<-interval("2018-08-16",Sys.Date()) # if you want to specify just absolute start date
I have created a reproducible example to illustrate the problem I am having with non-standard evaluation in R (dplyr). I would like to use dynamic variable names in the scenario below:
# Given a data frame of patient data, I need to find records containing date logic errors.
# My datasets are enormous but here is a tiny example
patientData <- data.frame(
patientID = 1:20,
birth_d = seq(as.Date("2010-01-01"),by = 90, length.out = 20),
treat_d = seq(as.Date("2011-01-01"),by = 90, length.out = 20),
death_d = seq(as.Date("2012-01-01"),by = 90, length.out = 20)
)
# To create some random records that will be in error (death_d before birth_d, birth_d after treat_d, etc):
patientData$birth_d[5] <- as.Date("2017-01-01")
patientData$death_d[7] <- as.Date("2001-01-01")
patientData$treat_d[10] <- as.Date("2018-01-01")
patientData$birth_d[12] <- as.Date("2018-05-05")
# To determine which records have birth_d after death_d I could do the following:
badRecords <- patientData %>% filter(death_d < birth_d)
OR
badRecords <- patientData %>% mutate(dateDiff = death_d - birth_d) %>% filter(dateDiff < 0)
# But in my large application (with lots and lots of date variables)
# I want to be able to use the date field names as *variables* and, using one date pair at a time,
# determine which records have dates out of sequence. For example,
firstDateName <- "birth_d"
secondDateName <- "death_d"
# I would like to do this, but it doesn't work
badRecords <- patientData %>% filter(!!firstDateName > !!secondDateName)
# This doesn't work...
badRecords <- patientData %>% mutate(dateDiff = !!secondDateName - !!firstDateName) %>% filter(dateDiff < 0)
# Neither does this... it creates a dateDiff data frame.. with 20 duplicate records
badRecords <- patientData %>% mutate(dateDiff = .[secondDateName] - .[firstDateName]) %>% filter(dateDiff < 0)
`
1) rlang Use sym like this:
library(dplyr)
library(rlang)
firstDateName <- sym("birth_d")
secondDateName <- sym("death_d")
badRecords <- patientData %>% filter(!!firstDateName > !!secondDateName)
giving:
> badRecords
patientID birth_d treat_d death_d
1 5 2017-01-01 2011-12-27 2012-12-26
2 7 2011-06-25 2012-06-24 2001-01-01
3 12 2018-05-05 2013-09-17 2014-09-17
2) Base R or in base R:
firstDateName <- "birth_d"
secondDateName <- "death_d"
is.bad <- patientData[[firstDateName]] > patientData[[secondDateName]]
badRecords <- patientData[is.bad, ]
2a) subset Another base solution would be to replace the last two lines above with:
subset(patientData, get(firstDateName) > get(secondDateName))
Here is one option with parse_expr from rlang
library(rlang)
library(dplyr)
patientData %>%
filter(!! parse_expr(paste(firstDateName, ">", secondDateName)))
# patientID birth_d treat_d death_d
#1 5 2017-01-01 2011-12-27 2012-12-26
#2 7 2011-06-25 2012-06-24 2001-01-01
#3 12 2018-05-05 2013-09-17 2014-09-17
I'm trying to group ids with date in this dataset, but I want to summarise based on one of the features outside of the group.
library(dplyr)
library(lubridate)
set.seed(100)
df <- data.frame(ids = sample(c('436247', '2465347', '346654645'), 10000, replace=TRUE),
date = sample(seq.Date(ymd('2018-03-01'), ymd('2018-05-01'), by=1), 10000, replace=TRUE))
new_df <- df %>%
group_by(ids, date) %>%
summarise(events = length(ids[date >= date - 30 & date <= date]))
I'm trying to take this dataframe and answer the question - "for each of the ids, and each date, how many other records within that id, are within the past 30 days of that date". Unfortunately, when I group_by both the ids and date, it only looks within the grouped date. I've created the solution below, but not sure if there is a better one with dplyr?
groupby_function <- function(df, spec_date){
result <- df %>%
group_by(ids) %>%
summarise(events = length(ids[date >= spec_date - 30 & date <= spec_date])) %>%
mutate(date = spec_date)
return(result)
}
date_vector <- seq.Date(ymd('2018-03-01'), ymd('2018-05-01'), by=1)
list_results <- lapply(date_vector, groupby_function, df=df)
x <- do.call(rbind, list_results)
"for each of the ids, and each date, how many other records within that id, are within the past 30 days of that date"
For that, a "join by" condition makes sense, but isn't yet included in dplyr. Until it is, you could use data.table inside your dplyr chain:
# enumerate id-date combos of interest
grid_df = expand.grid(
id = unique(df$ids),
d = seq(min(df$date), max(df$date), by="day")
)
# helper function
library(data.table)
count_matches = function(DF, targetDF, ...){
onexpr = substitute(list(...))
data.table(targetDF)[DF, on=eval(onexpr), .N, by=.EACHI]$N
}
# use a non-equi join to count matching rows
res = grid_df %>%
mutate(d_dn = d - 30) %>%
mutate(n = count_matches(., df, ids = id, date >= d_dn, date <= d)) %>%
as.tibble
# A tibble: 186 x 4
id d d_dn n
<fctr> <date> <date> <int>
1 436247 2018-03-01 2018-01-30 72
2 2465347 2018-03-01 2018-01-30 69
3 346654645 2018-03-01 2018-01-30 51
4 436247 2018-03-02 2018-01-31 123
5 2465347 2018-03-02 2018-01-31 120
6 346654645 2018-03-02 2018-01-31 100
7 436247 2018-03-03 2018-02-01 170
8 2465347 2018-03-03 2018-02-01 166
9 346654645 2018-03-03 2018-02-01 154
10 436247 2018-03-04 2018-02-02 228
# ... with 176 more rows
It should work fine for equality conditions to write either ids = id or ids == id, I think.
If you're interested, the syntax is x[i, on=, j, by=.EACHI] where x and i are tables. For each row of i, we look up rows of x based on the on= criteria (left-hand side refers to columns in x; right-hand to columns in i); then we do j for each ("by each row of i" so by=.EACHI). In this case, j = .N means that we count matched rows of x, returned as a column of counts N.
You can look at the "ungrouped" data by just going back to the original data frame(calling df$date or df$ids). So I think what you are after is
test_df <- df %>%
group_by(ids, date) %>%
summarise(events = length(df$ids[df$date >= date[1] - 30 & df$date <= date[1] & df$ids == ids[1]]))
Also, I ran your proposed function, but I did not see any difference in the result from your original group_by solution, so I don't think that is what you want.
If a 'non dplyr' solution is acceptable, this gives you what you want.
df$diff <- as.vector(
sapply(unique(df$ids), function(x)
sapply(df$date[df$ids == x], function(y)
sum(abs(y - df$date[df$ids == x]) >= 30)
)
)
)
Alternatively, in dplyr, you can get a result like the above using:
f <- function(x) {
sapply(x, function(y) sum(abs(y - x) >= 30))
}
df$diff <- unlist(
df %>%
group_by(ids) %>%
do(diff = f(.$date)) %>%
.$diff
)
Here's an answer. But it assumes there's a continuous sequence of dates in each id.
df %>%
group_by(ids, date) %>%
count() %>%
arrange(ids, date) %>%
group_by(ids) %>%
mutate(
events = cumsum(n) - cumsum(lag(n, 30, 0))
)
I need to aggregate multiple months from original data with dataframe in R, e.g: data frame with datetime include 2017 and 2018.
date category amt
1 2017-08-05 A 0.1900707
2 2017-08-06 B 0.2661277
3 2017-08-07 c 0.4763196
4 2017-08-08 A 0.5183718
5 2017-08-09 B 0.3021019
6 2017-08-10 c 0.3393616
What I want is to sum based on 6 month period and category:
period category sum
1 2017_secondPeriod A 25.00972
2 2018_firstPeriod A 25.59850
3 2017_secondPeriod B 24.96924
4 2018_firstPeriod B 24.79649
5 2017_secondPeriod c 20.17096
6 2018_firstPeriod c 27.01794
What I did:
1. select the last 6 months of 2017, like wise 2018
2. add a new column for each subset to indicate the period
3. Combine 2 subset again
4. aggregate
as following:
library(lubridate)
df <- data.frame(
date = today() + days(1:300),
category = c("A","B","c"),
amt = runif(300)
)
df2017_secondHalf <- subset(df, month(df$date) %in% c(7,8,9,10,11,12) & year(df$date) == 2017)
f2018_firstHalf <- subset(df, month(df$date) %in% c(1,2,3,4,5,6) & year(df$date) == 2018)
sum1 <- aggregate(df2017_secondHalf$amt, by=list(Category=df2017_secondHalf$Category), FUN=sum)
sum2 <- aggregate(df2018_firstHalf$amt, by=list(Category=df2018_secondHalf$Category), FUN=sum)
df2017_secondHalf$period <- '2017_secondPeriod'
df2018_firstHalf$period <- '2018_firstPeriod'
aggregate(x = df$amt, by = df[c("period", "category")], FUN = sum)
I try to figure out but did not know how to aggregate multple months e.g, 3 months, or 6 months.
Thanks in advance
Any suggesstion?
With lubridate and tidyverse (dplyr & magrittr)
First, let's create groups with Semesters, Quarter, and "Trimonthly".
library(tidyverse)
library(lubridate)
df <- df %>% mutate(Semester = semester(date, with_year = TRUE),
Quarter = quarter(date, with_year = TRUE),
Trimonthly = round_date(date, unit = "3 months" ))
Lubridate's semester() breaks by semsters and gives you a 1 (Jan-Jun) or 2 (Jul-Aug); quarter() does a similar thing with quarters.
I add a third, the more basic round_date function, where you can specify your time frame in the form of size and time units. It yields the first date of such time frame. I deliberately name it "Trimonthly" so you can see how it compares to quarter()
Pivot.Semester <- df %>%
group_by(Semester, category) %>%
summarise(Semester.sum = sum(amt))
Pivot.Quarter <- df %>%
group_by(Quarter, category) %>%
summarise(Quarter.sum = sum(amt))
Pivot.Trimonthly <- df %>%
group_by(Trimonthly, category) %>%
summarise(Trimonthly.sum = sum(amt))
Pivot.Semester
Pivot.Quarter
Pivot.Trimonthly
Optional: If you want to join the summarised data to the original DF.
df <- df %>% left_join(Pivot.Semester, by = c("category", "Semester")) %>%
left_join(Pivot.Quarter, by = c("category", "Quarter")) %>%
left_join(Pivot.Trimonthly, by = c("category", "Trimonthly"))
df
Here is a 3 line solution that uses no package. Let k be the number of months in a period. For half year periods k is 6. For quarter year periods k would be 3, etc. Replace 02 in the sprintf format with 1 if you want one digit suffices (but not for monthly since those must be two digit). Further modify the sprintf format if you want it to exactly match the question.
k <- 6
period <- with(as.POSIXlt(DF$date), sprintf("%d-%02d", year + 1900, (mon %/% k) + 1))
aggregate(amt ~ category + period, DF, sum)
giving:
category period amt
1 A 2017-02 0.7084425
2 B 2017-02 0.5682296
3 c 2017-02 0.8156812
At the expense of using one package we can simplify the quarterly and monthly calculations by replacing the formula for period with one of these:
library(zoo)
# quarterly
period <- as.yearqtr(DF$date)
# monthly
period <- as.yearmon(DF$date)
Note: The input in reproducible form is:
Lines <- "date category amt
1 2017-08-05 A 0.1900707
2 2017-08-06 B 0.2661277
3 2017-08-07 c 0.4763196
4 2017-08-08 A 0.5183718
5 2017-08-09 B 0.3021019
6 2017-08-10 c 0.3393616"
DF <- read.table(text = Lines)
DF$date <- as.Date(DF$date)