Actual OutputI am trying to sum an amount if a column of dates (z) fall between a specified range. Unfortunately my loop doesn't seem to wok and i have a null output.
Sd <- as.Date('2017-01-01',tz = "GMT")
EndDate <- as.Date('2017-01-20',tz = "GMT")
Ed <- EndDate + 30
LTV1 <- while (Sd < Ed) {
Sd <- Sd + 1
LTV1 <- LTV %>% group_by(InstallationDate)%>% filter(z < Sd) %>%
summarize(Amount = sum(USAmount))
}
as.data.frame(LTV1)
Apologies everyone. I am quite new to this. Here is a reproducible example:
Sample <- as.data.frame(seq(as.Date("2017/01/01"), by = "day", length.out
= 15))
Sample$Amount <- c(10,5,3,4,8,65,89,47,74,95,85,63,32,45,32)
colnames(Sample)[1] <- "date"
Sd <- as.Date('2017-01-01',tz = "GMT")
EndDate <- as.Date('2017-01-5',tz = "GMT")
Ed <- EndDate + 3
Sample1 <- while (Sd < Ed) {
Sd <- Sd + 1
Sample1 <- Sample %>% group_by(date)%>% filter(date < Sd) %>%
summarize(Amount = sum(Amount))
}
as.data.frame(Sample1)
Desired Output will be:
Dates: Day 1 Day 2 Day 3 .......................
Amount: 25 54 89 .......................
I think this is what you want:
LTV1 = LTV %>%
arrange(InstallationDate) %>%
group_by(InstallationDate) %>%
summarize(daily_amount = sum(USAmount)) %>%
ungroup() %>%
mutate(cumulative_amount = cumsum(daily_amount))
I believe the cumulative_amount column is what you are trying to create---though it's hard to tell and impossible to test since you haven't reproducibly shared your input data. I also can't tell if the arrange, group_by, and summarize are needed---if your data is already in order by date, arrange isn't needed. If your data only has one row per day, the grouping and summarizing aren't needed.
You assign (why?) to LTV1 the result of while, which is always NULL.
Related
I have a data frame with COVID data and I'm trying to make a column calculating the number of recovered people based off of the number of positive tests.
My data has a location, a date, and the number of tests administered/positive results/negative results each day. Here's a few lines using one location as an example (the real data has several months worth of dates):
loc date tests pos neg active
spot1 2020-04-10 1 1 0 5
spot1 2020-04-11 2 1 1 6
spot1 2020-04-12 0 0 0 6
spot1 2020-04-13 11 1 10 7
I want to make a new column that cumulatively counts each positive test in each location 14 days after it is recorded. On 2020-04-24, the 5 active classes are not active anymore, so I want a recovered column with 5. For each date I want the newly nonactive cases to be added.
My first thought was to try it in a loop:
df1 <- df %>%
mutate(date = as.Date(date)) %>%
group_by(loc) %>%
mutate(rec = for (i in 1:nrow(df)) {
#getting number of new cases
x <- df$pos[i]
#add 14 days to the date
d <- df$date + 14
df$rec <- sum(x)
})
As you can see, I'm not the best at writing for loops. That gives me a bunch of numbers, but bear very little meaningful relationship to the data.
Also tried it with map_dbl:
df1 <- df %>%
mutate(date = as.Date(date)) %>%
group_by(loc) %>%
mutate(rec = map_dbl(date, ~sum(pos[(date <= . + 14) & date >= .])))
Which resulted in the same number printed on the entire rec column.
Any suggestions? (Sorry for the lengthy explanation, just want to make sure this all makes sense)
Your sample data shows that -
you have all continuous dates despite 0 tests (12 April)
Active column seems like already a cumsum
Therefore I think you can simply use lag function with argument 14
example code
df %>% group_by(loc) %>% mutate(recovered = lag(active, 14)) %>% ungroup()
You could use aggregate to sum the specific column and then applying
cut in order to set a 14 day time frame for each sum:
df <- data.frame(loc = rep("spot1", 30),
date = seq(as.Date('2020-04-01'), as.Date('2020-04-30'),by = 1),
test = seq(1:30),
positive = seq(1:30),
active = seq(1:30))
output <- aggregate(positive ~ cut(date, "14 days"), df, sum)
output
Console output:
cut(date, "14 days") positive
1 2020-04-01 105
2 2020-04-15 301
3 2020-04-29 59
my solution:
library(dplyr)
date_seq <- seq(as.Date("2020/04/01"), by = "day", length.out = 30)
pos <- rpois(n = 60, lambda = 10)
mydf <-
data.frame(loc = c(rep('loc1', 30), rep('loc2', 30)),
date = date_seq,
pos = pos)
head(mydf)
getPosSum <- function(max, tbl, myloc, daysBack = 14) {
max.Date <- as.Date(max)
sum(tbl %>%
filter(date >= max.Date - (daysBack - 1) &
date <= max.Date & loc == myloc) %>%
select(pos))
}
result <-
mydf %>%
group_by(date, loc) %>%
mutate(rec = getPosSum(max = date, tbl = mydf, myloc = loc))
library(tidyverse)
library(lubridate)
data %>%
mutate(date = as_date(date),
cut = cut(date, '14 days') %>%
group_by(loc) %>%
arrange(cut) %>%
mutate(cum_pos = accumulate(pos, `+`)) # accumulate(pos, sum) should also work
As a general rule of thumb, avoid loops, especially within mutate - that won't work. Instead of map_dbl you should check out purrr::accumulate. There's specialized functions for this in R's base library such as cumsum and cummin but their behavior is a lot less predictable in relation to purrr's.
I think I have been unconsciously roaming around this question for some time.
Let me try to explain it clearly. Suppose I have a datatable such as this one :
library(tidyverse)
library(data.table)
library(lubridate)
MWE <- as.data.table(ggplot2::economics) %>%
.[,c("pce","psavert","uempmed","unemploy"):=NULL]
> MWE
date pop
1: 1967-07-01 198712.0
2: 1967-08-01 198911.0
3: 1967-09-01 199113.0
4: 1967-10-01 199311.0
5: 1967-11-01 199498.0
---
570: 2014-12-01 319746.2
571: 2015-01-01 319928.6
572: 2015-02-01 320074.5
573: 2015-03-01 320230.8
574: 2015-04-01 320402.3
There are several things I want to do that imply creating new variables that for a give row i depends on other variables at this row.
For instance, if I want the evolution of population for one month to another, what I want is something like :
MWE2 <- MWE[, MoM :=pop[date=i$date]/pop[date=(i$date-month(1))]
The same if I want a Year on year evolution :
MWE3 <- MWE[, YoY :=pop[date=i$date]/pop[date=(i$date-year(1))]
And if for instance I want to create the difference between the current value and the average of the population for the same month for the 5 previous years, it would be someting like
MWE4 <- MWE[, 5YD :=pop- mean(pop[month(date)=month(i$date) & (year(date$i)-6 <= year(date) < year(date$i))])]
So is there a way to do something close to what I envision ?
Do not hesitate to tell me if I have been unclear
Back to basics.
Crude solution, however it is straightforward.
library(data.table)
library(lubridate)
MWE <- as.data.table(ggplot2::economics)
MWE <- MWE[,c("pce","psavert","uempmed","unemploy"):=NULL]
data = data.frame(MWE)
lubridate::month(MWE$date[1])
monthly_diffs = double()
yearly_diffs = double()
current_year_months = 0
current_year_sum = 0
past_year_sum = 0
past_year_months = 0
for (i in 1:(nrow(data)-1)) {
#since your data appears to be on a monthly basis, we can do the difference directly
monthly_diffs[i] <- data$pop[i + 1] / data$pop[i]
#for the years there is a need to accumulate it first.
if (current_year_months == 0) {
current_year <- year(MWE$date[i])
}
if (year(MWE$date[i]) == current_year) {
#if we are within one year, we accumulate
current_year_months <- current_year_months + 1
current_year_sum <- current_year_sum + MWE$pop[i]
} else {
#otherwise we compute the averages, the ratio and save it
if (past_year_months != 0) {
#I will assume you want the average difference over the year
#as I find it a reasonable approach to a yearly unemployment rates.
#any other operation would be similar.
current_year_average <- current_year_sum / current_year_months
past_year_average <- past_year_sum / past_year_months
yearly_diffs = c(yearly_diffs, current_year_average / past_year_average)
}
past_year_months = current_year_months
past_year_sum = current_year_sum
current_year_months = 0
current_year_sum = 0
current_year = year(MWE$date[i])
}
}
The code chunk below will get you your data sets:
You can use mostly dplyrs lag function for the MoM and YoY variables.
MWE2 is a way to get the month over month ration, MWE3 gets the YoY ratio and the last 3 data sets get the 5 year lag and get the difference in the population. The 5 year code can be changed for whatever year you need it to be.
library(tidyverse)
library(data.table)
library(lubridate)
MWE <- as.data.table(ggplot2::economics) %>%
.[,c("pce","psavert","uempmed","unemploy"):=NULL]
MWE2 <- MWE %>%
arrange(date) %>%
mutate(
lagpop = lag(pop),
MoM = pop/lagpop)
MWE3 <- MWE %>%
mutate(yeard = year(date)) %>%
group_by(yeard) %>%
summarise(avgpop = mean(pop,na.rm = TRUE)) %>%
ungroup %>%
arrange(yeard) %>%
mutate(lagpop = lag(avgpop),
YoY = avgpop/lagpop)
MWEa <- mutate(MWE, date5 = date+(365*5)+2 ,
yeard = year(date5),
monthd = month(date5))
MWE <- mutate(MWE,
yeard = year(date),
monthd = month(date))
MWE4 <- left_join(MWE,MWEa , by = c("yeard","monthd")) %>%
mutate(diff5yr = pop.x - pop.y)
Actual OutputI am trying to sum an amount if a column of dates (z) fall between a specified range. Unfortunately my loop doesn't seem to wok and i have a null output.
Sd <- as.Date('2017-01-01',tz = "GMT")
EndDate <- as.Date('2017-01-20',tz = "GMT")
Ed <- EndDate + 30
LTV1 <- while (Sd < Ed) {
Sd <- Sd + 1
LTV1 <- LTV %>% group_by(InstallationDate)%>% filter(z < Sd) %>%
summarize(Amount = sum(USAmount))
}
as.data.frame(LTV1)
Apologies everyone. I am quite new to this. Here is a reproducible example:
Sample <- as.data.frame(seq(as.Date("2017/01/01"), by = "day", length.out
= 15))
Sample$Amount <- c(10,5,3,4,8,65,89,47,74,95,85,63,32,45,32)
colnames(Sample)[1] <- "date"
Sd <- as.Date('2017-01-01',tz = "GMT")
EndDate <- as.Date('2017-01-5',tz = "GMT")
Ed <- EndDate + 3
Sample1 <- while (Sd < Ed) {
Sd <- Sd + 1
Sample1 <- Sample %>% group_by(date)%>% filter(date < Sd) %>%
summarize(Amount = sum(Amount))
}
as.data.frame(Sample1)
Desired Output will be:
Dates: Day 1 Day 2 Day 3 .......................
Amount: 25 54 89 .......................
I think this is what you want:
LTV1 = LTV %>%
arrange(InstallationDate) %>%
group_by(InstallationDate) %>%
summarize(daily_amount = sum(USAmount)) %>%
ungroup() %>%
mutate(cumulative_amount = cumsum(daily_amount))
I believe the cumulative_amount column is what you are trying to create---though it's hard to tell and impossible to test since you haven't reproducibly shared your input data. I also can't tell if the arrange, group_by, and summarize are needed---if your data is already in order by date, arrange isn't needed. If your data only has one row per day, the grouping and summarizing aren't needed.
You assign (why?) to LTV1 the result of while, which is always NULL.
I have a dataframe that I want to group and obtain the median of the diff (lagged difference) in consistent units. Is very similar to the example below. As you can see by running the code below I have problems because diff have an units attribute that is not taken into account by my summarise function
library(tidyverse)
# Initialise random data
t = Sys.time()
rnd <- sample(1:10000,10,replace=F)
add <- rnd[order(rnd)]
# Create 2 dtaaframes
time1 <- data.frame(datetime = t + add)
time2 <- data.frame(datetime = t + add * 1000)
# Bind dataframe together
mydata <- bind_rows(time1, time2, .id = "group")
# Trying to get a summary table
mydata %>% group_by(group) %>% summarise(elapsed = median(diff(datetime[order(datetime)])))
# These are the values that I should get in my summary table
median(diff(time1$datetime))
median(diff(time2$datetime))
What about using difftime and setting the units?
mydata %>%
group_by(group) %>%
summarise(elapsed = median(difftime(datetime, lag(datetime), units = "mins"), na.rm = TRUE))
Here's one option, which will show all results in seconds. Use dminutes(1) or dhours(1) or ddays(1) if more appropriate.
mydata %>%
group_by(group) %>%
summarise(elapsed = median(diff(datetime[order(datetime)])) / lubridate::dseconds(1))
I've measured N20 flux from soil at multiple timepoints in the day (not equally spaced). I'm trying to calculate the total N20 flux from soil for a subset of days by finding the area under the curve for the given day. I know how to do this when using only measures from the given day, however, I'd like to include the last measure of the previous day and the first measure of the following day to improve the estimation of the curve.
Here's an example to give a more concrete idea:
library(MESS)
library(lubridate)
library(dplyr)
Generate Reproducible Example
datetime <- seq(ymd_hm('2015-04-07 11:20'),ymd('2015-04-13'), by = 'hours')
dat <- data.frame(datetime, day = day(datetime), Flux = rnorm(n = length(datetime), mean = 400, sd = 20))
useDate <- data.frame(day = c(7:12), DateGood = c("No", "Yes", "Yes", "No", "Yes", "No"))
dat <- left_join(dat, useDate)
Some days are "bad" (too many missing measures) and some are "Good" (usable). The goal is to filter all measurements (rows) that occurred on a "Good" day as well as the last measurement from the day before and the first measurement on the next day.
out <- dat %>%
mutate(lagDateGood = lag(DateGood),
leadDateGood = lead(DateGood)) %>%
filter(lagDateGood != "No" | leadDateGood != "No")
Now I need to calculate the area under the curve - this is not correct
out2 <- out %>%
group_by(day) %>%
mutate(hourOfday = hour(datetime) + minute(datetime)/60) %>%
summarize(auc = auc(x = hourOfday, y = Flux, from = 0, to = 24, type = "spline"))
The trouble is that I don't include the measurements on end of previous day and start of following day when calculating AUC. Also, I get an estimate of flux for day 10, which is a "bad" day.
I think the crux of my question has to do with groups. Some measurements need to be in multiple groups (for example the last measurement on day 8 would be used in estimating AUC for day 8 and day 9). Do you have suggestions for how I could form new groups? Or might there be a completely different way to achieve the goal?
For what it's worth, this is what I did. The answer really lies in the question I linked to in the comments. Starting with the dataframe "out" from the question:
#Now I need to calculate the area under the curve for each day
n <- nrow(out)
extract <- function(ix) out[seq(max(1, min(ix)-1), min(n, max(ix) + 1)), ]
res <- lapply(split(1:n, out$day), extract)
calcTotalFlux <- function(df) {
if (nrow(df) < 10) { # make sure the day has at least 10 measures
NA
} else {
day_midnight <- floor_date(df$datetime[2], "day")
df %>%
mutate(time = datetime - day_midnight) %>%
summarize(TotalFlux = auc(x = time, y = Flux, from = 0, to = 1440, type = "spline"))}
}
do.call("rbind",lapply(res, calcTotalFlux))
TotalFlux
7 NA
8 585230.2
9 579017.3
10 NA
11 563689.7
12 NA
Here's another way. More in line with the suggestions of #Alex Brown.
# Another way
last <- out %>%
group_by(day) %>%
filter(datetime == max(datetime)) %>%
ungroup() %>%
mutate(day = day + 1)
first <- out %>%
group_by(day) %>%
filter(datetime == min(datetime)) %>%
ungroup() %>%
mutate(day = day - 1)
d <- rbind(out, last, first) %>%
group_by(day) %>%
arrange(datetime)
n_measures_per_day <- d %>%
summarize(n = n())
d <- left_join(d, n_measures_per_day) %>%
filter(n > 4)
TotalFluxDF <- d %>%
mutate(timeAtMidnight = floor_date(datetime[3], "day"),
time = datetime - timeAtMidnight) %>%
summarize(auc = auc(x = time, y = Flux, from = 0, to = 1440, type = "spline"))
TotalFluxDF
Source: local data frame [3 x 2]
day auc
(dbl) (dbl)
1 8 585230.2
2 9 579017.3
3 11 563689.7