I have a data frame with COVID data and I'm trying to make a column calculating the number of recovered people based off of the number of positive tests.
My data has a location, a date, and the number of tests administered/positive results/negative results each day. Here's a few lines using one location as an example (the real data has several months worth of dates):
loc date tests pos neg active
spot1 2020-04-10 1 1 0 5
spot1 2020-04-11 2 1 1 6
spot1 2020-04-12 0 0 0 6
spot1 2020-04-13 11 1 10 7
I want to make a new column that cumulatively counts each positive test in each location 14 days after it is recorded. On 2020-04-24, the 5 active classes are not active anymore, so I want a recovered column with 5. For each date I want the newly nonactive cases to be added.
My first thought was to try it in a loop:
df1 <- df %>%
mutate(date = as.Date(date)) %>%
group_by(loc) %>%
mutate(rec = for (i in 1:nrow(df)) {
#getting number of new cases
x <- df$pos[i]
#add 14 days to the date
d <- df$date + 14
df$rec <- sum(x)
})
As you can see, I'm not the best at writing for loops. That gives me a bunch of numbers, but bear very little meaningful relationship to the data.
Also tried it with map_dbl:
df1 <- df %>%
mutate(date = as.Date(date)) %>%
group_by(loc) %>%
mutate(rec = map_dbl(date, ~sum(pos[(date <= . + 14) & date >= .])))
Which resulted in the same number printed on the entire rec column.
Any suggestions? (Sorry for the lengthy explanation, just want to make sure this all makes sense)
Your sample data shows that -
you have all continuous dates despite 0 tests (12 April)
Active column seems like already a cumsum
Therefore I think you can simply use lag function with argument 14
example code
df %>% group_by(loc) %>% mutate(recovered = lag(active, 14)) %>% ungroup()
You could use aggregate to sum the specific column and then applying
cut in order to set a 14 day time frame for each sum:
df <- data.frame(loc = rep("spot1", 30),
date = seq(as.Date('2020-04-01'), as.Date('2020-04-30'),by = 1),
test = seq(1:30),
positive = seq(1:30),
active = seq(1:30))
output <- aggregate(positive ~ cut(date, "14 days"), df, sum)
output
Console output:
cut(date, "14 days") positive
1 2020-04-01 105
2 2020-04-15 301
3 2020-04-29 59
my solution:
library(dplyr)
date_seq <- seq(as.Date("2020/04/01"), by = "day", length.out = 30)
pos <- rpois(n = 60, lambda = 10)
mydf <-
data.frame(loc = c(rep('loc1', 30), rep('loc2', 30)),
date = date_seq,
pos = pos)
head(mydf)
getPosSum <- function(max, tbl, myloc, daysBack = 14) {
max.Date <- as.Date(max)
sum(tbl %>%
filter(date >= max.Date - (daysBack - 1) &
date <= max.Date & loc == myloc) %>%
select(pos))
}
result <-
mydf %>%
group_by(date, loc) %>%
mutate(rec = getPosSum(max = date, tbl = mydf, myloc = loc))
library(tidyverse)
library(lubridate)
data %>%
mutate(date = as_date(date),
cut = cut(date, '14 days') %>%
group_by(loc) %>%
arrange(cut) %>%
mutate(cum_pos = accumulate(pos, `+`)) # accumulate(pos, sum) should also work
As a general rule of thumb, avoid loops, especially within mutate - that won't work. Instead of map_dbl you should check out purrr::accumulate. There's specialized functions for this in R's base library such as cumsum and cummin but their behavior is a lot less predictable in relation to purrr's.
Related
I would like to know if there is a way in r to return a value for the number of times where a series of data exceeds a certain value for a number of consecutive days.
e.g. How many times in a year was x greater than 10 for at least 30 consecutive days?
I know that you can find how many instances x was greater than a certain value over the whole year, but I'm not sure how to test for instances that are consecutive.
Where Data is a data.frame with Date, Year, and Value columns with daily data from 2010-2020:
Data %>%
group_by(Year) %>%
filter(Value >= 10) %>%
summarize(exceedances = n())
Here is an example of daily data from 2018-2021 with random values from 0-25:
library(tidyverse)
library(dplyr)
library(lubridate)
value = sample(0:25, 1461, replace=T)
date = seq(as.Date("2018-01-01"), as.Date("2021-12-31"), by = "1 day")
dat = data.frame(date = date,
year = year(date),
value = value)
dat %>%
group_by(year) %>%
filter(value >= 10) %>%
summarize(exceedances = n())
The output:
# A tibble: 4 x 2
year exceedances
<dbl> <int>
1 2018 216
2 2019 247
3 2020 229
4 2021 217
Desired output (n of >= 30 consecutive exceedances is a guess):
# A tibble: 4 x 2
year n_exceedances_30_consec
<dbl> <int>
1 2018 1
2 2019 0
3 2020 2
4 2021 0
The trick with this is that if there are 40 consecutive exceedances, I need that to show as 1 instance only, not 10 instances where the previous 30 days were >= 10.
You could use slider::slide_dbl. Maybe a complex way of doing it, but you could
library(tidyverse)
library(lubridate)
library(slider)
value = sample(0:25, 1461, replace=T)
date = seq(as.Date("2018-01-01"), as.Date("2021-12-31"), by = "1 day")
dat = data.frame(date = date,
year = year(date),
value = value)
library(dplyr)
library(slider)
consecutive_days <- 10
dat |>
mutate(greater = if_else(value >= 10, TRUE, FALSE)) |>
mutate(consecutive = slide_dbl(greater, sum, .before = consecutive_days-1)) |>
filter(consecutive >= consecutive_days) |>
filter(greater) |>
group_by(year) |>
summarize(exceedances = n())
I'm obviously freestyling here as there's no data provided. Hopefully this fits your needs!
Edit - I'm editing in light of the data you sent. This should now work correctly.
This is if you're more of a base R user:
date_index <- sapply(dat$date, \(x) {
d <- dat[dat$date >= x-(consecutive_days-1) &
dat$date <= x, ]
d <- d[d$value >= 10, ]
nrow(d) >= consecutive_days
})
aggregate(x = dat[date_index, ]$value,
by = list(year = dat[date_index, ]$year),
FUN = length)
I'm changing your data slightly as I wasn't getting any runs long enough.
set.seed(6)
date = seq(as.Date("2018-01-01"), as.Date("2021-12-31"), by = "1 day")
dat = data.frame(date = date,
year = format(date, "%Y"),
value = sample(5:50, length(date), replace=TRUE)
)
year_runs <- sapply(unique(dat$year), function(y) {
runs <- rle(dat$value[dat$year==y] >= 10); sum(runs$length>=30 & runs$value)
})
year_runs
2018 2019 2020 2021
1 2 2 2
The secret here is the base function rle (run length encoding). This is running separately by year; if a run is over the new year, it will be truncated.
In response to your edit that you want to count the run in the year it ends, you can then skip the sapply and use rle on the whole dataset, remove the runs that are too short, then invert it back and find the ends
runs <- rle(dat$value >= 10)
runs$values [runs$lengths < 30] <- FALSE
run_ends <- which(diff(inverse.rle(runs)) == -1) + 1
table(dat$year[run_ends])
2018 2019 2020 2021
1 2 2 2
I have a problem that sounds easy, however, I could not find a solution in R. I would like to shift values according to the first year of the release. I mean the first column represents the years of the release and the columns are years when the device is broken (values are numbers of broken devices).
This is a solution in Python:
def f(x):
shifted = np.argmin((x.index.astype(int)< x.name[0]))
return x.shift(-shifted)
df = df.set_index(['Delivery Year', 'Freq']).apply(f, axis=1)
df.columns = [f'Year.{i + 1}' for i in range(len(df.columns))]
df = df.reset_index()
df
I would like to have it in R too.
# TEST
data <- data.frame(
`Delivery Year` = c('1976','1977','1978','1979'),
`Freq` = c(120,100,80,60),
`Year.1976` = c(10,NA,NA,NA),
`Year.1977` = c(5,3,NA,NA),
`Year.1978` = c(10,NA,8,NA),
`Year.1979` = c(13,10,5,14)
)
data
# DESIRED
data <- data.frame(
`Delivery Year` = c('1976','1977','1978','1979'),
`Freq` = c(120,100,80,60),
`Year.1` = c(10,3,8,14),
`Year.2` = c(5,NA,5,NA),
`Year.3` = c(10,10,NA,NA),
`Year.4` = c(13,NA,NA,NA)
)
data
In addition, would it be also possible to transform the number of broken devices into the percentage of Freq column?
Thank you
Using tidyverse
data %>%
pivot_longer(!c(Delivery.Year, Freq)) %>%
separate(name, c("Lab", "Year")) %>%
select(-Lab) %>%
mutate_all(as.numeric) %>%
filter(Year >= Delivery.Year) %>%
group_by(Delivery.Year, Freq) %>%
mutate(ind = paste0("Year.", row_number()),
per = value/Freq) %>%
ungroup() %>%
pivot_wider(id_cols = c(Delivery.Year, Freq), names_from = ind, values_from = c(value, per))
I pivoted it into long form to begin with and separated the original column names Year.1976, Year.1977, etc. to just get the years from the columns and dropped the Year piece of it. Then I converted all columns to numeric to allow for mathematical operations like filtering for when Year >= Delivery.Year. I then created a column to get the titles you did request Year.1, Year.2, etc. and calculated the percent. Then I pivot_wider to get it in the format you requested. One thing to note is that I was unclear if you wanted both the original values and the percent or just the percent. If you only want the percent then values_from = per should do it for you.
library(dplyr)
f <- function(df) {
years <- paste0("Year.",sort(as.vector(na.omit(as.integer(stringr::str_extract(colnames(df), "\\d+"))))))
df1 <- df %>% select(years)
df2 <- df %>% select(-years)
val <- c()
firstyear <- years[1]
for (k in 1:nrow(df1) ) {
vec <- as.numeric(as.vector(df1[k,]))
val[k] <- (as.numeric(suppressWarnings(na.omit(vec))))[1]
}
df1[firstyear] <- val
colnames(df1) <- c(paste0("Year.",seq(1:ncol(df1))))
df <- cbind(df2,df1)
print(df)
}
> f(data)
Delivery.Year Freq Year.1 Year.2 Year.3 Year.4
1 1976 120 10 5 10 13
2 1977 100 3 3 NA 10
3 1978 80 8 NA 8 5
4 1979 60 14 NA NA 14
I think I have been unconsciously roaming around this question for some time.
Let me try to explain it clearly. Suppose I have a datatable such as this one :
library(tidyverse)
library(data.table)
library(lubridate)
MWE <- as.data.table(ggplot2::economics) %>%
.[,c("pce","psavert","uempmed","unemploy"):=NULL]
> MWE
date pop
1: 1967-07-01 198712.0
2: 1967-08-01 198911.0
3: 1967-09-01 199113.0
4: 1967-10-01 199311.0
5: 1967-11-01 199498.0
---
570: 2014-12-01 319746.2
571: 2015-01-01 319928.6
572: 2015-02-01 320074.5
573: 2015-03-01 320230.8
574: 2015-04-01 320402.3
There are several things I want to do that imply creating new variables that for a give row i depends on other variables at this row.
For instance, if I want the evolution of population for one month to another, what I want is something like :
MWE2 <- MWE[, MoM :=pop[date=i$date]/pop[date=(i$date-month(1))]
The same if I want a Year on year evolution :
MWE3 <- MWE[, YoY :=pop[date=i$date]/pop[date=(i$date-year(1))]
And if for instance I want to create the difference between the current value and the average of the population for the same month for the 5 previous years, it would be someting like
MWE4 <- MWE[, 5YD :=pop- mean(pop[month(date)=month(i$date) & (year(date$i)-6 <= year(date) < year(date$i))])]
So is there a way to do something close to what I envision ?
Do not hesitate to tell me if I have been unclear
Back to basics.
Crude solution, however it is straightforward.
library(data.table)
library(lubridate)
MWE <- as.data.table(ggplot2::economics)
MWE <- MWE[,c("pce","psavert","uempmed","unemploy"):=NULL]
data = data.frame(MWE)
lubridate::month(MWE$date[1])
monthly_diffs = double()
yearly_diffs = double()
current_year_months = 0
current_year_sum = 0
past_year_sum = 0
past_year_months = 0
for (i in 1:(nrow(data)-1)) {
#since your data appears to be on a monthly basis, we can do the difference directly
monthly_diffs[i] <- data$pop[i + 1] / data$pop[i]
#for the years there is a need to accumulate it first.
if (current_year_months == 0) {
current_year <- year(MWE$date[i])
}
if (year(MWE$date[i]) == current_year) {
#if we are within one year, we accumulate
current_year_months <- current_year_months + 1
current_year_sum <- current_year_sum + MWE$pop[i]
} else {
#otherwise we compute the averages, the ratio and save it
if (past_year_months != 0) {
#I will assume you want the average difference over the year
#as I find it a reasonable approach to a yearly unemployment rates.
#any other operation would be similar.
current_year_average <- current_year_sum / current_year_months
past_year_average <- past_year_sum / past_year_months
yearly_diffs = c(yearly_diffs, current_year_average / past_year_average)
}
past_year_months = current_year_months
past_year_sum = current_year_sum
current_year_months = 0
current_year_sum = 0
current_year = year(MWE$date[i])
}
}
The code chunk below will get you your data sets:
You can use mostly dplyrs lag function for the MoM and YoY variables.
MWE2 is a way to get the month over month ration, MWE3 gets the YoY ratio and the last 3 data sets get the 5 year lag and get the difference in the population. The 5 year code can be changed for whatever year you need it to be.
library(tidyverse)
library(data.table)
library(lubridate)
MWE <- as.data.table(ggplot2::economics) %>%
.[,c("pce","psavert","uempmed","unemploy"):=NULL]
MWE2 <- MWE %>%
arrange(date) %>%
mutate(
lagpop = lag(pop),
MoM = pop/lagpop)
MWE3 <- MWE %>%
mutate(yeard = year(date)) %>%
group_by(yeard) %>%
summarise(avgpop = mean(pop,na.rm = TRUE)) %>%
ungroup %>%
arrange(yeard) %>%
mutate(lagpop = lag(avgpop),
YoY = avgpop/lagpop)
MWEa <- mutate(MWE, date5 = date+(365*5)+2 ,
yeard = year(date5),
monthd = month(date5))
MWE <- mutate(MWE,
yeard = year(date),
monthd = month(date))
MWE4 <- left_join(MWE,MWEa , by = c("yeard","monthd")) %>%
mutate(diff5yr = pop.x - pop.y)
Actual OutputI am trying to sum an amount if a column of dates (z) fall between a specified range. Unfortunately my loop doesn't seem to wok and i have a null output.
Sd <- as.Date('2017-01-01',tz = "GMT")
EndDate <- as.Date('2017-01-20',tz = "GMT")
Ed <- EndDate + 30
LTV1 <- while (Sd < Ed) {
Sd <- Sd + 1
LTV1 <- LTV %>% group_by(InstallationDate)%>% filter(z < Sd) %>%
summarize(Amount = sum(USAmount))
}
as.data.frame(LTV1)
Apologies everyone. I am quite new to this. Here is a reproducible example:
Sample <- as.data.frame(seq(as.Date("2017/01/01"), by = "day", length.out
= 15))
Sample$Amount <- c(10,5,3,4,8,65,89,47,74,95,85,63,32,45,32)
colnames(Sample)[1] <- "date"
Sd <- as.Date('2017-01-01',tz = "GMT")
EndDate <- as.Date('2017-01-5',tz = "GMT")
Ed <- EndDate + 3
Sample1 <- while (Sd < Ed) {
Sd <- Sd + 1
Sample1 <- Sample %>% group_by(date)%>% filter(date < Sd) %>%
summarize(Amount = sum(Amount))
}
as.data.frame(Sample1)
Desired Output will be:
Dates: Day 1 Day 2 Day 3 .......................
Amount: 25 54 89 .......................
I think this is what you want:
LTV1 = LTV %>%
arrange(InstallationDate) %>%
group_by(InstallationDate) %>%
summarize(daily_amount = sum(USAmount)) %>%
ungroup() %>%
mutate(cumulative_amount = cumsum(daily_amount))
I believe the cumulative_amount column is what you are trying to create---though it's hard to tell and impossible to test since you haven't reproducibly shared your input data. I also can't tell if the arrange, group_by, and summarize are needed---if your data is already in order by date, arrange isn't needed. If your data only has one row per day, the grouping and summarizing aren't needed.
You assign (why?) to LTV1 the result of while, which is always NULL.
I've measured N20 flux from soil at multiple timepoints in the day (not equally spaced). I'm trying to calculate the total N20 flux from soil for a subset of days by finding the area under the curve for the given day. I know how to do this when using only measures from the given day, however, I'd like to include the last measure of the previous day and the first measure of the following day to improve the estimation of the curve.
Here's an example to give a more concrete idea:
library(MESS)
library(lubridate)
library(dplyr)
Generate Reproducible Example
datetime <- seq(ymd_hm('2015-04-07 11:20'),ymd('2015-04-13'), by = 'hours')
dat <- data.frame(datetime, day = day(datetime), Flux = rnorm(n = length(datetime), mean = 400, sd = 20))
useDate <- data.frame(day = c(7:12), DateGood = c("No", "Yes", "Yes", "No", "Yes", "No"))
dat <- left_join(dat, useDate)
Some days are "bad" (too many missing measures) and some are "Good" (usable). The goal is to filter all measurements (rows) that occurred on a "Good" day as well as the last measurement from the day before and the first measurement on the next day.
out <- dat %>%
mutate(lagDateGood = lag(DateGood),
leadDateGood = lead(DateGood)) %>%
filter(lagDateGood != "No" | leadDateGood != "No")
Now I need to calculate the area under the curve - this is not correct
out2 <- out %>%
group_by(day) %>%
mutate(hourOfday = hour(datetime) + minute(datetime)/60) %>%
summarize(auc = auc(x = hourOfday, y = Flux, from = 0, to = 24, type = "spline"))
The trouble is that I don't include the measurements on end of previous day and start of following day when calculating AUC. Also, I get an estimate of flux for day 10, which is a "bad" day.
I think the crux of my question has to do with groups. Some measurements need to be in multiple groups (for example the last measurement on day 8 would be used in estimating AUC for day 8 and day 9). Do you have suggestions for how I could form new groups? Or might there be a completely different way to achieve the goal?
For what it's worth, this is what I did. The answer really lies in the question I linked to in the comments. Starting with the dataframe "out" from the question:
#Now I need to calculate the area under the curve for each day
n <- nrow(out)
extract <- function(ix) out[seq(max(1, min(ix)-1), min(n, max(ix) + 1)), ]
res <- lapply(split(1:n, out$day), extract)
calcTotalFlux <- function(df) {
if (nrow(df) < 10) { # make sure the day has at least 10 measures
NA
} else {
day_midnight <- floor_date(df$datetime[2], "day")
df %>%
mutate(time = datetime - day_midnight) %>%
summarize(TotalFlux = auc(x = time, y = Flux, from = 0, to = 1440, type = "spline"))}
}
do.call("rbind",lapply(res, calcTotalFlux))
TotalFlux
7 NA
8 585230.2
9 579017.3
10 NA
11 563689.7
12 NA
Here's another way. More in line with the suggestions of #Alex Brown.
# Another way
last <- out %>%
group_by(day) %>%
filter(datetime == max(datetime)) %>%
ungroup() %>%
mutate(day = day + 1)
first <- out %>%
group_by(day) %>%
filter(datetime == min(datetime)) %>%
ungroup() %>%
mutate(day = day - 1)
d <- rbind(out, last, first) %>%
group_by(day) %>%
arrange(datetime)
n_measures_per_day <- d %>%
summarize(n = n())
d <- left_join(d, n_measures_per_day) %>%
filter(n > 4)
TotalFluxDF <- d %>%
mutate(timeAtMidnight = floor_date(datetime[3], "day"),
time = datetime - timeAtMidnight) %>%
summarize(auc = auc(x = time, y = Flux, from = 0, to = 1440, type = "spline"))
TotalFluxDF
Source: local data frame [3 x 2]
day auc
(dbl) (dbl)
1 8 585230.2
2 9 579017.3
3 11 563689.7