Using dplyr mutate function to create new variable conditionally based on current row - r

I am working on creating conditional averages for a large data set that involves # of flu cases seen during the week for several years. The data is organized as such:
What I want to do is create a new column that tabulates that average number of cases for that same week in previous years. For instance, for the row where Week.Number is 1 and Flu.Year is 2017, I would like the new row to give the average count for any year with Week.Number==1 & Flu.Year<2017. Normally, I would use the case_when() function to conditionally tabulate something like this. For instance, when calculating the average weekly volume I used this code:
mutate(average = case_when(
Flu.Year==2016 ~ mean(chcc$count[chcc$Flu.Year==2016]),
Flu.Year==2017 ~ mean(chcc$count[chcc$Flu.Year==2017]),
Flu.Year==2018 ~ mean(chcc$count[chcc$Flu.Year==2018]),
Flu.Year==2019 ~ mean(chcc$count[chcc$Flu.Year==2019]),
),
However, since there are four years of data * 52 weeks which is a lot of iterations to spell out the conditions for. Is there a way to elegantly code this in dplyr? The problem I keep running into is that I want to call values in counts column based on Week.Number and Flu.Year values in other rows conditioned on the current value of Week.Number and Flu.Year, and I am not sure how to accomplish that. Please let me know if there is further information / detail I can provide.
Thanks,
Steven
dat <- tibble( Flu.Year = rep(2016:2019,each = 52), Week.Number = rep(1:52,4), count = sample(1000, size=52*4, replace=TRUE) )

It's bad-form and, in some cases, an error when you use $-indexing within dplyr verbs.
I think a better way to get that average field is to group_by(Flu.Year) and calculate it straight-up.
library(dplyr)
set.seed(42)
dat <- tibble(
Flu.Year = sample(2016:2020, size=100, replace=TRUE),
count = sample(1000, size=100, replace=TRUE)
)
dat %>%
group_by(Flu.Year) %>%
mutate(average = mean(count)) %>%
# just to show a quick summary
slice(1:3) %>%
ungroup()
# # A tibble: 15 x 3
# Flu.Year count average
# <int> <int> <dbl>
# 1 2016 734 578.
# 2 2016 356 578.
# 3 2016 411 578.
# 4 2017 217 436.
# 5 2017 453 436.
# 6 2017 920 436.
# 7 2018 963 558
# 8 2018 609 558
# 9 2018 536 558
# 10 2019 943 543.
# 11 2019 740 543.
# 12 2019 536 543.
# 13 2020 627 494.
# 14 2020 218 494.
# 15 2020 389 494.
An alternative approach is to generate a summary table (just one row per year) and join it back in to the original data.
dat %>%
group_by(Flu.Year) %>%
summarize(average = mean(count))
# # A tibble: 5 x 2
# Flu.Year average
# <int> <dbl>
# 1 2016 578.
# 2 2017 436.
# 3 2018 558
# 4 2019 543.
# 5 2020 494.
dat %>%
group_by(Flu.Year) %>%
summarize(average = mean(count)) %>%
full_join(dat, by = "Flu.Year")
# # A tibble: 100 x 3
# Flu.Year average count
# <int> <dbl> <int>
# 1 2016 578. 734
# 2 2016 578. 356
# 3 2016 578. 411
# 4 2016 578. 720
# 5 2016 578. 851
# 6 2016 578. 822
# 7 2016 578. 465
# 8 2016 578. 679
# 9 2016 578. 30
# 10 2016 578. 180
# # ... with 90 more rows
The result, after chat:
tibble( Flu.Year = rep(2016:2018,each = 3), Week.Number = rep(1:3,3), count = 1:9 ) %>%
arrange(Flu.Year, Week.Number) %>%
group_by(Week.Number) %>%
mutate(year_week.average = lag(cumsum(count) / seq_along(count)))
# # A tibble: 9 x 4
# # Groups: Week.Number [3]
# Flu.Year Week.Number count year_week.average
# <int> <int> <int> <dbl>
# 1 2016 1 1 NA
# 2 2016 2 2 NA
# 3 2016 3 3 NA
# 4 2017 1 4 1
# 5 2017 2 5 2
# 6 2017 3 6 3
# 7 2018 1 7 2.5
# 8 2018 2 8 3.5
# 9 2018 3 9 4.5

We can use aggregate from base R
aggregate(count ~ Flu.Year, data, FUN = mean)

Related

How to create a loop code from big dataframe in R?

I have a data series of daily snow depth values over a 60 year period. I would like to see the number of days with a snow depth higher than 30 cm for each season, for example from July 1980 to June 1981. What does the code for this have to look like? I know how I could calculate the daily values higher than 30 cm per season individually, but not how a code could calculate all seasons.
I have uploaded my dataframe on wetransfer: Dataframe
Thank you so much for your help in advance.
Pernilla
Something like this would work
library(dplyr)
library(lubridate)
df<-read.csv('BayrischerWald_Brennes_SH_daily_merged.txt', sep=';')
df_season <-df %>%
mutate(season=(Day %>% ymd() - days(181)) %>% floor_date("year") %>% year())
df_group_by_season <- df_season %>%
filter(!is.na(SHincm)) %>%
group_by(season) %>%
summarize(days_above_30=sum(SHincm>30)) %>%
ungroup()
df_group_by_season
#> # A tibble: 61 × 2
#> season days_above_30
#> <dbl> <int>
#> 1 1961 1
#> 2 1962 0
#> 3 1963 0
#> 4 1964 0
#> 5 1965 0
#> 6 1966 0
#> 7 1967 129
#> 8 1968 60
#> 9 1969 107
#> 10 1970 43
#> # … with 51 more rows
Created on 2022-01-15 by the reprex package (v2.0.1)
Here is an approach using the aggregate() function. After reading the data, convert the Date field to a date object and get rid of the rows with missing values for the date:
snow <- read.table("BayrischerWald_Brennes_SH_daily_merged.txt", header=TRUE, sep=";")
snow$Day <- as.Date(snow$Day)
str(snow)
# 'data.frame': 51606 obs. of 2 variables:
# $ Day : Date, format: "1961-11-01" "1961-11-02" "1961-11-03" "1961-11-04" ...
# $ SHincm: int 0 0 0 0 2 9 19 22 15 5 ...
snow <- snow[!is.na(snow$Day), ]
str(snow)
# 'data.frame': 21886 obs. of 2 variables:
# $ Day : Date, format: "1961-11-01" "1961-11-02" "1961-11-03" "1961-11-04" ...
# $ SHincm: int 0 0 0 0 2 9 19 22 15 5 ...
Notice more than half of your data has missing values for the date. Now we need to divide the data by ski season:
brks <- as.Date(paste(1961:2022, "07-01", sep="-"))
lbls <- paste(1961:2021, 1962:2022, sep="/")
snow$Season <- cut(snow$Day, breaks=brks, labels=lbls)
Now we use aggregate() to get the number of days with over 30 inches of snow:
days30cm <- aggregate(SHincm~Season, snow, subset=snow$SHincm > 30, length)
colnames(days30cm)[2] <- "Over30cm"
head(days30cm, 10)
# Season Over30cm
# 1 1961/1962 1
# 2 1967/1968 129
# 3 1968/1969 60
# 4 1969/1970 107
# 5 1970/1971 43
# 6 1972/1973 101
# 7 1973/1974 119
# 8 1974/1975 188
# 9 1975/1976 126
# 10 1976/1977 112
In addition, you can get other statistics such as the maximum snow of the season or the total cm of snow:
maxsnow <- aggregate(SHincm~Season, snow, max)
totalsnow <- aggregate(SHincm~Season, snow, sum)

How do I go about filtering my data by the upper 50th percentile for a separate dependent variable?

I need to split my data so that when I use the facet_wrap I have the top 50 percentile for each year.
Here is a sample of my data:
# A tibble: 10,519 x 3
Species Abundance Year
<chr> <dbl> <chr>
1 Astropecten irregularis 2 2009
2 Asterias rubens 14 2009
3 Echinus esculentus 1 2009
4 Pagurus prideaux 1 2009
5 Raja clavata 1 2009
6 Astropecten irregularis 4 2009
7 Asterias rubens 47 2009
8 Henricia sp. 2 2009
9 Ophiura ophiura 8 2009
10 Solaster endeca 1 2009
# ... with 10,509 more rows
My current strategy is this:
Data <- All_years %>%
group_by(Species, Year) %>%
summarise(Abundance = sum(Abundance, na.rm = TRUE)) %>%
filter(quantile(Abundance, 0.50)<Abundance) %>%
filter(Abundance > 50)
The issue is that this gives me the top 50 percentile for the whole set while I would like it to give me the top 50 for each year so I can then display it with a facet_wrap in ggplot.

R Panel data: Create new variable based on ifelse() statement and previous row

My question refers to the following (simplified) panel data, for which I would like to create some sort of xrd_stock.
#Setup data
library(tidyverse)
firm_id <- c(rep(1, 5), rep(2, 3), rep(3, 4))
firm_name <- c(rep("Cosco", 5), rep("Apple", 3), rep("BP", 4))
fyear <- c(seq(2000, 2004, 1), seq(2003, 2005, 1), seq(2005, 2008, 1))
xrd <- c(49,93,121,84,37,197,36,154,104,116,6,21)
df <- data.frame(firm_id, firm_name, fyear, xrd)
#Define variables
growth = 0.08
depr = 0.15
For a new variable called xrd_stock I'd like to apply the following mechanics:
each firm_id should be handled separately: group_by(firm_id)
where fyear is at minimum, calculate xrd_stock as: xrd/(growth + depr)
otherwise, calculate xrd_stock as: xrd + (1-depr) * [xrd_stock from previous row]
With the following code, I already succeeded with step 1. and 2. and parts of step 3.
df2 <- df %>%
ungroup() %>%
group_by(firm_id) %>%
arrange(firm_id, fyear, decreasing = TRUE) %>% #Ensure that data is arranged w/ in asc(fyear) order; not required in this specific example as df is already in correct order
mutate(xrd_stock = ifelse(fyear == min(fyear), xrd/(growth + depr), xrd + (1-depr)*lag(xrd_stock))))
Difficulties occur in the else part of the function, such that R returns:
Error: Problem with `mutate()` input `xrd_stock`.
x object 'xrd_stock' not found
i Input `xrd_stock` is `ifelse(...)`.
i The error occured in group 1: firm_id = 1.
Run `rlang::last_error()` to see where the error occurred.
From this error message, I understand that R cannot refer to the just created xrd_stock in the previous row (logical when considering/assuming that R is not strictly working from top to bottom); however, when simply putting a 9 in the else part, my above code runs without any errors.
Can anyone help me with this problem so that results look eventually as shown below. I am more than happy to answer additional questions if required. Thank you very much to everyone in advance, who looks at my question :-)
Target results (Excel-calculated):
id name fyear xrd xrd_stock Calculation for xrd_stock
1 Cosco 2000 49 213 =49/(0.08+0.15)
1 Cosco 2001 93 274 =93+(1-0.15)*213
1 Cosco 2002 121 354 …
1 Cosco 2003 84 385 …
1 Cosco 2004 37 364 …
2 Apple 2003 197 857 =197/(0.08+0.15)
2 Apple 2004 36 764 =36+(1-0.15)*857
2 Apple 2005 154 803 …
3 BP 2005 104 452 …
3 BP 2006 116 500 …
3 BP 2007 6 431 …
3 BP 2008 21 388 …
arrange the data by fyear so minimum year is always the 1st row, you can then use accumulate to calculate.
library(dplyr)
df %>%
arrange(firm_id, fyear) %>%
group_by(firm_id) %>%
mutate(xrd_stock = purrr::accumulate(xrd[-1], ~.y + (1-depr) * .x,
.init = first(xrd)/(growth + depr)))
# firm_id firm_name fyear xrd xrd_stock
# <dbl> <chr> <dbl> <dbl> <dbl>
# 1 1 Cosco 2000 49 213.
# 2 1 Cosco 2001 93 274.
# 3 1 Cosco 2002 121 354.
# 4 1 Cosco 2003 84 385.
# 5 1 Cosco 2004 37 364.
# 6 2 Apple 2003 197 857.
# 7 2 Apple 2004 36 764.
# 8 2 Apple 2005 154 803.
# 9 3 BP 2005 104 452.
#10 3 BP 2006 116 500.
#11 3 BP 2007 6 431.
#12 3 BP 2008 21 388.

Subsetting data by date range across years in R

I have a long term sightings data set of identified individuals (~16,000 records from 1979- 2019) and I would like to subset the same date range (YYYY-09-01 to YYYY(+1)-08-31) across years in R. I have successfully done so for each "year" (and obtained the unique IDs) using:
library(dplyr)
library(lubridate)
year79 <-data%>%
select(ID, Sex, AgeClass, Age, Date, Month, Year)%>%
filter(Date>= as.Date("1978-09-01") & Date<= as.Date("1979-08-31")) %>%
filter(!duplicated(ID))
year80 <-data%>%
select(ID, Sex, AgeClass, Age, Date, Month, Year)%>%
filter(Date>= as.Date("1979-09-01") & Date<= as.Date("1980-08-31")) %>%
filter(!duplicated(ID))
I would like to clean up the code and ideally not need to specify the each range (just have it iterate through). I am new at R and stuck how to do this. Any suggestions?
FYI "Month" and "Year" are included for producing a table via melt and cast later on.
example data:
ID Year Month Day Date AgeClass Age Sex
1 1034 1979 4 17 1979-04-17 U 3 F
2 1127 1979 5 3 1979-05-03 A 13 F
3 1222 1979 5 3 1979-05-03 U 0 F
4 1303 1979 6 16 1979-06-16 U 0 F
5 1153 1980 4 16 1980-04-16 C 0 F
6 1014 1980 4 16 1980-04-16 U 6 F
ID Year Month Day Date AgeClass Age Sex
16428 2503 2019 5 8 2019-05-08 U NA F
16429 3760 2019 5 8 2019-05-08 A 12 F
16430 4080 2019 5 9 2019-05-09 A 9 F
16431 4095 2019 5 9 2019-05-09 A 9 U
16432 1204 2019 5 11 2019-05-11 A 37 F
16433 1204 2019 5 11 2019-05-11 A NA F
#> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
Every year has 122 days from Sept 1 to Dec 31 inclusive, so you could add a variable marking the "fiscal year" for each row:
set.seed(42)
library(dplyr)
my_data <- tibble(ID = 1:6,
Date = as.Date("1978-09-01") + c(-1, 0, 1, 364, 365, 366))
my_data
# There are 122 days from each Aug 31 (last of the FY) to the end of the CY.
# lubridate::ymd(19781231) - lubridate::ymd(19780831)
my_data %>%
mutate(FY = year(Date + 122))
## A tibble: 6 x 3
# ID Date FY
# <int> <date> <dbl>
#1 1 1978-08-31 1978
#2 2 1978-09-01 1979
#3 3 1978-09-02 1979
#4 4 1979-08-31 1979
#5 5 1979-09-01 1980
#6 6 1979-09-02 1980
You could keep the data in one table and do subsequent analysis using group_by(FY), or use %>% split(.$FY) to put each FY into its own element of a list. From my limited experience, I think it's generally an anti-pattern to create separate data frames for annual subsets of your data, as that makes your code harder to maintain, troubleshoot, and modify.

Clean way to calculate both group and overall statistics

I would like like to calculate the median not only for different groups of my data, but also the median over all groups and store the result in a single data.frame. While accomplishing each of these tasks separately is easy, I have not found a clean way to do both at the same time.
Right now, what I'm doing is calculate both statistics separately; then join the results; then tidy the data if necessary. Here's an example of what this may look like if I wanted to know the median delay per day and per month:
library(dplyr)
library(hflights)
data(hflights)
# Calculate both statistics separately
per_day <- hflights %>%
group_by(Year, Month, DayofMonth) %>%
summarise(Delay = mean(ArrDelay, na.rm = TRUE)) %>%
mutate(Interval = "Daily")
per_month <- hflights %>%
group_by(Year, Month) %>%
summarise(Delay = mean(ArrDelay, na.rm = TRUE)) %>%
mutate(Interval = "Monthly", DayofMonth = NA)
# Join into a single data.frame
my_summary <- full_join(per_day, per_month,
by = c("Year", "Month", "DayofMonth", "Interval", "Delay"))
my_summary
# Source: local data frame [377 x 5]
# Groups: Year, Month
#
# Year Month DayofMonth Delay Interval
# 1 2011 1 1 10.067642 Daily
# 2 2011 1 2 10.509745 Daily
# 3 2011 1 3 6.038627 Daily
# 4 2011 1 4 7.970740 Daily
# 5 2011 1 5 4.172650 Daily
# 6 2011 1 6 6.069909 Daily
# 7 2011 1 7 3.907295 Daily
# 8 2011 1 8 3.070140 Daily
# 9 2011 1 9 17.254325 Daily
# 10 2011 1 10 11.040388 Daily
# .. ... ... ... ... ...
Are there better ways to do this?
(Note that in many cases one could easily progressively roll up summaries as pointed out in the Introduction to dplyr. However, this doesn't work for statistics like median, mean etc.)
As a one-off table. This is fairly straightforward in data.table:
require(data.table)
setDT(hflights)[,{
mo_del <- mean(ArrDelay,na.rm=TRUE)
.SD[,.(DailyDelay = mean(ArrDelay,na.rm=TRUE),MonthlyDelay = mo_del),by=DayofMonth]
},by=.(Year,Month)]
# Year Month DayofMonth DailyDelay MonthlyDelay
# 1: 2011 1 1 10.0676417 4.926065
# 2: 2011 1 2 10.5097451 4.926065
# 3: 2011 1 3 6.0386266 4.926065
# 4: 2011 1 4 7.9707401 4.926065
# 5: 2011 1 5 4.1726496 4.926065
# ---
# 361: 2011 12 14 1.0293610 5.013244
# 362: 2011 12 17 -0.1049822 5.013244
# 363: 2011 12 24 -4.1457490 5.013244
# 364: 2011 12 25 -2.2976827 5.013244
# 365: 2011 12 31 46.4846491 5.013244
How it works. The basic syntax is DT[i,j,by].
With by=.(Year,Month), all operations in j are done per "by group."
We can nest another "by group" using the data.table of the current Subset of Data, .SD.
To return columns in j we use .(colname1=col1,colname2=col2,...).
Creating new variables. Alternately, we could create new variables in hflights using := in j.
hflights[,DailyDelay := mean(ArrDelay,na.rm=TRUE),.(Year,Month,DayofMonth)]
hflights[,MonthlyDelay := mean(ArrDelay,na.rm=TRUE),.(Year,Month)]
Then we can view the summary table:
hflights[,.GRP,.(Year,Month,DayofMonth,DailyDelay,MonthlyDelay)]
# Year Month DayofMonth DailyDelay MonthlyDelay .GRP
# 1: 2011 1 1 10.0676417 4.926065 1
# 2: 2011 1 2 10.5097451 4.926065 2
# 3: 2011 1 3 6.0386266 4.926065 3
# 4: 2011 1 4 7.9707401 4.926065 4
# 5: 2011 1 5 4.1726496 4.926065 5
# ---
# 361: 2011 12 14 1.0293610 5.013244 361
# 362: 2011 12 17 -0.1049822 5.013244 362
# 363: 2011 12 24 -4.1457490 5.013244 363
# 364: 2011 12 25 -2.2976827 5.013244 364
# 365: 2011 12 31 46.4846491 5.013244 365
(Something needed to be put in j here, so I used the "by group" code, .GRP.)

Resources