Conditional Sum of a column in R with NAs - r

I have a dataset with 4 columns which looks like that:
City
Year
Week
Average
Guelph
2020
2020-04-12
28.3
Hamilton
2020
2020-04-12
10.7
Waterloo
2020
2020-04-12
50.1
Guelph
2020
2020-04-20
3.5
Hamilton
2020
2020-04-20
42.9
I would like to sum the average column for the same week. In other words, I want to create a new dataset with three columns (Year, week, Average) where I won't have 3 different rows for the same week but only one (e.g instead of having three times 20220-04-12, I will have it one) and the corresponding cell in the average column will be the sum of all the rows that correspond to the same week. Something like that:
Year
Week
Average
2020
2020-04-12
89.1
2020
2020-04-20
46.4
where 89.1 is the sum of the first three rows that are at the same week and 46.4 is the sum of the last two rows of the initial table that correspond to the same week (2020-04-20).
The code I am using for that looks like that:
data_set <- data_set %>%
select(`Year`, `Week`, `Average`) %>%
group_by(Year, Week) %>%
summarize(Average = sum(Average))
but for some weeks I am getting back NAs and for some other I get the correct sum I want. The data are all numeric and in the initial dataset there are some NA values on the Average column.
Thanks in advance

You can accomplish this by passing in na.rm = TRUE to sum. Also, since you group_by(Year, Week), there isn't much to gain with using select in this case since you are generating a summary statistic on the Average variable within summarise.
df <- structure(list(City = c("Guelph", "Hamilton", "Waterloo", "Guelph",
"Hamilton"), Year = c(2020L, 2020L, 2020L, 2020L, 2020L), Week = c("2020-04-12",
"2020-04-12", "2020-04-12", "2020-04-20", "2020-04-20"), Average = c(28.3,
10.7, 50.1, 3.5, 42.9)), class = "data.frame", row.names = c(NA,
-5L))
library(dplyr)
df %>%
mutate(
Week = as.Date(Week),
) %>%
group_by(Year, Week) %>%
summarise(
Average = sum(Average, na.rm = TRUE)
)
#> # A tibble: 2 x 3
#> # Groups: Year [1]
#> Year Week Average
#> <int> <date> <dbl>
#> 1 2020 2020-04-12 89.1
#> 2 2020 2020-04-20 46.4
Created on 2021-03-10 by the reprex package (v0.3.0)

Related

Sum of column based on a condition a R

I would like to print out total amount for each date so that my new dataframe will have date and and total amount columns.
My data frame looks like this
permitnum
amount
6/1/2022
na
ascas
30.00
olic
40.41
6/2/2022
na
avrey
17.32
fev
32.18
grey
12.20
any advice on how to go about this will be appreciated
Here is another tidyverse option, where I convert to date (and then reformat), then we can fill in the date, so that we can use that to group. Then, get the sum for each date.
library(tidyverse)
df %>%
mutate(permitnum = format(as.Date(permitnum, "%m/%d/%Y"), "%m/%d/%Y")) %>%
fill(permitnum, .direction = "down") %>%
group_by(permitnum) %>%
summarise(total_amount = sum(as.numeric(amount), na.rm = TRUE))
Output
permitnum total_amount
<chr> <dbl>
1 06/01/2022 70.4
2 06/02/2022 61.7
Data
df <- structure(list(permitnum = c("6/1/2022", "ascas", "olic", "6/2/2022",
"avrey", "fev", "grey"), amount = c("na", "30.00", "40.41", "na",
"17.32", "32.18", "12.20")), class = "data.frame", row.names = c(NA,
-7L))
Here is an option. Split the data by the date marked by a row with a number, then summarize the total in amount and combine the date and all rows.
library(tidyverse)
dat <- read_table("permitnum amount
6/1/2022 na
ascas 30.00
olic 40.41
6/2/2022 na
avrey 17.32
fev 32.18
grey 12.20")
dat |>
group_split(id = cumsum(grepl("\\d", permitnum))) |>
map_dfr(\(x){
date <- x$permitnum[[1]]
x |>
slice(-1) |>
summarise(date = date,
total_amount = sum(as.numeric(amount)))
})
#> # A tibble: 2 x 2
#> date total_amount
#> <chr> <dbl>
#> 1 6/1/2022 70.4
#> 2 6/2/2022 61.7

Create date of "X" column, when I have age in days at "X" column and birth date column in R

I'm having some trouble finding out how to do a specific thing in R.
In my dataset, I have a column with the date of birth of participants. I also have a column giving me the age in days at which a disease was diagnosed.
What I want to do is to create a new column showing the date of diagnosis. I'm guessing it's a pretty easy thing to do since I have all the information needed, basically it's birth date + X number of days = Date of diagnosis, but I'm unable to figure out how to do it.
All of my searches give me information on the opposite, going from date to age. So if you're able to help me, it would be much appreciated!
library(tidyverse)
library(lubridate)
df <- tibble(
birth = sample(seq("1950-01-01" %>%
as.Date(),
today(), by = "day"), 10, replace = TRUE),
age = sample(3650:15000, 10, replace = TRUE)
)
df %>%
mutate(diagnosis_date = birth %m+% days(age))
#> # A tibble: 10 x 3
#> birth age diagnosis_date
#> <date> <int> <date>
#> 1 1955-01-16 6684 1973-05-05
#> 2 1958-11-03 6322 1976-02-24
#> 3 2007-02-23 4312 2018-12-14
#> 4 2002-07-11 8681 2026-04-17
#> 5 2021-12-28 11892 2054-07-20
#> 6 2017-07-31 3872 2028-03-07
#> 7 1995-06-30 14549 2035-04-30
#> 8 1955-09-02 12633 1990-04-04
#> 9 1958-10-10 4534 1971-03-10
#> 10 1980-12-05 6893 1999-10-20
Created on 2022-06-30 by the reprex package (v2.0.1)

R Calculate change in Weekly values Year on Year (with additional complication)

I have a data set of daily value. It spans from Dec-1 2018 to April-1 2020.
The columns are "date" and "value". As shown here:
date <- c("2018-12-01","2000-12-02", "2000-12-03",
...
"2020-03-30","2020-03-31","2020-04-01")
value <- c(1592,1825,1769,1909,2022, .... 2287,2169,2366,2001,2087,2099,2258)
df <- data.frame(date,value)
What I would like to do is the sum the values by week and then calculate week over week change from the current to previous year.
I know that I can sum by week using the following function:
Data_week <- df%>% group_by(category ,week = cut(date, "week")) %>% mutate(summed= sum(value))
My questions are twofold:
1) How do I sum by week and then manipulate the dataframe so that I can calculate week over week change (e.g. week dec.1 2019/ week dec.1 2018).
2) How can I do that above, but using a "customized" week. Let's say I want to define a week as moving 7 days back from the latest date I have data for. Eg. the latest week I would have would be week starting on March 26th (April 1st -7 days).
We can use lag from dplyr to help and also some convenience functions from lubridate.
library(dplyr)
library(lubridate)
df %>%
mutate(year = year(date)) %>%
group_by(week = week(date),year) %>%
summarize(summed = sum(value)) %>%
arrange(year, week) %>%
ungroup %>%
mutate(change = summed - lag(summed))
# week year summed change
# <dbl> <dbl> <dbl> <dbl>
# 1 48 2018 3638. NA
# 2 49 2018 15316. 11678.
# 3 50 2018 13283. -2033.
# 4 51 2018 15166. 1883.
# 5 52 2018 12885. -2281.
# 6 53 2018 1982. -10903.
# 7 1 2019 14177. 12195.
# 8 2 2019 14969. 791.
# 9 3 2019 14554. -415.
#10 4 2019 12850. -1704.
#11 5 2019 1907. -10943.
If you would like to define "weeks" in different ways, there is also isoweek and epiweek. See this answer for a great explaination of your options.
Data
set.seed(1)
df <- data.frame(date = seq.Date(from = as.Date("2018-12-01"), to = as.Date("2019-01-29"), "days"), value = runif(60,1500,2500))

Rolling 7 Day Sum grouped by date and unique ID

I am using workload data to compute 3 metrics - Daily, 7-Day rolling (sum of last 7 days) 28-Day Rolling Average(sum of last 28 days/4).
I have been able to compute by Daily but I need my 7-Day rolling and 28-Day Rolling Average and am having some trouble. I have 17 unique ID's for each date (dates range from 2018-08-09 to 2018-12-15).
library(dplyr)
library(tidyr)
library(tidyverse)
library(zoo)
Post_Practice <- read.csv("post.csv", stringsAsFactors = FALSE)
Post_Data <- Post_Practice[, 1:3]
DailyLoad <- Post_Data %>%
group_by(Date, Name) %>%
transmute(Daily = sum(DayLoad)) %>%
distinct(Date, Name, .keep_all = TRUE) %>%
mutate('7-day' = rollapply(Daily, 7, sum, na.rm = TRUE, partial = TRUE))
Input:
Date Name DayLoad
2018-08-09 Athlete 1 273.92000
2018-08-09 Athlete 2 351.16000
2018-08-09 Athlete 3 307.97000
2018-08-09 Athlete 1 434.20000
2018-08-09 Athlete 2 605.92000
2018-08-09 Athlete 3 432.87000
Input looks like this all the way to 2018-12-15. Some dates have multiples of data (like above) and some only have one entry.
This code produces the 7-day column but it shows the same number as the Daily ie:
Date Name Daily 7-day
<chr> <chr> <dbl> <dbl>
1 2018-08-09 Athlete 1 708. 708.
2 2018-08-09 Athlete 2 957. 957.
3 2018-08-09 Athlete 3 741. 741.
The goal is to have final table (ie 7 days later) look like this:
Date Name Daily 7-day
<chr> <chr> <dbl> <dbl>
1 2018-08-15 Athlete 1 413. 3693.
2 2018-08-15 Athlete 2 502. 4348.
3 2018-08-15 Athlete 3 490. 4007.
Where the Daily is the sum of that specific date and the 7-Dayis the sum of the last 7 dates for that specific unique ID.
The help file for rollsum says:
The default methods of rollmean and rollsum do not handle inputs that
contain NAs.
Use rollapplyr(x, width, sum, na.rm = TRUE) to exclude NAs in the input from the sum. Note the r at the end of rollapplyr to specify right alignment.
Also note the partial=TRUE argument can be used if you want partial sums at the beginning rather than NAs.

How to calculate/count the number of extreme precipitation events (above a "threshold") from daily rainfall data in each month per year basis

I am working on daily rainfall data and trying to evaluate the extreme events from the time series data above a certain threshold value in each month per year i.e. the number of times the rainfall exceeded a certain threshold in each month per year.
The rainfall timeseries data is from St Lucia and has two columns:
"YEARMODA" - defining the time (format- YYYYMMDD)
"PREP" - rainfall in mm (numeric)
StLucia <- read_excel("C:/Users/hp/Desktop/StLuciaProject.xlsx")
The dataframe which I'm working i.e "Precip1" on has two columns namely:
Time (format YYYY-MM-DD)
Precipitation (numeric value)
The code is provided below:
library("imputeTS")
StLucia$YEARMODA <- as.Date(as.character(StLucia$YEARMODA), format = "%Y%m%d")
data1 <- na_ma(StLucia$PREP, k=4, weighting = "exponential")
Precip1 <- data.frame(Time= StLucia$YEARMODA, Precipitation= data1, check.rows = TRUE)
I found out the threshold value based on the 95th percentile and 99th percentile using function quantile().
I now want to count the number of "extreme events" of rainfall above this threshold in each month on per year basis.
Please help me out on this. I would be highly obliged by your help. Thank You!
If you are open to a tidyverse method, here is an example with the economics dataset that is built into ggplot2. We can use ntile to assign a percentile group to each observation. Then we group_by the year, and get a count of the values that are in the desired percentiles. Because this is monthly data the counts are pretty low, but it's easily translated to daily data.
library(tidyverse)
thresholds <- economics %>%
mutate(
pctile = ntile(unemploy, 100),
year = year(date)
) %>%
group_by(year) %>%
summarise(
q95 = sum(pctile >= 95L),
q99 = sum(pctile >= 99L)
)
arrange(thresholds, desc(q95))
#> # A tibble: 49 x 3
#> year q95 q99
#> <dbl> <int> <int>
#> 1 2010 12 6
#> 2 2011 12 0
#> 3 2009 10 5
#> 4 1967 0 0
#> 5 1968 0 0
#> 6 1969 0 0
#> 7 1970 0 0
#> 8 1971 0 0
#> 9 1972 0 0
#> 10 1973 0 0
#> # ... with 39 more rows
Created on 2018-06-04 by the reprex package (v0.2.0).

Resources