How to add percentile (/quantile) values to a column in dataframe

How to add percentile (/quantile) values to a column in dataframe - r

My data set has flow rate measurements of a river for every day of the year from 2009 to 2021. This is split up into seasons: Winter (December, Jan, Feb), Spring (March, April, May), Summer (June, July, August) and Autumn (September, October, November).
This is a sample of my data set:
> (chitt_brook_wylye_2)
# A tibble: 4,437 x 7
river year season month date flow_rate quality
<chr> <dbl> <chr> <chr> <dttm> <dbl> <chr>
1 chittern_brook 2009 Winter December 2009-12-01 00:00:00 0.059 Good
2 chittern_brook 2009 Winter December 2009-12-02 00:00:00 0.061 Good
3 chittern_brook 2009 Winter December 2009-12-03 00:00:00 0.064 Good
4 chittern_brook 2009 Winter December 2009-12-04 00:00:00 0.068 Good
5 chittern_brook 2009 Winter December 2009-12-05 00:00:00 0.076 Good
6 chittern_brook 2009 Winter December 2009-12-06 00:00:00 0.138 Good
7 chittern_brook 2009 Winter December 2009-12-07 00:00:00 0.592 Good
8 chittern_brook 2009 Winter December 2009-12-08 00:00:00 1.04 Good
9 chittern_brook 2009 Winter December 2009-12-09 00:00:00 1.46 Good
10 chittern_brook 2009 Winter December 2009-12-10 00:00:00 1.7 Good
# ... with 4,427 more rows
I want to find the 95th percentile, 5th percentile, median and the mean of each season of every year and have the values for 95th 5th, median and mean in separate columns in a new dataframe.
For example:
> (df)
# A tibble: 49 x 2
season_label flow_rate_mean Q95 Q5 flow_rate_median
<chr> <dbl>
1 Winter 2009 0.453 3 2 4
2 Spring 2010 0.519 6 3 4
3 Summer 2010 0.0627 4 3 6
4 Autumn 2010 0.0415 6 2 6
5 Winter 2010 0.0622 8 3 3
6 Spring 2011 0.188 10 3 2
7 Summer 2011 0.0499 2 3 2
8 Autumn 2011 0.0383 2 2 1
9 Winter 2011 0.0461 5 2 7
10 Spring 2012 0.0925 3 2 8
# ... with 39 more rows
I currently have this code which creates the above dataframe with just the first two columns but I would like it to also include 95th percentile, 5th percentile and median. Is this feasible or will I need to do it separately and then combine it into one dataframe?
df <- chitt_brook_wylye_2 %>%
dplyr::mutate(month = as.numeric(format(date,"%m")),
year = as.numeric(format(date,"%Y")),
season_id = (12*year + month) %/% 3) %>%
dplyr::group_by(season_id) %>%
dplyr::mutate(season_label = paste(season, min(year))) %>%
dplyr::group_by(season_id,season_label) %>%
dplyr::summarise(flow_rate = mean(flow_rate))
Reproducible example and code:
date <- as.Date(c("2009-12-01","2010-01-01","2010-02-01","2010-03-01","2010-04-01","2010-05-01","2010-06-01","2010-07-01","2010-08-01","2010-09-01","2010-10-01","2010-11-01","2010-12-01"))
season <- c("Winter","Winter","Winter","Spring","Spring","Spring","Summer","Summer","Summer","Autumn","Autumn","Autumn","Winter")
var <- c(1,2,3,5,5,5,7,7,7,9,9,9,10)
df <- data.frame(date,season,var) %>% # creating the dataframe
dplyr::mutate(month = as.numeric(format(date,"%m")),
year = as.numeric(format(date,"%Y")),
season_id = (12*year + month) %/% 3) %>% #generating an identifiant for every season that exists in the data
dplyr::group_by(season_id) %>% # Grouping by the id
dplyr::mutate(season_label = paste(min(year),season)) %>%
dplyr::group_by(season_id,season_label) %>% ## season_label to keep the newly created label after the arriving summarise
dplyr::summarise(var = mean(var)) # Computing the mean

Related

How to create an time series in R for the .data format?

I´m having some difficulties to read and create an time series objects in R for this datasets:
SOI: https://psl.noaa.gov/data/correlation/soi.data
ONI: https://psl.noaa.gov/data/correlation/oni.data
By seen the data we have in the first column the years and in the cols the months (Jan to Dec)
I expect to have something like this for SOI in R:
YearMonth SOI
<mth> <dbl>
Jan 1948 -99.99
Feb 1948 -99.99
... ...
Sep 2021 -1.3
Oct 2021 -99.99
Nov 2021 -99.99
Dec 2021 -99.99
And in the same way for ONI:
YearMonth ONI
<mth> <dbl>
Jan 1950 -1.53
Feb 1950 -1.34
... ...
Aug 2021 -0.46
Sep 2021 -99.90
Oct 2021 -99.90
Nov 2021 -99.90
Dec 2021 -99.90
I believe that the arrangement of this dataset may be the source of my difficulty, as I am not being able to correctly pivot this data.
I know that here in the stack we have good R users that will definitely help me with the best practice.

This function seems to work for the two links shared but if there are more standard ways to get the data you can use that since they'll be more reliable.
library(dplyr)
library(tidyr)
read_data <- function(link) {
read.table(link, skip = 1, fill = TRUE) %>%
slice(-(grep('-99.9', V1):n())) %>%
mutate(across(.fns = as.numeric)) %>%
pivot_longer(cols = -V1) %>%
mutate(name = month.abb[match(name, unique(name))]) %>%
unite(YearMonth, V1, name, sep = ' ')
}
d1 <- read_data('https://psl.noaa.gov/data/correlation/soi.data')
d1
# A tibble: 888 x 2
# YearMonth value
# <chr> <dbl>
# 1 1948 Jan -100.
# 2 1948 Feb -100.
# 3 1948 Mar -100.
# 4 1948 Apr -100.
# 5 1948 May -100.
# 6 1948 Jun -100.
# 7 1948 Jul -100.
# 8 1948 Aug -100.
# 9 1948 Sep -100.
#10 1948 Oct -100.
# … with 878 more rows
For the second link -
d2 <- read_data('https://psl.noaa.gov/data/correlation/oni.data')
d2
# A tibble: 864 x 2
# YearMonth value
# <chr> <dbl>
# 1 1950 Jan -1.53
# 2 1950 Feb -1.34
# 3 1950 Mar -1.16
# 4 1950 Apr -1.18
# 5 1950 May -1.07
# 6 1950 Jun -0.85
# 7 1950 Jul -0.54
# 8 1950 Aug -0.42
# 9 1950 Sep -0.39
#10 1950 Oct -0.44
# … with 854 more rows

Trying to use gg_lag() but apparently have more than one time series

I'm trying to find lag using gg_lag but I keep getting the same error regarding my tsibble
# A tsibble: 255 x 6 [7D]
# Key: Demand [163]
Week Demand Date Month year Quarter
<dbl> <dbl> <date> <mth> <chr> <qtr>
1 1 48 2018-01-01 2018 Jan 2018 2018 Q1
2 2 101 2018-01-08 2018 Jan 2018 2018 Q1
3 3 129 2018-01-15 2018 Jan 2018 2018 Q1
4 4 113 2018-01-22 2018 Jan 2018 2018 Q1
5 5 116 2018-01-29 2018 Jan 2018 2018 Q1
6 6 123 2018-02-05 2018 Feb 2018 2018 Q1
7 7 137 2018-02-12 2018 Feb 2018 2018 Q1
8 8 136 2018-02-19 2018 Feb 2018 2018 Q1
9 9 151 2018-02-26 2018 Feb 2018 2018 Q1
10 10 87 2018-03-05 2018 Mar 2018 2018 Q1
# ... with 245 more rows
Printer_Q %>% gg_lag(Demand, geom='point')
Error: The data provided to contains more than one time series. Please filter a single time series to use gg_lag()
I tried filtering my data with:
Printer_Q <- Demandts %>%
select(-Week, -year, -Month, -Quarter)
...so that I am left with Demand and Date but it still says I have more than one time series? What am I doing wrong?

The Demand column should not be a key variable. A key variable is a categorical variable used to distinguish multiple time series in a single tsibble. It appears you just have one time series here, so you don't need a key variable.

dplyr sample_n by one variable through another one

I have a data frame with a "grouping" variable season and another variable year which is repeated for each month.
df <- data.frame(month = as.character(sapply(month.name,function(x)rep(x,4))),
season = c(rep("winter",8),rep("spring",12),rep("summer",12),rep("autumn",12),rep("winter",4)),
year = rep(2021:2024,12))
I would like to use dplyr::sample_n or something similar to choose 2 months in the data frame for each season and keep the same months for all the years, for example:
month season year
1 January winter 2021
2 January winter 2022
3 January winter 2023
4 January winter 2024
5 February winter 2021
6 February winter 2022
7 February winter 2023
8 February winter 2024
9 March spring 2021
10 March spring 2022
11 March spring 2023
12 March spring 2024
13 May spring 2021
14 May spring 2022
15 May spring 2023
16 May spring 2024
17 June summer 2021
18 June summer 2022
19 June summer 2023
20 June summer 2024
21 July summer 2021
22 July summer 2022
23 July summer 2023
24 July summer 2024
25 October autumn 2021
26 October autumn 2022
27 October autumn 2023
28 October autumn 2024
29 November autumn 2021
30 November autumn 2022
31 November autumn 2023
32 November autumn 2024
I cannot make df %>% group_by(season,year) %>% sample_n(2) since it chooses different months for each year.
Thanks!

We can randomly sample 2 values from month and filter them by group.
library(dplyr)
df %>%
group_by(season) %>%
filter(month %in% sample(unique(month),2))
# month season year
# <chr> <chr> <int>
# 1 January winter 2021
# 2 January winter 2022
# 3 January winter 2023
# 4 January winter 2024
# 5 February winter 2021
# 6 February winter 2022
# 7 February winter 2023
# 8 February winter 2024
# 9 March spring 2021
#10 March spring 2022
# … with 22 more rows
If for certain groups there are less than 2 unique values we can select minimum between 2 and unique values in the group to sample.
df %>%
group_by(season) %>%
filter(month %in% sample(unique(month),min(2, n_distinct(month))))
Using the same logic with base R, we can use ave
df[as.logical(with(df, ave(month, season,
FUN = function(x) x %in% sample(unique(x),2)))), ]

An option using slice
library(dplyr)
df %>%
group_by(season) %>%
slice(which(!is.na(match(month, sample(unique(month), 2)))))
# A tibble: 32 x 3
# Groups: season [4]
# month season year
# <fct> <fct> <int>
# 1 October autumn 2021
# 2 October autumn 2022
# 3 October autumn 2023
# 4 October autumn 2024
# 5 November autumn 2021
# 6 November autumn 2022
# 7 November autumn 2023
# 8 November autumn 2024
# 9 April spring 2021
#10 April spring 2022
# … with 22 more rows
Or using base R
by(df, df$season, FUN = function(x) subset(x, month %in% sample(unique(month), 2 )))

Group data by group of days within months in R

I am trying to summarise this daily time serie of rainfall by groups of 10-day periods within each month and calculate the acummulated rainfall.
library(tidyverse)
(dat <- tibble(
date = seq(as.Date("2016-01-01"), as.Date("2016-12-31"), by=1),
rainfall = rgamma(length(date), shape=2, scale=2)))
Therefore, I will obtain variability in the third group along the year, for instance: in january the third period has 11 days, february 9 days, and so on. This is my try:
library(lubridate)
dat %>%
group_by(decade=floor_date(date, "10 days")) %>%
summarize(acum_rainfall=sum(rainfall),
days = n())
this is the resulting output
# A tibble: 43 x 3
decade acum_rainfall days
<date> <dbl> <int>
1 2016-01-01 48.5 10
2 2016-01-11 39.9 10
3 2016-01-21 36.1 10
4 2016-01-31 1.87 1
5 2016-02-01 50.6 10
6 2016-02-11 32.1 10
7 2016-02-21 22.1 9
8 2016-03-01 45.9 10
9 2016-03-11 30.0 10
10 2016-03-21 42.4 10
# ... with 33 more rows
can someone help me to sum the residuals periods to the third one to obtain always 3 periods within each month? This would be the desired output (pay attention to the row 3):
decade acum_rainfall days
<date> <dbl> <int>
1 2016-01-01 48.5 10
2 2016-01-11 39.9 10
3 2016-01-21 37.97 11
4 2016-02-01 50.6 10
5 2016-02-11 32.1 10
6 2016-02-21 22.1 9

One way to do this is to use if_else to apply floor_date with different arguments depending on the day value of date. If day(date) is <30, use the normal way, if it's >= 30, then use '20 days' to ensure it gets rounded to day 21:
dat %>%
group_by(decade=if_else(day(date) >= 30,
floor_date(date, "20 days"),
floor_date(date, "10 days"))) %>%
summarize(acum_rainfall=sum(rainfall),
days = n())
# A tibble: 36 x 3
decade acum_rainfall days
<date> <dbl> <int>
1 2016-01-01 38.8 10
2 2016-01-11 38.4 10
3 2016-01-21 43.4 11
4 2016-02-01 34.4 10
5 2016-02-11 34.8 10
6 2016-02-21 25.3 9
7 2016-03-01 39.6 10
8 2016-03-11 53.9 10
9 2016-03-21 38.1 11
10 2016-04-01 36.6 10
# … with 26 more rows

Populate Missing dates by month and Administrative Levels

I have a dataset that roughly looks like this:
COUNTRY ADMIN1 month YEAR nevent
<chr> <chr> <dttm> <dbl> <dbl>
1 Algeria Adrar 2003-07-01 00:00:00 2003 1.00
2 Algeria Adrar 2004-10-01 00:00:00 2004 1.00
3 Algeria Adrar 2005-04-01 00:00:00 2005 1.00
4 Algeria Adrar 2005-06-01 00:00:00 2005 1.00
5 Algeria Adrar 2008-12-01 00:00:00 2008 1.00
6 Algeria Adrar 2009-02-01 00:00:00 2009 1.00
7 Algeria Adrar 2009-04-01 00:00:00 2009 2.00
8 Algeria Adrar 2009-11-01 00:00:00 2009 1.00
9 Algeria Adrar 2010-07-01 00:00:00 2010 1.00
10 Algeria Adrar 2012-02-01 00:00:00 2012 1.00
My units are monthly/1stadminlevel observations
Now, my main problem is that this dataset is displaying just months/adminlevels and countries in Africa when the occurrence of certain events > 0. How can I populate observations with 0 observed events?
In other words I would need to add:
- every month when no event happened (n event= 0)
- every admin1 where no events happened (ever and in specific months)
- every country where no events happened (ever and in specific months)
I notice the existence of the following:
library(tidyverse)
complete(dat, id, date)
But how can adapt it for particular my case?
Thanks

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to add percentile (/quantile) values to a column in dataframe - r

Related

How to create an time series in R for the .data format?

Trying to use gg_lag() but apparently have more than one time series

dplyr sample_n by one variable through another one

Group data by group of days within months in R

Populate Missing dates by month and Administrative Levels

Categories

Resources