dplyr sample_n by one variable through another one - r

I have a data frame with a "grouping" variable season and another variable year which is repeated for each month.
df <- data.frame(month = as.character(sapply(month.name,function(x)rep(x,4))),
season = c(rep("winter",8),rep("spring",12),rep("summer",12),rep("autumn",12),rep("winter",4)),
year = rep(2021:2024,12))
I would like to use dplyr::sample_n or something similar to choose 2 months in the data frame for each season and keep the same months for all the years, for example:
month season year
1 January winter 2021
2 January winter 2022
3 January winter 2023
4 January winter 2024
5 February winter 2021
6 February winter 2022
7 February winter 2023
8 February winter 2024
9 March spring 2021
10 March spring 2022
11 March spring 2023
12 March spring 2024
13 May spring 2021
14 May spring 2022
15 May spring 2023
16 May spring 2024
17 June summer 2021
18 June summer 2022
19 June summer 2023
20 June summer 2024
21 July summer 2021
22 July summer 2022
23 July summer 2023
24 July summer 2024
25 October autumn 2021
26 October autumn 2022
27 October autumn 2023
28 October autumn 2024
29 November autumn 2021
30 November autumn 2022
31 November autumn 2023
32 November autumn 2024
I cannot make df %>% group_by(season,year) %>% sample_n(2) since it chooses different months for each year.
Thanks!

We can randomly sample 2 values from month and filter them by group.
library(dplyr)
df %>%
group_by(season) %>%
filter(month %in% sample(unique(month),2))
# month season year
# <chr> <chr> <int>
# 1 January winter 2021
# 2 January winter 2022
# 3 January winter 2023
# 4 January winter 2024
# 5 February winter 2021
# 6 February winter 2022
# 7 February winter 2023
# 8 February winter 2024
# 9 March spring 2021
#10 March spring 2022
# … with 22 more rows
If for certain groups there are less than 2 unique values we can select minimum between 2 and unique values in the group to sample.
df %>%
group_by(season) %>%
filter(month %in% sample(unique(month),min(2, n_distinct(month))))
Using the same logic with base R, we can use ave
df[as.logical(with(df, ave(month, season,
FUN = function(x) x %in% sample(unique(x),2)))), ]

An option using slice
library(dplyr)
df %>%
group_by(season) %>%
slice(which(!is.na(match(month, sample(unique(month), 2)))))
# A tibble: 32 x 3
# Groups: season [4]
# month season year
# <fct> <fct> <int>
# 1 October autumn 2021
# 2 October autumn 2022
# 3 October autumn 2023
# 4 October autumn 2024
# 5 November autumn 2021
# 6 November autumn 2022
# 7 November autumn 2023
# 8 November autumn 2024
# 9 April spring 2021
#10 April spring 2022
# … with 22 more rows
Or using base R
by(df, df$season, FUN = function(x) subset(x, month %in% sample(unique(month), 2 )))

Related

Web scraping data from a Chart or Graph in R

Good Morning,
I am hoping someone can help. The task is straight forward but seems a little difficult to execute.
On this website: https://reiwa.com.au/rent/
There is a chart labelled: Property trends
I am trying to extract the two time-series form this chart.
I have used rvest etc but I have had no luck at all. I am really hoping someone has the skills to solve this one because it has me lost.
Thank you all in advance.
A little inspection with Chrome devtools led me to this:
res <- httr::GET("https://reiwa.com.au/api/insights/trends/Residential")
json <- jsonlite::fromJSON(httr::content(res, "text"))
head(json$Result$SaleTrends)
#> CalendarYear CalendarMonth DateLabel MedianPrice DisplayPrice ChartOrder
#> 1 2020 December December 2020 490000 $490k 12
#> 2 2021 January January 2021 495000 $495k 13
#> 3 2021 February February 2021 500000 $500k 14
#> 4 2021 March March 2021 505000 $505k 15
#> 5 2021 April April 2021 510000 $510k 16
#> 6 2021 May May 2021 515000 $515k 17
head(json$Result$LeaseTrends)
#> CalendarYear CalendarMonth DateLabel MedianPrice DisplayPrice ChartOrder
#> 1 2020 December December 2020 410 $410pw 12
#> 2 2021 January January 2021 420 $420pw 13
#> 3 2021 February February 2021 420 $420pw 14
#> 4 2021 March March 2021 430 $430pw 15
#> 5 2021 April April 2021 440 $440pw 16
#> 6 2021 May May 2021 450 $450pw 17

How can I create a new column that only extracts either the year or month from a mm/dd/yy hh:mm string?

I have a date/time string variable that looks like this:
> dput(df$starttime)
c("12/16/20 7:24", "6/21/21 13:20", "1/22/20 9:03", "1/07/20 17:19",
"11/8/21 10:14", NA, NA, "10/26/21 7:19", "3/14/22 9:48", "5/12/22 13:29"
I basically want to create a column that only has the year (2020, 2021, 2022) and the year + month (e.g., "Jan 2022)
1) Base R Assuming that you want separate month and year numeric columns, define a function which converts a string in the format shown in the question to a year or month number and then invoke it twice. No packages are used.
toNum <- function(x, fmt) format(as.Date(x, "%m/%d/%y"), fmt) |>
type.convert(as.is = TRUE)
transform(df, year = toNum(starttime, "%Y"), month = toNum(starttime, "%m"))
giving
starttime year month
1 12/16/20 7:24 2020 12
2 6/21/21 13:20 2021 6
3 1/22/20 9:03 2020 1
4 1/07/20 17:19 2020 1
5 11/8/21 10:14 2021 11
6 <NA> NA NA
7 <NA> NA NA
8 10/26/21 7:19 2021 10
9 3/14/22 9:48 2022 3
10 5/12/22 13:29 2022 5
2) yearmon Assuming that you want a yearmon class column which represents year and month internally as year + fraction where fraction is 0 for Ja, 1/12 for Feb, ..., 11/12 for Dec so that it sorts appropriately and adding 1/12, say, will give the next month we can use the following. Note that if ym is yearmon then as.integer(ym) is the year and cycle(ym) is the month number (1, 2, ..., 12).
library(zoo)
transform(df, yearmon = as.yearmon(starttime, "%m/%d/%y"))
giving:
starttime yearmon
1 12/16/20 7:24 Dec 2020
2 6/21/21 13:20 Jun 2021
3 1/22/20 9:03 Jan 2020
4 1/07/20 17:19 Jan 2020
5 11/8/21 10:14 Nov 2021
6 <NA> <NA>
7 <NA> <NA>
8 10/26/21 7:19 Oct 2021
9 3/14/22 9:48 Mar 2022
10 5/12/22 13:29 May 2022
Note
If you want to sort by starttime then use
ct <- as.POSIXct(df$starttime, format = "%m/%d/%Y %H:%M")
df[order(ct),, drop = FALSE ]
If you want a chronologically sortable output, you could use the tsibble::yearmonth type:
tsibble::yearmonth(lubridate::mdy_hm(c("12/16/20 7:24", "6/21/21 13:20", "1/22/20 9:03", "1/07/20 17:19",
"11/8/21 10:14", NA, NA, "10/26/21 7:19", "3/14/22 9:48", "5/12/22 13:29")))
result
<yearmonth[10]>
[1] "2020 Dec" "2021 Jun" "2020 Jan" "2020 Jan" "2021 Nov" NA NA
[8] "2021 Oct" "2022 Mar" "2022 May"
An option is to convert to datetime class POSIXct with mdy_hm (from lubridate), then format to extract the month (%b) and 4 digit year (%Y), filter out the NA elements and arrange based on the converted datetime column
library(dplyr)
library(lubridate)
df %>%
mutate(starttime = mdy_hm(starttime),
yearmonth = format(starttime, "%b %Y")) %>%
filter(complete.cases(yearmonth)) %>%
arrange(starttime)
-output
# A tibble: 8 × 2
starttime yearmonth
<dttm> <chr>
1 2020-01-07 17:19:00 Jan 2020
2 2020-01-22 09:03:00 Jan 2020
3 2020-12-16 07:24:00 Dec 2020
4 2021-06-21 13:20:00 Jun 2021
5 2021-10-26 07:19:00 Oct 2021
6 2021-11-08 10:14:00 Nov 2021
7 2022-03-14 09:48:00 Mar 2022
8 2022-05-12 13:29:00 May 2022
Try this with lubridate
library(lubridate)
data.frame(df,
Year = format(mdy_hm(df$starttime), "%Y"),
MonthYear = format(mdy_hm(df$starttime), "%b %Y"))
starttime Year MonthYear
1 12/16/20 7:24 2020 Dec 2020
2 6/21/21 13:20 2021 Jun 2021
3 1/22/20 9:03 2020 Jan 2020
4 1/07/20 17:19 2020 Jan 2020
5 11/8/21 10:14 2021 Nov 2021
6 <NA> <NA> <NA>
7 <NA> <NA> <NA>
8 10/26/21 7:19 2021 Oct 2021
9 3/14/22 9:48 2022 Mar 2022
10 5/12/22 13:29 2022 May 2022
It uses mdy_hm in conjunction with format to get the desired Year %Y and %b %Y abbreviated month and year part of the date.
Ordered rows:
df_new <- data.frame(df,
Year = format(mdy_hm(df$starttime), "%Y"),
MonthYear = format(mdy_hm(df$starttime), "%b %Y"))
df_new[order(my(df_new$MonthYear)),]
starttime Year MonthYear
3 1/22/20 9:03 2020 Jan 2020
4 1/07/20 17:19 2020 Jan 2020
1 12/16/20 7:24 2020 Dec 2020
2 6/21/21 13:20 2021 Jun 2021
8 10/26/21 7:19 2021 Oct 2021
5 11/8/21 10:14 2021 Nov 2021
9 3/14/22 9:48 2022 Mar 2022
10 5/12/22 13:29 2022 May 2022
6 <NA> <NA> <NA>
7 <NA> <NA> <NA>
Without NAs
na.omit(df_new[order(my(df_new$MonthYear)),])
starttime Year MonthYear
3 1/22/20 9:03 2020 Jan 2020
4 1/07/20 17:19 2020 Jan 2020
1 12/16/20 7:24 2020 Dec 2020
2 6/21/21 13:20 2021 Jun 2021
8 10/26/21 7:19 2021 Oct 2021
5 11/8/21 10:14 2021 Nov 2021
9 3/14/22 9:48 2022 Mar 2022
10 5/12/22 13:29 2022 May 2022
Data
df <- structure(list(starttime = c("12/16/20 7:24", "6/21/21 13:20",
"1/22/20 9:03", "1/07/20 17:19", "11/8/21 10:14", NA, NA, "10/26/21 7:19",
"3/14/22 9:48", "5/12/22 13:29")), class = "data.frame", row.names = c(NA,
-10L))

How to add percentile (/quantile) values to a column in dataframe

My data set has flow rate measurements of a river for every day of the year from 2009 to 2021. This is split up into seasons: Winter (December, Jan, Feb), Spring (March, April, May), Summer (June, July, August) and Autumn (September, October, November).
This is a sample of my data set:
> (chitt_brook_wylye_2)
# A tibble: 4,437 x 7
river year season month date flow_rate quality
<chr> <dbl> <chr> <chr> <dttm> <dbl> <chr>
1 chittern_brook 2009 Winter December 2009-12-01 00:00:00 0.059 Good
2 chittern_brook 2009 Winter December 2009-12-02 00:00:00 0.061 Good
3 chittern_brook 2009 Winter December 2009-12-03 00:00:00 0.064 Good
4 chittern_brook 2009 Winter December 2009-12-04 00:00:00 0.068 Good
5 chittern_brook 2009 Winter December 2009-12-05 00:00:00 0.076 Good
6 chittern_brook 2009 Winter December 2009-12-06 00:00:00 0.138 Good
7 chittern_brook 2009 Winter December 2009-12-07 00:00:00 0.592 Good
8 chittern_brook 2009 Winter December 2009-12-08 00:00:00 1.04 Good
9 chittern_brook 2009 Winter December 2009-12-09 00:00:00 1.46 Good
10 chittern_brook 2009 Winter December 2009-12-10 00:00:00 1.7 Good
# ... with 4,427 more rows
I want to find the 95th percentile, 5th percentile, median and the mean of each season of every year and have the values for 95th 5th, median and mean in separate columns in a new dataframe.
For example:
> (df)
# A tibble: 49 x 2
season_label flow_rate_mean Q95 Q5 flow_rate_median
<chr> <dbl>
1 Winter 2009 0.453 3 2 4
2 Spring 2010 0.519 6 3 4
3 Summer 2010 0.0627 4 3 6
4 Autumn 2010 0.0415 6 2 6
5 Winter 2010 0.0622 8 3 3
6 Spring 2011 0.188 10 3 2
7 Summer 2011 0.0499 2 3 2
8 Autumn 2011 0.0383 2 2 1
9 Winter 2011 0.0461 5 2 7
10 Spring 2012 0.0925 3 2 8
# ... with 39 more rows
I currently have this code which creates the above dataframe with just the first two columns but I would like it to also include 95th percentile, 5th percentile and median. Is this feasible or will I need to do it separately and then combine it into one dataframe?
df <- chitt_brook_wylye_2 %>%
dplyr::mutate(month = as.numeric(format(date,"%m")),
year = as.numeric(format(date,"%Y")),
season_id = (12*year + month) %/% 3) %>%
dplyr::group_by(season_id) %>%
dplyr::mutate(season_label = paste(season, min(year))) %>%
dplyr::group_by(season_id,season_label) %>%
dplyr::summarise(flow_rate = mean(flow_rate))
Reproducible example and code:
date <- as.Date(c("2009-12-01","2010-01-01","2010-02-01","2010-03-01","2010-04-01","2010-05-01","2010-06-01","2010-07-01","2010-08-01","2010-09-01","2010-10-01","2010-11-01","2010-12-01"))
season <- c("Winter","Winter","Winter","Spring","Spring","Spring","Summer","Summer","Summer","Autumn","Autumn","Autumn","Winter")
var <- c(1,2,3,5,5,5,7,7,7,9,9,9,10)
df <- data.frame(date,season,var) %>% # creating the dataframe
dplyr::mutate(month = as.numeric(format(date,"%m")),
year = as.numeric(format(date,"%Y")),
season_id = (12*year + month) %/% 3) %>% #generating an identifiant for every season that exists in the data
dplyr::group_by(season_id) %>% # Grouping by the id
dplyr::mutate(season_label = paste(min(year),season)) %>%
dplyr::group_by(season_id,season_label) %>% ## season_label to keep the newly created label after the arriving summarise
dplyr::summarise(var = mean(var)) # Computing the mean

Trying to use gg_lag() but apparently have more than one time series

I'm trying to find lag using gg_lag but I keep getting the same error regarding my tsibble
# A tsibble: 255 x 6 [7D]
# Key: Demand [163]
Week Demand Date Month year Quarter
<dbl> <dbl> <date> <mth> <chr> <qtr>
1 1 48 2018-01-01 2018 Jan 2018 2018 Q1
2 2 101 2018-01-08 2018 Jan 2018 2018 Q1
3 3 129 2018-01-15 2018 Jan 2018 2018 Q1
4 4 113 2018-01-22 2018 Jan 2018 2018 Q1
5 5 116 2018-01-29 2018 Jan 2018 2018 Q1
6 6 123 2018-02-05 2018 Feb 2018 2018 Q1
7 7 137 2018-02-12 2018 Feb 2018 2018 Q1
8 8 136 2018-02-19 2018 Feb 2018 2018 Q1
9 9 151 2018-02-26 2018 Feb 2018 2018 Q1
10 10 87 2018-03-05 2018 Mar 2018 2018 Q1
# ... with 245 more rows
Printer_Q %>% gg_lag(Demand, geom='point')
Error: The data provided to contains more than one time series. Please filter a single time series to use gg_lag()
I tried filtering my data with:
Printer_Q <- Demandts %>%
select(-Week, -year, -Month, -Quarter)
...so that I am left with Demand and Date but it still says I have more than one time series? What am I doing wrong?
The Demand column should not be a key variable. A key variable is a categorical variable used to distinguish multiple time series in a single tsibble. It appears you just have one time series here, so you don't need a key variable.

sorting of month in matrix in R

I have a matrix in this format:
year month Freq
1 2014 April 466
2 2015 April 59535
3 2014 August 10982
4 2015 August 0
5 2014 December 35881
6 2015 December 0
7 2014 February 17
8 2015 February 24258
9 2014 January 0
10 2015 January 22785
11 2014 July 2981
12 2015 July 0
13 2014 June 1279
14 2015 June 31356
15 2014 March 289
16 2015 March 40274
I need to sort months on the basis of their occurrence i.e jan, feb, mar... when I sort it gets sorted on the basis of first alphabet. I used this:
mat <- mat[order(mat[,1], decreasing = TRUE), ]
and it looks like this :
row.names April August December February January July June March May November October September
1 2015 59535 0 0 24258 22785 0 31356 40274 84211 0 0 0
2 2014 466 10982 35881 17 0 2981 1279 289 879 8911 8565 4000
Can we sort months on the basis of occurrence in R ?
Suppose DF is the data frame from which you derived your matrix. We provide such a data frame in reproducible form at the end. Ensure that month and year are factors with appropriate levels. Note that month.name is a builtin variable in R that is used here to ensure that the month levels are appropriately sorted and we have assumed year is a numeric column. Then use levelplot like this:
DF2 <- transform(DF,
month = factor(as.character(month), levels = month.name),
year = factor(year)
)
library(lattice)
levelplot(Freq ~ year * month, DF2)
Note: Here is DF in reproducible form:
Lines <- " year month Freq
1 2014 April 466
2 2015 April 59535
3 2014 August 10982
4 2015 August 0
5 2014 December 35881
6 2015 December 0
7 2014 February 17
8 2015 February 24258
9 2014 January 0
10 2015 January 22785
11 2014 July 2981
12 2015 July 0
13 2014 June 1279
14 2015 June 31356
15 2014 March 289
16 2015 March 40274 "
DF <- read.table(text = Lines, header = TRUE)
Assuming you want to sort based on time (have to add a dummy day 1 to convert to time format):
time = strptime(paste(1, mat$month, mat$year), format = "%d %B %Y")
mat = mat[sort.ind(time, index.return=T)$ix, ]
Or if you don't care about the year:
time = strptime(paste(1, mat$month, 2000), format = "%d %B %Y")
mat = mat[sort.ind(time, index.return=T)$ix, ]

Resources