Plotting average monthly counts per decade on a plot - r

I have a data set that has monthly "flows" over 68 years. I am trying to make a comparison of flow distributions by decade by making a plot that has a seasonal distribution on the x-axis and displays a mean value for each decade on the plot.

Using your sample data, and the tidyverse packages, the following code will calculate the average per decade and month:
library(tidyverse)
x <- "Year Jan Feb Mar Apr May Jun Jul Aug Sep
1948 29550 47330 64940 61140 20320 17540 37850 29250 17100
1949 45700 53200 37870 36310 39200 23040 31170 23640 19720
1950 16050 17950 27040 21610 15510 16090 12010 11360 14390
1951 14280 13210 16260 24280 13570 9547 9921 8129 7304
1952 19030 29250 58860 31780 19940 16930 9268 9862 9708
1953 24340 28020 31830 29700 44980 15630 22660 14190 13430
1954 34660 23260 24390 21500 13250 10860 10700 8188 6092
1955 14050 19430 12780 19330 12210 7892 12450 10920 6850
1956 7262 20800 27680 24110 13560 8594 10150 7721 10540
1957 14470 13350 22720 39860 23980 12630 10230 7008 8567"
d <- read_table(x) %>%
mutate(
decade = (Year %/% 10)*10 # add column for decade
) %>%
select(-Year) %>% # remove the year
pivot_longer( # convert to a 'tidy' (long) format
cols = Jan:Sep,
names_to = "month",
values_to = "count"
) %>%
mutate(
month = factor(month, levels = month.abb, ordered = TRUE) # make sure months are ordered
) %>%
group_by(decade, month) %>%
summarise(
mean = mean(count)
)
If you print that dataframe, you get:
> d
# A tibble: 18 x 3
# Groups: decade [2]
decade month mean
<dbl> <ord> <dbl>
1 1940 Jan 37625
2 1940 Feb 50265
3 1940 Mar 51405
4 1940 Apr 48725
5 1940 May 29760
6 1940 Jun 20290
7 1940 Jul 34510
8 1940 Aug 26445
9 1940 Sep 18410
10 1950 Jan 18018.
11 1950 Feb 20659.
12 1950 Mar 27695
13 1950 Apr 26521.
14 1950 May 19625
15 1950 Jun 12272.
16 1950 Jul 12174.
17 1950 Aug 9672.
18 1950 Sep 9610.
If you need it back in wide format:
d2 <- d %>%
pivot_wider(
id_cols = decade,
names_from = month,
values_from = mean
)
> d2
# A tibble: 2 x 10
# Groups: decade [2]
decade Jan Feb Mar Apr May Jun Jul Aug Sep
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1940 37625 50265 51405 48725 29760 20290 34510 26445 18410
2 1950 18018. 20659. 27695 26521. 19625 12272. 12174. 9672. 9610.

(Edit: changed from line graph to dodged bar plot, to better align with OP code.)
Here's an approach using dplyr, tidyr, and ggplot2 from tidyverse.
library(tidyverse)
M %>%
group_by(Decade = floor(Year/10)*10) %>%
summarize_at(vars(Jan:Sep), mean) %>%
# This uses tidyr::pivot_longer to reshape the data longer, which gives us the
# ability to map decade to color.
pivot_longer(-Decade, names_to = "Month", values_to = "Avg") %>%
# This step to get the months to be an ordered factor in order of appearance,
# which is necessary to avoid the months showing up in alphabetical order.
mutate(Month = fct_inorder(Month)) %>%
# Alternatively, we could have aligned these thusly
# mutate(Month_order = match(Month, month.abb)) %>%
# mutate(Month = fct_reorder(Month, Month_order)) %>%
ggplot(aes(Month, Avg, fill = as.factor(Decade))) +
geom_col(position = position_dodge()) +
scale_fill_discrete(name = "Decade")

Related

How to plot monthly data having in the x-axis months and Years R studio

I have a dataframe where column 1 are Months, column 2 are Years and column 3 are precipitation values.
I want to plot the precipitation values for EACH month and EACH year.
My data goes from at January 1961 to February 2019.
¿How can I plot that?
Here is my data:
If I use this:
plot(YearAn,PPMensual,type="l",col="red",xlab="años", ylab="PP media anual")
I get this:
Which is wrong because it puts all the monthly values in every single year! What Im looking for is an x axis that looks like "JAN-1961, FEB1961....until FEB-2019"
It can be done easily using ggplot/tidyverse packages.
First lets load the the packages (ggplot is part of tidyverse) and create a sample data:
library(tidyverse)
set.seed(123)
df <- data.frame(month = rep(c(1:12), 2),
year = rep(c("1961", "1962"),
each = 12),
ppmensual = rnorm(24, 5, 2))
Now we can plot the data (df):
df %>%
ggplot(aes(month, ppmensual,
group = year,
color = year)) +
geom_line()
Using lubridate and ggplot2 but with no grouping:
Setup
library(lubridate) #for graphic
library(ggplot2) # for make_date()
df <- tibble(month = rep(month.name, 40),
year = rep(c(1961:2000), each = 12),
PP = runif(12*40) * runif(12*40) * 10) # PP data is random here
print(df, n = 20)
month year PP
<chr> <int> <dbl>
1 January 1961 5.42
2 February 1961 0.855
3 March 1961 5.89
4 April 1961 1.37
5 May 1961 0.0894
6 June 1961 2.63
7 July 1961 1.89
8 August 1961 0.148
9 September 1961 0.142
10 October 1961 3.49
11 November 1961 1.92
12 December 1961 1.51
13 January 1962 5.60
14 February 1962 1.69
15 March 1962 1.14
16 April 1962 1.81
17 May 1962 8.11
18 June 1962 0.879
19 July 1962 4.85
20 August 1962 6.96
# … with 460 more rows
Graph
df %>%
ggplot(aes(x = make_date(year, factor(month)), y = PP)) +
geom_line() +
xlab("años")

ratio calculation and sort the calculated rates

df <- read.csv ('https://raw.githubusercontent.com/ulklc/covid19-
timeseries/master/countryReport/raw/rawReport.csv',
stringsAsFactors = FALSE)
df8 <- read.csv ('https://raw.githubusercontent.com/hirenvadher954/Worldometers-
Scraping/master/countries.csv',
stringsAsFactors = FALSE)
install.packages("tidyverse")
library(tidyverse)
df %>%
left_join(df8, by = c("countryName" = "country_name")) %>%
mutate(population = as.numeric(str_remove_all(population, ","))) %>%
group_by(countryName) %>%
group_by(countryName) %>%
unique() %>%
summarize(population = sum(population, na.rm = TRUE),
confirmed = sum(confirmed, na.rm = TRUE),
recovered = sum(recovered, na.rm = TRUE),
death = sum(death, na.rm = TRUE),
death_prop = paste0(as.character(death), "/", as.character(population))
)
in this code
population / death rate was calculated.
highest population / death have rate
Finding 10 countries.
confirmed and recovered
dont will be available.
10 x 6
countryName population confirmed recovered death death_prop
<chr> <dbl> <int> <int> <int> <chr>
1 Afghanistan 4749258212 141652 16505 3796 3796/4749258212
2 Albania 351091234 37233 22518 1501 1501/351091234
3 Algeria 5349827368 206413 88323 20812 20812/5349827368
4 Andorra 9411324 38518 18054 2015 2015/9411324
5 Angola 4009685184 1620 435 115 115/4009685184
6 Anguilla 1814018 161 92 0 0/1814018
7 Antigua and Barbuda 11947338 1230 514 128 128/11947338
8 Argentina 5513884428 232975 66155 10740 10740/5513884428
9 Armenia 361515646 121702 46955 1626 1626/361515646
10 Aruba 13025452 5194 3135 91 91/13025452
data is an example.
the information is not correct.
The data is in cumulative format meaning all the values for today have all the values till yesterday. So take only max values of each column and calculate death_prop.
library(dplyr)
df %>%
left_join(df8, by = c("countryName" = "country_name")) %>%
mutate(population = as.numeric(str_remove_all(population, ","))) %>%
group_by(countryName) %>%
summarise_at(vars(population:death), max, na.rm = TRUE) %>%
mutate(death_prop = death/population * 100) %>%
arrange(desc(death_prop))
# A tibble: 215 x 5
# countryName population year death death_prop
# <chr> <dbl> <dbl> <int> <dbl>
# 1 San Marino 33860 2019 42 0.124
# 2 Belgium 11589623 2020 9312 0.0803
# 3 Andorra 77142 2019 51 0.0661
# 4 Spain 46754778 2020 28752 0.0615
# 5 Italy 60461826 2020 32877 0.0544
# 6 United Kingdom 67886011 2020 36914 0.0544
# 7 France 65273511 2020 28432 0.0436
# 8 Sweden 10099265 2020 4029 0.0399
# 9 Sint Maarten 42388 2019 15 0.0354
#10 Netherlands 17134872 2020 5830 0.0340
# … with 205 more rows

generate seasonal plot, but with fiscal year start/end dates

Hello! Is there a way to index a chart to start and end at specific points
(which may be out of numeric order)?
I have data that begins October 1st, and ends September 31st the following year. The series repeats through multiple years past, and i want to build a daily seasonality chart. The challenge is the X axis is not from low to high, it runs 10-11-12-1-2-3-4-5-6-7-8-9.
Question 1:
Can you order the index by month 10-11-12-1-2-3-4-5-6-7-8-9?
while, being compatible with %m-%d formatting, as the real problem is in
daily format, but for the sake of brevity, I am only using months.
the result should look something like this...sorry i had to use excel...
Question 2:
Can we remove the connected chart lines, or will the solution to 1, naturally fix
question 2? examples in the attempts below.
Question 3:
Can the final formatting of the solution allow to take a moving average, or other
mutations of the initial data? The table in attempt #2 would allow to take the average of each month by year. Since July 17 is 6 and July 18 is 12, we would plot a 9 in the chart, ect for the entire plot.
Question 4:
Is there and XTS equivalent to solve this problem?
THANK YOU, THANK YOU, THANK YOU!
library(ggplot2)
library(plotly)
library(tidyr)
library(reshape2)
Date <- seq(as.Date("2016-10-1"), as.Date("2018-09-01"), by="month")
values <- c(2,3,4,3,4,5,6,4,5,6,7,8,9,10,8,9,10,11,12,13,11,12,13,14)
YearEnd <-c(2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,
2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018,2018)
df <- data.frame(Date,values,YearEnd)
## PLOT THE TIMESERIES
plot_ly(df, x = ~Date, y = ~values, type = "scatter", mode = "lines")
## PLOT THE DATA BY MONTH: attempt 1
df$Month <- format(df$Date, format="%m")
df2 <- df %>%
select(values, Month, YearEnd)
plot_ly(df2, x = ~Month, y = ~values, type = "scatter", mode = "lines",
connectgaps = FALSE)
## Plot starts on the 10th month, which is good, but the index is
## in standard order, not 10-11-12-1-2-3-4-5-6-7-8-9
## It also still connects the gaps, bad.
## CREATE A PIVOTTABLE: attempt 2
table <- spread(df2,YearEnd, values)
df3 <- melt(table , id.vars = 'Month', variable.name = 'series')
plot_ly(df3, x = ~Month, y = ~values, type = "scatter", mode = "lines",
connectgaps = FALSE)
## now the data are in the right order, but the index is still wrong
## I also do not understand how plotly is ordering it correctly, as 2
## is not the starting point in January.
You just need to set the desired levels for the Month inside factor
library(magrittr)
library(tidyverse)
library(lubridate)
library(plotly)
Date <- seq(as.Date("2016-10-1"), as.Date("2018-09-01"), by = "month")
values <- c(2, 3, 4, 3, 4, 5, 6, 4, 5, 6, 7, 8, 9, 10, 8, 9, 10, 11, 12, 13, 11, 12, 13, 14)
YearEnd <- c(
2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017,
2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018
)
df <- data.frame(Date, values, YearEnd)
# to fiscal year order
df %<>%
mutate(
Month = month(Date),
YearEnd = factor(YearEnd)) %>%
mutate(Month = factor(Month,
levels = c(10:12, 1:9),
labels = c(month.abb[10:12], month.abb[1:9])))
df
#> Date values YearEnd Month
#> 1 2016-10-01 2 2017 Oct
#> 2 2016-11-01 3 2017 Nov
#> 3 2016-12-01 4 2017 Dec
#> 4 2017-01-01 3 2017 Jan
#> 5 2017-02-01 4 2017 Feb
#> 6 2017-03-01 5 2017 Mar
#> 7 2017-04-01 6 2017 Apr
#> 8 2017-05-01 4 2017 May
#> 9 2017-06-01 5 2017 Jun
#> 10 2017-07-01 6 2017 Jul
#> 11 2017-08-01 7 2017 Aug
#> 12 2017-09-01 8 2017 Sep
...
p1 <- ggplot(df, aes(
x = Month, y = values,
color = YearEnd,
group = YearEnd)) +
geom_line() +
theme_classic(base_size = 12)
ggplotly(p1)
Edit: to plot by Julian day, we use a similar method to the 3rd one from this answer
# Generate random data
set.seed(2018)
date = seq(from = as.Date("2016-10-01"), to = as.Date("2018-09-30"),
by = "days")
values = c(rnorm(length(date)/2, 8, 1.5), rnorm(length(date)/2, 16, 2))
dat <- data.frame(date, values)
df <- dat %>%
tbl_df() %>%
mutate(jday = factor(yday(date)),
Month = month(date),
Year = year(date),
# only create label for the 1st day of the month
myLabel = case_when(day(date) == 1L ~ format(date, "%b-%d"),
TRUE ~ NA_character_)) %>%
# create fiscal year column
mutate(fcyear = case_when(Month > 9 ~ as.factor(Year + 1),
TRUE ~ as.factor(Year))) %>%
mutate(Month = factor(Month,
levels = c(10:12, 1:9),
labels = c(month.abb[10:12], month.abb[1:9])))
df
#> # A tibble: 730 x 7
#> date values jday Month Year myLabel fcyear
#> <date> <dbl> <fct> <fct> <dbl> <chr> <fct>
#> 1 2016-10-01 7.37 275 Oct 2016 Oct-01 2017
#> 2 2016-10-02 5.68 276 Oct 2016 <NA> 2017
#> 3 2016-10-03 7.90 277 Oct 2016 <NA> 2017
#> 4 2016-10-04 8.41 278 Oct 2016 <NA> 2017
#> 5 2016-10-05 10.6 279 Oct 2016 <NA> 2017
#> 6 2016-10-06 7.60 280 Oct 2016 <NA> 2017
#> 7 2016-10-07 11.1 281 Oct 2016 <NA> 2017
#> 8 2016-10-08 9.30 282 Oct 2016 <NA> 2017
#> 9 2016-10-09 7.08 283 Oct 2016 <NA> 2017
#> 10 2016-10-10 8.96 284 Oct 2016 <NA> 2017
#> # ... with 720 more rows
# Create a row number for plotting to make sure ggplot plot in
# the exact order of a fiscal year
df1 <- df %>%
group_by(fcyear) %>%
mutate(order = row_number()) %>%
ungroup()
df1
#> # A tibble: 730 x 8
#> date values jday Month Year myLabel fcyear order
#> <date> <dbl> <fct> <fct> <dbl> <chr> <fct> <int>
#> 1 2016-10-01 7.37 275 Oct 2016 Oct-01 2017 1
#> 2 2016-10-02 5.68 276 Oct 2016 <NA> 2017 2
#> 3 2016-10-03 7.90 277 Oct 2016 <NA> 2017 3
#> 4 2016-10-04 8.41 278 Oct 2016 <NA> 2017 4
#> 5 2016-10-05 10.6 279 Oct 2016 <NA> 2017 5
#> 6 2016-10-06 7.60 280 Oct 2016 <NA> 2017 6
#> 7 2016-10-07 11.1 281 Oct 2016 <NA> 2017 7
#> 8 2016-10-08 9.30 282 Oct 2016 <NA> 2017 8
#> 9 2016-10-09 7.08 283 Oct 2016 <NA> 2017 9
#> 10 2016-10-10 8.96 284 Oct 2016 <NA> 2017 10
#> # ... with 720 more rows
# plot with `order` as x-axis
p2 <- ggplot(df1,
aes(x = order, y = values,
color = fcyear,
group = fcyear)) +
geom_line() +
theme_classic(base_size = 12) +
xlab(NULL)
# now replace `order` label with `myLabel` created above
x_break <- df1$order[!is.na(df1$myLabel)][1:12]
x_label <- df1$myLabel[x_break]
x_label
#> [1] "Oct-01" "Nov-01" "Dec-01" "Jan-01" "Feb-01" "Mar-01" "Apr-01"
#> [8] "May-01" "Jun-01" "Jul-01" "Aug-01" "Sep-01"
p3 <- p2 +
scale_x_continuous(
breaks = x_break,
labels = x_label) +
theme(axis.text.x = element_text(angle = 90)) +
scale_color_brewer("Fiscal Year", palette = "Dark2") +
xlab(NULL)
p3
ggplotly(p3)
Created on 2018-09-09 by the reprex package (v0.2.0.9000).
Consider this an appendix to Tung's excellent answer. Here I've made it obvious how to alter the code for different start and end months of financial (or production) years which varies by country (and industry), with the Parameter EndMonth. I've also added an annual average, which seems like a pretty obvious thing to want as well (though outside the OP's request).
library(tidyverse)
library(lubridate)
## Generate random data
set.seed(2018)
date = seq(from = as.Date("2016-06-01"), to = as.Date("2016-06-01")+729,
by = "days") # about 2 years, but even number of days
values = c(rnorm(length(date)/2, 8, 1.5), rnorm(length(date)/2, 16, 2))
dat <- data.frame(date, values)
EndMonth <- 5 #i.e. if last month of financial year is May, use 5 for 5th month of calendar year
df <- dat %>%
tbl_df() %>%
mutate(jday = factor(yday(date)),
Month = month(date),
Year = year(date),
# only create label for the 1st day of the month
myLabel = case_when(day(date) == 1L ~ format(date, "%b%e"),
TRUE ~ NA_character_)) %>%
# create fiscal year column
mutate(fcyear = case_when(Month > EndMonth ~ as.factor(Year + 1),
TRUE ~ as.factor(Year))) %>%
mutate(Month = factor(Month,
levels = c((EndMonth+1):12, 1:(EndMonth)),
labels = c(month.abb[(EndMonth+1):12], month.abb[1:EndMonth])))
df
#make 2 (or n) year average
df_mean <- df %>%
group_by(jday) %>%
mutate(values = mean(values, na.rm=TRUE)) %>%
filter(fcyear %in% c("2017")) %>% #note hard code for first fcyear in dataset
mutate(fcyear = "Average")
#Add average to data frame
df <- bind_rows(df, df_mean)
# Create a row number for plotting to make sure ggplot plot in
# the exact order of a fiscal year
df1 <- df %>%
group_by(fcyear) %>%
mutate(order = row_number()) %>%
ungroup()
df1
# plot with `order` as x-axis
p2 <- ggplot(df1,
aes(x = order, y = values,
color = fcyear,
group = fcyear)) +
geom_line() +
theme_classic(base_size = 12) +
xlab(NULL)
p2
# now replace `order` label with `myLabel` created above
x_break <- df1$order[!is.na(df1$myLabel)][1:12]
x_label <- df1$myLabel[x_break]
x_label
p3 <- p2 +
scale_x_continuous(
breaks = x_break,
labels = x_label) +
theme(axis.text.x = element_text(angle = 90)) +
scale_color_brewer("Fiscal Year", palette = "Dark2") +
xlab(NULL)
p3

computing onset date of snowmelt in R [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I have daily temperature in this format starting from 1950 to 2017
Data
I need to compute snowmelt onset date which is defined as as the the first day when daily temperature is above 0 C, following the last five-day period between March and May, when the daily temperature is below 0 C. My codes so far:
df1<-read.csv("temp.csv")
require(dplyr)
# applying the condition to check each temperature value
df1$boolean<- ifelse(df1$temp<0.0 , 1, 0)
#computing the total sum < 0 and the start and end date
snow<-df1 %>%
mutate(boolean = ifelse(is.na(boolean), 0, boolean)) %>%
group_by(group = cumsum(c(0, diff(boolean) != 0))) %>%
filter(boolean == 1 & n() > 1) %>%
summarize("Start Date"=min(as.character(date)),
"End Date"=max(as.character(date)),
"Length of Run"=n()) %>%
ungroup() %>%
select(-matches("group"))
colnames(snow)[3] <- 'length'
# subset length that greater >5
obs<-subset(snow,length >=5)
The codes above give me partial solution ( if further manually edit I will get ideal solution to match my definition) I am only interested in one onset date for each year. I need some further guidance on how I can edit this code to compute onset date based on definition above.
I have number of locations so manually editing this would not be ideal solution.
Your help would be appreciated
We have assumed in (1) that the melt day must occur in Mar, Apr or May and in (2) that only the 5 subzero days occur in Mar, Apr, May but the melt day could occur in June, say.
1) Define df2 which is df1 plus additional columns: month, year and code where code is 0 if the date is not in Mar, Apr, May and is otherwise 1 if temp < 0 and 2 if temp >= 0.
Now using df2 run rollapplyr on code returning TRUE if the most recent 6 dates have codes 1, 1, 1, 1, 1, 2 and otherwise FALSE. Take the TRUE rows and only keep the last in each year. Right join that to a data frame of all years in order to generate NAs in the output for any missing years.
library(zoo)
df2 <- df1 %>%
mutate(Date = as.Date(Date), month = as.numeric(format(Date, "%m")),
year = as.numeric(format(Date, "%Y")),
code = (month %in% 3:5) * ((temp < 0) + 2 * (temp >= 0)),
OK = rollapplyr(code, 6, identical, c(1, 1, 1, 1, 1, 2), fill = FALSE))
df2 %>%
filter(OK) %>%
filter(!duplicated(year, fromLast = TRUE)) %>%
right_join(unique(df2["year"]), by = "year") %>%
select(year, Date)
giving:
year Date
1 1950 1950-05-24
2 1951 1951-05-21
3 1952 1952-05-28
4 1953 1953-05-15
5 1954 1954-05-28
6 1955 1955-05-14
7 1956 1956-05-27
8 1957 1957-05-17
9 1958 1958-05-21
10 1959 <NA>
11 1960 1960-05-26
12 1961 1961-05-16
13 1962 1962-05-19
14 1963 1963-05-13
15 1964 1964-05-27
16 1965 1965-05-20
17 1966 1966-05-26
18 1967 1967-05-26
19 1968 1968-05-27
20 1969 1969-05-30
21 1970 1970-05-21
2) In (1) we assumed that the melt onset day must be in Mar, Apr or May but here we assume that only the subzero days lie in that range and the melt onset day may extend further out.
Calculations are the same as in (1) except that the codes are now such that 1 indicates a subzero temperature in Mar, Apr or May, 2 indicates any temp above zero any time (not just in Mar, Apr and May) and 0 is anything else. We collapse the codes into a character string (one character per date) and use a regular expression on it to look for a substring of 5 ones followed by anything until we get to the next 2. We process the rest as in (1) except now we don't need the join since there will always be a melt onset day. Without the join we can represent this now as a single pipeline.
df1 %>%
mutate(Date = as.Date(Date), month = as.numeric(format(Date, "%m")),
year = as.numeric(format(Date, "%Y")),
code = (month %in% 3:5) * (temp < 0) + 2 * (temp >= 0),
OK = { g <- gregexpr("1{5}.*?2", paste(code, collapse = ""))[[1]]
seq_along(code) %in% (g + attr(g, "match.length") - 1) }) %>%
filter(OK) %>%
filter(!duplicated(year, fromLast = TRUE)) %>%
select(year, Date)
giving:
year Date
1 1950 1950-05-24
2 1951 1951-06-01
3 1952 1952-05-28
4 1953 1953-05-15
5 1954 1954-05-28
6 1955 1955-05-14
7 1956 1956-05-27
8 1957 1957-05-17
9 1958 1958-05-21
10 1959 1959-06-02
11 1960 1960-05-26
12 1961 1961-05-16
13 1962 1962-05-19
14 1963 1963-06-01
15 1964 1964-05-27
16 1965 1965-05-20
17 1966 1966-05-26
18 1967 1967-05-26
19 1968 1968-05-27
20 1969 1969-05-30
21 1970 1970-05-21
A straightforward solution in tidyverse.
library(tidyverse)
library(lubridate)
readxl::read_excel("temp.xlsx") -> df1
df1 %>%
mutate(year = year(Date),
month = month(Date)) %>%
group_by(year) %>%
mutate(
below_0 = as.numeric(temp < 0),
streak5 = cumsum(below_0) - cumsum(lag(below_0, 5, 0)),
onset = month %in% c(3, 4, 5) & lag(streak5) == 5 & below_0 == 0) %>%
filter(onset) %>%
summarise(Date = last(Date))
Gives
# A tibble: 20 x 2
year Date
<dbl> <dttm>
1 1950 1950-05-24
2 1951 1951-05-21
3 1952 1952-05-28
4 1953 1953-05-15
5 1954 1954-05-28
6 1955 1955-05-14
7 1956 1956-05-27
8 1957 1957-05-17
9 1958 1958-05-21
10 1960 1960-05-26
11 1961 1961-05-16
12 1962 1962-05-19
13 1963 1963-05-13
14 1964 1964-05-27
15 1965 1965-05-20
16 1966 1966-05-26
17 1967 1967-05-26
18 1968 1968-05-27
19 1969 1969-05-30
20 1970 1970-05-21
I hope the code more or less explains itself, streak5 is the number of previous days with temp below 0, onset implements the criteria given in the question, summarise picks the last date in given year.
rle() to the rescue!
library(broom)
library(tidyverse)
temp <- read_csv("temp.csv")
Best read the pipe below first before reading this helper function.
For each year we:
take a run-length encoding of above/below 0
the first one that's TRUE (<0) and has 5+ consecutive days is our candidate
take the next index
if that's too much (no days that fit the criteria) return NA
else return that date
thus:
mk_runs <- function(xdf) {
r <- rle(xdf$below_0) take the T/F RLE
pos <- which(r$values & r$length>=5)[1] # find the first one meeting criteria
idx <- (sum(r$lengths[1:pos]))+1 # sum the lengths up until this point and add 1 to get to the first > 0 day
if (idx > nrow(xdf)) { # if past our date range return NA
data_frame(year=xdf$year[1], date=NA)
} else {
xdf[idx, c("year", "date")]
}
}
We need to get the data into shape:
separate(temp, Date, c("month", "day", "year")) %>%
mutate_all(as.numeric) %>%
mutate(year = ifelse(year >=50, 1900+year, 2000+year)) %>%
mutate(date = as.Date(sprintf("%04d-%02d-%02d", year, month, day))) %>%
mutate(month = lubridate::month(date)) %>%
mutate(below_0 = temp < 0) %>%
filter(month >= 3 & month <=5) %>%
group_by(year) %>% # year groups
arrange(date) %>% # in order
do(mk_runs(.)) %>% # see above function
print(n=21)
## # A tibble: 21 x 2
## # Groups: year [21]
## year date
## <dbl> <date>
## 1 1950 1950-04-30
## 2 1951 1951-05-21
## 3 1952 1952-05-28
## 4 1953 1953-05-15
## 5 1954 1954-05-28
## 6 1955 1955-05-14
## 7 1956 1956-05-02
## 8 1957 1957-05-07
## 9 1958 1958-04-27
## 10 1959 NA
## 11 1960 1960-04-24
## 12 1961 1961-05-16
## 13 1962 1962-05-19
## 14 1963 1963-05-13
## 15 1964 1964-05-20
## 16 1965 1965-05-20
## 17 1966 1966-05-07
## 18 1967 1967-04-27
## 19 1968 1968-05-10
## 20 1969 1969-05-22
## 21 1970 1970-05-21
Here is another attempt. In my first step, I created two new columns first (i.e., year and month). Then, I filtered the data for data between March and May. Then, I created index numbers for rows which have temperature higher than 0 Celsius. This process is done per year. Since you need to have five consecutive days before those days that have temperature above zero, index numbers equal to / smaller than 5 needs to be ignored. This is done if_else() in the true condition in the outer if_else().
In my second step, I chose to use a package called SOfun which is developed by the author of splitstackshape. You can download this package from github. What getMyRows() is doing are; 1) it identifies which rows should be considered by specifying pattern, 2) get a certain range of rows from the marked rows in 1), and 3) create a list. Here range = -5:0 means that I am choosing five previous rows of a target row, and the target row itself.
In my third step, I subsetted mylist with two logical conditions. !is.na(x$ind[6]) checks if the 6th element of ind is not NA, and all(x$temp[1:5] < 0) checks if the 1st-5th elements of temp (temperature) are all smaller than zero. Filter() chooses list elements that satisfy the two logical condition. Then, I extracted the 6th row from each data frame since that is the target row. I bound the list, grouped the data by year and chose the first observation for each year using slice().
library(devtools)
install_github("mrdwab/overflow-mrdwab")
install_github("mrdwab/SOfun")
library(overflow)
library(SOfun)
library(readxl)
library(dplyr)
# Part 1
mydf <- read_excel("temp.xlsx") %>%
mutate(year = as.numeric(format(Date, "%Y")),
month = as.numeric(format(Date, "%m"))) %>%
filter(between(month, 3, 5)) %>%
group_by(year) %>%
mutate(ind = if_else(temp > 0,
{ind <- row_number()
if_else(ind <= 5, NA_integer_, ind)},
NA_integer_)) %>%
ungroup
# Part 2
mylist <- getMyRows(mydf,
pattern = which(complete.cases(mydf$ind)),
range = -5:0, isNumeric = TRUE)
# Part 3
Filter(function(x) !is.na(x$ind[6]) & all(x$temp[1:5] < 0), mylist) %>%
lapply(function(x) x[6, ]) %>%
bind_rows %>%
group_by(year) %>%
slice(1) %>%
select(Date)
year Date
<dbl> <dttm>
1 1950 1950-04-30 00:00:00
2 1951 1951-05-21 00:00:00
3 1952 1952-05-28 00:00:00
4 1953 1953-05-15 00:00:00
5 1954 1954-05-28 00:00:00
6 1955 1955-05-14 00:00:00
7 1956 1956-05-02 00:00:00
8 1957 1957-05-07 00:00:00
9 1958 1958-04-27 00:00:00
10 1960 1960-04-24 00:00:00
11 1961 1961-05-16 00:00:00
12 1962 1962-05-19 00:00:00
13 1963 1963-05-13 00:00:00
14 1964 1964-05-20 00:00:00
15 1965 1965-05-20 00:00:00
16 1966 1966-05-07 00:00:00
17 1967 1967-04-27 00:00:00
18 1968 1968-05-10 00:00:00
19 1969 1969-05-22 00:00:00
20 1970 1970-05-21 00:00:00

Use dplyr/tidyr to turn rows into columns in R data frame

I have a data frame like this:
year <-c(floor(runif(100,min=2015, max=2017)))
month <- c(floor(runif(100, min=1, max=13)))
inch <- c(floor(runif(100, min=0, max=10)))
mm <- c(floor(runif(100, min=0, max=100)))
df = data.frame(year, month, inch, mm);
year month inch mm
2016 11 0 10
2015 9 3 34
2016 6 3 33
2015 8 0 77
I only care about the columns year, month, and mm.
I need to re-arrange the data frame so that the first column is the name of the month and the rest of the columns is the value of mm.
Months 2015 2016
Jan # #
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
So two things needs to happen.
(1) The month needs to become a string of the first three letters of the month.
(2) I need to group by year, and then put the mm values in a column under that year.
So far I have this code, but I can't figure it out:
df %>%
select(-inch) %>%
group_by(month) %>%
summarize(mm = mm) %>%
ungroup()
To convert month to names, you can refer to month.abb; And then you can summarize by year and month, spread to wide format:
library(dplyr)
library(tidyr)
df %>%
group_by(year, month = month.abb[month]) %>%
summarise(mm = mean(mm)) %>% # use mean as an example, could also be sum or other
# intended aggregation methods
spread(year, mm) %>%
arrange(match(month, month.abb)) # rearrange month in chronological order
# A tibble: 12 x 3
# month `2015` `2016`
# <chr> <dbl> <dbl>
# 1 Jan 65.50000 28.14286
# 2 Feb 54.40000 30.00000
# 3 Mar 23.50000 95.00000
# 4 Apr 7.00000 43.60000
# 5 May 45.33333 44.50000
# 6 Jun 70.33333 63.16667
# 7 Jul 72.83333 52.00000
# 8 Aug 53.66667 66.50000
# 9 Sep 51.00000 64.40000
#10 Oct 74.00000 39.66667
#11 Nov 66.20000 58.71429
#12 Dec 38.25000 51.50000

Resources