Computation of yearwise breakpoints in R - r

I have daily rainfall data which I have converted to yearwise cumulative value using following code
library(tidyverse); library(segmented); library(seas); library(strucchange)
## get mscdata from "seas" packages
data(mscdata)
dat <- (mksub(mscdata, id=1108447))
## generate cumulative sum of rain by year
d2 <- dat %>% group_by(year) %>% mutate(rain_cs = cumsum(rain)) %>% ungroup
Then I want to compute of yearwise breakpoints using strucchange. I could able to do it for single year like
y <- subset(d2,year=="1992")$rain_cs
breakpoints(y ~ 1, breaks = 3)$breakpoints
I have used breaks = 3 to have 3 breakpoints. Now how to dynamically apply it year-wise to estimate breakpoints?

You can group_by year and use summarise in dplyr 1.0.0 which can generate multiple rows in summarise :
library(dplyr)
library(strucchange)
d2 %>%
group_by(year) %>%
summarise(breakpoints = breakpoints(rain_cs~1, breaks = 3)$breakpoints)
# year breakpoints
# <int> <dbl>
# 1 1975 73
# 2 1975 237
# 3 1975 301
# 4 1976 83
# 5 1976 166
# 6 1976 297
# 7 1977 98
# 8 1977 239
# 9 1977 311
#10 1978 102
# … with 80 more rows
To get data as 3 columns instead, we can store the output in a list and use unnest_wider.
d2 %>%
group_by(year) %>%
summarise(breakpoints = list(breakpoints(rain_cs~1,breaks = 3)$breakpoints)) %>%
tidyr::unnest_wider(breakpoints) %>%
tibble::column_to_rownames('year')

Related

Create a table out of a tibble

I do have the following dataframe with 45 million observations:
year month variable
1992 1 0
1992 1 1
1992 1 1
1992 2 0
1992 2 1
1992 2 0
My goal is to count the frequency of the variable for each month of a year.
I was already able to generate these sums with cps_data as my dataframe and SKILL_1 as my variable.
cps_data %>%
group_by(YEAR, MONTH) %>%
summarise_at(vars(SKILL_1),
list(name = sum))
Logically, I obtained 348 different rows as a tibble. Now, I struggle to create a new table with these values. My new table should look similar to my tibble. How can I do that? Is there even a way? I've already tried to read in an excel file with a date range from 01/1992 - 01/2021 in order to obtain exactly 349 rows and then merge it with the rows of the tibble, but it did not work..
# A tibble: 349 x 3
# Groups: YEAR [30]
YEAR MONTH name
<dbl> <int+lbl> <dbl>
1 1992 1 [January] 499
2 1992 2 [February] 482
3 1992 3 [March] 485
4 1992 4 [April] 457
5 1992 5 [May] 434
6 1992 6 [June] 470
7 1992 7 [July] 450
8 1992 8 [August] 438
9 1992 9 [September] 442
10 1992 10 [October] 427
# ... with 339 more rows
many thanks in advance!!
library(zoo)
createmonthyear <- function(start_date,end_date){
ym <- seq(as.yearmon(start_date), as.yearmon(end_date), 1/12)
data.frame(start = pmax(start_date, as.Date(ym)),
end = pmin(end_date, as.Date(ym, frac = 1)),
month = month.name[cycle(ym)],
year = as.integer(ym),
stringsAsFactors = FALSE)}
Once you create the function, you can specify the start and end date you want:
left_table <- data.frame(createmonthyear(1991-01-01,2021-01-01))
then left join the output with what you have
library(dplyr)
right_table <- data.frame(cps_data %>%
group_by(YEAR, MONTH) %>%
summarise_at(vars(SKILL_1),
list(name = sum)))
results <- left_join(left_table, right_table, by = c("Year" = "year", "Month" = "month")

Applying yearwise segmented regression in R

I have daily rainfall data which I have converted to yearwise cumulative value using following code
library(seas)
library(data.table)
library(ggplot2)
#Loading data
data(mscdata)
dat <- (mksub(mscdata, id=1108447))
dat$julian.date <- as.numeric(format(dat$date, "%j"))
DT <- data.table(dat)
DT[, Cum.Sum := cumsum(rain), by=list(year)]
df <- cbind.data.frame(day=dat$julian.date,cumulative=DT$Cum.Sum)
Then I want to apply segmented regression year-wise to have year-wise breakpoints. I could able to do it for single year like
library("segmented")
x <- subset(dat,year=="1984")$julian.date
y <- subset(DT,year=="1984")$Cum.Sum
fit.lm<-lm(y~x)
segmented(fit.lm, seg.Z = ~ x, npsi=3)
I have used npsi = 3 to have 3 breakpoints. Now how to dinimically apply it year-wise segmented regression and have the estimated breakpoints?
Here's a short script to come out with a customised function so that you can run the different yearwise regressions.
## using tidyverse processes instead of mixing and matching with other data manipulation packages
library(tidyverse); library(segmented); library(seas)
## get mscdata from "seas" packages
data(mscdata)
dat <- (mksub(mscdata, id=1108447))
## generate cumulative sum of rain by year
d2 <- dat %>% group_by(year) %>% mutate(rain_cs = cumsum(rain)) %>% ungroup
## write a custom function
segmentedlm <- function(data, year){
subset.df <- data %>% filter(year == year)
fit.lm <- lm(rain_cs ~ julian.date, subset.df)
segmented(fit.lm, seg.Z = ~ julian.date, npsi=3)
}
# run the customised function for 1975 data
segmentedlm(d2, "1975") %>% plot(., main="1975")
segmentedlm(d2, "1984") %>% plot(., main = "1984")
To output the summary of segmented linear models of multiple years into a text file:
sink("output.txt")
lapply(c("1975", "1984"), function(x) segmentedlm(d2, x))
sink()
You can change the argument for lapply to input all the years.
You can store the lm object in a list and apply segmented for each year.
library(tidyverse)
data <- DT %>%
group_by(year) %>%
summarise(fit.lm = list(lm(Cum.Sum~julian.date)),
julian.date1 = list(julian.date)) %>%
mutate(out = map2(fit.lm, julian.date1, function(x, julian.date)
data.frame(segmented::segmented(x,
seg.Z = ~julian.date, npsi=3)$psi))) %>%
unnest_wider(out) %>%
unnest(cols = c(Initial, Est., St.Err)) %>%
dplyr::select(-fit.lm, -julian.date1)
# A tibble: 90 x 4
# year Initial Est. St.Err
# <int> <dbl> <dbl> <dbl>
# 1 1975 84.8 68.3 1.44
# 2 1975 168. 167. 9.31
# 3 1975 282. 281. 0.917
# 4 1976 84.8 68.3 1.44
# 5 1976 168. 167. 9.33
# 6 1976 282. 281. 0.913
# 7 1977 84.8 68.3 1.44
# 8 1977 168. 167. 9.32
# 9 1977 282. 281. 0.913
#10 1978 84.8 68.3 1.44
# … with 80 more rows

aggregation of the region's values ​in the dataset

df <- read.csv ('https://raw.githubusercontent.com/ulklc/covid19-
timeseries/master/countryReport/raw/rawReport.csv',
stringsAsFactors = FALSE)
I processed the dataset.
Can we find the day of the least death in the Asian region?
the important thing here;
 is the sum of deaths of all countries in the asia region. Accordingly, it is to sort and find the day.
as output;
date region death
2020/02/17 asia 6300 (asia region sum)
The data in the output I created are examples. The data in the example are not real.
Since these are cumulative cases and deaths, we need to difference the data.
library(dplyr)
df %>%
mutate(day = as.Date(day)) %>%
filter(region=="Asia") %>%
group_by(day) %>%
summarise(deaths=sum(death)) %>%
mutate(d=c(first(deaths),diff(deaths))) %>%
arrange(d)
# A tibble: 107 x 3
day deaths d
<date> <int> <int>
1 2020-01-23 18 1 # <- this day saw only 1 death in the whole of Asia
2 2020-01-29 133 2
3 2020-02-21 2249 3
4 2020-02-12 1118 5
5 2020-01-24 26 8
6 2020-02-23 2465 10
7 2020-01-26 56 14
8 2020-01-25 42 16
9 2020-01-22 17 17
10 2020-01-27 82 26
# ... with 97 more rows
So the second day of records saw the least number of deaths recorded (so far).
Using the dplyr package for data treatment :
df <- read.csv ('https://raw.githubusercontent.com/ulklc/covid19-
timeseries/master/countryReport/raw/rawReport.csv',
stringsAsFactors = FALSE)
library(dplyr)
df_sum <- df %>% group_by(region,day) %>% # grouping by region and day
summarise(death=sum(death)) %>% # summing following the groups
filter(region=="Asia",death==min(death)) # keeping only minimum of Asia
Then you have :
> df_sum
# A tibble: 1 x 3
# Groups: region [1]
region day death
<fct> <fct> <int>
1 Asia 2020/01/22 17

Calculate difference between values using different column and with gaps using R

Can anyone help me figure out how to calculate the difference in values based on my monthly data? For example I would like to calculate the difference in groundwater values between Jan-Jul, Feb-Aug, Mar-Sept etc, for each well by year. Note in some years there will be some months missing. Any tidyverse solutions would be appreciated.
Well year month value
<dbl> <dbl> <fct> <dbl>
1 222 1995 February 8.53
2 222 1995 March 8.69
3 222 1995 April 8.92
4 222 1995 May 9.59
5 222 1995 June 9.59
6 222 1995 July 9.70
7 222 1995 August 9.66
8 222 1995 September 9.46
9 222 1995 October 9.49
10 222 1995 November 9.31
# ... with 18,400 more rows
df1 <- subset(df, month %in% c("February", "August"))
test <- df1 %>%
dcast(site + year + Well ~ month, value.var = "value") %>%
mutate(Diff = February - August)
Thanks,
Simon
So I attempted to manufacture a data set and use dplyr to create a solution. It is best practice to include a method of generating a sample data set, so please do so in future questions.
# load required library
library(dplyr)
# generate data set of all site, well, and month combinations
## define valid values
sites = letters[1:3]
wells = 1:5
months = month.name
## perform a series of merges
full_sites_wells_months_set <-
merge(sites, wells) %>%
dplyr::rename(sites = x, wells = y) %>% # this line and the prior could be replaced on your system with initial_tibble %>% dplyr::select(sites, wells) %>% unique()
merge(months) %>%
dplyr::rename(months = y) %>%
dplyr::arrange(sites, wells)
# create sample initial_tibble
## define fraction of records to simulate missing months
data_availability <- 0.8
initial_tibble <-
full_sites_wells_months_set %>%
dplyr::sample_frac(data_availability) %>%
dplyr::mutate(values = runif(nrow(full_sites_wells_months_set)*data_availability)) # generate random groundwater values
# generate final result by joining full expected set of sites, wells, and months to actual data, then group by sites and wells and perform lag subtraction
final_tibble <-
full_sites_wells_months_set %>%
dplyr::left_join(initial_tibble) %>%
dplyr::group_by(sites, wells) %>%
dplyr::mutate(trailing_difference_6_months = values - dplyr::lag(values, 6L))

loop to run model on subset dataframe

I am not very experienced with loops so I am not sure where I went wrong here...
I have a dataframe that looks like:
month year day mean.temp mean.temp.year.month
1 1961 1 4.85 4.090323
1 1961 2 4.90 4.090323
1 1961 3 2.95 4.090323
1 1961 4 3.40 4.090323
1 1961 5 2.90 4.090323
dataset showing 3 months for 2 years can be found here:
https://drive.google.com/file/d/1w7NVeoEh8b7cAkU3cu1sXx6yCh75Inqg/view?usp=sharing
and I want to subset this dataframe by year and month so that I can run one nls model per year and month. Since my dataset contains 56 years (and each year has 12 months), that will give 672 models. Then I want to store the parameter estimates in a separate table.
I've created this code, but I can't work out why it is only giving me the parameter estimates for month 12 (all 56 years, but just month 12):
table <- matrix(99999, nrow=672, ncol=4)
YEARMONTHsel <- unique(df_weather[c("year", "month")])
YEARsel <- unique(df_weather$year)
MONTHsel <- unique(df_weather$month)
for (i in 1:length(YEARsel)) {
for (j in 1:length(MONTHsel)) {
temp2 <- df_weather[df_weather$year==YEARsel[i] & df_weather$month==MONTHsel[j],]
mn <- nls(mean.temp~mean.temp.year.month+alpha*sin(day*pi*2/30+phi),
data = temp2, control=nlc,
start=list(alpha=-6.07043, phi = -10))
cr <- as.vector(coef(mn))
nv <-length(coef(mn))
table[i,1:nv] <- cr
table[i,nv+1]<- YEARsel[i]
table[i,nv+2]<- MONTHsel[j]
}
}
I've tried several options (i.e. without using nested loop) but I'm not getting anywhere.
Any help would be greatly appreciated!Thanks.
Based on your loop, it looks like you want to run the regression grouped by year and month and then extract the coefficients in a new dataframe (correct me if thats wrong)
library(readxl)
library(tidyverse)
df <- read_excel("~/Downloads/df_weather.xlsx")
df %>% nest(-month, -year) %>%
mutate(model = map(data, ~nls(mean.temp~mean.temp.year.month+alpha*sin(day*pi*2/30+phi),
data = .x, control= "nlc",
start=list(alpha=-6.07043, phi = -10))),
coeff = map(model, ~coefficients(.x))) %>%
unnest(coeff %>% map(broom::tidy)) %>%
spread(names, x) %>%
arrange(year)
#> # A tibble: 6 x 4
#> month year alpha phi
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1961 0.561 -10.8
#> 2 2 1961 -1.50 -10.5
#> 3 3 1961 -2.06 -9.77
#> 4 1 1962 -3.35 -5.48
#> 5 2 1962 -2.27 -9.97
#> 6 3 1962 0.959 -10.8
First we nest the data based on your groups (in this case year and month), then we map the model for each group, then we map the coefficients for each group, lastly we unnest the coefficients and spread the data from long to wide.

Resources