Applying yearwise segmented regression in R - r

I have daily rainfall data which I have converted to yearwise cumulative value using following code
library(seas)
library(data.table)
library(ggplot2)
#Loading data
data(mscdata)
dat <- (mksub(mscdata, id=1108447))
dat$julian.date <- as.numeric(format(dat$date, "%j"))
DT <- data.table(dat)
DT[, Cum.Sum := cumsum(rain), by=list(year)]
df <- cbind.data.frame(day=dat$julian.date,cumulative=DT$Cum.Sum)
Then I want to apply segmented regression year-wise to have year-wise breakpoints. I could able to do it for single year like
library("segmented")
x <- subset(dat,year=="1984")$julian.date
y <- subset(DT,year=="1984")$Cum.Sum
fit.lm<-lm(y~x)
segmented(fit.lm, seg.Z = ~ x, npsi=3)
I have used npsi = 3 to have 3 breakpoints. Now how to dinimically apply it year-wise segmented regression and have the estimated breakpoints?

Here's a short script to come out with a customised function so that you can run the different yearwise regressions.
## using tidyverse processes instead of mixing and matching with other data manipulation packages
library(tidyverse); library(segmented); library(seas)
## get mscdata from "seas" packages
data(mscdata)
dat <- (mksub(mscdata, id=1108447))
## generate cumulative sum of rain by year
d2 <- dat %>% group_by(year) %>% mutate(rain_cs = cumsum(rain)) %>% ungroup
## write a custom function
segmentedlm <- function(data, year){
subset.df <- data %>% filter(year == year)
fit.lm <- lm(rain_cs ~ julian.date, subset.df)
segmented(fit.lm, seg.Z = ~ julian.date, npsi=3)
}
# run the customised function for 1975 data
segmentedlm(d2, "1975") %>% plot(., main="1975")
segmentedlm(d2, "1984") %>% plot(., main = "1984")
To output the summary of segmented linear models of multiple years into a text file:
sink("output.txt")
lapply(c("1975", "1984"), function(x) segmentedlm(d2, x))
sink()
You can change the argument for lapply to input all the years.

You can store the lm object in a list and apply segmented for each year.
library(tidyverse)
data <- DT %>%
group_by(year) %>%
summarise(fit.lm = list(lm(Cum.Sum~julian.date)),
julian.date1 = list(julian.date)) %>%
mutate(out = map2(fit.lm, julian.date1, function(x, julian.date)
data.frame(segmented::segmented(x,
seg.Z = ~julian.date, npsi=3)$psi))) %>%
unnest_wider(out) %>%
unnest(cols = c(Initial, Est., St.Err)) %>%
dplyr::select(-fit.lm, -julian.date1)
# A tibble: 90 x 4
# year Initial Est. St.Err
# <int> <dbl> <dbl> <dbl>
# 1 1975 84.8 68.3 1.44
# 2 1975 168. 167. 9.31
# 3 1975 282. 281. 0.917
# 4 1976 84.8 68.3 1.44
# 5 1976 168. 167. 9.33
# 6 1976 282. 281. 0.913
# 7 1977 84.8 68.3 1.44
# 8 1977 168. 167. 9.32
# 9 1977 282. 281. 0.913
#10 1978 84.8 68.3 1.44
# … with 80 more rows

Related

How to create a cumulative variable that groups by PERMNO and arranges by date in R

I have a dataframe with variables from COMPUSTAT containing data on various accounting items, including SG&A expenses from different companies.
I want to create a new variable in the dataframe which accumulates the SG&A expenses for each company in chronological order. I use PERMNO codes as the unique ID for each company.
I have tried this code, however it does not seem to work:
crsp.comp2$cxsgaq <- crsp.comp2 %>%
group_by(permno) %>%
arrange(date) %>%
mutate_at(vars(xsgaq), cumsum(xsgaq))
(xsgag is the COMPUSTAT variable for SG&A expenses)
Thank you very much for your help
Your example code is attempting write the entire dataframe crsp.comp2, into a variable crsp.comp2$cxsgaq.
Usually the vars() function variables needs to be "quoted"; though in your situation, use the standard mutate() function and assign the cxsgaq variable there.
crsp.comp2 <- crsp.comp2 %>%
group_by(permno) %>%
arrange(date) %>%
mutate(cxsgaq = cumsum(xsgaq))
Reproducible example with iris dataset:
library(tidyverse)
iris %>%
group_by(Species) %>%
arrange(Sepal.Length) %>%
mutate(C.Sepal.Width = cumsum(Sepal.Width))
Building on the answer from #m-viking, if using the WRDS PostgreSQL server, you would simply use window_order (from dplyr) in place of arrange. (I use the Compustat firm identifier gvkey in place of permno so that this code works, but the idea is the same.)
library(dplyr, warn.conflicts = FALSE)
library(DBI)
pg <- dbConnect(RPostgres::Postgres(),
bigint = "integer", sslmode='allow')
fundq <- tbl(pg, sql("SELECT * FROM comp.fundq"))
comp2 <-
fundq %>%
filter(indfmt == "INDL", datafmt == "STD",
consol == "C", popsrc == "D")
comp2 <-
comp2 %>%
group_by(gvkey) %>%
dbplyr::window_order(datadate) %>%
mutate(cxsgaq = cumsum(xsgaq))
comp2 %>%
filter(!is.na(xsgaq)) %>%
select(gvkey, datadate, xsgaq, cxsgaq)
#> # Source: lazy query [?? x 4]
#> # Database: postgres [iangow#wrds-pgdata.wharton.upenn.edu:9737/wrds]
#> # Groups: gvkey
#> # Ordered by: datadate
#> gvkey datadate xsgaq cxsgaq
#> <chr> <date> <dbl> <dbl>
#> 1 001000 1966-12-31 0.679 0.679
#> 2 001000 1967-12-31 1.02 1.70
#> 3 001000 1968-12-31 5.86 7.55
#> 4 001000 1969-12-31 7.18 14.7
#> 5 001000 1970-12-31 8.25 23.0
#> 6 001000 1971-12-31 7.96 30.9
#> 7 001000 1972-12-31 7.55 38.5
#> 8 001000 1973-12-31 8.53 47.0
#> 9 001000 1974-12-31 8.86 55.9
#> 10 001000 1975-12-31 9.59 65.5
#> # … with more rows
Created on 2021-04-05 by the reprex package (v1.0.0)

Arrange data by variables of a data.frame in R?

I have written down the following script to get the data in longer format. How i can get the data.frame arrange by variables and not by Date?. That means first i should get the data for Variable A for all the dates followed by Variable X.
library(lubridate)
library(tidyverse)
set.seed(123)
DF <- data.frame(Date = seq(as.Date("1979-01-01"), to = as.Date("1979-12-31"), by = "day"),
A = runif(365,1,10), X = runif(365,5,15)) %>%
pivot_longer(-Date, names_to = "Variables", values_to = "Values")
Maybe I not understood wrigth, but you can arrange your data according to the variables column, through the arrange() function.
library(tidyverse)
DF <- DF %>%
arrange(Variables)
Resulting this
# A tibble: 730 x 3
Date Variables Values
<date> <chr> <dbl>
1 1979-01-01 A 3.59
2 1979-01-02 A 8.09
3 1979-01-03 A 4.68
4 1979-01-04 A 8.95
5 1979-01-05 A 9.46
6 1979-01-06 A 1.41
7 1979-01-07 A 5.75
8 1979-01-08 A 9.03
9 1979-01-09 A 5.96
10 1979-01-10 A 5.11
# ... with 720 more rows
In base R, we can use
DF1 <- DF[order(DF$Variables),]
Am I missing something? This is it.
arrange (DF,Variables,Date) %>% select(Variables,everything())

Weekly average of daily panel data in R

I have a large panel dataset of roughly 4million daily observations (Overview of my Dataset).
The variable symbol depicts the 952 different stocks contained in the data set and the other variables are some stock-related daily measures. I want to calculate the weekly averages of the variables rv, rskew, rkurt and rsj for each of the of the 952 stocks included in symbol.
I tried to group the dataset with group_by(symbol), but then I did not manage to aggregate the daily observations in the right way.
I am not very experienced with R and would highly appreciate some help here.
This is simple with the lubridate and dplyr packages:
library(dplyr)
library(lubridate)
set.seed(123)
df <- data.frame(date = seq.Date(ymd('2020-07-01'),ymd('2020-07-31'),by='day'),
sybol = 'a',
x = runif(31),
y = runif(31),
z = runif(31)
)
df <- df %>%
mutate(year = year(date),
week = week(date),
) %>%
group_by(year, week, symbol) %>%
summarise(x = mean(x),
y = mean(y),
z = mean(z)
)
> df
# A tibble: 5 x 6
# Groups: year, week [5]
year week symbol x y z
<dbl> <dbl> <fct> <dbl> <dbl> <dbl>
1 2020 27 a 0.555 0.552 0.620
2 2020 28 a 0.652 0.292 0.461
3 2020 29 a 0.495 0.350 0.398
4 2020 30 a 0.690 0.495 0.609
5 2020 31 a 0.466 0.378 0.376

Computation of yearwise breakpoints in R

I have daily rainfall data which I have converted to yearwise cumulative value using following code
library(tidyverse); library(segmented); library(seas); library(strucchange)
## get mscdata from "seas" packages
data(mscdata)
dat <- (mksub(mscdata, id=1108447))
## generate cumulative sum of rain by year
d2 <- dat %>% group_by(year) %>% mutate(rain_cs = cumsum(rain)) %>% ungroup
Then I want to compute of yearwise breakpoints using strucchange. I could able to do it for single year like
y <- subset(d2,year=="1992")$rain_cs
breakpoints(y ~ 1, breaks = 3)$breakpoints
I have used breaks = 3 to have 3 breakpoints. Now how to dynamically apply it year-wise to estimate breakpoints?
You can group_by year and use summarise in dplyr 1.0.0 which can generate multiple rows in summarise :
library(dplyr)
library(strucchange)
d2 %>%
group_by(year) %>%
summarise(breakpoints = breakpoints(rain_cs~1, breaks = 3)$breakpoints)
# year breakpoints
# <int> <dbl>
# 1 1975 73
# 2 1975 237
# 3 1975 301
# 4 1976 83
# 5 1976 166
# 6 1976 297
# 7 1977 98
# 8 1977 239
# 9 1977 311
#10 1978 102
# … with 80 more rows
To get data as 3 columns instead, we can store the output in a list and use unnest_wider.
d2 %>%
group_by(year) %>%
summarise(breakpoints = list(breakpoints(rain_cs~1,breaks = 3)$breakpoints)) %>%
tidyr::unnest_wider(breakpoints) %>%
tibble::column_to_rownames('year')

loop to run model on subset dataframe

I am not very experienced with loops so I am not sure where I went wrong here...
I have a dataframe that looks like:
month year day mean.temp mean.temp.year.month
1 1961 1 4.85 4.090323
1 1961 2 4.90 4.090323
1 1961 3 2.95 4.090323
1 1961 4 3.40 4.090323
1 1961 5 2.90 4.090323
dataset showing 3 months for 2 years can be found here:
https://drive.google.com/file/d/1w7NVeoEh8b7cAkU3cu1sXx6yCh75Inqg/view?usp=sharing
and I want to subset this dataframe by year and month so that I can run one nls model per year and month. Since my dataset contains 56 years (and each year has 12 months), that will give 672 models. Then I want to store the parameter estimates in a separate table.
I've created this code, but I can't work out why it is only giving me the parameter estimates for month 12 (all 56 years, but just month 12):
table <- matrix(99999, nrow=672, ncol=4)
YEARMONTHsel <- unique(df_weather[c("year", "month")])
YEARsel <- unique(df_weather$year)
MONTHsel <- unique(df_weather$month)
for (i in 1:length(YEARsel)) {
for (j in 1:length(MONTHsel)) {
temp2 <- df_weather[df_weather$year==YEARsel[i] & df_weather$month==MONTHsel[j],]
mn <- nls(mean.temp~mean.temp.year.month+alpha*sin(day*pi*2/30+phi),
data = temp2, control=nlc,
start=list(alpha=-6.07043, phi = -10))
cr <- as.vector(coef(mn))
nv <-length(coef(mn))
table[i,1:nv] <- cr
table[i,nv+1]<- YEARsel[i]
table[i,nv+2]<- MONTHsel[j]
}
}
I've tried several options (i.e. without using nested loop) but I'm not getting anywhere.
Any help would be greatly appreciated!Thanks.
Based on your loop, it looks like you want to run the regression grouped by year and month and then extract the coefficients in a new dataframe (correct me if thats wrong)
library(readxl)
library(tidyverse)
df <- read_excel("~/Downloads/df_weather.xlsx")
df %>% nest(-month, -year) %>%
mutate(model = map(data, ~nls(mean.temp~mean.temp.year.month+alpha*sin(day*pi*2/30+phi),
data = .x, control= "nlc",
start=list(alpha=-6.07043, phi = -10))),
coeff = map(model, ~coefficients(.x))) %>%
unnest(coeff %>% map(broom::tidy)) %>%
spread(names, x) %>%
arrange(year)
#> # A tibble: 6 x 4
#> month year alpha phi
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1961 0.561 -10.8
#> 2 2 1961 -1.50 -10.5
#> 3 3 1961 -2.06 -9.77
#> 4 1 1962 -3.35 -5.48
#> 5 2 1962 -2.27 -9.97
#> 6 3 1962 0.959 -10.8
First we nest the data based on your groups (in this case year and month), then we map the model for each group, then we map the coefficients for each group, lastly we unnest the coefficients and spread the data from long to wide.

Resources