Say I want to run regressions per group whereby I want to use the last 5 year data as input for that regression. Then, for each next year, I would like to "shift" the input for that regression by one year (i.e., 4 observations).
From those regressions I want to extract both the R2 and the fitted values/residuals, which I then need in subsequent regressions that follow similar notions.
I have some code working using loops, but it is not really elegant nor efficient for large datasets. I assume there must be a nice plyr way for resolving this issue.
# libraries #
library(dplyr)
library(broom)
# reproducible data #
df <- tibble(ID = as.factor(rep(c(1, 2), each = 40)),
YEAR = rep(rep(c(2001:2010), each = 4), 2),
QTR = rep(c(1:4), 20),
DV = rnorm(80),
IV = DV * rnorm(80))
# output vector #
output = tibble(ID = NA,
YEAR = NA,
R2 = NA)
# loop #
k = 1
for (i in levels(df$ID)){
n_row = df %>%
arrange(ID) %>%
filter(ID == i) %>%
nrow()
for (j in seq(1, (n_row - 19), by = 4)){
output[k, 1] = i
output[k, 2] = df %>%
filter(ID == i) %>%
slice((j + 19)) %>%
select(YEAR) %>%
unlist()
output[k, 3] = df %>%
filter(ID == i) %>%
slice(j:(j + 19)) %>%
do(model = lm(DV ~ IV, data = .)) %>%
glance(model) %>%
ungroup() %>%
select(r.squared) %>%
ungroup()
k = k + 1
}
}
Define a function which returns the year and R squared given a subset of rows of df (without ID) and then use rollapply with it.
library(dplyr)
library(zoo)
R2 <- function(x) {
x <- as.data.frame(x)
c(YEAR = tail(x$YEAR, 1), R2 = summary(lm(DV ~ IV, x))$r.squared)
}
df %>%
group_by(ID) %>%
do(data.frame(rollapply(.[-1], 20, by = 4, R2, by.column = FALSE))) %>%
ungroup
giving:
# A tibble: 12 x 3
ID YEAR R2
<fct> <dbl> <dbl>
1 1 2005 0.0133
2 1 2006 0.130
3 1 2007 0.0476
4 1 2008 0.0116
5 1 2009 0.00337
6 1 2010 0.00570
7 2 2005 0.0481
8 2 2006 0.00527
9 2 2007 0.0158
10 2 2008 0.0303
11 2 2009 0.235
12 2 2010 0.116
Related
I would like to fill a column (KE) with values in other columns (K2007/K2008/K2009) based on the conditions (Year). For examle, If "Year" is 2007, KE would be 1.
Year <- c(2007,2008,2009)
K2007 <- c(1,2,3)
K2008 <- c(4,5,6)
K2009 <- c(7,8,9)
KE <- c(1,5,9)
Thanks in advance,
---2022/06/24 update----
I'm so sorry that the data have another limit.
I want to assign values to KE conditional on two column values in addition to "year."
Year <- c(2007,2008,2009)
X <- c(10, 20, 30)
Y <- c(40, 50, 60)
K2007 <- c(1,2,3)
K2008 <- c(4,5,6)
K2009 <- c(7,8,9)
KE <- c(1,5,9)
In this case, when Year = 2007, X = 10 and Y = 40, KE will be 1.
I successfully got the results below the code.
test <- ddd %>%
mutate(KE = case_when(
Year == 2007 ~ ke_2007,
Year == 2008 ~ ke_2008,
Year == 2009 ~ ke_2009,
TRUE ~ NA_real_))
Thanks,
If you don't need to keep the original columns you could pivot the data to a tidier long format, remove the "K" from the original column names, and only keep rows in which Year is the same as the old column:
library(tidyr)
library(dplyr)
dat |>
pivot_longer(K2007:K2009, values_to = "KE") |>
mutate(name = sub('.', '', name) |> as.double()) |>
filter(Year == name) |>
select(-name)
#> # A tibble: 3 x 2
#> Year KE
#> <dbl> <dbl>
#> 1 2007 1
#> 2 2008 5
#> 3 2009 9
Created on 2022-06-23 by the reprex package (v2.0.1)
Try this
df <- data.frame(K2007 , K2008 , K2009)
KE <- sapply(seq_along(Year) ,
\(x) df[ x,grep(Year[x] , names(df))])
KE
#[1] 1 5 9
library(tidyverse)
tbl <- tibble(
year = c(2007,2008,2009),
k2007 = c(1,2,3),
k2008 = c(4,5,6),
k2009 = c(7,8,9))
tbl %>%
pivot_longer(-year, values_to = 'ke') %>%
filter(name == str_c('k', year)) %>%
select(-name) %>%
left_join(tbl, ., 'year')
# A tibble: 3 x 5
year k2007 k2008 k2009 ke
<dbl> <dbl> <dbl> <dbl> <dbl>
1 2007 1 4 7 1
2 2008 2 5 8 5
3 2009 3 6 9 9
I need to obtain the last 30 years mean value of TOTAL column month over month.
The dataset is avaible here:
library(dplyr)
ENSO <-read.table("http://www.cpc.ncep.noaa.gov/products/analysis_monitoring/ensostuff/detrend.nino34.ascii.txt", header = TRUE)
glimpse(ENSO)
For example for the sep-2021 I need to calculate:
$$
(TOTAL_{sep-2021} +
TOTAL_{sep-2020} +
TOTAL_{sep-2019} +
...
TOTAL_{sep-1991}) / 30
$$
I tried to use dplyr::mutate but I think that slider or zoo maybe can be helpful inside an condition because the time series begins in jan-1950 and obviously I wouldn't have the last 30 average values MoMs.
You can either fix the current year to 2021 or get the higher year in the table.
Then, you only need to filter out years that are lower than this current year minus 30.
If you want to sugar-coat this, you can even use a custom name for your column.
Here is the code:
current_y = max(ENSO$YR)
col_name = paste0("total_mean_", current_y-30, "_to_", current_y)
ENSO %>%
filter(YR>current_y-30) %>%
group_by(MON) %>%
summarise(!!col_name:=mean(TOTAL))
# # A tibble: 12 x 2
# MON total_mean_1991_to_2021
# <int> <dbl>
# 1 1 26.5
# 2 2 26.7
# 3 3 27.3
# 4 4 27.8
# 5 5 27.9
# 6 6 27.7
# 7 7 27.3
# 8 8 26.8
# 9 9 26.7
#10 10 26.7
#11 11 26.7
#12 12 26.5
Here is a data.table approach.
functional:
1 - Split the table to a list of tables by month
2 - caculate the rolling mean of the last 30 months
3 - rowbind the monthly tables to a single table
library(data.table)
# Make ENSO a data.table, key by year and month
setDT(ENSO, key = c("YR", "MON"))
# Split by MON
L <- split(ENSO, by = "MON")
# Loop over L, create monthly mean over the last 30 entries
L <- lapply(L, function(x) {
x[, MON30_avg := frollmean(TOTAL, n = 30)]
})
# Rowbind List together again
final <- rbindlist(L, use.names = TRUE, fill = TRUE)
Thank you Bloxx!
I can use the new variable by using
library(dplyr)
library(fpp3)
ENSO <- read.table("http://www.cpc.ncep.noaa.gov/products/analysis_monitoring/ensostuff/detrend.nino34.ascii.txt", header = TRUE) %>%
mutate(
Dates = paste(YR, "-", MON),
Dates = yearmonth(Dates),
Month_Year = paste(month.name[month(Dates)],"/", year(Dates)),
diff_total = difference(TOTAL),
ANOM = round( TOTAL - ClimAdjust, digits = 2),
# TMA = TMA_{t-1} + TMA_{t} + TMA_{t+1}
TMA = round( slide_dbl(ANOM, mean, .before = 1, .after = 1), digits=2 ),
# ´Climatic Condition`= if 5 last consecutives TMA > 0.5 then El Niño, otherwise if 5 last consecutives TMA < -0.5 then La Niña
`Climatic Condition` =
lag( case_when(
rollapplyr(TMA < -0.5, 5, all, fill = FALSE) ~ "La Niña",
rollapplyr(TMA > 0.5, 5, all, fill = FALSE) ~ "El Niño") ),
`3 months` =
case_when(
month(Dates) == 1 ~ "DJF",
month(Dates) == 2 ~ "JFM",
month(Dates) == 3 ~ "FMA",
month(Dates) == 4 ~ "MAM",
month(Dates) == 5 ~ "AMJ",
month(Dates) == 6 ~ "MJJ",
month(Dates) == 7 ~ "JJA",
month(Dates) == 8 ~ "JAS",
month(Dates) == 9 ~ "ASO",
month(Dates) == 10 ~ "SON",
month(Dates) == 11 ~ "OND",
month(Dates) == 12 ~ "NDJ" )
) %>% as_tsibble(index = Dates)
ENSO <- ENSO %>% # To reorder the dtaaframe
select(
Dates,
Month_Year,
YR,
MON,
TOTAL,
ClimAdjust,
ANOM,
TMA,
`3 months`,
`Climatic Condition`,
diff_total
)
ClimAdj <- ENSO %>%
group_by(MON) %>%
summarise(ClimAdj = mean(TOTAL) )
ENSO <- left_join(ENSO, ClimAdj %>%
select(Dates, ClimAdj), by = c("Dates" = "Dates"))
ENSO <- ENSO %>%
select(
-MON.y
) %>%
rename(
MON = "MON.x"
)
ENSO <- ENSO %>%
select(
Dates,
Month_Year,
YR,
MON,
TOTAL,
#ClimAdjust,
ClimAdj,
ANOM,
TMA,
`3 months`,
`Climatic Condition`,
diff_total
)
glimpse(ENSO)
Here is the updated code. You first arrange by year and month and then slice last 360 months (30 years)! Then group by month and then calculate mean:
ENSO %>% arrange(YR, MON) %>% slice_tail(n = 360) %>% group_by(MON) %>% summarise(mean(TOTAL))
Hope this is what you want. Each month has mean for the last 30 years.
Say you have the following time series dataset in r:
n <- 3
set.seed(1)
df <- data.frame(Day = rep("Mon", n),
Time = 1:n,
Temper = round(rnorm(n, 4, 2), 0))
print(df)
Day Time Temper
Mon 1 3
Mon 2 4
Mon 3 2
Now say you wish to add midpoints between adjacent existing values as extra rows in the data frame. For example, say you wish to add the exact midpoint between every pair of adjacent values, to produce the following new dataframe:
Day Time Temper
Mon 1 3
Mon 1.5 3.5
Mon 2 4
Mon 2.5 3
Mon 3 2
What would be some efficient R code that could accomplish this on much larger datasets?
It would be great if this code could also populate the dataframe with values which are not just the exact midpoints, for example the 'one-third' datapoint:
Day Time Temper
Mon 1 3
Mon 1.33 3.33
Mon 2 4
Mon 2.33 3.33
Mon 3 2
Another solution:
library(tidyverse)
df %>%
slice(rep(1:n(), each = 2)) %>%
mutate_at(c("Time", "Temper"), function(x) {
replace(x, seq(2, n(), 2),
(x + (1 / 3) * (lead(x) - lag(x)))[seq(2, n(), 2)])
}) %>%
mutate_at(c("Time", "Temper"), round, 2) %>%
slice(-n())
#> Day Time Temper
#> 1 Mon 1.00 3.00
#> 2 Mon 1.33 3.33
#> 3 Mon 2.00 4.00
#> 4 Mon 2.33 3.33
#> 5 Mon 3.00 2.00
Here is an idea using dplyr and purrr. We first rbind rows of NA and we then fill those NAs, i.e.
library(tidyverse)
df %>%
group_by(Day) %>%
map_dfr(rbind, NA) %>%
fill(Day) %>%
mutate_at(vars(c(2, 3)), funs(replace(., is.na(.), (1/2) * (lag(.) + lead(.))[is.na(.)] ))) %>%
na.omit()
which gives,
# A tibble: 5 x 3
Day Time Temper
<int> <dbl> <dbl>
1 1 1 3
2 1 1.5 3.5
3 1 2 4
4 1 2.5 3
5 1 3 2
You may want to benchmark (cf library microbenchmark) those 3 solutions, depending on the numbers of factors and numeric variables you have.
Using base R
n <- 3
set.seed(1)
dframe <- data.frame(Day = rep("Mon", n),
Time = 1:n,
Temper = round(rnorm(n, 4, 2), 0))
# --- convert factor to numeric
mframe <- as.data.frame(sapply(dframe, as.numeric))
# --- function to use on variables
pfun <- function(x, coef = 1/4){
# x <- mframe$Time ; coef <- .25 ;
newp <- x[1:(length(x)-1)] + diff(x, lag = 1) * coef
res <- c(rbind(x[1:(length(x) -1) ], newp) , x[length(x)] )
return( res )
}
# --- base R way
# pfun( mframe$Time )
# sapply(mframe, pfun, .5)
apply(mframe, 2, pfun)
dframe_final <- as.data.frame ( apply(mframe, 2, pfun) )
# str(dframe_final)
# --- get Day's or other factors back
for(col in names(dframe)[sapply(dframe, is.factor)]){
dframe_final[[col]] <- factor(dframe_final[[col]])
levels( dframe_final[[col]] ) <- levels(dframe[[col]])
}
dplyr
# --- dplyr way
library(dplyr)
library(purrr)
lfactors <- dframe %>%
map_if(is.factor, levels)
dframe2 <- dframe %>%
as_tibble %>%
map_dfr(as.numeric) %>%
map_dfr(pfun) %>%
mutate_at(.vars = names(dframe)[sapply(dframe, is.factor)], .funs = factor)
# --- get Day's or other factors back
for(col in names(dframe)[sapply(dframe, is.factor)]){
dframe2[[col]] <- factor(dframe2[[col]])
levels( dframe2[[col]] ) <- levels(dframe[[col]])
}
data.table
# --- data.table way
library(data.table)
dframe3 <- data.table(dframe)
dframe3 <- dframe3[ , lapply(.SD, as.numeric)]
dframe3 <- dframe3[ , lapply(.SD, pfun)]
# --- get Day's or other factors back
for(col in names(dframe)[sapply(dframe, is.factor)]){
dframe3[ , (col) := factor(get(col)) ]
levels( dframe3[[col]] ) <- levels(dframe[[col]])
}
I am simulating events from the following data table using the map function and filtering zero value events.
However I would like to filter within the map function, thereby reducing the size of the event table that gets created.
The following simulates events based on the Poisson distribution for a given mean (it includes freq = 0 but to manage memory I don't want these):
library(tidyverse)
set.seed(1); n <- 10
data <- tibble(locid = seq(5), exp = 2)
event <- data %>%
mutate(freq = map(exp, ~rpois(n, .x))) %>%
mutate(freq = map(freq, ~ data.frame(freq = .x, sim = seq_along(.x)))) %>%
unnest()
I can then filter with event %>% filter(freq != 0). How can I slot this into the map function please? This will make the memory footprint a lot more manageable for my code. Thank you!
An option would be discard
library(tidyverse)
data %>%
mutate(freq = map(exp, ~rpois(n, .x) %>%
discard(. == 0) %>%
tibble(freq = ., sim = seq_along(.)))) %>%
unnest
if 'sim' should be based on the original sequence, then create a tibble of 'rpois' output and the sequence of the elements, then do the filter within map
data %>%
mutate(freq = map(exp, ~ rpois(n , .x) %>%
tibble(freq = ., sim = seq_along(.)) %>%
filter(freq != 0))) %>%
unnest
Or using mutate in between
data %>%
mutate(freq = map(exp, ~ tibble(freq = rpois(n, .x)) %>%
mutate(sim = row_number()) %>%
filter(freq != 0))) %>%
unnest
Here is one idea. No need to create data.frame. Create list with freq and sim, and then unnest them.
library(tidyverse)
set.seed(1); n <- 10
data <- tibble(locid = seq(5), exp = 2)
event <- data %>%
mutate(freq = map(exp, ~rpois(n, .x)),
sim = map(freq, ~which(.x > 0)),
freq = map(freq, ~.x[.x > 0]))%>%
unnest()
event
# # A tibble: 45 x 4
# locid exp freq sim
# <int> <dbl> <int> <int>
# 1 1 2 1 1
# 2 1 2 1 2
# 3 1 2 2 3
# 4 1 2 4 4
# 5 1 2 1 5
# 6 1 2 4 6
# 7 1 2 4 7
# 8 1 2 2 8
# 9 1 2 2 9
# 10 2 2 1 1
# # ... with 35 more rows
I have the following problem I am trying to solve. Here some example data:
library(tidyverse)
library(lubridate)
date <- data.frame(date=seq(ymd('2018-01-01'),ymd('2018-02-28'), by = '1 day'))
group <- data.frame(group=c("A","B"))
subgroup <- data.frame(subgroup=c("C","D"))
DF <- merge(merge(date,group,by=NULL),subgroup,by=NULL)
DF$group_value <- apply(DF, 1, function(x) sample(8:12,1))
DF$subgroup_value <- apply(DF, 1, function(x) sample(1:5,1))
DF <- DF %>%
arrange(date,group,subgroup)
I now want to calculate the following:
for every given day t, group and subgroup combination calculate the
number of days until the (backward) cumsum of subgroup_value is equal
or greater than the group value of day t.
I know how to do that by using some for loops and some dplyr functionality, but this is just terrible slow:
for(i in seq(1,nrow(date),1)) {
for(j in seq(1,nrow(group),1)) {
for(k in seq(1,nrow(subgroup),1)) {
tmp <- DF %>%
filter(date<=date[i] & group == group[j] & subgroup == subgroup[k]) %>%
arrange(desc(date))
tmp$helper <- 1
tmp <- tmp %>%
mutate(
cs_helper = cumsum(helper),
cs_subgroup_value = cumsum(subgroup_value),
nr_days = case_when (
cs_subgroup_value >= group_value ~ cs_helper,
TRUE ~ NA_real_)
)
#this is the final result for date[i], group[j], subgroup[k]
value <- min(tmp[,"nr_days"], na.rm=T)
}
}
}
Example
head(DF,10)
date group subgroup group_value subgroup_value result
1 2018-01-01 A C 12 2 NA
2 2018-01-02 A C 11 4 NA
3 2018-01-03 A C 11 4 NA
4 2018-01-04 A C 9 5 2
5 2018-01-05 A C 12 5 3
6 2018-01-06 A C 10 3 3
7 2018-01-07 A C 12 5 3
8 2018-01-08 A C 8 1 3
9 2018-01-09 A C 12 4 4
10 2018-01-10 A C 9 1 4
So for row 10, I need to sum the last 4 values of subgroup to be greater or equal to 9.
I am sure that this code can be highly optimized by using some vectorized version but I am struggle to find a good starting point for that (As you can see from the code above, I am a newbie in R)
My question is: How would you approach this problem in order to vectorize it for speed optimisation?
Thanks!
Stephan
Here's an attempt, take a copy of each group/subgroups data frame, and cross join to the data. This is then filtered to only find the days before. This allows us for each day to calculate all of the cumulative sums
DF %>%
group_by(group, subgroup) %>%
mutate(day = row_number(), J = TRUE) %>%
nest() %>%
arrange(group, subgroup) %>%
mutate(data = map(data, function(d) {
inner_join(d, transmute(d, x = day, v = subgroup_value, J), by = "J") %>%
filter(day >= x) %>%
mutate(x = day - x + 1) %>%
arrange(day, x) %>%
group_by(date, group_value, date, subgroup_value) %>%
mutate(vv = cumsum(v),
vv = ifelse(vv >= group_value, vv, NA),
xx = ifelse(!is.na(vv), x, NA)) %>%
group_by(date, group_value, day, subgroup_value) %>%
summarise(x = min(xx, na.rm = TRUE), v = min(vv, na.rm = TRUE))
})) %>%
unnest()