I have a data set that looks like the following:
ID Count
1 0
1 1
1 NA
1 2
1 NA
1 NA
1 NA
1 NA
1 NA
2 0
2 NA
2 NA
2 3
The first row of each ID starts with 0. I want to fill the NA values with sequential values by group. If there are values before and after the NA values, I need to fill the NA values with a sequence counting up to the first value after the NA values. If there are no values after the NA values, I need to fill the NA values with a sequence counting up from the last value before the NA value. The output should look like following:
ID Count
1 0
1 1
1 1
1 2
1 3
1 4
1 5
1 6
1 7
2 0
2 1
2 2
2 3
This is a little complicated, but I think this does what you want. I left all my helper columns in so you can see what's happening, but the non-needed columns can all be dropped at the end.
library(dplyr)
library(vctrs)
df %>%
group_by(ID, na_group = cumsum(!is.na(Count))) %>%
mutate(n_til_non_na = ifelse(is.na(Count), rev(row_number()), 0L)) %>%
group_by(ID) %>%
mutate(
fill_down = vec_fill_missing(Count, direction = "down"),
fill_up = vec_fill_missing(Count, direction = "up"),
result = case_when(
is.na(fill_up) ~ fill_down + cumsum(is.na(fill_up)),
is.na(Count) ~ fill_up - n_til_non_na,
TRUE ~ Count
)
) %>%
ungroup()
# # A tibble: 13 × 7
# ID Count na_group n_til_non_na fill_down fill_up result
# <int> <int> <int> <int> <int> <int> <int>
# 1 1 0 1 0 0 0 0
# 2 1 1 2 0 1 1 1
# 3 1 NA 2 1 1 2 1
# 4 1 2 3 0 2 2 2
# 5 1 NA 3 5 2 NA 3
# 6 1 NA 3 4 2 NA 4
# 7 1 NA 3 3 2 NA 5
# 8 1 NA 3 2 2 NA 6
# 9 1 NA 3 1 2 NA 7
# 10 2 0 4 0 0 0 0
# 11 2 NA 4 2 0 3 1
# 12 2 NA 4 1 0 3 2
# 13 2 3 5 0 3 3 3
Using this sample data:
df = read.table(text = 'ID Count
1 0
1 1
1 NA
1 2
1 NA
1 NA
1 NA
1 NA
1 NA
2 0
2 NA
2 NA
2 3', header = T)
You can use purrr::accumulate(), first backwards, then forward. While going backwards, replace each missing value with the previous value - 1 to count down; then while moving forwards, replace remaining missing values with the previous value + 1 to count up.
library(dplyr)
library(purrr)
dat %>%
group_by(ID) %>%
mutate(
Count = accumulate(
Count,
\(x, y) ifelse(is.na(x), y - 1, x),
.dir = "backward"
),
Count = accumulate(
Count,
\(x, y) ifelse(is.na(y), x + 1, y)
)
) %>%
ungroup()
# A tibble: 13 × 2
ID Count
<dbl> <dbl>
1 1 0
2 1 1
3 1 1
4 1 2
5 1 3
6 1 4
7 1 5
8 1 6
9 1 7
10 2 0
11 2 1
12 2 2
13 2 3
I am trying to calculated the difference between the first and the last value for the each Group for the each columne (Value1, Value2, Value3 etc).
Group Dates Value1 Value2 Value3
1 2000-01-01 NA 0 5
1 2000-02-01 1 0 10
1 2000-03-01 2 1 0
2 2000-04-01 4 1 NA
2 2000-05-01 1 2 NA
2 2000-06-01 2 2 40
For example: diff_Value1=-1 because the first value for the Group 1 is 1 and last value is 2.
I am using below code. How can I extend for 30 more columns (e.g. Value1...-Value 30)?
Do I need to use loop inside mutate function?
df <- data.frame(Group=c(rep(1,3), rep(2,3)),
Dates=seq(as.Date("2000/1/1"), by = "month", length.out = 6),
Value1=c(NA, 1:2,4,1:2),
Value2=c(0,0,1,1,2,2),
Value3=c(5,10,0,NA,NA,40)
)
df %>%
group_by(Group) %>%
dplyr::mutate(
Value1_diff = dplyr::first(na.omit(Value1))-dplyr::last(na.omit(Value1)),
Value2_diff = dplyr::first(na.omit(Value2))-dplyr::last(na.omit(Value2)),
Value3_diff = dplyr::first(na.omit(Value3))-dplyr::last(na.omit(Value3))
)
Group Dates Value1 Value2 Value3 Value1_diff Value2_diff Value3_diff
1 2000-01-01 NA 0 5 -1 -1 5
1 2000-02-01 1 0 10 -1 -1 5
1 2000-03-01 2 1 0 -1 -1 5
2 2000-04-01 4 1 NA 2 -1 0
2 2000-05-01 1 2 NA 2 -1 0
2 2000-06-01 2 2 40 2 -1 0
We may use across to loop over multiple columns
library(dplyr)
df <- df %>%
group_by(Group) %>%
mutate(across(starts_with('Value'),
~ first(na.omit(.)) - last(na.omit(.)), .names = "{.col}_diff")) %>%
ungroup
-output
df
# A tibble: 6 × 8
Group Dates Value1 Value2 Value3 Value1_diff Value2_diff Value3_diff
<dbl> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2000-01-01 NA 0 5 -1 -1 5
2 1 2000-02-01 1 0 10 -1 -1 5
3 1 2000-03-01 2 1 0 -1 -1 5
4 2 2000-04-01 4 1 NA 2 -1 0
5 2 2000-05-01 1 2 NA 2 -1 0
6 2 2000-06-01 2 2 40 2 -1 0
I have a data frame that contains a grouping variable (ID), a date and an event column with numeric values, in which 0 represent no event and >0 represents an event. An example data frame can be generated with the following code:
df <- data.frame(ID = c (1, 1, 1, 1, 2, 2, 2),
date = as.Date(c("2014-08-03", "2014-08-04", "2014-08-07", "2014-08-10", "2015-07-01", "2015-07-03", "2015-08-01")),
event = c(1, 0, 3, 0, 0, 4, 0))
df
> df
ID date event
1 1 2014-08-03 1
2 1 2014-08-04 0
3 1 2014-08-07 3
4 1 2014-08-10 0
5 2 2015-07-01 0
6 2 2015-07-03 4
7 2 2015-08-01 0
Now, I want to calculate the time that has passed since any last event (>0) has occured. In the particular case that the first entry/entries for any ID contains no event, "NA" should be generated. My desired output would look like this:
> df
ID date event tae
1 1 2014-08-03 1 0
2 1 2014-08-04 0 1
3 1 2014-08-07 3 0
4 1 2014-08-10 0 3
5 2 2015-07-01 0 NA
6 2 2015-07-03 4 0
7 2 2015-08-01 0 29
I have tried several different approaches. The closest I got was this:
library(dplyr)
df %>%
mutate(tmpG = cumsum(c(FALSE, as.logical(diff(event))))) %>%
group_by(ID) %>%
mutate(tmp = c(0, diff(date)) * !event) %>%
group_by(tmpG) %>%
mutate(tae = cumsum(tmp)) %>%
ungroup() %>%
select(-c(tmp, tmpG))
# A tibble: 7 x 4
ID date event tae
<dbl> <date> <dbl> <dbl>
1 1 2014-08-03 1 0
2 1 2014-08-04 0 1
3 1 2014-08-07 3 0
4 1 2014-08-10 0 3
5 2 2015-07-01 0 3
6 2 2015-07-03 4 0
7 2 2015-08-01 0 29
Any suggestions on how to get that code running (or any other alternative) would be greatly appreciated.
Here is another tidyverse approach, that uses fill to carry forward the most recent event.
library(tidyverse)
df %>%
group_by(ID) %>%
mutate(last_event = if_else(event > 0, date, NA_real_)) %>%
fill(last_event) %>%
mutate(tae = as.numeric(date - last_event))
Output
ID date event last_event tae
<dbl> <date> <dbl> <date> <dbl>
1 1 2014-08-03 1 2014-08-03 0
2 1 2014-08-04 0 2014-08-03 1
3 1 2014-08-07 3 2014-08-07 0
4 1 2014-08-10 0 2014-08-07 3
5 2 2015-07-01 0 NA NA
6 2 2015-07-03 4 2015-07-03 0
7 2 2015-08-01 0 2015-07-03 29
df %>%
group_by(ID) %>%
mutate(tae = as.double(if_else(event==0, date-lag(date), 0)))
Output:
ID date event tae
<dbl> <date> <dbl> <dbl>
1 1 2014-08-03 1 0
2 1 2014-08-04 0 1
3 1 2014-08-07 3 0
4 1 2014-08-10 0 3
5 2 2015-07-01 0 NA
6 2 2015-07-03 4 0
7 2 2015-08-01 0 29
I have a time-series panel dataset that is structured in the following way: There are multiple funds that each own multiple stocks and we have a value column for the stock. As you can see the panel is not balanced. My actual dataset is very large with each fund having at least 500 stocks and different quarters being represented with some having missing quarter values.
df <- data.frame(
fund_id = c(1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2),
stock_id = c(1,1,1,1,1,1,2,2,2,2,2,2,2,1,1,3,3,3,3),
year_q = c("2011-03","2011-06","2011-09","2011-12","2012-03","2012-06","2011-12","2012-03","2012-06","2012-09",
"2012-12","2013-03","2013-06","2014-09","2015-03","2013-03","2013-06","2013-09","2013-12"),
value = c(1,2,1,3,4,2,1,2,3,4,2,1,3,1,1,3,2,3,1)
)
> df
fund_id stock_id year_q value
1 1 1 2011-03 1
2 1 1 2011-06 2
3 1 1 2011-09 1
4 1 1 2011-12 3
5 1 1 2012-03 4
6 1 1 2012-06 2
7 1 2 2011-12 1
8 1 2 2012-03 2
9 1 2 2012-06 3
10 1 2 2012-09 4
11 1 2 2012-12 2
12 1 2 2013-03 1
13 1 2 2013-06 3
14 2 1 2014-09 1
15 2 1 2015-03 1
16 2 3 2013-03 3
17 2 3 2013-06 2
18 2 3 2013-09 3
19 2 3 2013-12 1
I would like to calculate for each fund, the percentage of stocks held in the current quarter that were ever held in the previous one to 3 quarters. So basically for every fund and every date, I would like to have 3 columns with past 1 Q, past 2Q and past 3Q which show what percentage of stocks held on that date were also present in each of those past quarters.
Here is what the result should look like:
result <- data.frame(
fund_id = c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2),
year_q = c("2011-03","2011-06","2011-09","2011-12","2012-03","2012-06","2012-09","2012-12","2013-03","2013-06",
"2013-03","2013-06","2013-09","2013-12","2014-03","2014-06","2014-09","2014-12","2015-03"),
past_1Q = c("NA",1,1,0.5,1,1,0.5,1,1,1,"NA",1,1,1,"NA","NA","NA","NA","NA"),
past_2Q = c("NA","NA",1,0.5,0.5,1,0.5,0.5,1,1,"NA","NA",1,1,"NA","NA","NA","NA","NA"),
past_3Q = c("NA","NA","NA",0.5,0.5,0.5,0.5,0.5,0.5,1,"NA","NA","NA",1,"NA","NA","NA","NA","NA")
)
> result
fund_id year_q past_1Q past_2Q past_3Q
1 1 2011-03 NA NA NA
2 1 2011-06 1 NA NA
3 1 2011-09 1 1 NA
4 1 2011-12 0.5 0.5 0.5
5 1 2012-03 1 0.5 0.5
6 1 2012-06 1 1 0.5
7 1 2012-09 0.5 0.5 0.5
8 1 2012-12 1 0.5 0.5
9 1 2013-03 1 1 0.5
10 1 2013-06 1 1 1
11 2 2013-03 NA NA NA
12 2 2013-06 1 NA NA
13 2 2013-09 1 1 NA
14 2 2013-12 1 1 1
15 2 2014-03 NA NA NA
16 2 2014-06 NA NA NA
17 2 2014-09 NA NA NA
18 2 2014-12 NA NA NA
19 2 2015-03 NA NA NA
I tried to do this using rollapply but can't get the correct results.
I understand that this might not be the best sample data but in my real data each fund usually have more than 500 stocks and I expect the percentage of matching stocks from one period to the past periods to be something around 0.95 on average.
This is what I have to get the first two result columns (credits to #r2evans):
result <- df %>%
group_by(fund_id) %>%
mutate(miny = min(year_q), maxy = max(year_q)) %>%
distinct(fund_id, miny, maxy) %>%
group_by(fund_id) %>%
mutate(across(c(miny, maxy), ~ as.Date(paste0(., "-01")))) %>%
transmute(year_q = purrr::map2(miny, maxy, ~ format(seq(.x, .y, by = "3 months"), format = "%Y-%m"))) %>%
tidyr::unnest(year_q) %>%
full_join(df, by = c("fund_id", "year_q")) %>%
distinct(fund_id, year_q) %>%
arrange(fund_id, year_q)
library(tidyverse)
df %>%
mutate(year_q = as.Date(paste0(year_q, '-01'))) %>%
group_by(fund_id, year_q) %>%
summarise(stock_id = list(unique(stock_id))) %>%
complete(year_q = seq(min(year_q), max(year_q), by = "3 months")) %>%
reduce(.init = ., 1:3, ~ mutate(.x, "past_{.y}Q" := map(1:n(), \(N) unlist(stock_id[pmax(N-.y, 0)])))) %>%
mutate(across(contains("past"), \(past) map2_dbl(stock_id, past, ~ mean(.x %in% .y)) %>% replace_na(0))) %>%
ungroup()
# A tibble: 19 × 6
fund_id year_q stock_id past_1Q past_2Q past_3Q
<dbl> <date> <list> <dbl> <dbl> <dbl>
1 1 2011-03-01 <dbl [1]> 0 0 0
2 1 2011-06-01 <dbl [1]> 1 0 0
3 1 2011-09-01 <dbl [1]> 1 1 0
4 1 2011-12-01 <dbl [2]> 0.5 0.5 0.5
5 1 2012-03-01 <dbl [2]> 1 0.5 0.5
6 1 2012-06-01 <dbl [2]> 1 1 0.5
7 1 2012-09-01 <dbl [1]> 1 1 1
8 1 2012-12-01 <dbl [1]> 1 1 1
9 1 2013-03-01 <dbl [1]> 1 1 1
10 1 2013-06-01 <dbl [1]> 1 1 1
11 2 2013-03-01 <dbl [1]> 0 0 0
12 2 2013-06-01 <dbl [1]> 1 0 0
13 2 2013-09-01 <dbl [1]> 1 1 0
14 2 2013-12-01 <dbl [1]> 1 1 1
15 2 2014-03-01 <NULL> 0 0 0
16 2 2014-06-01 <NULL> 0 0 0
17 2 2014-09-01 <dbl [1]> 0 0 0
18 2 2014-12-01 <NULL> 0 0 0
19 2 2015-03-01 <dbl [1]> 0 1 0
Given your example, I think the code below gets you there. You might need to switch to data.table if you have a lot of records. Note that I use df1, not df. df is a function in R. I used padr::pad to fill in the missing quarters within a fund. So it will only fill in quarters if there is data from at least 1 stock in the fund. It will not add quarters that are in fund 2 to fund 1 as these have nothing to do with fund 1.
edit: Added a group by in the lag function to correctly lag over the stock_id as the arrange puts the NA values for the stock_ids, instead of the desired order.
df1 %>%
mutate(year_q = ymd(paste0(year_q, "-01"))) %>%
group_by(fund_id) %>%
padr::pad(interval = "3 months") %>%
arrange(fund_id, stock_id, year_q) %>%
mutate(past_1Q = if_else(stock_id == lag(stock_id, order_by = year_q, default = 0), 1, 0),
past_2Q = if_else(stock_id == lag(stock_id, n = 2, order_by = year_q, default = 0), 1, 0),
past_3Q = if_else(stock_id == lag(stock_id, n = 3, order_by = year_q, default = 0), 1, 0)) %>%
group_by(year_q, .add = TRUE) %>%
# add number of stocks in the fund in this quarter
mutate(n_stocks = n()) %>%
summarise(past_1Q = sum(past_1Q, na.rm = T) / mean(n_stocks, na.rm = T),
past_2Q = sum(past_2Q, na.rm = T) / mean(n_stocks, na.rm = T),
past_3Q = sum(past_3Q, na.rm = T) / mean(n_stocks, na.rm = T))
# A tibble: 19 × 5
# Groups: fund_id [2]
fund_id year_q past_1Q past_2Q past_3Q
<dbl> <date> <dbl> <dbl> <dbl>
1 1 2011-03-01 0 0 0
2 1 2011-06-01 1 0 0
3 1 2011-09-01 1 1 0
4 1 2011-12-01 0.5 0.5 0.5
5 1 2012-03-01 0 1 0.5
6 1 2012-06-01 0 1 0
7 1 2012-09-01 1 0 1
8 1 2012-12-01 1 1 0
9 1 2013-03-01 1 1 1
10 1 2013-06-01 1 1 1
11 2 2013-03-01 0 0 0
12 2 2013-06-01 1 0 0
13 2 2013-09-01 1 1 0
14 2 2013-12-01 1 1 1
15 2 2014-03-01 0 0 0
16 2 2014-06-01 0 0 0
17 2 2014-09-01 0 0 0
18 2 2014-12-01 0 0 0
19 2 2015-03-01 0 1 0
I think your results table is incorrect. Looking at 2012-12, there is only one stock live in fund 1. Based on this calculation the outcome should be 100%, not 50%. As 100% of the stocks in the fund now, where also in the fund last quarter, etc. etc.