I am trying to calculated the difference between the first and the last value for the each Group for the each columne (Value1, Value2, Value3 etc).
Group Dates Value1 Value2 Value3
1 2000-01-01 NA 0 5
1 2000-02-01 1 0 10
1 2000-03-01 2 1 0
2 2000-04-01 4 1 NA
2 2000-05-01 1 2 NA
2 2000-06-01 2 2 40
For example: diff_Value1=-1 because the first value for the Group 1 is 1 and last value is 2.
I am using below code. How can I extend for 30 more columns (e.g. Value1...-Value 30)?
Do I need to use loop inside mutate function?
df <- data.frame(Group=c(rep(1,3), rep(2,3)),
Dates=seq(as.Date("2000/1/1"), by = "month", length.out = 6),
Value1=c(NA, 1:2,4,1:2),
Value2=c(0,0,1,1,2,2),
Value3=c(5,10,0,NA,NA,40)
)
df %>%
group_by(Group) %>%
dplyr::mutate(
Value1_diff = dplyr::first(na.omit(Value1))-dplyr::last(na.omit(Value1)),
Value2_diff = dplyr::first(na.omit(Value2))-dplyr::last(na.omit(Value2)),
Value3_diff = dplyr::first(na.omit(Value3))-dplyr::last(na.omit(Value3))
)
Group Dates Value1 Value2 Value3 Value1_diff Value2_diff Value3_diff
1 2000-01-01 NA 0 5 -1 -1 5
1 2000-02-01 1 0 10 -1 -1 5
1 2000-03-01 2 1 0 -1 -1 5
2 2000-04-01 4 1 NA 2 -1 0
2 2000-05-01 1 2 NA 2 -1 0
2 2000-06-01 2 2 40 2 -1 0
We may use across to loop over multiple columns
library(dplyr)
df <- df %>%
group_by(Group) %>%
mutate(across(starts_with('Value'),
~ first(na.omit(.)) - last(na.omit(.)), .names = "{.col}_diff")) %>%
ungroup
-output
df
# A tibble: 6 × 8
Group Dates Value1 Value2 Value3 Value1_diff Value2_diff Value3_diff
<dbl> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2000-01-01 NA 0 5 -1 -1 5
2 1 2000-02-01 1 0 10 -1 -1 5
3 1 2000-03-01 2 1 0 -1 -1 5
4 2 2000-04-01 4 1 NA 2 -1 0
5 2 2000-05-01 1 2 NA 2 -1 0
6 2 2000-06-01 2 2 40 2 -1 0
I have a time-series panel dataset that is structured in the following way: There are multiple funds that each own multiple stocks and we have a value column for the stock. As you can see the panel is not balanced. My actual dataset is very large with each fund having at least 500 stocks and different quarters being represented with some having missing quarter values.
df <- data.frame(
fund_id = c(1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2),
stock_id = c(1,1,1,1,1,1,2,2,2,2,2,2,2,1,1,3,3,3,3),
year_q = c("2011-03","2011-06","2011-09","2011-12","2012-03","2012-06","2011-12","2012-03","2012-06","2012-09",
"2012-12","2013-03","2013-06","2014-09","2015-03","2013-03","2013-06","2013-09","2013-12"),
value = c(1,2,1,3,4,2,1,2,3,4,2,1,3,1,1,3,2,3,1)
)
> df
fund_id stock_id year_q value
1 1 1 2011-03 1
2 1 1 2011-06 2
3 1 1 2011-09 1
4 1 1 2011-12 3
5 1 1 2012-03 4
6 1 1 2012-06 2
7 1 2 2011-12 1
8 1 2 2012-03 2
9 1 2 2012-06 3
10 1 2 2012-09 4
11 1 2 2012-12 2
12 1 2 2013-03 1
13 1 2 2013-06 3
14 2 1 2014-09 1
15 2 1 2015-03 1
16 2 3 2013-03 3
17 2 3 2013-06 2
18 2 3 2013-09 3
19 2 3 2013-12 1
I would like to calculate for each fund, the percentage of stocks held in the current quarter that were ever held in the previous one to 3 quarters. So basically for every fund and every date, I would like to have 3 columns with past 1 Q, past 2Q and past 3Q which show what percentage of stocks held on that date were also present in each of those past quarters.
Here is what the result should look like:
result <- data.frame(
fund_id = c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2),
year_q = c("2011-03","2011-06","2011-09","2011-12","2012-03","2012-06","2012-09","2012-12","2013-03","2013-06",
"2013-03","2013-06","2013-09","2013-12","2014-03","2014-06","2014-09","2014-12","2015-03"),
past_1Q = c("NA",1,1,0.5,1,1,0.5,1,1,1,"NA",1,1,1,"NA","NA","NA","NA","NA"),
past_2Q = c("NA","NA",1,0.5,0.5,1,0.5,0.5,1,1,"NA","NA",1,1,"NA","NA","NA","NA","NA"),
past_3Q = c("NA","NA","NA",0.5,0.5,0.5,0.5,0.5,0.5,1,"NA","NA","NA",1,"NA","NA","NA","NA","NA")
)
> result
fund_id year_q past_1Q past_2Q past_3Q
1 1 2011-03 NA NA NA
2 1 2011-06 1 NA NA
3 1 2011-09 1 1 NA
4 1 2011-12 0.5 0.5 0.5
5 1 2012-03 1 0.5 0.5
6 1 2012-06 1 1 0.5
7 1 2012-09 0.5 0.5 0.5
8 1 2012-12 1 0.5 0.5
9 1 2013-03 1 1 0.5
10 1 2013-06 1 1 1
11 2 2013-03 NA NA NA
12 2 2013-06 1 NA NA
13 2 2013-09 1 1 NA
14 2 2013-12 1 1 1
15 2 2014-03 NA NA NA
16 2 2014-06 NA NA NA
17 2 2014-09 NA NA NA
18 2 2014-12 NA NA NA
19 2 2015-03 NA NA NA
I tried to do this using rollapply but can't get the correct results.
I understand that this might not be the best sample data but in my real data each fund usually have more than 500 stocks and I expect the percentage of matching stocks from one period to the past periods to be something around 0.95 on average.
This is what I have to get the first two result columns (credits to #r2evans):
result <- df %>%
group_by(fund_id) %>%
mutate(miny = min(year_q), maxy = max(year_q)) %>%
distinct(fund_id, miny, maxy) %>%
group_by(fund_id) %>%
mutate(across(c(miny, maxy), ~ as.Date(paste0(., "-01")))) %>%
transmute(year_q = purrr::map2(miny, maxy, ~ format(seq(.x, .y, by = "3 months"), format = "%Y-%m"))) %>%
tidyr::unnest(year_q) %>%
full_join(df, by = c("fund_id", "year_q")) %>%
distinct(fund_id, year_q) %>%
arrange(fund_id, year_q)
library(tidyverse)
df %>%
mutate(year_q = as.Date(paste0(year_q, '-01'))) %>%
group_by(fund_id, year_q) %>%
summarise(stock_id = list(unique(stock_id))) %>%
complete(year_q = seq(min(year_q), max(year_q), by = "3 months")) %>%
reduce(.init = ., 1:3, ~ mutate(.x, "past_{.y}Q" := map(1:n(), \(N) unlist(stock_id[pmax(N-.y, 0)])))) %>%
mutate(across(contains("past"), \(past) map2_dbl(stock_id, past, ~ mean(.x %in% .y)) %>% replace_na(0))) %>%
ungroup()
# A tibble: 19 × 6
fund_id year_q stock_id past_1Q past_2Q past_3Q
<dbl> <date> <list> <dbl> <dbl> <dbl>
1 1 2011-03-01 <dbl [1]> 0 0 0
2 1 2011-06-01 <dbl [1]> 1 0 0
3 1 2011-09-01 <dbl [1]> 1 1 0
4 1 2011-12-01 <dbl [2]> 0.5 0.5 0.5
5 1 2012-03-01 <dbl [2]> 1 0.5 0.5
6 1 2012-06-01 <dbl [2]> 1 1 0.5
7 1 2012-09-01 <dbl [1]> 1 1 1
8 1 2012-12-01 <dbl [1]> 1 1 1
9 1 2013-03-01 <dbl [1]> 1 1 1
10 1 2013-06-01 <dbl [1]> 1 1 1
11 2 2013-03-01 <dbl [1]> 0 0 0
12 2 2013-06-01 <dbl [1]> 1 0 0
13 2 2013-09-01 <dbl [1]> 1 1 0
14 2 2013-12-01 <dbl [1]> 1 1 1
15 2 2014-03-01 <NULL> 0 0 0
16 2 2014-06-01 <NULL> 0 0 0
17 2 2014-09-01 <dbl [1]> 0 0 0
18 2 2014-12-01 <NULL> 0 0 0
19 2 2015-03-01 <dbl [1]> 0 1 0
Given your example, I think the code below gets you there. You might need to switch to data.table if you have a lot of records. Note that I use df1, not df. df is a function in R. I used padr::pad to fill in the missing quarters within a fund. So it will only fill in quarters if there is data from at least 1 stock in the fund. It will not add quarters that are in fund 2 to fund 1 as these have nothing to do with fund 1.
edit: Added a group by in the lag function to correctly lag over the stock_id as the arrange puts the NA values for the stock_ids, instead of the desired order.
df1 %>%
mutate(year_q = ymd(paste0(year_q, "-01"))) %>%
group_by(fund_id) %>%
padr::pad(interval = "3 months") %>%
arrange(fund_id, stock_id, year_q) %>%
mutate(past_1Q = if_else(stock_id == lag(stock_id, order_by = year_q, default = 0), 1, 0),
past_2Q = if_else(stock_id == lag(stock_id, n = 2, order_by = year_q, default = 0), 1, 0),
past_3Q = if_else(stock_id == lag(stock_id, n = 3, order_by = year_q, default = 0), 1, 0)) %>%
group_by(year_q, .add = TRUE) %>%
# add number of stocks in the fund in this quarter
mutate(n_stocks = n()) %>%
summarise(past_1Q = sum(past_1Q, na.rm = T) / mean(n_stocks, na.rm = T),
past_2Q = sum(past_2Q, na.rm = T) / mean(n_stocks, na.rm = T),
past_3Q = sum(past_3Q, na.rm = T) / mean(n_stocks, na.rm = T))
# A tibble: 19 × 5
# Groups: fund_id [2]
fund_id year_q past_1Q past_2Q past_3Q
<dbl> <date> <dbl> <dbl> <dbl>
1 1 2011-03-01 0 0 0
2 1 2011-06-01 1 0 0
3 1 2011-09-01 1 1 0
4 1 2011-12-01 0.5 0.5 0.5
5 1 2012-03-01 0 1 0.5
6 1 2012-06-01 0 1 0
7 1 2012-09-01 1 0 1
8 1 2012-12-01 1 1 0
9 1 2013-03-01 1 1 1
10 1 2013-06-01 1 1 1
11 2 2013-03-01 0 0 0
12 2 2013-06-01 1 0 0
13 2 2013-09-01 1 1 0
14 2 2013-12-01 1 1 1
15 2 2014-03-01 0 0 0
16 2 2014-06-01 0 0 0
17 2 2014-09-01 0 0 0
18 2 2014-12-01 0 0 0
19 2 2015-03-01 0 1 0
I think your results table is incorrect. Looking at 2012-12, there is only one stock live in fund 1. Based on this calculation the outcome should be 100%, not 50%. As 100% of the stocks in the fund now, where also in the fund last quarter, etc. etc.
I have a dataframe as follows:
ID
Col1
RespID
Col3
Col4
Year
Month
Day
1
blue
729Ad
3.2
A
2021
April
2
2
orange
295gS
6.5
A
2021
April
1
3
red
729Ad
8.4
B
2021
April
20
4
yellow
592Jd
2.9
A
2021
March
12
5
green
937sa
3.5
B
2021
May
13
I would like to calculate a new column, Col5, such that its value is 1 if the row has Col4 value of A and there exists another column somewhere in the dataset a row with the same RespId but a Col4 value of B. Otherwise it’s value is 0. Then I will drop all rows with Col4 value of B, to keep just those with A. I'd also like to account for the date fields (year, month, date) so that this is done in groups based on say a 30 day timeframe. So if 'B' appears within 30 days of when 'A' appears in the dataset, only then is there a 1 present (if 'B' appears within 60 days, then there is no 1. Additionally, I'd like to keep everything as data.frames.
Here is what the desired output table would look like prior to dropping rows with Col4 value of B:
ID
Col1
RespID
Col3
Col4
Col5
1
blue
729Ad
3.2
A
1
2
orange
295gS
6.5
A
0
3
red
729Ad
8.4
B
0
4
yellow
592Jd
2.9
A
0
5
green
937sa
3.5
B
0
I have found Ronak's solution in this thread (Calculated Column Based on Rows in Tidymodels Recipe) to be useful, however, would like to modify for the date range.
A lot of things to unpack here.
I think you're tripping up over your own feet by trying to do too many things at once. I've broken down the code into four distinct steps to make the thought process easy to follow. Obviously, for use in a production environment it should be rewritten more efficiently.
1. Generate some data
library(tidyverse)
set.seed(42)
df <- tibble(
id = c(1:10),
resp_id = c(1701, seq(2286, 2289), 1701, seq(2290, 2293)),
grouping = sample(c("A", "B"), size = 10, replace = TRUE),
date = seq.Date(as.Date("2363-10-04"), as.Date("2363-11-17"), length.out = 10)
)
Resulting data:
# A tibble: 10 × 4
id resp_id grouping date
<int> <dbl> <chr> <date>
1 1 1701 A 2363-10-04
2 2 2286 A 2363-10-08
3 3 2287 A 2363-10-13
4 4 2288 A 2363-10-18
5 5 2289 B 2363-10-23
6 6 1701 B 2363-10-28
7 7 2290 B 2363-11-02
8 8 2291 B 2363-11-07
9 9 2292 A 2363-11-12
10 10 2293 B 2363-11-17
2. Check grouping
df <- df %>%
mutate(
is_a = ifelse(grouping == "A", 1, 0),
is_b = ifelse(grouping == "B", 1, 0)
)
We have the grouping now as easy-to-use dummy variables:
> df
# A tibble: 10 × 6
id resp_id grouping date is_a is_b
<int> <dbl> <chr> <date> <dbl> <dbl>
1 1 1701 A 2363-10-04 1 0
2 2 2286 A 2363-10-08 1 0
3 3 2287 A 2363-10-13 1 0
4 4 2288 A 2363-10-18 1 0
5 5 2289 B 2363-10-23 0 1
6 6 1701 B 2363-10-28 0 1
7 7 2290 B 2363-11-02 0 1
8 8 2291 B 2363-11-07 0 1
9 9 2292 A 2363-11-12 1 0
10 10 2293 B 2363-11-17 0 1
3. Check completeness
df <- df %>%
group_by(
resp_id
) %>%
mutate(
# Check if the grouping has both "A" and "B" values
is_complete = ifelse(
sum(is_a) > 0 & sum(is_b) > 0,
1,
0
)
) %>%
ungroup()
We see that there is only one resp_id value that is complete — 1701:
> df
# A tibble: 10 × 7
id resp_id grouping date is_a is_b is_complete
<int> <dbl> <chr> <date> <dbl> <dbl> <dbl>
1 1 1701 A 2363-10-04 1 0 1
2 2 2286 A 2363-10-08 1 0 0
3 3 2287 A 2363-10-13 1 0 0
4 4 2288 A 2363-10-18 1 0 0
5 5 2289 B 2363-10-23 0 1 0
6 6 1701 B 2363-10-28 0 1 1
7 7 2290 B 2363-11-02 0 1 0
8 8 2291 B 2363-11-07 0 1 0
9 9 2292 A 2363-11-12 1 0 0
10 10 2293 B 2363-11-17 0 1 0
4. Assign target value
df <- df %>%
group_by(
resp_id
) %>%
mutate(
# Check if the "A" part of a complete grouping has a another value within 30 days
is_within_timeframe = ifelse(
is_complete == 1 & is_a == 1 & max(date) - min(date) <= 30,
1,
0
)
) %>%
ungroup()
We see that our one complete set has in fact a B value that falls within 30 days of the A observation (Caveat: This only works if there are always exactly one or two observations per grouping!). Column is_within_timeframe corresponds to your Col4:
> df
# A tibble: 10 × 8
id resp_id grouping date is_a is_b is_complete is_within_timeframe
<int> <dbl> <chr> <date> <dbl> <dbl> <dbl> <dbl>
1 1 1701 A 2363-10-04 1 0 1 1
2 2 2286 A 2363-10-08 1 0 0 0
3 3 2287 A 2363-10-13 1 0 0 0
4 4 2288 A 2363-10-18 1 0 0 0
5 5 2289 B 2363-10-23 0 1 0 0
6 6 1701 B 2363-10-28 0 1 1 0
7 7 2290 B 2363-11-02 0 1 0 0
8 8 2291 B 2363-11-07 0 1 0 0
9 9 2292 A 2363-11-12 1 0 0 0
10 10 2293 B 2363-11-17 0 1 0 0
I'm trying to apply ifelse for the entire group. I know this sounds unclear, so let me provide a reproducible example. Consider the following data frame.
id = rep(c(1:3), each = 5)
date = rep(seq.Date(as_date("2010-01-01"), as_date("2010-01-05"), by = "day"), 3)
value = c(1:4, NA, 3:5, NA, 5, NA, 1:4)
df <- data.frame(id, date, value)
Suppose I want to create a column "missing" which takes value 1 for the entire group (not only the corresponding row) if "value" column is NA for date = 2010-01-05.
df %>% group_by(id) %>% mutate(missing = ifelse(value %in% NA & date == "2010-01-05", 1, 0))
I tried piping group_by(id) before the ifelse command like above hoping that ifelse value will be populated by group, but it's not working. Indeed it produces the same result as the code below which does not have group_by pipe.
df %>% mutate(missing = ifelse(value %in% NA & date == "2010-01-05", 1, 0))
At the end of the day, I want my data to look like
df2
id date value missing
1 1 2010-01-01 1 1
2 1 2010-01-02 2 1
3 1 2010-01-03 3 1
4 1 2010-01-04 4 1
5 1 2010-01-05 NA 1
6 2 2010-01-01 3 0
7 2 2010-01-02 4 0
8 2 2010-01-03 5 0
9 2 2010-01-04 NA 0
10 2 2010-01-05 5 0
11 3 2010-01-01 NA 0
12 3 2010-01-02 1 0
13 3 2010-01-03 2 0
14 3 2010-01-04 3 0
15 3 2010-01-05 4 0
Is there a way I can do this by somehow tweaking ifelse?
You can do,
library(dplyr)
df %>%
group_by(id) %>%
mutate(res = as.integer(is.na(value[date == "2010-01-05"])))
which gives,
id date value res
<int> <date> <dbl> <int>
1 1 2010-01-01 1 1
2 1 2010-01-02 2 1
3 1 2010-01-03 3 1
4 1 2010-01-04 4 1
5 1 2010-01-05 NA 1
6 2 2010-01-01 3 0
7 2 2010-01-02 4 0
8 2 2010-01-03 5 0
9 2 2010-01-04 NA 0
10 2 2010-01-05 5 0
11 3 2010-01-01 NA 0
12 3 2010-01-02 1 0
13 3 2010-01-03 2 0
14 3 2010-01-04 3 0
15 3 2010-01-05 4 0
I have a data frame like the following:
Date Flare Painmed_Use
1 2015-12-01 0 0
2 2015-12-02 0 0
3 2015-12-03 0 0
4 2015-12-04 0 0
5 2015-12-05 0 0
6 2015-12-06 0 1
7 2015-12-07 1 4
8 2015-12-08 1 3
9 2015-12-09 1 1
10 2015-12-10 1 0
11 2015-12-11 0 0
12 2015-12-12 0 0
13 2015-12-13 1 2
14 2015-12-14 1 3
15 2015-12-15 1 1
16 2015-12-16 0 0
I'm trying to find the length of each flare as well as the total med use during each flare using dplyr. My current solution (inspired by Use rle to group by runs when using dplyr),
df %>%
group_by(yy = {yy = rle(Flare); rep(seq_along(yy$lengths), yy$lengths)}, Flare) %>%
summarize(Painmed_UseCum = sum(Painmed_Use),FlareLength = n())
gives the following output:
yy Flare Painmed_UseCum FlareLength
<int> <int> <dbl> <int>
1 1 0 1 6
2 2 1 8 4
3 3 0 0 2
4 4 1 6 3
5 5 0 0 1
This is almost exactly what I need. However, I can't figure out how to preserve other columns, the critical one being the date that corresponds to the last row of a particular flare. So, the output I'm seeking is the same as above but with the addition of the Dates, like so:
Date yy Flare Painmed_UseCum FlareLength
<int> <int> <dbl> <int>
1 2015-12-06 1 0 1 6
2 2015-12-10 2 1 8 4
3 2015-12-12 3 0 0 2
4 2015-12-15 4 1 6 3
5 2015-12-16 5 0 0 1
Note: In some ways this is a follow up from a previous question of mine (R code to get max count of time series data by group) but my attempt to keep that question simpler, though perhaps useful to others, ended up necessitating this further question.
You could either include Date in summarise
library(dplyr)
df %>%
group_by(yy = {yy = rle(Flare); rep(seq_along(yy$lengths),yy$lengths)}) %>%
summarize(Painmed_UseCum = sum(Painmed_Use),FlareLength = n(), Date = max(Date))
# Groups: yy, Flare [5]
# Date Flare Painmed_Use yy
# <date> <int> <int> <int>
#1 2015-12-06 0 1 1
#2 2015-12-10 1 0 2
#3 2015-12-12 0 0 3
#4 2015-12-15 1 1 4
#5 2015-12-16 0 0 5
Or if there are more columns to preserve better approach is to use mutate and select the last row in each group.
df %>%
group_by(yy = {yy = rle(Flare); rep(seq_along(yy$lengths), yy$lengths)}) %>%
mutate(Painmed_UseCum = sum(Painmed_Use),FlareLength = n()) %>%
slice(n())
To create groups, we can replace rle with rleid from data.table which would be simpler.
group_by(yy = data.table::rleid(Flare))