Multiple variable summary with dplyr - r

I have the following DF
head(sample_data)
article value date
1 A 21920 2015
2 I 615 2017
3 B 1414 2018
4 D 102 2018
5 I 1096 2015
6 A 2577 2021
Full dataset
dput(sample_data)
structure(list(article = c("A", "I", "B", "D", "I", "A", "C",
"C", "D", "H", "B", "I", "A", "G", "E", "G", "D", "A", "D", "B",
"A", "C", "D", "F", "G", "D", "G", "C", "E", "E", "G", "G", "A",
"A", "E", "H", "B", "E", "E", "B", "B", "A", "H", "A", "B", "G",
"D", "C", "E", "A"), value = c(21920, 615, 1414, 102, 1096, 2577,
840, 311, 804, 695, 3863, 279, 7324, 299, 311, 133, 759, 5386,
5396, 11051, 14708, 856, 1749, 2212, 318, 3478, 415, 781, 227,
248, 122, 185, 1344, 15442, 248, 433, 5068, 38, 165, 369, 805,
18944, 264, 11716, 4274, 442, 2530, 827, 164, 18506), date = c("2015",
"2017", "2018", "2018", "2015", "2021", "2016", "2021", "2017",
"2021", "2019", "2015", "2019", "2016", "2015", "2019", "2018",
"2020", "2017", "2015", "2015", "2016", "2015", "2015", "2021",
"2015", "2019", "2016", "2016", "2015", "2019", "2020", "2019",
"2016", "2016", "2015", "2015", "2021", "2021", "2020", "2020",
"2015", "2016", "2017", "2019", "2016", "2015", "2016", "2019",
"2016")), row.names = c(NA, -50L), class = "data.frame")
I'm trying to use dplyr to get something along the lines of this:
sample_data %>%
+ group_by(article, date) %>%
+ summarise(weight = sum(value))
`summarise()` has grouped output by 'article'. You can override using the `.groups` argument.
# A tibble: 29 x 3
# Groups: article [9]
article date weight
<chr> <chr> <dbl>
1 A 2015 55572
2 A 2016 33948
3 A 2017 11716
4 A 2019 8668
5 A 2020 5386
6 A 2021 2577
7 B 2015 16119
8 B 2018 1414
9 B 2019 8137
10 B 2020 1174
# ... with 19 more rows
However, I want to add another column with a proportion of each article's weight of the total (sum of A:I) per year. The sum of all article proportions should then amount to 1 for each year.
I tried the code below. I suspect this occurs because I use "value" that results in all values being printed, hence all occurrences. How can I summarise this so it looks like the one above with the added column?
sample_data %>%
+ group_by(article, date) %>%
+ summarise(weight = sum(value), prop = value/weight)
`summarise()` has grouped output by 'article', 'date'. You can override using the `.groups` argument.
# A tibble: 50 x 4
# Groups: article, date [29]
article date weight prop
<chr> <chr> <dbl> <dbl>
1 A 2015 55572 0.394
2 A 2015 55572 0.265
3 A 2015 55572 0.341
4 A 2016 33948 0.455
5 A 2016 33948 0.545
6 A 2017 11716 1
7 A 2019 8668 0.845
8 A 2019 8668 0.155
9 A 2020 5386 1
10 A 2021 2577 1
# ... with 40 more rows

After the initial summarize, you have one entry for each article per year. You then wish to know what the contribution of each article was to each year's total, so you need to group_by again using just the year, and finally mutate to get the proportion for each article.
library(dplyr)
sample_data %>%
group_by(article, date) %>%
summarise(weight = sum(value), .groups = "keep") %>%
group_by(date) %>%
mutate(prop = weight / sum(weight))
#> # A tibble: 29 x 4
#> # Groups: date [7]
#> article date weight prop
#> <chr> <chr> <dbl> <dbl>
#> 1 A 2015 55572 0.661
#> 2 A 2016 33948 0.876
#> 3 A 2017 11716 0.632
#> 4 A 2019 8668 0.491
#> 5 A 2020 5386 0.799
#> 6 A 2021 2577 0.628
#> 7 B 2015 16119 0.192
#> 8 B 2018 1414 0.622
#> 9 B 2019 8137 0.461
#> 10 B 2020 1174 0.174
#> # ... with 19 more rows
Created on 2022-02-19 by the reprex package (v2.0.1)

An option is also to have do the group by sum within first summarise
library(dplyr)
library(tibble)
library(tidyr)
sample_data %>%
group_by(date) %>%
summarise(out = enframe(tapply(value, article, sum)/sum(value),
name = 'article', value = 'prop'), .groups = 'drop') %>%
unpack(out)
# A tibble: 29 × 3
date article prop
<chr> <chr> <dbl>
1 2015 A 0.661
2 2015 B 0.192
3 2015 D 0.0923
4 2015 E 0.00665
5 2015 F 0.0263
6 2015 H 0.00515
7 2015 I 0.0164
8 2016 A 0.876
9 2016 C 0.0853
10 2016 E 0.0123
# … with 19 more rows

Related

Convert long data to wide data with multiple columns and multiple values

I have a data frame of 300K rows with 2 main columns of interest. (NAME & SUBJCT) I need to convert this data into a wide format and in addition, if I get a records for a particular subject with multiple dates, I need to place them next to each other.
I tried using tidyr::pivot_wider but I'm not able to get it work.
Sample data:
DF <- data.frame(
NAME = c("ABC", "ABC", "DEF", "ABC", "ABC", "ABC", "DEF", "ABC", "DEF", "ABC", "DEF", "DEF", "DEF", "DEF", "DEF", "DEF", "ABC"),
SUBJECT = c("MATHS", "LANGUAGE 1", "LANGUAGE 1", "LANGUAGE 2","LANGUAGE 2","LANGUAGE 2","LANGUAGE 2", "SCIENCE", "SCIENCE", "HISTORY", "PE", "ENVIRONMENT", "COMPUTERS", "COMPUTERS", "COMPUTERS", "BIOLOGY", "SANSKRIT"),
YEAR = c("2010", "2011", "2012", "2013", "2014", "2015", "2013", "2015", "2016", "2016", "2017", "2015", "2016", "2017", "2018", "2015", "2013"),
MARKS = c("45", "48", "47", "44", "48", "46", "42", "42", "43", "37", "42", "43", "42", "41", "44", "41", "44"),
MAXIMUM = c("46", rep("50", 5), "45", "50", rep("45", 9))
)
> DF
NAME SUBJECT YEAR MARKS MAXIMUM
1 ABC MATHS 2010 45 46
2 ABC LANGUAGE 1 2011 48 50
3 DEF LANGUAGE 1 2012 47 50
4 ABC LANGUAGE 2 2013 44 50
5 ABC LANGUAGE 2 2014 48 50
6 ABC LANGUAGE 2 2015 46 50
7 DEF LANGUAGE 2 2013 42 45
8 ABC SCIENCE 2015 42 50
9 DEF SCIENCE 2016 43 45
10 ABC HISTORY 2016 37 45
11 DEF PE 2017 42 45
12 DEF ENVIRONMENT 2015 43 45
13 DEF COMPUTERS 2016 42 45
14 DEF COMPUTERS 2017 41 45
15 DEF COMPUTERS 2018 44 45
16 DEF BIOLOGY 2015 41 45
17 ABC SANSKRIT 2013 44 45
My expected output is like this: (It is a bit long)
Bit tricky with pivoting twice, but here you go:
library(tidyverse)
DF %>%
group_by(NAME, SUBJECT) %>%
mutate(ind = row_number()) %>%
ungroup() %>%
pivot_longer(c("YEAR", "MARKS", "MAXIMUM")) %>%
mutate(name = paste0(name, ind)) %>%
select(-ind) %>%
pivot_wider(names_from = c("SUBJECT", "name"), values_from = "value")

dplyr filter out groups in which the max value (per group) is below the top-3 max-values (per group)

So I have this dataframe:
structure(list(id = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4,
4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6,
6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8,
8, 8, 8, 8, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9), year = c("2017", "2018",
"2019", "2020", "2021", "2022", "2023", "2024", "2025", "2026",
"2017", "2018", "2019", "2020", "2021", "2022", "2023", "2024",
"2025", "2026", "2017", "2018", "2019", "2020", "2021", "2022",
"2023", "2024", "2025", "2026", "2017", "2018", "2019", "2020",
"2021", "2022", "2023", "2024", "2025", "2026", "2017", "2018",
"2019", "2020", "2021", "2022", "2023", "2024", "2025", "2026",
"2017", "2018", "2019", "2020", "2021", "2022", "2023", "2024",
"2025", "2026", "2017", "2018", "2019", "2020", "2021", "2022",
"2023", "2024", "2025", "2026", "2017", "2018", "2019", "2020",
"2021", "2022", "2023", "2024", "2025", "2026", "2017", "2018",
"2019", "2020", "2021", "2022", "2023", "2024", "2025", "2026"
), volume = c(0.0013, 0.0013, 0.0012579, 0.0011895, 0.0011421,
0.0010842, 0.0010211, 0.0010158, 0.00099474, 0.00092632, 0.07878,
0.078791, 0.077295, 0.076638, 0.075538, 0.074468, 0.074776, 0.074051,
0.071706, 0.068056, 0.023269, 0.023011, 0.022374, 0.021962, 0.021408,
0.020949, 0.020811, 0.020354, 0.019309, 0.018042, 0.0004, 0.0004,
0.00038421, 0.00035263, 0.00033158, 0.00032105, 0.00026842, 0.00028421,
0.00026842, 0.00024211, 0.0002, 0.0001, 0.00011579, 0, 0, 0,
0, 0, 0, 0, 0.028422, 0.028361, 0.027768, 0.027501, 0.027029,
0.02651, 0.026588, 0.026209, 0.025094, 0.023391, 0.0001, 0.0001,
0, 0, 0, 0, 0, 0, 0, 0, 0.0047, 0.0047158, 0.0048368, 0.0048316,
0.0049263, 0.0049737, 0.0049947, 0.0051684, 0.0052526, 0.0051842,
0.0106, 0.010389, 0.010279, 0.010005, 0.0098421, 0.0096368, 0.0094053,
0.0093368, 0.0092526, 0.0089316)), class = c("tbl_df", "tbl",
"data.frame"), row.names = c(NA, -90L))
Which looks like this:
# A tibble: 6 × 3
id year volume
<dbl> <chr> <dbl>
1 1 2017 0.0013
2 1 2018 0.0013
3 1 2019 0.00126
4 1 2020 0.00119
5 1 2021 0.00114
6 1 2022 0.00108
Id has 9 distinct IDs, each with 10 rows. Now I would like to find the maximum value for the column volume and then filter out the groups (or just make an extra column like inTop3 ) that highlights those IDs which are in the top-3 highest volume values.
This could mean that the largest 3 values are within the group with ID = 2. But I really only want to compare the maximum value of each group with the maximum value of each other group.
Getting the maximum value per group is trivial:
df %>%
group_by(id) %>%
mutate(
m = max(volume)
)
But then I am a little lost how to go on. Especially I wonder how I could create a boolean column that indicates wheter a group is in the top-3 or not.
Another possible solution:
library(dplyr)
df %>%
group_by(id) %>%
summarise(m = max(volume)) %>%
slice_max(m, n = 3)
#> # A tibble: 3 × 2
#> id m
#> <dbl> <dbl>
#> 1 2 0.0788
#> 2 6 0.0284
#> 3 3 0.0233
To get the entire group for each of the 3 max-values:
library(tidyverse)
df %>%
group_by(id) %>%
summarise(m = max(volume)) %>%
slice_max(m, n = 3) %>%
group_split(id) %>%
map(~ inner_join(df, .x, by = "id"))
#> [[1]]
#> # A tibble: 10 × 4
#> id year volume m
#> <dbl> <chr> <dbl> <dbl>
#> 1 2 2017 0.0788 0.0788
#> 2 2 2018 0.0788 0.0788
#> 3 2 2019 0.0773 0.0788
#> 4 2 2020 0.0766 0.0788
#> 5 2 2021 0.0755 0.0788
#> 6 2 2022 0.0745 0.0788
#> 7 2 2023 0.0748 0.0788
#> 8 2 2024 0.0741 0.0788
#> 9 2 2025 0.0717 0.0788
#> 10 2 2026 0.0681 0.0788
#>
#> [[2]]
#> # A tibble: 10 × 4
#> id year volume m
#> <dbl> <chr> <dbl> <dbl>
#> 1 3 2017 0.0233 0.0233
#> 2 3 2018 0.0230 0.0233
#> 3 3 2019 0.0224 0.0233
#> 4 3 2020 0.0220 0.0233
#> 5 3 2021 0.0214 0.0233
#> 6 3 2022 0.0209 0.0233
#> 7 3 2023 0.0208 0.0233
#> 8 3 2024 0.0204 0.0233
#> 9 3 2025 0.0193 0.0233
#> 10 3 2026 0.0180 0.0233
#>
#> [[3]]
#> # A tibble: 10 × 4
#> id year volume m
#> <dbl> <chr> <dbl> <dbl>
#> 1 6 2017 0.0284 0.0284
#> 2 6 2018 0.0284 0.0284
#> 3 6 2019 0.0278 0.0284
#> 4 6 2020 0.0275 0.0284
#> 5 6 2021 0.0270 0.0284
#> 6 6 2022 0.0265 0.0284
#> 7 6 2023 0.0266 0.0284
#> 8 6 2024 0.0262 0.0284
#> 9 6 2025 0.0251 0.0284
#> 10 6 2026 0.0234 0.0284
You may use dplyr::top_n
df %>%
group_by(id) %>%
arrange(id, desc(volume)) %>%
top_n(3)
id year volume
<dbl> <chr> <dbl>
1 1 2017 0.0013
2 1 2018 0.0013
3 1 2019 0.00126
4 2 2018 0.0788
5 2 2017 0.0788
6 2 2019 0.0773
7 3 2017 0.0233
8 3 2018 0.0230
9 3 2019 0.0224
10 4 2017 0.0004
# … with 24 more rows
top3/nottop3
df %>%
group_by(id) %>%
arrange(id, desc(volume)) %>%
mutate(top3 = ifelse(row_number() %in% c(1,2,3), "top3", "nottop3"))
id year volume top3
<dbl> <chr> <dbl> <chr>
1 1 2017 0.0013 top3
2 1 2018 0.0013 top3
3 1 2019 0.00126 top3
4 1 2020 0.00119 nottop3
5 1 2021 0.00114 nottop3
6 1 2022 0.00108 nottop3
7 1 2023 0.00102 nottop3
8 1 2024 0.00102 nottop3
9 1 2025 0.000995 nottop3
10 1 2026 0.000926 nottop3
For me this worked as closest to what I wanted:
df %>%
group_by(id) %>%
summarise(m = max(volume)) %>%
arrange(desc(m)) %>%
mutate(top3 = if_else(row_number() %in% c(1, 2, 3), T, F)) %>%
inner_join(., df, by = c("id")) -> top3

Summarising movements of individuals spread over several rows

I am a newly self-taught user of R and require assistance.
I am working with a dataset that has captured location of residence and whether the locality is metropolitan, regional or rural over 7 years (2015-2021) for a subset of a population. Each individual has a unique ID and each year is on a new row (ie. each ID has 7 rows). I am trying to figure out how many individuals have remained in the same location, how many have moved and where they moved to.
I am really struggling to figure out what I need to do to get the required outputs, but I assume there is a way to get a summary table that has number of individuals who havent moved (+- where they are located) and number of individuals that have moved (+- where they have moved to).
Your assistance would be greatly appreciated.
Dummy dataset:
stack <- tribble(
~ID, ~Year, ~Residence, ~Locality,
#--/--/--/----
"a", "2015", "Sydney", "Metro",
"a", "2016", "Sydney", "Metro",
"a", "2017", "Sydney", "Metro",
"a", "2018", "Sydney", "Metro",
"a", "2019", "Sydney", "Metro",
"a", "2020", "Sydney", "Metro",
"a", "2021", "Sydney", "Metro",
"b", "2015", "Sydney", "Metro",
"b", "2016", "Orange", "Regional",
"b", "2017", "Orange", "Regional",
"b", "2018", "Orange", "Regional",
"b", "2019", "Orange", "Regional",
"b", "2020", "Broken Hill", "Rural",
"b", "2021", "Sydney", "Metro",
"c", "2015", "Dubbo", "Regional",
"c", "2016", "Dubbo", "Regional",
"c", "2017", "Dubbo", "Regional",
"c", "2018", "Dubbo", "Regional",
"c", "2019", "Dubbo", "Regional",
"c", "2020", "Dubbo", "Regional",
"c", "2021", "Dubbo", "Regional",
)
Cheers in advance.
You can use the lead function to add columns containing the persons' location in the following year. Using mutate across, you can apply the lead to two columns simultaneously. You can then make a row-wise comparisons and look for moves before summarising.
#Group by individual before applying the lead function
#Apply the lead function to the two listed columns and add "nextyear" as a suffix
#Add a logical column which returns TRUE if any change of residence or locality is detected.
#summarise the date by individual by retaining the location with the max year.
stack%>%
unite(col="Location", c(Residence, Locality), sep="-")%>%
group_by(ID)%>%
mutate(across(c("Year", "Location"), list(nextyear= lead)),
Move=Location!=Location_nextyear)%>%
filter(!is.na(Year_nextyear))%>%
mutate(nb.of.moves=sum(Move, na.rm=TRUE))%>%
slice_max(Year)%>%
select(ID, last.location=Location_nextyear, nb.of.moves)
# A tibble: 3 x 3
# Groups: ID [3]
ID last.location nb.of.moves
<chr> <chr> <int>
1 a Sydney-Metro 0
2 b Sydney-Metro 3
3 c Dubbo-Regional 0
Here is another tidyverse option and using cumsum. We can get the cumulative sum to show how many times each person moves (if they do). Then, we can slice the last row, and get the count of each location. The change column indicates how many times they moved. However, it's unclear what you want the final product to look like.
library(tidyverse)
stack %>%
group_by(ID) %>%
mutate(
change = cumsum(case_when(
paste0(Residence, Locality) != lag(paste0(Residence, Locality)) ~ TRUE,
TRUE ~ FALSE
))
) %>%
slice(n()) %>%
ungroup %>%
count(Residence, Locality, change)
Output
Residence Locality change n
<chr> <chr> <int> <int>
1 Dubbo Regional 0 1
2 Sydney Metro 0 1
3 Sydney Metro 3 1
Using data.table.
library(data.table)
setDT(stack) # convert to data.table
setorder(stack, ID, Year) # assure rows are in correct order
stack[, rle(paste(Residence, Locality, sep=', ')), by=.(ID)]
## ID lengths values
## 1: a 7 Sydney, Metro
## 2: b 1 Sydney, Metro
## 3: b 4 Orange, Regional
## 4: b 1 Broken Hill, Rural
## 5: b 1 Sydney, Metro
## 6: c 7 Dubbo, Regional
So a stayed in Sydney for 7 years, b stayed in Sydney for 1 year then moved to Orange for 4 years, then moved to Broken Hill for 1 year, then moved back to Sydney for 1 year.
To determine how many times each person moved:
result <- stack[, rle(paste(Residence, Locality, sep=', ')), by=.(ID)]
result[, .(N=.N-1), by=.(ID)]
## ID N
## 1: a 0
## 2: b 3
## 3: c 0
So a and c did not move at all, and b moved 3 times.
Similar to what #Dealec did, I used the lag function from dplyr instead.
library(tidyverse)
library(janitor)
#>
#> Attaching package: 'janitor'
#> The following objects are masked from 'package:stats':
#>
#> chisq.test, fisher.test
stack <- tribble(
~ID, ~Year, ~Residence, ~Locality,
#--/--/--/----
"a", "2015", "Sydney", "Metro",
"a", "2016", "Sydney", "Metro",
"a", "2017", "Sydney", "Metro",
"a", "2018", "Sydney", "Metro",
"a", "2019", "Sydney", "Metro",
"a", "2020", "Sydney", "Metro",
"a", "2021", "Sydney", "Metro",
"b", "2015", "Sydney", "Metro",
"b", "2016", "Orange", "Regional",
"b", "2017", "Orange", "Regional",
"b", "2018", "Orange", "Regional",
"b", "2019", "Orange", "Regional",
"b", "2020", "Broken Hill", "Rural",
"b", "2021", "Sydney", "Metro",
"c", "2015", "Dubbo", "Regional",
"c", "2016", "Dubbo", "Regional",
"c", "2017", "Dubbo", "Regional",
"c", "2018", "Dubbo", "Regional",
"c", "2019", "Dubbo", "Regional",
"c", "2020", "Dubbo", "Regional",
"c", "2021", "Dubbo", "Regional",
) %>%
clean_names()
results <- stack %>%
mutate(location = paste(residence, locality, sep = "_")) %>%
arrange(id, year) %>%
group_by(id) %>%
mutate(
row = row_number(),
movement = case_when(
row == 1 ~ NA_character_,
location == lag(location, n = 1) ~ "no_movement",
TRUE ~ location
)
) %>%
ungroup() %>%
select(-row)
results
#> # A tibble: 21 x 6
#> id year residence locality location movement
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 a 2015 Sydney Metro Sydney_Metro <NA>
#> 2 a 2016 Sydney Metro Sydney_Metro no_movement
#> 3 a 2017 Sydney Metro Sydney_Metro no_movement
#> 4 a 2018 Sydney Metro Sydney_Metro no_movement
#> 5 a 2019 Sydney Metro Sydney_Metro no_movement
#> 6 a 2020 Sydney Metro Sydney_Metro no_movement
#> 7 a 2021 Sydney Metro Sydney_Metro no_movement
#> 8 b 2015 Sydney Metro Sydney_Metro <NA>
#> 9 b 2016 Orange Regional Orange_Regional Orange_Regional
#> 10 b 2017 Orange Regional Orange_Regional no_movement
#> # ... with 11 more rows
results %>%
count(year, movement) %>%
pivot_wider(names_from = movement,
values_from = n) %>%
clean_names()
#> # A tibble: 7 x 6
#> year na no_movement orange_regional broken_hill_rural sydney_metro
#> <chr> <int> <int> <int> <int> <int>
#> 1 2015 3 NA NA NA NA
#> 2 2016 NA 2 1 NA NA
#> 3 2017 NA 3 NA NA NA
#> 4 2018 NA 3 NA NA NA
#> 5 2019 NA 3 NA NA NA
#> 6 2020 NA 2 NA 1 NA
#> 7 2021 NA 2 NA NA 1
#tracking movement from a location
from_location <- stack %>%
mutate(location = paste(residence, locality, sep = "_")) %>%
arrange(id, year) %>%
group_by(id) %>%
mutate(
row = row_number(),
movement_from = case_when(
row == 1 ~ NA_character_,
location == lag(location, n = 1) ~ "no_movement",
TRUE ~ lag(location, n = 1)
)
) %>%
ungroup() %>%
select(-row)
from_location %>%
count(year, movement_from) %>%
pivot_wider(names_from = movement_from,
names_prefix = "from_",
values_from = n) %>%
clean_names()
#> # A tibble: 7 x 6
#> year from_na from_no_movement from_sydney_metro from_orange_regional
#> <chr> <int> <int> <int> <int>
#> 1 2015 3 NA NA NA
#> 2 2016 NA 2 1 NA
#> 3 2017 NA 3 NA NA
#> 4 2018 NA 3 NA NA
#> 5 2019 NA 3 NA NA
#> 6 2020 NA 2 NA 1
#> 7 2021 NA 2 NA NA
#> # ... with 1 more variable: from_broken_hill_rural <int>
Created on 2022-04-28 by the reprex package (v2.0.1)

Rolling regression based on column values (or date) in R

I have the following table:
# A tibble: 40 x 5
# Groups: stock [1]
stock date mkt_cap week returns
<chr> <date> <dbl> <int> <dbl>
1 A 2019-03-04 10522834. NA NA
2 A 2019-03-05 11659707. NA 9.70
3 A 2019-03-06 11464531. NA -2.25
4 A 2019-03-07 12217241. NA 5.80
5 A 2019-03-08 11619351. 1 -5.57
6 A 2019-03-09 11578687. NA -0.899
7 A 2019-03-10 11658368. NA 0.141
8 A 2019-03-11 12722921. NA 8.20
9 A 2019-03-12 15429934. NA 18.8
10 A 2019-03-13 16801600. NA 7.98
11 A 2019-03-14 17898334. NA 5.79
12 A 2019-03-15 18492686. 2 2.74
13 A 2019-03-16 20686683. NA 10.7
14 A 2019-03-17 22299970. NA 6.98
15 A 2019-03-18 22924182. NA 2.24
16 A 2019-03-19 24174351. NA 4.79
17 A 2019-03-20 24661467. NA 1.48
18 A 2019-03-21 23351810. NA -5.97
19 A 2019-03-22 27826601. 3 17.0
20 A 2019-03-23 30670482. NA 9.22
21 A 2019-03-24 32802772. NA 6.21
22 A 2019-03-25 31778387. NA -3.68
23 A 2019-03-26 33237006. NA 3.99
24 A 2019-03-27 34971479. NA 4.59
25 A 2019-03-28 36774005. NA 4.53
26 A 2019-03-29 37594815. 4 1.71
27 A 2019-03-30 38321816. NA 1.42
28 A 2019-03-31 35167070. NA -9.08
29 A 2019-04-01 35625396. NA 0.808
30 A 2019-04-02 35764747. NA -0.0940
31 A 2019-04-03 28316242. NA -23.8
32 A 2019-04-04 26124803. NA -8.53
33 A 2019-04-05 30390295. 5 14.6
34 A 2019-04-06 28256485. NA -7.76
35 A 2019-04-07 29807837. NA 4.87
36 A 2019-04-08 30970364. NA 3.36
37 A 2019-04-09 30470093. NA -2.10
38 A 2019-04-10 30860276. NA 0.806
39 A 2019-04-11 27946472. NA -10.4
40 A 2019-04-12 27662766. 6 -1.48
Over this table, I want to run a rolling regression where the rolling regression contains the past month of data. I want to run these rolling regressions over the weeks. That is, over week==1, week ==2 etc., where we use the past month of data. The regression should be lm(return~mkt_cap). I have tried a number of things using the slide_period() function, however, this did not work out for me. For example, I have tried to run
tbl.data %>% group_by(stock, week) %>% slide_period(date, date, "month", ~.x, .before = 1). There are some gaps in my data, therefore I prefer a solution that considers the date.
Could someone help me out? Kind regards.
I would use a tidyverse rowwise approach.
Not clear to me is how models should be created by week and go back to the last month. In the approach below I calculate max_date per week and from this I go back 30 days.
# setup
library(tidyverse)
library(lubridate)
dat <- tribble(~stock, ~date, ~mkt_cap, ~week, ~returns,
"A", "2019-03-04", 10522834., NA, NA,
"A", "2019-03-05", 11659707., NA, 9.70,
"A", "2019-03-06", 11464531., NA, -2.25,
"A", "2019-03-07", 12217241., NA, 5.80,
"A", "2019-03-08", 11619351., 1, -5.57,
"A", "2019-03-09", 11578687., NA, -0.899,
"A", "2019-03-10", 11658368., NA, 0.141,
"A", "2019-03-11", 12722921., NA, 8.20,
"A", "2019-03-12", 15429934., NA, 18.8,
"A", "2019-03-13", 16801600., NA, 7.98,
"A", "2019-03-14", 17898334., NA, 5.79,
"A", "2019-03-15", 18492686., 2, 2.74,
"A", "2019-03-16", 20686683., NA, 10.7,
"A", "2019-03-17", 22299970., NA, 6.98,
"A", "2019-03-18", 22924182., NA, 2.24,
"A", "2019-03-19", 24174351., NA, 4.79,
"A", "2019-03-20", 24661467., NA, 1.48,
"A", "2019-03-21", 23351810., NA, -5.97,
"A", "2019-03-22", 27826601., 3, 17.0,
"A", "2019-03-23", 30670482., NA, 9.22,
"A", "2019-03-24", 32802772., NA, 6.21,
"A", "2019-03-25", 31778387., NA, -3.68,
"A", "2019-03-26", 33237006., NA, 3.99,
"A", "2019-03-27", 34971479., NA, 4.59,
"A", "2019-03-28", 36774005., NA, 4.53,
"A", "2019-03-29", 37594815., 4, 1.71,
"A", "2019-03-30", 38321816., NA, 1.42,
"A", "2019-03-31", 35167070., NA, -9.08,
"A", "2019-04-01", 35625396., NA, 0.808,
"A", "2019-04-02", 35764747., NA, -0.0940,
"A", "2019-04-03", 28316242., NA, -23.8,
"A", "2019-04-04", 26124803., NA, -8.53,
"A", "2019-04-05", 30390295., 5, 14.6,
"A", "2019-04-06", 28256485., NA, -7.76,
"A", "2019-04-07", 29807837., NA, 4.87,
"A", "2019-04-08", 30970364., NA, 3.36,
"A", "2019-04-09", 30470093., NA, -2.10,
"A", "2019-04-10", 30860276., NA, 0.806,
"A", "2019-04-11", 27946472., NA, -10.4,
"A", "2019-04-12", 27662766., 6, -1.48) %>%
mutate(date = as.Date(date)) %>%
fill(week, .direction = "up")
# summarised data.frame by week with min and max date
dat2 <- dat %>%
group_by(week) %>%
summarise(max_date = max(date),
min_date = max_date %m-% months(1))
#> `summarise()` ungrouping output (override with `.groups` argument)
# create the models
dat3 <- dat2 %>%
rowwise() %>%
mutate(mod = list(lm(returns ~ mkt_cap,
data = filter(dat,
date <= .env$max_date,
date >= .env$min_date))))
# get the relevant informationen per week
dat3 %>%
mutate(res = list(broom::tidy(mod)),
broom::glance(mod)) %>%
select(week,
res,
adj.r.squared,
mod_p.value = p.value,
nobs) %>%
unnest(res) %>%
filter(term != "(Intercept)")
#> # A tibble: 6 x 9
#> week term estimate std.error statistic p.value adj.r.squared mod_p.value
#> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 mkt_~ 1.01e-5 1.34e-5 0.756 0.529 -0.167 0.529
#> 2 2 mkt_~ 9.26e-7 7.45e-7 1.24 0.245 0.0520 0.245
#> 3 3 mkt_~ 2.56e-7 2.97e-7 0.864 0.400 -0.0152 0.400
#> 4 4 mkt_~ 2.00e-8 1.42e-7 0.141 0.889 -0.0426 0.889
#> 5 5 mkt_~ -1.18e-7 1.61e-7 -0.736 0.467 -0.0150 0.467
#> 6 6 mkt_~ -3.23e-7 2.37e-7 -1.37 0.182 0.0271 0.182
#> # ... with 1 more variable: nobs <int>
Created on 2021-04-27 by the reprex package (v0.3.0)
Update
This approach can be easily expanded when working with more than one stock:
# lets append the same data and change stock to "B":
dat <- dat %>%
bind_rows({mutate(., stock = "B")})
# summarised data.frame by week and group with min and max date
dat2 <- dat %>%
group_by(stock, week) %>%
summarise(max_date = max(date),
min_date = max_date %m-% months(1))
#> `summarise()` has grouped output by 'stock'. You can override using the `.groups` argument.
# create the models, and this time also filer for .env$stock
dat3 <- dat2 %>%
rowwise() %>%
mutate(mod = list(lm(returns ~ mkt_cap,
data = filter(dat,
stock == .env$stock,
date <= .env$max_date,
date >= .env$min_date))))
# get the relevant informationen per week (this stays the same!)
dat3 %>%
mutate(res = list(broom::tidy(mod)),
broom::glance(mod)) %>%
select(week,
res,
adj.r.squared,
mod_p.value = p.value,
nobs) %>%
unnest(res) %>%
filter(term != "(Intercept)")
#> Adding missing grouping variables: `stock`
#> # A tibble: 12 x 10
#> # Groups: stock [2]
#> stock week term estimate std.error statistic p.value adj.r.squared
#> <chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 A 1 mkt_cap 0.0000101 0.0000134 0.756 0.529 -0.167
#> 2 A 2 mkt_cap 0.000000926 0.000000745 1.24 0.245 0.0520
#> 3 A 3 mkt_cap 0.000000256 0.000000297 0.864 0.400 -0.0152
#> 4 A 4 mkt_cap 0.0000000200 0.000000142 0.141 0.889 -0.0426
#> 5 A 5 mkt_cap -0.000000118 0.000000161 -0.736 0.467 -0.0150
#> 6 A 6 mkt_cap -0.000000323 0.000000237 -1.37 0.182 0.0271
#> 7 B 1 mkt_cap 0.0000101 0.0000134 0.756 0.529 -0.167
#> 8 B 2 mkt_cap 0.000000926 0.000000745 1.24 0.245 0.0520
#> 9 B 3 mkt_cap 0.000000256 0.000000297 0.864 0.400 -0.0152
#> 10 B 4 mkt_cap 0.0000000200 0.000000142 0.141 0.889 -0.0426
#> 11 B 5 mkt_cap -0.000000118 0.000000161 -0.736 0.467 -0.0150
#> 12 B 6 mkt_cap -0.000000323 0.000000237 -1.37 0.182 0.0271
#> # … with 2 more variables: mod_p.value <dbl>, nobs <int>
Created on 2021-04-27 by the reprex package (v0.3.0)
An ugly Base R solution (assuming you just want the predicted values returned):
# Allocate some memory such that each stock in data.frame
# can become an element in a list: df_list => empty list:
df_list <- vector("list", length(unique(df$stock)))
# Split the data.frame into the list: df_list => list of data.frames:
df_list <- with(df, split(df, stock))
# Number of weeks to consider in rolling regression in this case 4,
# approximating a month: n_weeks => integer scalar:
n_weeks <- 4
# For each stock in the list: nested lists => stdout(console)
lapply(df_list, function(x){
# Clean the week vector, filling NAs with values:
# week => integer vector
x$week <- with(x, rev(na.omit(rev(week))[cumsum(!is.na(rev(week)))]))
# Impute the first return value if it is missing:
x$returns[1] <- with(x,
ifelse(is.na(returns[1]), returns[which.min(!(is.na(returns)))],
returns[1]
)
)
# Interpolate the return using the previous value:
# returns => numeric vector
x$returns <- with(x, na.omit(returns)[cumsum(!is.na(returns))])
# For each week:
y <- lapply(unique(x$week), function(z){
# Calculate the range for the regression:
rng <- if(z - n_weeks <= 0){
seq_len(z)
}else{
seq(from = (z - n_weeks), to = z, by = 1)
}
# Subset the data: sbst => data.frame
sbst <- x[x$week %in% rng,]
# Calculate the regression:
predict(lm(returns ~ mkt_cap, data = sbst))
}
)
# Return the list of regressions:
y
}
)
Data:
df <- structure(list(stock = c("A", "A", "A", "A", "A", "A", "A", "A",
"A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A",
"A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A",
"A", "A", "A", "A", "A", "A"), date = structure(17959:17998, class = c("IDate",
"Date")), mkt_cap = c(10522834, 11659707, 11464531, 12217241,
11619351, 11578687, 11658368, 12722921, 15429934, 16801600, 17898334,
18492686, 20686683, 22299970, 22924182, 24174351, 24661467, 23351810,
27826601, 30670482, 32802772, 31778387, 33237006, 34971479, 36774005,
37594815, 38321816, 35167070, 35625396, 35764747, 28316242, 26124803,
30390295, 28256485, 29807837, 30970364, 30470093, 30860276, 27946472,
27662766), week = c(NA, NA, NA, NA, 1L, NA, NA, NA, NA, NA, NA,
2L, NA, NA, NA, NA, NA, NA, 3L, NA, NA, NA, NA, NA, NA, 4L, NA,
NA, NA, NA, NA, NA, 5L, NA, NA, NA, NA, NA, NA, 6L), returns = c(NA,
9.7, -2.25, 5.8, -5.57, -0.899, 0.141, 8.2, 18.8, 7.98, 5.79,
2.74, 10.7, 6.98, 2.24, 4.79, 1.48, -5.97, 17, 9.22, 6.21, -3.68,
3.99, 4.59, 4.53, 1.71, 1.42, -9.08, 0.808, -0.094, -23.8, -8.53,
14.6, -7.76, 4.87, 3.36, -2.1, 0.806, -10.4, -1.48)), class = "data.frame", row.names = c(NA,
-40L))
Does slide_index() from the slider package do what you want?
library(tidyverse)
library(slider)
library(broom)
set.seed(1001)
## more or less the slider help page for slide_index()
df <- data.frame(
y = rnorm(100),
x = rnorm(100),
i = as.Date("2019-08-15") + c(0, 2, 4, 6:102) # <- irregular
)
head(df)
#> y x i
#> 1 2.1886481 0.07862339 2019-08-15
#> 2 -0.1775473 -0.98708727 2019-08-17
#> 3 -0.1852753 -1.17523226 2019-08-19
#> 4 -2.5065362 1.68140888 2019-08-21
#> 5 -0.5573113 0.75623228 2019-08-22
#> 6 -0.1435595 0.30309733 2019-08-23
# 20 day rolling regression. Current day + 10 days back.
out <- df %>%
mutate(model = slide_index(df, i, ~ lm(y ~ x, df),
.before = 10, .complete = TRUE)) %>%
as_tibble()
out %>%
filter(!(map_lgl(model, ~ is_empty(.x)))) %>%
mutate(results = map(model, tidy)) %>%
unnest(cols = c(results))
#> # A tibble: 186 x 9
#> y x i model term estimate std.error statistic p.value
#> <dbl> <dbl> <date> <list> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 -0.623 0.741 2019-08-25 <lm> (Intercept) -0.000347 0.115 -0.00302 0.998
#> 2 -0.623 0.741 2019-08-25 <lm> x -0.0825 0.144 -0.575 0.567
#> 3 -0.907 0.495 2019-08-26 <lm> (Intercept) -0.000347 0.115 -0.00302 0.998
#> 4 -0.907 0.495 2019-08-26 <lm> x -0.0825 0.144 -0.575 0.567
#> 5 -1.59 -1.13 2019-08-27 <lm> (Intercept) -0.000347 0.115 -0.00302 0.998
#> 6 -1.59 -1.13 2019-08-27 <lm> x -0.0825 0.144 -0.575 0.567
#> 7 0.303 -1.16 2019-08-28 <lm> (Intercept) -0.000347 0.115 -0.00302 0.998
#> 8 0.303 -1.16 2019-08-28 <lm> x -0.0825 0.144 -0.575 0.567
#> 9 1.63 -0.713 2019-08-29 <lm> (Intercept) -0.000347 0.115 -0.00302 0.998
#> 10 1.63 -0.713 2019-08-29 <lm> x -0.0825 0.144 -0.575 0.567
#> # … with 176 more rows

Mutate and case when issue - dplyr

I have the following data and I was to make a new column using mutate which details when colour = 'g' then take the level on the g row minus the level figure on the 'r' row.
Then likewise with type. Where type = 1 then take the corresponding level minus the level on the type 2 row.
library(dplyr)
d <- tibble(
date = c("2018", "2018", "2018", "2019", "2019", "2019", "2020", "2020", "2020", "2020"),
colour = c("none","g", "r", "none","g", "r", "none", "none", "none", "none"),
type = c("type1", "none", "none", "type2", "none", "none", "none", "none", "none", "none"),
level= c(78, 99, 45, 67, 87, 78, 89, 87, 67, 76))
Just to be clear this is what I want the data to look like.
So the data should look like this:
d2 <- tibble(
date = c("2018", "2018", "2018", "2019", "2019", "2019", "2020", "2020", "2020", "2020"),
colour = c("none","g", "r", "none","g", "r", "none", "none", "none", "none"),
type = c("type1", "none", "none", "type2", "none", "none", "none", "none", "none", "none"),
level= c(78, 99, 45, 67, 87, 78, 89, 87, 67, 76),
color_gap = c("NULL", 44, "NULL", "NULL", 9, "NULL", "NULL", "NULL", "NULL", "NULL"),
type_gap = c(11, "NULL", "NULL", "NULL", "NULL", "NULL", "NULL", "NULL", "NULL", "NULL"))
I started to use mutate and case when and got to the below. However, I'm stuck on the final calculation part. How do I say I want to take the color g level - the color r level?
d %>%
mutate(color_gap = case_when(color == "g" ~ level)%>%
mutate(type_gap = case_when(type== "type1" ~ level)%>%
) -> d2
Anyone know how to complete this?
Thanks
This subtracts the first r level from the first g level, second r level from second g level, etc. Same for type1 and type2. This has no checks at all. It doesn't check whether there is a matching r for each g, whether they are in the expected order, whether they are in the same date-group, etc. It assumes the data is already perfectly formatted as expected, so be careful using this on real data.
d %>%
mutate(color_gap = replace(rep(NA, n()), colour == 'g',
level[colour == 'g'] - level[colour == 'r']),
type_gap = replace(rep(NA, n()), type == 'type1',
level[type == 'type1'] - level[type == 'type2']))
# # A tibble: 10 x 6
# date colour type level color_gap type_gap
# <chr> <chr> <chr> <dbl> <dbl> <dbl>
# 1 2018 none type1 78 NA 11
# 2 2018 g none 99 54 NA
# 3 2018 r none 45 NA NA
# 4 2019 none type2 67 NA NA
# 5 2019 g none 87 9 NA
# 6 2019 r none 78 NA NA
# 7 2020 none none 89 NA NA
# 8 2020 none none 87 NA NA
# 9 2020 none none 67 NA NA
# 10 2020 none none 76 NA NA
you could do this with group_by and mutate.
I assumed that there is only 1 row per date that would satisfy each condition.
d %>%
mutate(color_gap = case_when(colour == "g" ~ level)) %>%
mutate(type_gap = case_when(type== "type1" ~ level)) %>%
group_by(date) %>%
mutate(diff = max(color_gap,na.rm=T)-max(type_gap, na.rm=T))

Resources