Rolling weighted sum across table with NA in R - r

I am trying to get rolling weighted sums across a table, and have a method involving matrix multiplication, but it breaks when some of the data is missing.
So if I use
library(tidyverse)
mydata <- tibble(Country = c("Australia", "Canada"),
"1980" = c(1000, 2000),
"1981" = c(1100, 2100),
"1982" = c(1300, 2300),
"1983" = c(1200, 2400),
"1984" = c(1400, 2200),
"1985" = c(1500, 2500))
weights <- c(3, 4, 6)
n0 <- ncol(mydata) - length(weights)
matweights <- matrix(rep(c(rep(0, n0), weights), n0)[-(1:n0)], ncol=n0)
tibble(cbind(mydata[, 1], as.matrix(mydata[, -1]) %*% matweights))
I get what I want with
# A tibble: 2 x 5
Country `1` `2` `3` `4`
<chr> <dbl> <dbl> <dbl> <dbl>
1 Australia 15200 15700 17100 18200
2 Canada 28200 29900 29700 31000
where for example in the top right 18200 is 3*1200 + 4*1400 + 6*1500
But if for example one of the values is missing, say mydata[2, 3] <- NA then I would get
# A tibble: 2 x 5
Country `1` `2` `3` `4`
<chr> <dbl> <dbl> <dbl> <dbl>
1 Australia 15200 15700 17100 18200
2 Canada NA NA NA NA
when I want
# A tibble: 2 x 5
Country `1` `2` `3` `4`
<chr> <dbl> <dbl> <dbl> <dbl>
1 Australia 15200 15700 17100 18200
2 Canada NA NA 29700 31000
The problem with my matrix approach is 0 * NA giving NA when I want it to be 0. I know there are solutions using some kind of apply approach but I suspect that may be slower with a large table.

I really quite like slider for sliding functions—it's very flexible, and has a purrr-like syntax. Here, slide_index_dbl() will let us slide a function and use another variable as an index by which to decide what observations are within the window.
First, reshape to long form and group, then it's a single call within mutate(). .before here specifies how many years back to include; .complete specifies to ignore partial windows.
library(tidyverse)
out1 <- mydata %>%
gather(year, value, -Country, convert = TRUE) %>%
group_by(Country) %>%
mutate(
value_3y = slider::slide_index_dbl(
value, .i = year,
.f = ~sum(.x * weights),
.before = 2, .complete = TRUE
)
)
out1
#> # A tibble: 12 x 4
#> # Groups: Country [2]
#> Country year value value_3y
#> <chr> <int> <dbl> <dbl>
#> 1 Australia 1980 1000 NA
#> 2 Canada 1980 2000 NA
#> 3 Australia 1981 1100 NA
#> 4 Canada 1981 2100 NA
#> 5 Australia 1982 1300 15200
#> 6 Canada 1982 2300 28200
#> 7 Australia 1983 1200 15700
#> 8 Canada 1983 2400 29900
#> 9 Australia 1984 1400 17100
#> 10 Canada 1984 2200 29700
#> 11 Australia 1985 1500 18200
#> 12 Canada 1985 2500 31000
To reshape to wide form:
out1 %>%
select(-value) %>%
drop_na() %>% # omit to keep partial/empty years
spread(year, value_3y)
#> # A tibble: 2 x 5
#> # Groups: Country [2]
#> Country `1982` `1983` `1984` `1985`
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Australia 15200 15700 17100 18200
#> 2 Canada 28200 29900 29700 31000
If the data contains NAs, the code works exactly the same:
mydata[2, 3] <- NA
out2 <- mydata %>%
gather(year, value, -Country, convert = TRUE) %>%
group_by(Country) %>%
mutate(
value_3y = slider::slide_index_dbl(
value, .i = year,
.f = ~sum(.x * weights),
.before = 2, .complete = TRUE
)
)
out2
#> # A tibble: 12 x 4
#> # Groups: Country [2]
#> Country year value value_3y
#> <chr> <int> <dbl> <dbl>
#> 1 Australia 1980 1000 NA
#> 2 Canada 1980 2000 NA
#> 3 Australia 1981 1100 NA
#> 4 Canada 1981 NA NA
#> 5 Australia 1982 1300 15200
#> 6 Canada 1982 2300 NA
#> 7 Australia 1983 1200 15700
#> 8 Canada 1983 2400 NA
#> 9 Australia 1984 1400 17100
#> 10 Canada 1984 2200 29700
#> 11 Australia 1985 1500 18200
#> 12 Canada 1985 2500 31000
out2 %>%
select(-value) %>%
drop_na() %>%
spread(year, value_3y)
#> # A tibble: 2 x 5
#> # Groups: Country [2]
#> Country `1982` `1983` `1984` `1985`
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Australia 15200 15700 17100 18200
#> 2 Canada NA NA 29700 31000

Using rollapply we have the following matrix:
library(zoo)
t(rollapply(t(mydata[, -1]), 3, function(x) sum(x * weights)))
## [,1] [,2] [,3] [,4]
## [1,] 15200 15700 17100 18200
## [2,] NA NA 29700 31000

Linear filtering option:
t(apply(mydata[-1], 1, stats::filter, filter=rev(weights), sides=1))
# [,1] [,2] [,3] [,4] [,5] [,6]
#[1,] NA NA 15200 15700 17100 18200
#[2,] NA NA NA NA 29700 31000

Related

How to find the annual evolution rate for each firm in my data table?

So I have a data table of 5000 firms, each firm is assigned a numerical value ("id") which is 1 for the first firm, 2 for the second ...
Here is my table with only the profit variable :
|id | year | profit
|:----| :----| :----|
|1 |2001 |-0.4
|1 |2002 |-0.89
|2 |2001 |1.89
|2 |2002 |2.79
Each firm is expressed twice, one line specifies the data in 2001 and the second in 2002 (the "id" value being the same on both lines because it is the same firm one year apart).
How to calculate the annual rate of change of each firm ("id") between 2001 and 2002 ?
I'm really new to R and I don't see where to start? Separate the 2001 and 2002 data?
I did this :
years <- sort(unique(group$year))years
And I also found this on the internet but with no success :
library(dplyr)
res <-
group %>%
arrange(id,year) %>%
group_by(id) %>%
mutate(evol_rate = ("group$year$2002" / lag("group$year$2001") - 1) * 100) %>%
ungroup()
Thank you very much
From what you've written, I take it that you want to calculate the formula for ROC for the profit values of 2001 and 2002:
ROC=(current_value​/previous_value − 1) ∗ 100
To accomplish this, I suggest tidyr::pivot_wider() which reshapes your dataframe from long to wide format (see: https://r4ds.had.co.nz/tidy-data.html#pivoting).
Code:
require(tidyr)
require(dplyr)
id <- sort(rep(seq(1,250, 1), 2))
year <- rep(seq(2001, 2002, 1), 500)
value <- sample(500:2000, 500)
df <- data.frame(id, year, value)
head(df, 10)
#> id year value
#> 1 1 2001 856
#> 2 1 2002 1850
#> 3 2 2001 1687
#> 4 2 2002 1902
#> 5 3 2001 1728
#> 6 3 2002 1773
#> 7 4 2001 691
#> 8 4 2002 1691
#> 9 5 2001 1368
#> 10 5 2002 893
df_wide <- df %>%
pivot_wider(names_from = year,
names_prefix = "profit_",
values_from = value,
values_fn = mean)
res <- df_wide %>%
mutate(evol_rate = (profit_2002/profit_2001-1)*100) %>%
round(2)
head(res, 10)
#> # A tibble: 10 x 4
#> id profit_2001 profit_2002 evol_rate
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 856 1850 116.
#> 2 2 1687 1902 12.7
#> 3 3 1728 1773 2.6
#> 4 4 691 1691 145.
#> 5 5 1368 893 -34.7
#> 6 6 883 516 -41.6
#> 7 7 1280 1649 28.8
#> 8 8 1579 1383 -12.4
#> 9 9 1907 1626 -14.7
#> 10 10 1227 1134 -7.58
If you want to do it without reshaping your data into a wide format you can use
library(tidyverse)
id <- sort(rep(seq(1,250, 1), 2))
year <- rep(seq(2001, 2002, 1), 500)
value <- sample(500:2000, 500)
df <- data.frame(id, year, value)
df %>% head(n = 10)
#> id year value
#> 1 1 2001 1173
#> 2 1 2002 1648
#> 3 2 2001 1560
#> 4 2 2002 1091
#> 5 3 2001 1736
#> 6 3 2002 667
#> 7 4 2001 1840
#> 8 4 2002 1202
#> 9 5 2001 1597
#> 10 5 2002 1797
new_df <- df %>%
group_by(id) %>%
mutate(ROC = ((value / lag(value) - 1) * 100))
new_df %>% head(n = 10)
#> # A tibble: 10 × 4
#> # Groups: id [5]
#> id year value ROC
#> <dbl> <dbl> <int> <dbl>
#> 1 1 2001 1173 NA
#> 2 1 2002 1648 40.5
#> 3 2 2001 1560 NA
#> 4 2 2002 1091 -30.1
#> 5 3 2001 1736 NA
#> 6 3 2002 667 -61.6
#> 7 4 2001 1840 NA
#> 8 4 2002 1202 -34.7
#> 9 5 2001 1597 NA
#> 10 5 2002 1797 12.5
This groups the data by id and then uses lag to compare the current year to the year prior

How do I group data with the same date into one row in R program

enter image description here
How can I construct the data to be a triangle method in R
I have done theses manually but I have huge data if there is any code that can be developed
dev
origin 1 2 3 4
2005 1500 2420 2720 3020
2006 1150 1840 2070 NA
2007 1650 2640 NA NA
2008 1740 NA NA NA
Here's a tidyverse solution.
First, create the data, as you haven't done so.
library(tidyverse)
d <- tibble(
Year=c(rep(2005, 4), rep(2006, 3), rep(2007, 2), 2008),
Amount=c(1500, 2420, 2720, 3020, 1150, 1840, 2070, 1650, 2640, 1740)
)
Now create an index within year. We will need this later.
d %>%
group_by(Year) %>%
mutate(Index=1:n())
# A tibble: 10 × 3
# Groups: Year [4]
Year Amount Index
<dbl> <dbl> <int>
1 2005 1500 1
2 2005 2420 2
3 2005 2720 3
4 2005 3020 4
5 2006 1150 1
6 2006 1840 2
7 2006 2070 3
8 2007 1650 1
9 2007 2640 2
10 2008 1740 1
Now pivot_wider into the required format.
d %>%
group_by(Year) %>%
mutate(Index=1:n()) %>%
pivot_wider(
names_from=Index,
values_from=Amount
)
# A tibble: 4 × 5
# Groups: Year [4]
Year `1` `2` `3` `4`
<dbl> <dbl> <dbl> <dbl> <dbl>
1 2005 1500 2420 2720 3020
2 2006 1150 1840 2070 NA
3 2007 1650 2640 NA NA
4 2008 1740 NA NA NA
Finally fix the column name and remove the grouping.
d %>%
group_by(Year) %>%
mutate(Index=1:n()) %>%
pivot_wider(
names_from=Index,
values_from=Amount
) %>%
rename(origin=Year) %>%
ungroup()
# A tibble: 4 × 5
origin `1` `2` `3` `4`
<dbl> <dbl> <dbl> <dbl> <dbl>
1 2005 1500 2420 2720 3020
2 2006 1150 1840 2070 NA
3 2007 1650 2640 NA NA
4 2008 1740 NA NA NA

Oversample within in group

I would like to oversample such that I have balance on my binary dependent variable within each group in my data set.
So my data looks like this:
library(dplyr)
library(purrr)
library(tidyr)
seed(123)
# example data
(data <- tibble(
country = c("France", "France", "France",
"UK", "UK", "UK", "UK", "UK", "UK"),
YES = c(0, 0, 1,
0, 0, 0, 0, 1, 1),
X = rnorm(9, 0 ,1)
))
# A tibble: 9 x 3
country YES X
<chr> <dbl> <dbl>
1 France 0 -1.12
2 France 0 -0.200
3 France 1 0.781
4 UK 0 0.100
5 UK 0 0.0997
6 UK 0 -0.380
7 UK 0 -0.0160
8 UK 1 -0.0265
9 UK 1 0.860
I am trying to achieve balance on YES within France and the UK by oversampling. In France I would like to have 4 observations and in the UK 8 so that one random sample could look like this):
# A tibble: 12 x 3
country YES X
<chr> <dbl> <dbl>
1 France 0 -1.12
2 France 0 -0.200
3 France 1 0.781
3 France 1 0.781
4 UK 0 0.100
5 UK 0 0.0997
6 UK 0 -0.380
7 UK 0 -0.0160
8 UK 1 -0.0265
9 UK 1 0.860
8 UK 1 -0.0265
8 UK 1 -0.0265
My approach was this:
# oversample 1's within each country
(n_data <- data %>%
group_by(country) %>%
nest(.key = "original") %>%
mutate(os = map(original, ~ group_by(., YES))) %>%
mutate(os = map(os, ~ slice_sample(., replace = TRUE, prop = 1))))
# A tibble: 2 x 3
# Groups: country [2]
country original os
<chr> <list> <list>
1 France <tibble [3 x 2]> <tibble [3 x 2]>
2 UK <tibble [6 x 2]> <tibble [6 x 2]>
Warning message:
`.key` is deprecated
So in OS the dimensions should be 4 x 2 and 8 x 2. Does anyone know how to do this?
This seems overcomplicated, but each individual step seems clear and robust:
data %>%
count(country, YES) %>%
group_by(country) %>%
## Figure out how many additional rows are needed
mutate(
goal_rows = max(n),
extra_rows = goal_rows - n
) %>%
select(country, YES, extra_rows) %>%
## Keep only the country/YES combinations that need extra rows
filter(extra_rows > 0) %>%
## Join back to original data
left_join(data, by = c("country", "YES")) %>%
group_by(country) %>%
## Randomly keep the appropriate number of rows
mutate(rand = rank(runif(n()))) %>%
filter(rand <= extra_rows) %>%
select(-extra_rows, -rand) %>%
## Combine oversampled rows with original data
bind_rows(data) %>%
arrange(country, YES)
# # A tibble: 12 x 3
# # Groups: country [2]
# country YES X
# <chr> <dbl> <dbl>
# 1 France 0 1.88
# 2 France 0 -0.0793
# 3 France 1 0.812
# 4 France 1 0.812
# 5 UK 0 -1.66
# 6 UK 0 -0.797
# 7 UK 0 0.639
# 8 UK 0 -0.141
# 9 UK 1 -0.207
# 10 UK 1 1.30
# 11 UK 1 -0.207
# 12 UK 1 1.30

How to quote and unquote a variable into a function and iterate over a dataframe

I'm trying to take a function and iterate over a data frame of values. The goal here is to summarize the airport delays by groups of 10.
How do you take the value of what is passed into a function as a name? The column origin (EWR, LGA, JFK) should be saved as a column, and it still needs to be passed into the group by function.
library(tidyverse)
library(nycflights13)
head(flights)
#> # A tibble: 6 x 19
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
#> 1 2013 1 1 517 515 2 830 819
#> 2 2013 1 1 533 529 4 850 830
#> 3 2013 1 1 542 540 2 923 850
#> 4 2013 1 1 544 545 -1 1004 1022
#> 5 2013 1 1 554 600 -6 812 837
#> 6 2013 1 1 554 558 -4 740 728
#> # ... with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#> # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#> # hour <dbl>, minute <dbl>, time_hour <dttm>
ntile_summary <- function(data, by, var) {
by <- enquo(by)
var <- enquo(var)
data %>%
mutate(pcts = ntile(!!by, n = 10),
col_nm = !!by)
group_by(pcts, col_nm) %>%
summarize(avg = mean(!!var, na.ram = TRUE))
}
params <- expand_grid(
flights %>% count(origin) %>% select(origin),
flights %>% count(day) %>% head(2) %>% select(day)
)
ntile_summary(flights, day, arr_delay)
#> Error in group_by(pcts, col_nm): object 'pcts' not found
purrr::walk(params, ~ntile_summary(flights, !origin, arr_delay))
#> Error in !origin: invalid argument type
Created on 2020-03-15 by the reprex package (v0.3.0)
After the mutate, the connection is. not there %>%
ntile_summary <- function(data, by, var) {
by <- enquo(by)
var <- enquo(var)
data %>%
mutate(pcts = ntile(!!by, n = 10),
col_nm = !!by) %>%
group_by(pcts, col_nm) %>%
summarize(avg = mean(!!var, na.ram = TRUE))
}
ntile_summary(flights, day, arr_delay)
# A tibble: 40 x 3
# Groups: pcts [10]
# pcts col_nm avg
# <int> <int> <dbl>
# 1 1 1 NA
# 2 1 2 NA
# 3 1 3 NA
# 4 1 4 -4.44
# 5 2 4 NA
# 6 2 5 NA
# 7 2 6 NA
# 8 2 7 NA
# 9 3 7 NA
#10 3 8 NA
# … with 30 more rows
We could also make use of curly-curly operator ({{}}) instead of enquo + `!!~
ntile_summary <- function(data, by, var) {
data %>%
mutate(col_nm = {{by}}, pcts = ntile({{by}}, n = 10)) %>%
group_by(pcts, col_nm) %>%
summarize(avg = mean({{var}}, na.ram = TRUE))
}
ntile_summary(flights, day, arr_delay)
# A tibble: 40 x 3
# Groups: pcts [10]
# pcts col_nm avg
# <int> <int> <dbl>
# 1 1 1 NA
# 2 1 2 NA
# 3 1 3 NA
# 4 1 4 -4.44
# 5 2 4 NA
# 6 2 5 NA
# 7 2 6 NA
# 8 2 7 NA
# 9 3 7 NA
#10 3 8 NA
# … with 30 more rows

dplyr: keep empty levels of factor but not empty levels of a combination of factors that don't appear in data

When grouping and summarising with dplyr, what is the correct way to keep empty levels of each grouping factor but not keep empty combinations from multiple grouping factors?
As an example, consider data recorded at different times at multiple sites. I might filter and then calculate something for each year in each site. I'd like to have the default value of the summary on an empty vector if the filter removes a year completely. So site "a" has 10 years and site "b" has 1 year so I'd always like 11 rows in the summary.
If I use .drop = TRUE in group_by I lose years:
library(dplyr)
library(zoo)
library(lubridate)
set.seed(1)
df <- data.frame(site = factor(c(rep("a", 120), rep("b", 12))),
date = c(seq.Date(as.Date("2000/1/1"), by = "month", length.out = 120), seq.Date(as.Date("2000/1/1"), by = "month", length.out = 12)),
value = rnorm(132, 50, 10))
df$year <- factor(lubridate::year(df$date))
df %>%
filter(value > 65) %>%
group_by(site, year, .drop = TRUE) %>%
summarise(f = first(date))
#> # A tibble: 6 x 3
#> # Groups: site [1]
#> site year f
#> <fct> <fct> <date>
#> 1 a 2000 2000-04-01
#> 2 a 2004 2004-08-01
#> 3 a 2005 2005-01-01
#> 4 a 2007 2007-11-01
#> 5 a 2008 2008-10-01
#> 6 a 2009 2009-02-01
and with .drop = FALSE I gain all the extra years for site "b" which were not in the original data:
df %>%
filter(value > 65) %>%
group_by(site, year, .drop = FALSE) %>%
summarise(f = first(date))
#> # A tibble: 20 x 3
#> # Groups: site [2]
#> site year f
#> <fct> <fct> <date>
#> 1 a 2000 2000-04-01
#> 2 a 2001 NA
#> 3 a 2002 NA
#> 4 a 2003 NA
#> 5 a 2004 2004-08-01
#> 6 a 2005 2005-01-01
#> 7 a 2006 NA
#> 8 a 2007 2007-11-01
#> 9 a 2008 2008-10-01
#> 10 a 2009 2009-02-01
#> 11 b 2000 NA
#> 12 b 2001 NA
#> 13 b 2002 NA
#> 14 b 2003 NA
#> 15 b 2004 NA
#> 16 b 2005 NA
#> 17 b 2006 NA
#> 18 b 2007 NA
#> 19 b 2008 NA
#> 20 b 2009 NA
The best way I could think of was to calculate counts, then merge then filter then drop the count variable, but that's pretty messy.
I know the .drop was only recently added to dplyr, which is very useful for one factor, but is there yet a clean way to do this for multiple factors?
df %>%
filter(value > 65) %>%
group_by(site, year, .drop = FALSE) %>%
summarise(f = first(date)) %>%
left_join(df %>% count(site, year, .drop = FALSE), by = c("site", "year")) %>%
filter(n > 0) %>%
select(-n)
#> # A tibble: 11 x 3
#> # Groups: site [2]
#> site year f
#> <fct> <fct> <date>
#> 1 a 2000 2000-04-01
#> 2 a 2001 NA
#> 3 a 2002 NA
#> 4 a 2003 NA
#> 5 a 2004 2004-08-01
#> 6 a 2005 2005-01-01
#> 7 a 2006 NA
#> 8 a 2007 2007-11-01
#> 9 a 2008 2008-10-01
#> 10 a 2009 2009-02-01
#> 11 b 2000 NA
Not sure if this is what you like.
If you replace dates with value < 65 with NA instead of filtering them out you can proceed as usual.
df %>%
mutate(date = replace(date, value < 65, NA)) %>%
group_by(site, year) %>%
summarise(f = first(date[!is.na(date)]))
# A tibble: 11 x 3
# Groups: site [2]
site year f
<fct> <fct> <date>
1 a 2000 NA
2 a 2001 NA
3 a 2002 2002-03-01
4 a 2003 NA
5 a 2004 NA
6 a 2005 NA
7 a 2006 2006-02-01
8 a 2007 NA
9 a 2008 2008-07-01
10 a 2009 2009-02-01
11 b 2000 2000-08-01

Resources