Computing lags but grouping by two categories with dplyr

Computing lags but grouping by two categories with dplyr - r

What I want it's create the var3 using a lag (dplyr package), but should be consistent with the year and the ID. I mean, the lag should belong to the corresponding ID. The dataset is like an unbalanced panel.
YEAR ID VARS
2010 1 -
2011 1 -
2012 1 -
2010 2 -
2011 2 -
2012 2 -
2010 3 -
...
My issue is similar to the following question/post, but grouping by two categories:
dplyr: lead() and lag() wrong when used with group_by()
I tried to extend the solution, unsuccessfully (I get NAs).
Attempt #1:
data %>%
group_by(YEAR,ID) %>%
summarise(var1 = ...
var2 = ...
var3 = var1 - dplyr::lag(var2))
)
Attempt #2:
data %>%
group_by(YEAR,ID) %>%
summarise(var1 = ...
var2 = ...
gr = sprintf(YEAR,ID)
var3 = var1 - dplyr::lag(var2, order_by = gr))
)
Minimum example:
MyData <-
data.frame(YEAR = rep(seq(2010,2014),5),
ID = rep(1:5, each=5),
var1 = rnorm(n=25,mean=10,sd=3),
var2 = rnorm(n=25,mean=1,sd=1)
)
MyData %>%
group_by(YEAR,ID) %>%
summarise(var3 = var1 - dplyr::lag(var2)
)
Thanks in advance.

Do you mean group_by(ID) and effectively "order by YEAR"?
MyData %>%
group_by(ID) %>%
mutate(var3 = var1 - dplyr::lag(var2)) %>%
print(n=99)
# # A tibble: 25 x 5
# # Groups: ID [5]
# YEAR ID var1 var2 var3
# <int> <int> <dbl> <dbl> <dbl>
# 1 2010 1 11.1 1.16 NA
# 2 2011 1 13.5 -0.550 12.4
# 3 2012 1 10.2 2.11 10.7
# 4 2013 1 8.57 1.43 6.46
# 5 2014 1 12.6 1.89 11.2
# 6 2010 2 8.87 1.87 NA
# 7 2011 2 5.30 1.70 3.43
# 8 2012 2 6.81 0.956 5.11
# 9 2013 2 13.3 -0.0296 12.4
# 10 2014 2 9.98 -1.27 10.0
# 11 2010 3 8.62 0.258 NA
# 12 2011 3 12.4 2.00 12.2
# 13 2012 3 16.1 2.12 14.1
# 14 2013 3 8.48 2.83 6.37
# 15 2014 3 10.6 0.190 7.80
# 16 2010 4 12.3 0.887 NA
# 17 2011 4 10.9 1.07 10.0
# 18 2012 4 7.99 1.09 6.92
# 19 2013 4 10.1 1.95 9.03
# 20 2014 4 11.1 1.82 9.17
# 21 2010 5 15.1 1.67 NA
# 22 2011 5 10.4 0.492 8.76
# 23 2012 5 10.0 1.66 9.51
# 24 2013 5 10.6 0.567 8.91
# 25 2014 5 5.32 -0.881 4.76
(Disregarding your summarize into a mutate for now.)

Related

stderr in dplyr (R): What am I doing wrong?

I'm trying to calculate year-wise standard error for the variable AcrePrice. I'm running the function stderr (also tried with sd(acrePrice)/count(n)). Both of these return an error.
Here's the relevant code:
library(alr4)
library(tidyverse)
MinnLand %>% group_by(year) %>% summarize(sd(acrePrice)/count(n))
MinnLand %>% group_by(year) %>% summarize(stderr(acrePrice))
Why is there a problem? The mean and SDs are easily calculated.

The issue with the first function is count, which requires a data.frame, instead it would be n()
library(dplyr)
MinnLand %>%
group_by(year) %>%
summarize(SE = sd(acrePrice)/n(), .groups = 'drop')
-output
# A tibble: 10 x 2
# year SE
# <dbl> <dbl>
# 1 2002 2.25
# 2 2003 0.840
# 3 2004 0.742
# 4 2005 0.862
# 5 2006 0.849
# 6 2007 0.765
# 7 2008 0.708
# 8 2009 1.23
# 9 2010 0.986
#10 2011 1.95
According to ?stderr
stdin(), stdout() and stderr() are standard connections corresponding to input, output and error on the console respectively (and not necessarily to file streams).
We can use std.error from plotrix
library(plotrix)
MinnLand %>%
group_by(year) %>%
summarize(SE = std.error(acrePrice))
-output
# A tibble: 10 x 2
# year SE
# <dbl> <dbl>
# 1 2002 53.4
# 2 2003 38.6
# 3 2004 37.0
# 4 2005 41.5
# 5 2006 39.7
# 6 2007 36.3
# 7 2008 34.9
# 8 2009 47.1
# 9 2010 42.1
#10 2011 63.6

why error: arrange() failed at implicit mutate() step

The following code was executed:
tb <- tibble(
year <- rep(2001:2020,10)
)
tb %<>% arrange(year) %>%
mutate(
id <- rep(1:10,20),
r1 <- rnorm(200,0,1),
r2 <- rnorm(200,1,1),
r3 <- rnorm(200,2,1)
)
Then the error message popped up:
Error: arrange() failed at implicit mutate() step.
x Could not create a temporary column for ..1.
ℹ ..1 is year.
Can anyone shed light on what the reason is?

Try this. It looks like a variable assignation issue. Try replacing <- by = and %<>% by %>%. Here a possible solution:
#Data
tb <- tibble(
year = rep(2001:2020,10)
)
#Code
tb %>% arrange(year) %>%
mutate(
id = rep(1:10,20),
r1 = rnorm(200,0,1),
r2 = rnorm(200,1,1),
r3 = rnorm(200,2,1)
)
Output:
# A tibble: 200 x 5
year id r1 r2 r3
<int> <int> <dbl> <dbl> <dbl>
1 2001 1 1.10 1.62 2.92
2 2001 2 0.144 1.18 1.08
3 2001 3 -0.118 2.32 3.15
4 2001 4 -0.912 0.701 1.36
5 2001 5 -1.44 -0.648 1.11
6 2001 6 -0.797 1.95 -0.333
7 2001 7 1.25 -0.113 1.85
8 2001 8 0.772 1.62 2.32
9 2001 9 -0.220 1.51 1.29
10 2001 10 -0.425 1.37 3.24
# ... with 190 more rows

Assigning different names to columns in list using tidyr::pivot_longer, and combining them

I'm extracting "wide" data which I intend to tidy with tidyr::pivot_longer().
library(tidyverse)
df1 <-
data.frame(
M = words[1:10],
N = rnorm(10, 3, 3),
O = rnorm(10, 3, 3),
P = rnorm(10, 3, 3)
)
df2 <-
data.frame(
M = words[1:10],
N = rnorm(10, 3, 3),
O = rnorm(10, 3, 3),
P = rnorm(10, 3, 3)
)
df3 <-
data.frame(
M = words[1:10],
N = rnorm(10, 3, 3),
O = rnorm(10, 3, 3),
P = rnorm(10, 3, 3)
)
df4 <-
data.frame(
M = words[1:10],
N = rnorm(10, 3, 3),
O = rnorm(10, 3, 3),
P = rnorm(10, 3, 3)
)
lst <- list(df1, df2, df3, df4)
colname <-
c("ticker", "2017", "2018", "2019")
header <- list("Leverage", "Gearing", "Capex.to.sales", "FCFex")
lst <- lst %>%
lapply(setNames, colname) %>%
lapply(pivot_longer, -ticker, names_to = "Period", values_to = header)
Using values_to = header gives me this error:
Error in [[<-.data.frame(tmp`, ".value", value = list("Leverage", :
replacement has 4 rows, data has 3
Instead, I had to use the default values_to = "value", and subsequently use this code to rename my columns:
lst <- lst %>%
lapply(setNames, colname) %>%
lapply(pivot_longer, -ticker, names_to = "Period", values_to = "value")
lst <- map(seq_along(lst), function(i){
x <- lst[[i]]
colnames(x)[3] <- header[[i]]
x
})
My output is shown below (columns renamed), but I was wondering if there is a way to feed a vector into values_to instead of using map (as it makes for better piping)? Or is there a more efficient way about going about this?
> lst
[[1]]
# A tibble: 30 x 3
ticker Period Leverage
<fct> <chr> <dbl>
1 a 2017 6.01
2 a 2018 4.82
3 a 2019 1.58
4 able 2017 8.64
5 able 2018 6.70
6 able 2019 0.831
7 about 2017 -0.187
8 about 2018 0.549
9 about 2019 0.829
10 absolute 2017 1.26
# ... with 20 more rows
[[2]]
# A tibble: 30 x 3
ticker Period Gearing
<fct> <chr> <dbl>
1 a 2017 2.37
2 a 2018 3.58
3 a 2019 5.63
4 able 2017 0.311
5 able 2018 0.708
6 able 2019 -0.0651
7 about 2017 2.89
8 about 2018 6.25
9 about 2019 10.1
10 absolute 2017 6.48
# ... with 20 more rows
[[3]]
# A tibble: 30 x 3
ticker Period Capex.to.sales
<fct> <chr> <dbl>
1 a 2017 5.22
2 a 2018 1.88
3 a 2019 0.746
4 able 2017 -3.90
5 able 2018 3.06
6 able 2019 1.91
7 about 2017 1.35
8 about 2018 4.12
9 about 2019 11.1
10 absolute 2017 1.76
# ... with 20 more rows
[[4]]
# A tibble: 30 x 3
ticker Period FCFex
<fct> <chr> <dbl>
1 a 2017 1.76
2 a 2018 2.85
3 a 2019 1.86
4 able 2017 -3.38
5 able 2018 -3.02
6 able 2019 -1.52
7 about 2017 6.46
8 about 2018 5.39
9 about 2019 0.810
10 absolute 2017 8.08
# ... with 20 more rows
For the second part of my question, I intend to use bind_col() to combine all the four dataframes into one, but the two common columns are being duplicated (as seen below).
How do I tell R to just bind the rightmost column that was renamed i.e. exclude the first two columns for the last three dataframes? Thank you.
Metrics <- bind_cols(lst)
> Metrics
# A tibble: 30 x 12
ticker Period Leverage ticker1 Period1 Gearing ticker2 Period2
<fct> <chr> <dbl> <fct> <chr> <dbl> <fct> <chr>
1 a 2017 6.01 a 2017 2.37 a 2017
2 a 2018 4.82 a 2018 3.58 a 2018
3 a 2019 1.58 a 2019 5.63 a 2019
4 able 2017 8.64 able 2017 0.311 able 2017
5 able 2018 6.70 able 2018 0.708 able 2018
6 able 2019 0.831 able 2019 -0.0651 able 2019
7 about 2017 -0.187 about 2017 2.89 about 2017
8 about 2018 0.549 about 2018 6.25 about 2018
9 about 2019 0.829 about 2019 10.1 about 2019
10 absol~ 2017 1.26 absolu~ 2017 6.48 absolu~ 2017
# ... with 20 more rows, and 4 more variables: Capex.to.sales <dbl>,
# ticker3 <fct>, Period3 <chr>, FCFex <dbl>

You could do it with purrr:
library(purrr)
lst <- map(lst, setNames, colname)
map2_dfc(lst, header, ~ pivot_longer(
.x, -ticker, names_to = "Period", values_to = .y)) %>%
select(c(1:3, 6, 9, 12))
Output:
ticker Period Leverage Gearing Capex.to.sales FCFex
<fct> <chr> <dbl> <dbl> <dbl> <dbl>
1 a 2017 6.20 3.43 7.87 7.52
2 a 2018 1.63 3.30 0.126 1.52
3 a 2019 2.32 1.49 -0.286 6.95
4 able 2017 6.38 3.42 7.34 2.60
5 able 2018 0.763 1.68 -0.648 -2.85
6 able 2019 5.56 2.35 -0.572 3.21
7 about 2017 -0.762 1.49 3.12 2.43
8 about 2018 9.07 -1.22 0.821 4.00
9 about 2019 1.37 8.27 -0.700 -1.05
10 absolute 2017 1.39 2.49 0.390 2.40
# … with 20 more rows

Using "first" in mutate

My dataframe looks something like the first four columns of the following:
ID Obs Seconds Mean Ratio
<chr> <dbl> <dbl> <dbl> <dbl>
1 1815522 1 1 NA 1/10.6
2 1815522 2 26 NA 26/10.6
3 1815522 3 4.68 10.6 4.68/10.6
4 1815522 4 0 10.2 0/10.6
5 1815522 5 1.5 2.06 1.5/10.6
6 1815522 6 2.22 1.24 2.22/10.6
7 1815676 1 12 NA 12/9.67
8 1815676 2 6 NA 6/9.67
9 1815676 3 11 9.67 11/9.67
10 1815676 4 1 6 1/9.67
11 1815676 5 30 14 30/9.67
12 1815676 6 29 20 29/9.67
13 1815676 7 23 27.3 23/9.67
14 1815676 8 51 34.3 51/9.67
I am trying to add a fifth column "Ratio", containing the ratio of each row's value for Seconds, and the ID-group's first not-NA value of Mean. How do I do that?
I've tried several things:
temp %>%
group_by(ID) %>%
mutate(Ratio = case_when(all(is.na(Mean)) ~ NA_real_,
!all(is.na(Mean)) ~ Seconds/(first(Mean[!is.na(Mean)]))))
This gives me the following error:
Error in mutate_impl(.data, dots) :
Column `Ratio` must be length 2 (the group size) or one, not 0
I also tried
temp %>%
group_by(ID) %>%
mutate(Ratio = ifelse(!all(is.na(Mean)), Seconds/(first(Mean[!is.na(Mean)])), NA_real_))
But in this case, it will create a column that looks like this:
Ratio
<dbl>
1 0.0947
2 0.0947
3 0.0947
4 0.0947
5 0.0947
6 0.0947
7 1.24
8 1.24
9 1.24
10 1.24
11 1.24
12 1.24
13 1.24
14 1.24
I really don't know what else to try. Please help! :)

An idea is to use fill with .direction = 'up' since you are interested in the first value, to fill your NAs and simply divide with the first value. No need for case_when to capture all NAs since it will by default give NA as an answer, i.e.
library(tidyverse)
df %>%
group_by(ID) %>%
fill(Mean, .direction = 'up') %>%
mutate(ratio = Seconds / first(Mean))
which gives,
# A tibble: 14 x 5
# Groups: ID [2]
ID Obs Seconds Mean ratio
<int> <int> <dbl> <dbl> <dbl>
1 1815522 1 1 10.6 0.0943
2 1815522 2 26 10.6 2.45
3 1815522 3 4.68 10.6 0.442
4 1815522 4 0 10.2 0
5 1815522 5 1.5 2.06 0.142
6 1815522 6 2.22 1.24 0.209
7 1815676 1 12 9.67 1.24
8 1815676 2 6 9.67 0.620
9 1815676 3 11 9.67 1.14
10 1815676 4 1 6 0.103
11 1815676 5 30 14 3.10
12 1815676 6 29 20 3.00
13 1815676 7 23 27.3 2.38
14 1815676 8 51 34.3 5.27

Try this:
library(tidyverse)
df %>%
group_by(ID) %>%
mutate(
isNA = mean(is.na(Mean)),
Ratio = if_else(isNA == 1, NA_real_, Seconds / first(Mean[!is.na(Mean)]))
)

Calculate the percent occurrence of a variable in multiple groups

Sample data
set.seed(123)
df <- data.frame(loc.id = rep(1:1000, each = 35), year = rep(1980:2014,times = 1000),month.id = sample(c(1:4,8:10,12),35*1000,replace = T))
The data frame has a 1000 locations X 35 years of data for a variable called month.id which is basically the month of a year. For each year, I want to calculate percent occurrence of each month. For e.g. for 1980,
month.vec <- df[df$year == 1980,]
table(month.vec$month.id)
1 2 3 4 8 9 10 12
106 132 116 122 114 130 141 139
To calculate the percent occurrence of months:
table(month.vec$month.id)/length(month.vec$month.id) * 100
1 2 3 4 8 9 10 12
10.6 13.2 11.6 12.2 11.4 13.0 14.1 13.9
I want to have a table something like this:
year month percent
1980 1 10.6
1980 2 13.2
1980 3 11.6
1980 4 12.2
1980 5 NA
1980 6 NA
1980 7 NA
1980 8 11.4
1980 9 13
1980 10 14.1
1980 11 NA
1980 12 13.9
Since, months 5,6,7,11 are missing, I just want to add the additional rows with NAs for those months. If possible, I would
like a dplyr solution to something like this:
library(dplyr)
df %>% group_by(year) %>% summarise(percentage.contri = table(month.id)/length(month.id)*100)

Solution using dplyr and tidyr
# To get month as integer use (or add as.integer to mutate):
# df$month.id <- as.integer(df$month.id)
library(dplyr)
library(tidyr)
df %>%
group_by(year, month.id) %>%
# Count occurrences per year & month
summarise(n = n()) %>%
# Get percent per month (year number is calculated with sum(n))
mutate(percent = n / sum(n) * 100) %>%
# Fill in missing months
complete(year, month.id = 1:12, fill = list(percent = 0)) %>%
select(year, month.id, percent)
year month.id percent
<int> <dbl> <dbl>
1 1980 1.00 10.6
2 1980 2.00 13.2
3 1980 3.00 11.6
4 1980 4.00 12.2
5 1980 5.00 0
6 1980 6.00 0
7 1980 7.00 0
8 1980 8.00 11.4
9 1980 9.00 13.0
10 1980 10.0 14.1
11 1980 11.0 0
12 1980 12.0 13.9

A base R solution:
tab <- table(month.vec$year, factor(month.vec$month.id, levels = 1:12))/length(month.vec$month.id) * 100
dfnew <- as.data.frame(tab)
which gives:
> dfnew
Var1 Var2 Freq
1 1980 1 10.6
2 1980 2 13.2
3 1980 3 11.6
4 1980 4 12.2
5 1980 5 0.0
6 1980 6 0.0
7 1980 7 0.0
8 1980 8 11.4
9 1980 9 13.0
10 1980 10 14.1
11 1980 11 0.0
12 1980 12 13.9
Or with data.table:
library(data.table)
setDT(month.vec)[, .N, by = .(year, month.id)
][.(year = 1980, month.id = 1:12), on = .(year, month.id)
][, N := 100 * N/sum(N, na.rm = TRUE)][]

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Computing lags but grouping by two categories with dplyr - r

Related

stderr in dplyr (R): What am I doing wrong?

why error: arrange() failed at implicit mutate() step

Assigning different names to columns in list using tidyr::pivot_longer, and combining them

Using "first" in mutate

Calculate the percent occurrence of a variable in multiple groups

Categories

Resources