library(tidyverse)
mean_by <- function(data,by,conti){
data %>% group_by({{by}}) %>% summarise(mean=mean({{conti}})) %>%
print() %>%
ggplot(aes(x={{by}},y=mean))+geom_col()
}
map(mtcars %>% select_if(is.numeric),~mean_by(mtcars,cyl,.))
# Not quite the same
mean_by(mtcars,cyl,carb)
I was toying around with the curly curly operator in R (just learned about it!) and then when iterating using map it seemd like the grouping isnt working very well, and I cant get my hands around the problem. What am I doing wrong?
Btw, When trying the explicit pmap way, I couldnt get around using the cyl variable in a clever way
pmap(mtcars %>% select_if(is.numeric),mean_by,..1=mtcars,..2=cyl,..3=.)
Error in pmap():
i In index: 1.
Caused by error in withCallingHandlers():
! object 'cyl' not found
Run rlang::last_error() to see where the error occurred.
It is expecting the column names and not the values - here, the select_if returns a subset of columns that are numeric. We may need the names to loop which would be a string, thus it is better to convert to symbol and evaluate (!!)
library(dplyr)
library(purrr)
mean_by <- function(data,by,conti){
by_sym <- rlang::ensym(by)
conti <- rlang::ensym(conti)
data %>% group_by(!! by_sym) %>%
summarise(mean=mean(!!conti)) %>%
print() %>%
ggplot(aes(x= !!by_sym,y=mean))+geom_col()
}
map(mtcars %>%
select_if(is.numeric) %>%
names,~mean_by(mtcars,cyl, !!.x))
-output (graphs removed)
# A tibble: 3 × 2
cyl mean
<dbl> <dbl>
1 4 26.7
2 6 19.7
3 8 15.1
# A tibble: 3 × 2
cyl mean
<dbl> <dbl>
1 4 4
2 6 6
3 8 8
# A tibble: 3 × 2
cyl mean
<dbl> <dbl>
1 4 105.
2 6 183.
3 8 353.
# A tibble: 3 × 2
cyl mean
<dbl> <dbl>
1 4 82.6
2 6 122.
3 8 209.
# A tibble: 3 × 2
cyl mean
<dbl> <dbl>
1 4 4.07
2 6 3.59
3 8 3.23
# A tibble: 3 × 2
cyl mean
<dbl> <dbl>
1 4 2.29
2 6 3.12
3 8 4.00
# A tibble: 3 × 2
cyl mean
<dbl> <dbl>
1 4 19.1
2 6 18.0
3 8 16.8
# A tibble: 3 × 2
cyl mean
<dbl> <dbl>
1 4 0.909
2 6 0.571
3 8 0
# A tibble: 3 × 2
cyl mean
<dbl> <dbl>
1 4 0.727
2 6 0.429
3 8 0.143
# A tibble: 3 × 2
cyl mean
<dbl> <dbl>
1 4 4.09
2 6 3.86
3 8 3.29
# A tibble: 3 × 2
cyl mean
<dbl> <dbl>
1 4 1.55
2 6 3.43
3 8 3.5
I've not seen the tilde syntax with map, but if you change that it seems to work.
map(mtcars %>% select_if(is.numeric), mean_by, data=mtcars, by=cyl)
Side note, you don't need that print() statement in mean_by.
mean_by <- function(data,by,conti){
data %>% group_by({{by}}) %>% summarise(mean=mean({{conti}})) %>%
ggplot(aes(x={{by}},y=mean))+geom_col()
}
Related
I want to group_by multiple columns wihout intersection.
I am looking for the output below without having to replicate the code for both variables.
library(dplyr)
> mtcars %>%
+ group_by(cyl) %>%
+ summarise(mean(disp))
# A tibble: 3 × 2
cyl `mean(disp)`
<dbl> <dbl>
1 4 105.
2 6 183.
3 8 353.
>
> mtcars %>%
+ group_by(am) %>%
+ summarise(mean(disp))
# A tibble: 2 × 2
am `mean(disp)`
<dbl> <dbl>
1 0 290.
2 1 144.
I am not looking for the code below since this gives the intersection between the variables:
> mtcars %>%
+ group_by(cyl, am) %>%
+ summarise(mean(disp))
# A tibble: 6 × 3
# Groups: cyl [3]
cyl am `mean(disp)`
<dbl> <dbl> <dbl>
1 4 0 136.
2 4 1 93.6
3 6 0 205.
4 6 1 155
5 8 0 358.
6 8 1 326
Thanks a lot!
An alternative would be a custom function:
my_func <- function(df, group){
df %>%
group_by({{group}}) %>%
summarise(mean_disp = mean(disp))
}
my_func(mtcars, cyl)
my_func(mtcars, am)
cyl mean_disp
<dbl> <dbl>
1 4 105.
2 6 183.
3 8 353.
> my_func(mtcars, am)
# A tibble: 2 × 2
am mean_disp
<dbl> <dbl>
1 0 290.
2 1 144.
Something like this?
library(tidyverse)
c("cyl", "am") %>%
map(~ mtcars %>%
group_by(!!sym(.x)) %>%
summarise(result = mean(disp)))
[[1]]
# A tibble: 3 x 2
cyl result
<dbl> <dbl>
1 4 105.
2 6 183.
3 8 353.
[[2]]
# A tibble: 2 x 2
am result
<dbl> <dbl>
1 0 290.
2 1 144.
THIS IS MY DATA
I have a panel data in R, so I want to create a rolling window linear regression by group. For instance, I have a lot of dates from 1 to 618. Each number represents one date, but I have more than one observation for each date.
I want to create a rolling window for 20 dates. Finally, i want to output all coefficients for lm(y~x1+x2+x3+x4+x5+x6) in the period 1:20, and make a rolling window for doing another regression for 2:21, 3:22.. and so on for all my observations, so the last coefficients are for 598:618 period (I have 618 so i can´t do it manually).
My problem is that i select a window for 20 observations but i only get to select this 20 first observations, for example:
1
1
1
1
1
1
1 .... 1
and maybe the first 20 observations are only observations for the first date (1), because there are more than one observations by date. So I want to catch 20 observationes filtering by group, actually this will be more than 20 observations, but i want to rolling by date (date 1 to date 20, regardless of the observations.
After that, i need to estimate by Newey West method, so i need include in the final code something like that and output all coefficients and t-statistics.
neweywest <- coeftest(LMOBJECT, vcov. = NeweyWest, lag=12)
I hope it has been understood well.
You can create multiple linear models for a given interval of dates like this:
library(tidyverse)
# example data
set.seed(1337)
n_dates <- 10
data <- tibble(
date = runif(100, min = 1, max = n_dates) %>% floor(),
x1 = runif(100)**2,
x2 = runif(100) * 2,
x3 = runif(100) + 2,
y = x1 + 2 * x2 + runif(100)
) %>%
arrange(date)
data
#> # A tibble: 100 × 5
#> date x1 x2 x3 y
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0.754 0.700 2.21 2.79
#> 2 1 0.0230 1.97 2.70 4.89
#> 3 1 0.388 0.500 2.21 1.54
#> 4 1 0.225 0.135 2.87 0.849
#> 5 1 0.00000810 0.139 2.22 1.12
#> 6 1 0.255 0.893 2.21 2.25
#> 7 1 0.402 1.37 2.06 3.51
#> 8 1 0.00275 0.363 2.68 0.984
#> 9 2 0.238 1.68 2.53 3.98
#> 10 2 0.0309 1.47 2.05 3.69
#> # … with 90 more rows
# number of rows per day
data %>% count(date)
#> # A tibble: 9 × 2
#> date n
#> <dbl> <int>
#> 1 1 8
#> 2 2 10
#> 3 3 15
#> 4 4 8
#> 5 5 10
#> 6 6 10
#> 7 7 12
#> 8 8 7
#> 9 9 20
# size of rolling window in days
window_size <- 3
models <- tibble(
from = seq(n_dates),
to = from + window_size - 1
) %>%
mutate(
data = from %>% map2(to, ~ data %>% filter(date >= .x & date <= .y)),
model = data %>% map(possibly(~ lm(y ~ x1 + x2 + x3, data = .x), NA))
)
models
#> # A tibble: 10 × 4
#> from to data model
#> <int> <dbl> <list> <list>
#> 1 1 3 <tibble [33 × 5]> <lm>
#> 2 2 4 <tibble [33 × 5]> <lm>
#> 3 3 5 <tibble [33 × 5]> <lm>
#> 4 4 6 <tibble [28 × 5]> <lm>
#> 5 5 7 <tibble [32 × 5]> <lm>
#> 6 6 8 <tibble [29 × 5]> <lm>
#> 7 7 9 <tibble [39 × 5]> <lm>
#> 8 8 10 <tibble [27 × 5]> <lm>
#> 9 9 11 <tibble [20 × 5]> <lm>
#> 10 10 12 <tibble [0 × 5]> <lgl [1]>
models %>%
filter(!is.na(model)) %>%
transmute(
from, to,
coeff = model %>% map(coefficients),
r2 = model %>% map_dbl(~ .x %>% summary() %>% pluck("r.squared"))
) %>%
unnest_wider(coeff)
# A tibble: 9 x 7
# from to `(Intercept)` x1 x2 x3 r2
# <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 3 0.601 0.883 2.07 -0.0788 0.970
#2 2 4 0.766 0.965 2.01 -0.141 0.965
#3 3 5 0.879 0.954 1.94 -0.165 0.953
Another way of subseting groups is to use nest:
# get all observations from day 3 to 5
data %>% arrange(date) %>% nest(-date) %>% slice(3:5) %>% unnest()
I am trying to determine the difference between the set of ids in subsequent pairs of dataframes. The dataframes are derived from an original dataframe split by a grouping variable representing the time period. The results should show the rows of the new ids that occur in the current time period compared to the previous one.
I can accomplish this with a list of dataframes:
library(tidyverse)
set.seed(999)
examp <- tibble(
id = c(replicate(4, sample.int(20, 9))),
year = rep(1:4, each = 9),
val = runif(36)
)
examp %>%
split(.$year) %>%
# note my default, I compare the first year to itself
map2(lag(., default = .[1]), anti_join, by = "id")
$`1`
# A tibble: 0 x 3
# ... with 3 variables: id <int>, year <int>, val <dbl>
$`2`
# A tibble: 3 x 3
id year val
<int> <int> <dbl>
1 5 2 0.450
2 11 2 0.943
3 2 2 0.571
$`3`
# A tibble: 6 x 3
id year val
<int> <int> <dbl>
1 19 3 0.870
2 12 3 0.403
3 9 3 0.331
4 20 3 0.315
5 16 3 0.455
6 17 3 0.699
$`4`
# A tibble: 5 x 3
id year val
<int> <int> <dbl>
1 4 4 0.190
2 11 4 0.0804
3 2 4 0.247
4 1 4 0.619
5 18 4 0.434
But I could not get the same to work using mutate in a nested dataframe:
examp %>%
nest_by(year) %>%
mutate(new = anti_join(data, lag(data), by = "id"))
# A tibble: 4 x 3
# Rowwise: year
year data new$id $val
<int> <list<tibble[,2]>> <int> <dbl>
1 1 [9 x 2] 3 0.0601
2 2 [9 x 2] 1 0.495
3 3 [9 x 2] 17 0.699
4 4 [9 x 2] 18 0.434
Here I could not figure out how to specify the default and the output is unexpected. I expected "new" to be a list-column of dataframes corresponding with those above, which I could then unnest.
I am interested in learning more about working with nested dataframes and any help understanding how to get this to work would be much appreciated. Additionally, if there is another (simple) solution to this general problem, I would be happy to learn about it.
It should be wrapped in a list
library(dplyr)
out <- examp %>%
nest_by(year) %>%
ungroup %>%
mutate(newdat = lag(data, default = data[1])) %>%
rowwise %>%
mutate(new = list(anti_join(data, newdat, by = 'id')))
-output
out$new
[[1]]
# A tibble: 0 x 2
# … with 2 variables: id <int>, val <dbl>
[[2]]
# A tibble: 3 x 2
id val
<int> <dbl>
1 5 0.450
2 11 0.943
3 2 0.571
[[3]]
# A tibble: 6 x 2
id val
<int> <dbl>
1 19 0.870
2 12 0.403
3 9 0.331
4 20 0.315
5 16 0.455
6 17 0.699
[[4]]
# A tibble: 5 x 2
id val
<int> <dbl>
1 4 0.190
2 11 0.0804
3 2 0.247
4 1 0.619
5 18 0.434
The following line produces this output:
diamonds %>% group_by(cut) %>% summarise(across(x:z, mean), .groups = 'drop')
# A tibble: 5 x 4
cut x y z
<ord> <dbl> <dbl> <dbl>
1 Fair 6.25 6.18 3.98
2 Good 5.84 5.85 3.64
3 Very Good 5.74 5.77 3.56
4 Premium 5.97 5.94 3.65
5 Ideal 5.51 5.52 3.40
I'd like to have the numbers rounded, which I can achieve like so:
diamonds %>% group_by(cut) %>% summarise(across(x:z, mean), .groups = 'drop') %>% mutate(across(x:z, round))
# A tibble: 5 x 4
cut x y z
<ord> <dbl> <dbl> <dbl>
1 Fair 6 6 4
2 Good 6 6 4
3 Very Good 6 6 4
4 Premium 6 6 4
5 Ideal 6 6 3
I had to summarize and then mutate. My question is, is there some way to have handled the rounding within my summarise call?
You can supply custom functions as well as built-ins to across:
diamonds %>%
group_by(cut) %>%
summarise(across(x:z, function(x) round(mean(x))), .groups = 'drop')
# A tibble: 5 x 4
cut x y z
* <ord> <dbl> <dbl> <dbl>
1 Fair 6 6 4
2 Good 6 6 4
3 Very Good 6 6 4
4 Premium 6 6 4
5 Ideal 6 6 3
You can use an anonymous function
diamonds %>%
group_by(cut) %>% summarise(across(x:z, function(x) round(mean(x))), .groups="drop")
# A tibble: 5 x 4
cut x y z
* <ord> <dbl> <dbl> <dbl>
1 Fair 6 6 4
2 Good 6 6 4
3 Very Good 6 6 4
4 Premium 6 6 4
5 Ideal 6 6 3
Consider the case below for an experiment where group is different treatments, init are the initial values for each sample, change is expected change after treatment and sd_change is standard deviation of the change.
library(tidyverse)
set.seed(001)
data1 <- tibble(group = rep(c("a", "b"), each = 4),
init = rpois(8, 10)) %>%
group_by(group, init) %>%
expand(change = seq(2, 6, 2)) %>%
mutate(sd_change = 2)
as_tibble(data1)
> data1
# A tibble: 24 x 4
# Groups: group, init [8]
group init change sd_change
<chr> <int> <dbl> <dbl>
1 a 7 2 2
2 a 7 4 2
3 a 7 6 2
4 a 8 2 2
5 a 8 4 2
6 a 8 6 2
7 a 10 2 2
8 a 10 4 2
9 a 10 6 2
10 a 11 2 2
# ... with 14 more rows
I generate final values and obtain mean and variance for each group and change as below
data2a <- data1 %>%
rowwise %>%
mutate(final = rnorm(1, change, sd_change) + init) %>%
ungroup
data2a %>%
group_by(group, change) %>%
summarise(mu_start = mean(init), mu_end = mean(final),
v_start = var(init), v_end = var(final))
# A tibble: 6 x 6
# Groups: group [2]
group change mu_start mu_end v_start v_end
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 a 2 9 10.9 3.33 13.9
2 a 4 9 14.7 3.33 4.90
3 a 6 9 15.5 3.33 10.2
4 b 2 11.5 13.2 4.33 3.69
5 b 4 11.5 14.8 4.33 17.8
6 b 6 11.5 17.7 4.33 9.77
I want to repeat the above procedure R times by generating one final random value. I can do this with a for loop but I'm learning purrr and I'm stuck when summarising. See one version below:
# function to generate final values where R = 3
f <- function(n=3, x, y, z){
out <- rnorm(n, x, y)
out <- out + z
}
data2b <- data1 %>%
mutate(final = pmap(list(z = init,
x = change,
y = sd_change),
f)) %>%
ungroup
as_tibble(data2b)
# A tibble: 24 x 5
group init change sd_change final
<chr> <int> <dbl> <dbl> <list>
1 a 7 2 2 <dbl [3]>
2 a 7 4 2 <dbl [3]>
3 a 7 6 2 <dbl [3]>
4 a 8 2 2 <dbl [3]>
5 a 8 4 2 <dbl [3]>
6 a 8 6 2 <dbl [3]>
7 a 10 2 2 <dbl [3]>
8 a 10 4 2 <dbl [3]>
9 a 10 6 2 <dbl [3]>
10 a 11 2 2 <dbl [3]>
# ... with 14 more rows
summarise to get mu_end that should be a list of length R=3 in this example. The following gives an error
data2b %>%
split(.$group, .$change) %>%
mutate(mu_end = map(final, mean),
v_end = map(final, var)
Error in UseMethod("mutate_") :
no applicable method for 'mutate_' applied to an object of class "list"
The output should be like this
# A tibble: 6 x 4
# Groups: group [2]
group change mu_end v_end
<chr> <dbl> <dbl> <dbl>
1 a 2 10.9 13.9
2 a 4 14.7 4.90
3 a 6 15.5 10.2
4 b 2 13.2 3.69
5 b 4 14.8 17.8
6 b 6 17.7 9.77
but each row of mu_end and v_end should be a list of length R
any help?
We can either do a group_split and then map through the list of tibbles, mutate to create the mean and var of the list column 'final' by looping with map
data2b %>%
group_split(group, change) %>%
map_df(~ .x %>%
mutate(mu_end = map_dbl(final, mean),
v_end = map_dbl(final, var)))
Or without splitting
data2b %>%
group_by(group, change) %>%
mutate(mu_end = map_dbl(final, mean), v_end = map_dbl(final, var))