Arrange values within a specific group - r

I'm trying to arrange values in decreasing order within a exact group in a nested dataframe. My input data looks like this. I've got two grouping variables (group1 and group2) and three values (i.e. id, value2, value3).
library(tidyverse)
set.seed(1234)
df <- tibble(group1 = c(rep(LETTERS[1:3], 4)),
group2 = c(rep(0, 6), rep(2, 6)),
value2 = rnorm(12, 20, sd = 10),
value3 = rnorm(12, 20, sd = 50)) %>%
group_by(group1) %>%
mutate(id = c(1:4)) %>%
ungroup()
I decided to group them by group1 and group2 and then nest():
df_nested <- df %>%
group_by(group1, group2) %>%
nest()
# A tibble: 6 x 3
# Groups: group1, group2 [6]
group1 group2 data
<chr> <dbl> <list>
1 A 0 <tibble [2 x 3]>
2 B 0 <tibble [2 x 3]>
3 C 0 <tibble [2 x 3]>
4 A 2 <tibble [2 x 3]>
5 B 2 <tibble [2 x 3]>
6 C 2 <tibble [2 x 3]>
Perfect. Now I need to sort only those data which group2 is equal to 2 by id. However I'm receiving a following error:
df_nested %>%
mutate(data = map2_df(.x = data, .y = group2,
~ifelse(.y == 2, arrange(-.x$id),
.x)))
Error: Argument 1 must have names

You could do :
library(dplyr)
library(purrr)
df_nested$data <- map2(df_nested$data, df_nested$group2,~if(.y == 2)
arrange(.x, -.x$id) else .x)
So data where group2 is not equal to 2 is not sorted
df_nested$data[[1]]
# A tibble: 2 x 3
# value2 value3 id
# <dbl> <dbl> <int>
#1 13.1 -89.0 1
#2 9.76 -3.29 2
and where group2 is 2 is sorted.
df_nested$data[[4]]
# A tibble: 2 x 3
#value2 value3 id
# <dbl> <dbl> <int>
#1 15.0 -28.4 4
#2 31.0 -22.8 3
If you want to combine them do :
map2_df(df_nested$data, df_nested$group2,~if(.y == 2) arrange(.x, -.x$id) else .x)

I would suggest creating an additional variable id_ which will be equal to the original id variable when group2 == 2 and NA otherwise. This way if we use it in sorting it'll make no effect when group2 != 2.
df %>%
mutate(id_ = if_else(group2 == 2, id, NA_integer_)) %>%
arrange(group1, group2, -id_)
#> # A tibble: 12 x 6
#> group1 group2 value2 value3 id id_
#> <chr> <dbl> <dbl> <dbl> <int> <int>
#> 1 A 0 17.6 50.2 1 NA
#> 2 A 0 33.8 -14.4 2 NA
#> 3 A 2 23.1 22.6 4 4
#> 4 A 2 13.7 50.2 3 3
#> 5 B 0 15.4 49.9 1 NA
#> 6 B 0 16.2 63.7 2 NA
#> 7 B 2 41.7 -2.90 4 4
#> 8 B 2 16.6 46.7 3 3
#> 9 C 0 19.9 -64.3 1 NA
#> 10 C 0 19.9 59.7 2 NA
#> 11 C 2 34.1 48.5 4 4
#> 12 C 2 32.3 23.1 3 3
Then if needed we can group and nest the result.

Related

Rolling Window Regression by group in R (with dates)

THIS IS MY DATA
I have a panel data in R, so I want to create a rolling window linear regression by group. For instance, I have a lot of dates from 1 to 618. Each number represents one date, but I have more than one observation for each date.
I want to create a rolling window for 20 dates. Finally, i want to output all coefficients for lm(y~x1+x2+x3+x4+x5+x6) in the period 1:20, and make a rolling window for doing another regression for 2:21, 3:22.. and so on for all my observations, so the last coefficients are for 598:618 period (I have 618 so i can´t do it manually).
My problem is that i select a window for 20 observations but i only get to select this 20 first observations, for example:
1
1
1
1
1
1
1 .... 1
and maybe the first 20 observations are only observations for the first date (1), because there are more than one observations by date. So I want to catch 20 observationes filtering by group, actually this will be more than 20 observations, but i want to rolling by date (date 1 to date 20, regardless of the observations.
After that, i need to estimate by Newey West method, so i need include in the final code something like that and output all coefficients and t-statistics.
neweywest <- coeftest(LMOBJECT, vcov. = NeweyWest, lag=12)
I hope it has been understood well.
You can create multiple linear models for a given interval of dates like this:
library(tidyverse)
# example data
set.seed(1337)
n_dates <- 10
data <- tibble(
date = runif(100, min = 1, max = n_dates) %>% floor(),
x1 = runif(100)**2,
x2 = runif(100) * 2,
x3 = runif(100) + 2,
y = x1 + 2 * x2 + runif(100)
) %>%
arrange(date)
data
#> # A tibble: 100 × 5
#> date x1 x2 x3 y
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0.754 0.700 2.21 2.79
#> 2 1 0.0230 1.97 2.70 4.89
#> 3 1 0.388 0.500 2.21 1.54
#> 4 1 0.225 0.135 2.87 0.849
#> 5 1 0.00000810 0.139 2.22 1.12
#> 6 1 0.255 0.893 2.21 2.25
#> 7 1 0.402 1.37 2.06 3.51
#> 8 1 0.00275 0.363 2.68 0.984
#> 9 2 0.238 1.68 2.53 3.98
#> 10 2 0.0309 1.47 2.05 3.69
#> # … with 90 more rows
# number of rows per day
data %>% count(date)
#> # A tibble: 9 × 2
#> date n
#> <dbl> <int>
#> 1 1 8
#> 2 2 10
#> 3 3 15
#> 4 4 8
#> 5 5 10
#> 6 6 10
#> 7 7 12
#> 8 8 7
#> 9 9 20
# size of rolling window in days
window_size <- 3
models <- tibble(
from = seq(n_dates),
to = from + window_size - 1
) %>%
mutate(
data = from %>% map2(to, ~ data %>% filter(date >= .x & date <= .y)),
model = data %>% map(possibly(~ lm(y ~ x1 + x2 + x3, data = .x), NA))
)
models
#> # A tibble: 10 × 4
#> from to data model
#> <int> <dbl> <list> <list>
#> 1 1 3 <tibble [33 × 5]> <lm>
#> 2 2 4 <tibble [33 × 5]> <lm>
#> 3 3 5 <tibble [33 × 5]> <lm>
#> 4 4 6 <tibble [28 × 5]> <lm>
#> 5 5 7 <tibble [32 × 5]> <lm>
#> 6 6 8 <tibble [29 × 5]> <lm>
#> 7 7 9 <tibble [39 × 5]> <lm>
#> 8 8 10 <tibble [27 × 5]> <lm>
#> 9 9 11 <tibble [20 × 5]> <lm>
#> 10 10 12 <tibble [0 × 5]> <lgl [1]>
models %>%
filter(!is.na(model)) %>%
transmute(
from, to,
coeff = model %>% map(coefficients),
r2 = model %>% map_dbl(~ .x %>% summary() %>% pluck("r.squared"))
) %>%
unnest_wider(coeff)
# A tibble: 9 x 7
# from to `(Intercept)` x1 x2 x3 r2
# <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 3 0.601 0.883 2.07 -0.0788 0.970
#2 2 4 0.766 0.965 2.01 -0.141 0.965
#3 3 5 0.879 0.954 1.94 -0.165 0.953
Another way of subseting groups is to use nest:
# get all observations from day 3 to 5
data %>% arrange(date) %>% nest(-date) %>% slice(3:5) %>% unnest()

dplyr `slice_max` interpolation not working

Given a data.frame:
library(tidyverse)
set.seed(0)
df <- tibble(A = 1:10, B = rnorm(10), C = rbinom(10,2,0.6))
var <- "B"
I'd like to get filter the data frame by the highest values of the variable in var. Logically, I'd do either:
df %>%
slice_max({{ var }}, n = 5)
#> # A tibble: 1 × 3
#> A B C
#> <int> <dbl> <int>
#> 1 1 1.26 1
df %>%
slice_max(!! var, n = 5)
#> # A tibble: 1 × 3
#> A B C
#> <int> <dbl> <int>
#> 1 1 1.26 1
But neither interpolation is working... what am I missing here?
Expected output would be the same as:
df %>%
slice_max(B, n = 5)
#> # A tibble: 5 × 3
#> A B C
#> <int> <dbl> <int>
#> 1 10 2.40 0
#> 2 3 1.33 2
#> 3 4 1.27 1
#> 4 1 1.26 1
#> 5 5 0.415 2
I think you need to use the newer .data version as outlined here:
df %>%
slice_max(.data[[var]] , n = 5)
#> # A tibble: 5 × 3
#> A B C
#> <int> <dbl> <int>
#> 1 10 2.40 0
#> 2 3 1.33 2
#> 3 4 1.27 1
#> 4 1 1.26 1
#> 5 5 0.415 2
I am puzzled by why your approach is get the first row only though!
We may convert to sym and evaluate (!!)
library(dplyr)
df %>%
slice_max(!! rlang::sym(var), n = 5)
-output
# A tibble: 5 × 3
A B C
<int> <dbl> <int>
1 10 2.40 0
2 3 1.33 2
3 4 1.27 1
4 1 1.26 1
5 5 0.415 2

Calculating R squared from multiple columns

I'm very new to R and have been trying to figure out how to calculate R^2 from a few columns within a large data set of approx 300+ columns.
Example:
rcalc <- data.frame('x1' = c(694, 702, 701), 'x2'=c(652, 659, 655),
'x3'=c(614, 612, 613), 'y1'= c(17.97, 17.95, 17.96), 'y2' = c(12.03, 12.0,
12.1), 'y3' = c(0.09, 0.1, 0.1))
From here I am stuck.
The formula in excel I can do, and looks like this:
RSQ(X1:X3, Y1:Y3) or RSQ(694:652:614, 17.97:12.03:0.09)
So, each row needs to be calculated for R^2. I was able to use the 'lm' command but was only able to do this for 1 row:
I had to take the value from each column of x (x1:x3) and stack them into 1 column, then each value from each column y (y1:y3) and stack into 1 column. Then performed the following:
rsqrd = lm(x~y, data=rcalc)
summary(rsqrd)$r.squared
This worked but again, only for 1 row. I'm not sure how to do this for thousands of rows. I hope this wasn't too confusing. Any help is greatly appreciated.
Troubleshooting:
with pivot_longer:
row col obs value
1 c 300_0 DUT Ip2_comp 784.9775
1 c 300_12 DUT Ip2_comp 864.4234
1 c 300_18 DUT Ip2_comp 919.3384
1 c 300_0 REF O2 0.09
1 c 300_12 REF O2 11.95
1 c 300_18 REF O2 17.98
2 c 300_0 DUT Ip2_comp 781.5785
2 c 300_12 DUT Ip2_comp 865.5541
2 c 300_18 DUT Ip2_comp 921.0646
2 c 300_0 REF O2 0.09
With Pivot_wider:
row obs c
1 300_0 DUT Ip2_comp 784.9775
1 300_12 DUT Ip2_comp 864.4234
1 300_18 DUT Ip2_comp 919.3384
1 300_0 REF O2 0.09
1 300_12 REF O2 11.95
1 300_18 REF O2 17.98
2 300_0 DUT Ip2_comp 781.5785
2 300_12 DUT Ip2_comp 865.5541
2 300_18 DUT Ip2_comp 921.0646
I'm sure this could be done more concisely, but here's one approach using tidyverse functions. First, I do some reshaping to add a row number and make it into a longer shape, with columns for row, observation # (1-3), x, and y.
Then I "nest" all the data except row number so that I can run a separate regression on each row's data, and then extract r squared (and a variety of other stats) from each regression.
library(tidyverse)
rcalc %>% # your data
# reshape to get matched columns for all x and for all y values
mutate(row = row_number()) %>%
pivot_longer(-row, names_to = c("col", "obs"), names_sep = 1) %>% # split column name into two fields after first character
pivot_wider(names_from = col, values_from = value) %>%
# nest data, regression, unnest
nest(-row) %>%
mutate(model = map(data, function(df) lm(y ~ x, data = df)),
tidied = map(model, broom::glance)) %>%
unnest(tidied)
Result
# A tibble: 3 x 15
row data model r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
<int> <list> <list> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int>
1 1 <tibble [3 × 3]> <lm> 0.952 0.905 2.81 20.0 0.140 1 -5.71 17.4 14.7 7.91 1 3
2 2 <tibble [3 × 3]> <lm> 0.973 0.946 2.10 36.3 0.105 1 -4.84 15.7 13.0 4.43 1 3
3 3 <tibble [3 × 3]> <lm> 0.951 0.903 2.84 19.6 0.141 1 -5.74 17.5 14.8 8.05 1 3
Edit: for troubleshooting, I am adding here the results I see at each stage:
after the pivot_longer step:
# A tibble: 18 x 4
row col obs value
<int> <chr> <chr> <dbl>
1 1 x 1 694
2 1 x 2 652
3 1 x 3 614
4 1 y 1 18.0
5 1 y 2 12.0
6 1 y 3 0.09
7 2 x 1 702
8 2 x 2 659
9 2 x 3 612
10 2 y 1 18.0
11 2 y 2 12
12 2 y 3 0.1
13 3 x 1 701
14 3 x 2 655
15 3 x 3 613
16 3 y 1 18.0
17 3 y 2 12.1
18 3 y 3 0.1
after the pivot_wider step:
# A tibble: 9 x 4
row obs x y
<int> <chr> <dbl> <dbl>
1 1 1 694 18.0
2 1 2 652 12.0
3 1 3 614 0.09
4 2 1 702 18.0
5 2 2 659 12
6 2 3 612 0.1
7 3 1 701 18.0
8 3 2 655 12.1
9 3 3 613 0.1

Join current and previous dataframes in nested tibble using lag() and mutate() to produce a new list-column

I am trying to determine the difference between the set of ids in subsequent pairs of dataframes. The dataframes are derived from an original dataframe split by a grouping variable representing the time period. The results should show the rows of the new ids that occur in the current time period compared to the previous one.
I can accomplish this with a list of dataframes:
library(tidyverse)
set.seed(999)
examp <- tibble(
id = c(replicate(4, sample.int(20, 9))),
year = rep(1:4, each = 9),
val = runif(36)
)
examp %>%
split(.$year) %>%
# note my default, I compare the first year to itself
map2(lag(., default = .[1]), anti_join, by = "id")
$`1`
# A tibble: 0 x 3
# ... with 3 variables: id <int>, year <int>, val <dbl>
$`2`
# A tibble: 3 x 3
id year val
<int> <int> <dbl>
1 5 2 0.450
2 11 2 0.943
3 2 2 0.571
$`3`
# A tibble: 6 x 3
id year val
<int> <int> <dbl>
1 19 3 0.870
2 12 3 0.403
3 9 3 0.331
4 20 3 0.315
5 16 3 0.455
6 17 3 0.699
$`4`
# A tibble: 5 x 3
id year val
<int> <int> <dbl>
1 4 4 0.190
2 11 4 0.0804
3 2 4 0.247
4 1 4 0.619
5 18 4 0.434
But I could not get the same to work using mutate in a nested dataframe:
examp %>%
nest_by(year) %>%
mutate(new = anti_join(data, lag(data), by = "id"))
# A tibble: 4 x 3
# Rowwise: year
year data new$id $val
<int> <list<tibble[,2]>> <int> <dbl>
1 1 [9 x 2] 3 0.0601
2 2 [9 x 2] 1 0.495
3 3 [9 x 2] 17 0.699
4 4 [9 x 2] 18 0.434
Here I could not figure out how to specify the default and the output is unexpected. I expected "new" to be a list-column of dataframes corresponding with those above, which I could then unnest.
I am interested in learning more about working with nested dataframes and any help understanding how to get this to work would be much appreciated. Additionally, if there is another (simple) solution to this general problem, I would be happy to learn about it.
It should be wrapped in a list
library(dplyr)
out <- examp %>%
nest_by(year) %>%
ungroup %>%
mutate(newdat = lag(data, default = data[1])) %>%
rowwise %>%
mutate(new = list(anti_join(data, newdat, by = 'id')))
-output
out$new
[[1]]
# A tibble: 0 x 2
# … with 2 variables: id <int>, val <dbl>
[[2]]
# A tibble: 3 x 2
id val
<int> <dbl>
1 5 0.450
2 11 0.943
3 2 0.571
[[3]]
# A tibble: 6 x 2
id val
<int> <dbl>
1 19 0.870
2 12 0.403
3 9 0.331
4 20 0.315
5 16 0.455
6 17 0.699
[[4]]
# A tibble: 5 x 2
id val
<int> <dbl>
1 4 0.190
2 11 0.0804
3 2 0.247
4 1 0.619
5 18 0.434

mutate using values in a nest for each group using map

Consider the case below for an experiment where group is different treatments, init are the initial values for each sample, change is expected change after treatment and sd_change is standard deviation of the change.
library(tidyverse)
set.seed(001)
data1 <- tibble(group = rep(c("a", "b"), each = 4),
init = rpois(8, 10)) %>%
group_by(group, init) %>%
expand(change = seq(2, 6, 2)) %>%
mutate(sd_change = 2)
as_tibble(data1)
> data1
# A tibble: 24 x 4
# Groups: group, init [8]
group init change sd_change
<chr> <int> <dbl> <dbl>
1 a 7 2 2
2 a 7 4 2
3 a 7 6 2
4 a 8 2 2
5 a 8 4 2
6 a 8 6 2
7 a 10 2 2
8 a 10 4 2
9 a 10 6 2
10 a 11 2 2
# ... with 14 more rows
I generate final values and obtain mean and variance for each group and change as below
data2a <- data1 %>%
rowwise %>%
mutate(final = rnorm(1, change, sd_change) + init) %>%
ungroup
data2a %>%
group_by(group, change) %>%
summarise(mu_start = mean(init), mu_end = mean(final),
v_start = var(init), v_end = var(final))
# A tibble: 6 x 6
# Groups: group [2]
group change mu_start mu_end v_start v_end
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 a 2 9 10.9 3.33 13.9
2 a 4 9 14.7 3.33 4.90
3 a 6 9 15.5 3.33 10.2
4 b 2 11.5 13.2 4.33 3.69
5 b 4 11.5 14.8 4.33 17.8
6 b 6 11.5 17.7 4.33 9.77
I want to repeat the above procedure R times by generating one final random value. I can do this with a for loop but I'm learning purrr and I'm stuck when summarising. See one version below:
# function to generate final values where R = 3
f <- function(n=3, x, y, z){
out <- rnorm(n, x, y)
out <- out + z
}
data2b <- data1 %>%
mutate(final = pmap(list(z = init,
x = change,
y = sd_change),
f)) %>%
ungroup
as_tibble(data2b)
# A tibble: 24 x 5
group init change sd_change final
<chr> <int> <dbl> <dbl> <list>
1 a 7 2 2 <dbl [3]>
2 a 7 4 2 <dbl [3]>
3 a 7 6 2 <dbl [3]>
4 a 8 2 2 <dbl [3]>
5 a 8 4 2 <dbl [3]>
6 a 8 6 2 <dbl [3]>
7 a 10 2 2 <dbl [3]>
8 a 10 4 2 <dbl [3]>
9 a 10 6 2 <dbl [3]>
10 a 11 2 2 <dbl [3]>
# ... with 14 more rows
summarise to get mu_end that should be a list of length R=3 in this example. The following gives an error
data2b %>%
split(.$group, .$change) %>%
mutate(mu_end = map(final, mean),
v_end = map(final, var)
Error in UseMethod("mutate_") :
no applicable method for 'mutate_' applied to an object of class "list"
The output should be like this
# A tibble: 6 x 4
# Groups: group [2]
group change mu_end v_end
<chr> <dbl> <dbl> <dbl>
1 a 2 10.9 13.9
2 a 4 14.7 4.90
3 a 6 15.5 10.2
4 b 2 13.2 3.69
5 b 4 14.8 17.8
6 b 6 17.7 9.77
but each row of mu_end and v_end should be a list of length R
any help?
We can either do a group_split and then map through the list of tibbles, mutate to create the mean and var of the list column 'final' by looping with map
data2b %>%
group_split(group, change) %>%
map_df(~ .x %>%
mutate(mu_end = map_dbl(final, mean),
v_end = map_dbl(final, var)))
Or without splitting
data2b %>%
group_by(group, change) %>%
mutate(mu_end = map_dbl(final, mean), v_end = map_dbl(final, var))

Resources