Nesting several groups of columns inside a data frame - r

The concept of nesting several columns into a single list-column is very powerful. However, I am not sure whether it is possible at all to nest more than one set of columns into several list-columns within the same pipeline using the nest function in {tidyr}. For instance, assume I have the following data frame:
df <- as.data.frame(replicate(6, runif(10) * 100))
colnames(df) <- c(
paste0("a", 1:2), # a1, a2
paste0("b", 1:4) # b1, b2, b3, b4
)
df
a1 a2 b1 b2 b3 b4
1 20.807348 69.339482 91.837151 99.76813 3.394350 33.780049
2 64.667733 20.676381 80.523369 38.42774 85.635208 60.111491
3 55.352501 55.699571 4.812923 38.65333 98.869203 80.345576
4 45.194094 16.511696 83.834651 51.48698 7.191081 16.697210
5 66.401642 89.041055 26.965636 67.90061 90.622428 59.552935
6 35.750100 55.997766 49.768556 68.45900 67.523080 58.993232
7 21.392823 5.335281 56.348328 35.68331 51.029617 66.290035
8 8.851236 19.486580 14.199370 22.49754 14.617592 18.236406
9 70.475652 6.229997 43.169364 12.63378 21.415589 2.163004
10 47.837613 37.641530 38.001288 71.15896 71.000568 2.135611
I would like to nest the "a" columns into a list-column AND nest the "b" columns into a second list-column because I would like to perform different computations on them.
Nesting the "a" columns works:
library(tidyr)
nest(df, a1, a2, .key = "a")
b1 b2 b3 b4 a
1 91.837151 99.76813 3.394350 33.780049 20.80735, 69.33948
2 80.523369 38.42774 85.635208 60.111491 64.66773, 20.67638
3 4.812923 38.65333 98.869203 80.345576 55.35250, 55.69957
4 83.834651 51.48698 7.191081 16.697210 45.19409, 16.51170
5 26.965636 67.90061 90.622428 59.552935 66.40164, 89.04105
6 49.768556 68.45900 67.523080 58.993232 35.75010, 55.99777
7 56.348328 35.68331 51.029617 66.290035 21.392823, 5.335281
8 14.199370 22.49754 14.617592 18.236406 8.851236, 19.486580
9 43.169364 12.63378 21.415589 2.163004 70.475652, 6.229997
10 38.001288 71.15896 71.000568 2.135611 47.83761, 37.64153
But it is impossible to nest the "b" columns AFTER the "a" columns have been nested:
nest(df, a1, a2, .key = "a") %>%
nest(b1, b2, b3, b4, .key = "b")
Error in grouped_df_impl(data, unname(vars), drop) :
Column `a` can't be used as a grouping variable because it's a list
which makes sense by reading the error message.
My work-around is to:
nest the "a" columns
perform the required computations on the "a" list-column
unnest the "a" list-column
nest the "b" columns
perform the required computations on the "b" list-column
unnest the "b" list-column
Is there a more straight-forward way to achieve this? Your help is much appreciated.

We can use map to do this
library(tidyverse)
out <- list('a', 'b') %>%
map(~ df %>%
select(matches(.x)) %>%
nest(names(.), .key = !! rlang::sym(.x))) %>%
bind_cols
out
# A tibble: 1 x 2
# a b
# <list> <list>
#1 <data.frame [10 × 2]> <data.frame [10 × 4]>
out %>%
unnest
# A tibble: 10 x 6
# a1 a2 b1 b2 b3 b4
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 20.8 69.3 91.8 99.8 3.39 33.8
# 2 64.7 20.7 80.5 38.4 85.6 60.1
# 3 55.4 55.7 4.81 38.7 98.9 80.3
# 4 45.2 16.5 83.8 51.5 7.19 16.7
# 5 66.4 89.0 27.0 67.9 90.6 59.6
# 6 35.8 56.0 49.8 68.5 67.5 59.0
# 7 21.4 5.34 56.3 35.7 51.0 66.3
# 8 8.85 19.5 14.2 22.5 14.6 18.2
# 9 70.5 6.23 43.2 12.6 21.4 2.16
#10 47.8 37.6 38.0 71.2 71.0 2.14
We could do the separate computations on the 'a' and 'b' list of columns
out %>%
mutate(a = map(a, `*`, 4)) %>%
unnest
# A tibble: 10 x 6
# a1 a2 b1 b2 b3 b4
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 83.2 277. 91.8 99.8 3.39 33.8
# 2 259. 82.7 80.5 38.4 85.6 60.1
# 3 221. 223. 4.81 38.7 98.9 80.3
# 4 181. 66.0 83.8 51.5 7.19 16.7
# 5 266. 356. 27.0 67.9 90.6 59.6
# 6 143. 224. 49.8 68.5 67.5 59.0
# 7 85.6 21.3 56.3 35.7 51.0 66.3
# 8 35.4 77.9 14.2 22.5 14.6 18.2
# 9 282. 24.9 43.2 12.6 21.4 2.16
#10 191. 151. 38.0 71.2 71.0 2.14
Having said that, it is also possible to select columns of interest with mutate_at instead of doing nest/unnest
df %>%
mutate_at(vars(matches('^a\\d+')), funs(.*4))

Related

Add new variable with arithmetic conditions

the randomly generated data frame contains ID, Dates, and Earnings. I changed up the data frame format so that each column represents a date and its values corresponds to the earnings.
I want to create a new variable named "Date_over100 " that would determine the date when one's cumulative earnings have exceeded 100. I have put below a reproducible code that would generate the data frame. I assume conditional statements or loops would be used to achieve this. I would appreciate all the help there is. Thanks in advance!
ID <- c(1:10)
Date <- sample(seq(as.Date('2021/01/01'), as.Date('2021/01/11'), by="day", replace=T), 10)
Earning <- round(runif(10,30,50),digits = 2)
df <- data.frame(ID,Date,Earning,check.names = F)
df1 <- df%>%
arrange(Date)%>%
pivot_wider(names_from = Date, values_from = Earning)
df1 <- as.data.frame(df1)
df1[is.na(df1)] <- round(runif(sum(is.na(df1)),min=30,max=50),digits = 2)
I go back to long format for the calculation, then join to the wide data:
library(dplyr)
library(tidyr)
df1 %>% pivot_longer(cols = -ID, names_to = "date") %>%
group_by(ID) %>%
summarize(Date_over_100 = Date[which.max(cumsum(value) > 100)]) %>%
right_join(df1, by = "ID")
# # A tibble: 10 × 12
# ID Date_over_100 `2021-01-04` `2021-01-01` `2021-01-08` `2021-01-11` `2021-01-02` `2021-01-09`
# <int> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 2021-01-08 45.0 46.2 40.1 47.4 47.5 48.8
# 2 2 2021-01-08 36.7 30.3 36.2 47.5 41.4 41.7
# 3 3 2021-01-08 49.5 46.0 45.0 43.9 45.4 37.1
# 4 4 2021-01-08 31.0 48.7 47.3 40.4 40.8 35.5
# 5 5 2021-01-08 48.2 35.2 32.1 44.2 35.4 49.7
# 6 6 2021-01-08 40.8 37.6 31.8 40.3 38.3 42.5
# 7 7 2021-01-08 37.9 42.9 36.8 46.0 39.8 33.6
# 8 8 2021-01-08 47.7 47.8 39.7 46.4 43.8 46.5
# 9 9 2021-01-08 32.9 42.0 41.8 32.8 33.9 35.5
# 10 10 2021-01-08 34.5 40.1 42.7 35.9 44.8 31.8
# # … with 4 more variables: 2021-01-10 <dbl>, 2021-01-03 <dbl>, 2021-01-07 <dbl>, 2021-01-05 <dbl>

R dplyr summarize_at: numeric vector of column positions results in "Can't convert a character NA to a symbol" - Summary stats output with t-test

I wish to summarize a set of data in a dataframe using dplyer.
Concerning the "vars" argument, the documentation reads:
A list of columns generated by vars(), a character vector of column names, a numeric vector of column positions, or NULL.
I have the following behavior depending on the type of "vars" argument:
summarize_at(vars(D8,D9,D10), mean, na.rm=TRUE) # works
summarize_at(c("D8","D9","D10"), mean, na.rm=TRUE) # works
summarize_at(c(12,13,14), mean, na.rm=TRUE) # Using column indexes for D8, D9 and D10, respectively
# ! Can't convert a character `NA` to a symbol.
summarize_at(c(12:14), mean, na.rm=TRUE) # Same error as c(12,13,14)
Why I'm getting that error?
POST EDIT: Adding data and actual code
Data:
# A tibble: 12 x 5
TTMENT DOSE D8 D9 D10
<chr> <dbl> <dbl> <dbl> <dbl>
1 Group_1 0 40.3 41.1 41.5
2 Group_1 0 37.4 36.9 37.1
3 Group_1 0 44.8 44.1 44.4
4 Group_2 450 39.6 39.6 39.4
5 Group_2 450 40.6 41.2 40.8
6 Group_2 450 41.1 42.1 41.2
7 Group_3 500 38.5 39.2 39.9
8 Group_3 500 41.6 41.6 41.5
9 Group_3 500 41.8 41.8 42.4
10 Group_4 700 43.6 42 42.4
11 Group_4 700 43.1 42.7 42.7
12 Group_4 700 41.6 40.8 41.9
Error triggering code:
group_by(TTMENT, DOSE) %>%
#summarize_at(c("D8","D9","D10"), mean, na.rm=TRUE)
#summarize_at(vars(D8,D9,D10), mean, na.rm=TRUE)
summarize_at(c(3,4,5), mean, na.rm=TRUE)
Full error:
Error in FUN(): ! Can't convert a character NA to a symbol. Backtrace:
stack %>% group_by(TTMENT, DOSE) %>% ...
dplyr::summarize_at(., c(3, 4, 5), mean, na.rm = TRUE)
dplyr:::manip_at(...)
dplyr:::tbl_at_syms(.tbl, .vars, .include_group_vars = .include_group_vars)
rlang::syms(vars)
rlang:::map(x, sym)
base::lapply(.x, .f, ...)
rlang FUN(X[[i]], ...) Error in FUN(X[[i]], ...) :
I actually want an output showing mean, SD and SE presented in 3 rows per group (rather than in columns); and if possible an asterisk next to the mean in case of significant t-test between each group and the reference group (Group 1). Something like that:
Group Statistic D8 D9 D10
Group_1 Mean XX XX XX
Group_1 SD XX XX XX
Group_1 SE XX XX XX
Group_2 Mean XX* XX XX*
Group_2 SD XX XX XX
Group_2 SE XX XX XX
Group_3 etc.
Any ideas on how to achieve this?
Just posting an answer as I found an explanation (newbie topic though...)
Apparently, by using group_by the columns used to group the data are extracted from the column indexes. Therefore, given the dataframe:
# A tibble: 12 x 5
TTMENT DOSE D8 D9 D10
<chr> <dbl> <dbl> <dbl> <dbl>
1 Group_1 0 40.3 41.1 41.5
2 Group_1 0 37.4 36.9 37.1
3 Group_1 0 44.8 44.1 44.4
4 Group_2 450 39.6 39.6 39.4
5 Group_2 450 40.6 41.2 40.8
6 Group_2 450 41.1 42.1 41.2
7 Group_3 500 38.5 39.2 39.9
8 Group_3 500 41.6 41.6 41.5
9 Group_3 500 41.8 41.8 42.4
10 Group_4 700 43.6 42 42.4
11 Group_4 700 43.1 42.7 42.7
12 Group_4 700 41.6 40.8 41.9
The following code fails as it assumes column indexes 3, 4 and 5 for columns D8, D9 and D10 respectively:
results <- stack %>%
group_by(TTMENT, DOSE) %>%
summarize_at(c(3:5), mean, na.rm=TRUE)
Results in error:
Error in `FUN()`:
! Can't convert a character `NA` to a symbol.
In contrast, the following code provides the expected result, as it assumes column indexes 1, 2 and 3 for columns D8, D9 and D10 respectively. This is ignores TTMENT and DOSE for the index counting:
results <- stack %>%
group_by(TTMENT, DOSE) %>%
summarize_at(c(1:3), mean, na.rm=TRUE)
Result:
# A tibble: 4 x 5
# Groups: TTMENT [4]
TTMENT DOSE D8 D9 D10
<chr> <dbl> <dbl> <dbl> <dbl>
1 Group_1 0 40.8 40.7 41
2 Group_2 450 40.4 41.0 40.5
3 Group_3 500 40.6 40.9 41.3
4 Group_4 700 42.8 41.8 42.3
Thanks to #jpiversen, as his/her comment helped to understand what was going on.

mutate arithmetic using dplyr::lag

I want to automate calculating the difference in means from a grouped mean_se table. But using lag() in a mutate function produces NA's.
iris %>% group_by(Species) %>%
group_modify(~ mean_se(.x$Sepal.Length)) %>% mutate(difference = y-lag(y))
What I would like is a difference column that says NA, 0.93, 0.65
A harder case would be to specify the particular category against which to calculate the operation, for example filter(marital == "No answer") so that the mean differences in each raceXmarital status are calculated against the values of "No answer" in the marital column (34, 64 and 56)
gss_cat %>% group_by(race, marital) %>%
group_modify(~ mean_se(.x$age))
The group attribute is still present after the mutate unless we do ungroup. There is no equivalent option is mutate as in summarise i.e. .groups = 'drop'
library(dplyr)
iris %>%
group_by(Species) %>%
group_modify(~ mean_se(.x$Sepal.Length)) %>%
ungroup %>%
mutate(difference = y-lag(y))
-output
# A tibble: 3 x 5
Species y ymin ymax difference
<fct> <dbl> <dbl> <dbl> <dbl>
1 setosa 5.01 4.96 5.06 NA
2 versicolor 5.94 5.86 6.01 0.93
3 virginica 6.59 6.50 6.68 0.652
Essentially, when we have a single row (i.e. here the output after the group_modify is a single row per group) and we take the lag it is just NA because the default option is NA
lag(5)
[1] NA
and any value subtracted from NA returns NA
6 - NA
[1] NA
For the second case, the data is grouped by two columns, therefore, we can change the grouping to 'race' and do the subtraction by subsetting
library(forcats)
data(gss_cat)
gss_cat %>%
group_by(race, marital) %>%
group_modify(~ mean_se(.x$age)) %>%
group_by(race) %>%
mutate(diff = y - y[marital == 'No answer']) %>%
ungroup
-output
# A tibble: 18 x 6
race marital y ymin ymax diff
<fct> <fct> <dbl> <dbl> <dbl> <dbl>
1 Other No answer 34 28 40 0
2 Other Never married 30.2 29.8 30.6 -3.79
3 Other Separated 42.5 41.3 43.8 8.54
4 Other Divorced 45.5 44.7 46.3 11.5
5 Other Widowed 64.5 62.7 66.3 30.5
6 Other Married 42.2 41.8 42.7 8.24
7 Black No answer 64 NA NA 0
8 Black Never married 34.5 34.2 34.8 -29.5
9 Black Separated 46.2 45.2 47.1 -17.8
10 Black Divorced 51.0 50.4 51.5 -13.0
11 Black Widowed 67.5 66.7 68.4 3.53
12 Black Married 46.4 46.0 46.9 -17.6
13 White No answer 56 50.1 61.9 0
14 White Never married 34.4 34.2 34.7 -21.6
15 White Separated 45.6 44.9 46.2 -10.4
16 White Divorced 51.6 51.3 51.8 -4.44
17 White Widowed 72.8 72.5 73.1 16.8
18 White Married 49.7 49.5 49.8 -6.32

How to get returns from a DF

So I'm using the quantmod library to calculate historical returns, but while I can get the past prices, how can I calculate the returns and add it on to the dataframe???
My code looks like this
tickers <- c('KO', 'AAPL')
getSymbols(tickers, from = '2020-07-01', to = '2021-07-01')
history <- cbind(KO$KO.Close,AAPL$AAPL.Close)
First I did a way to better import and structure data
Import
library(quantmod)
library(tidyverse)
tickers <- c('KO', 'AAPL')
df <-
map_df(
.x = tickers,
.f = function(x){
getSymbols(x, from = '2020-07-01', to = '2021-07-01',auto.assign = FALSE) %>%
as_tibble() %>%
set_names(c("open","high","low","close","volume","adjusted")) %>%
mutate(symbol = x)
}
)
# A tibble: 504 x 7
open high low close volume adjusted symbol
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 44.9 45.5 44.7 44.8 14316500 43.1 KO
2 45.3 45.4 44.8 44.9 15111900 43.2 KO
3 45.1 45.3 44.6 45.2 15146000 43.5 KO
4 45 45.5 44.8 45.2 13043600 43.5 KO
5 45.1 45.2 44.5 45.1 13851200 43.3 KO
6 45.0 45.0 43.8 43.9 16087100 42.2 KO
7 43.9 45.2 43.9 45.2 15627800 43.4 KO
8 45.5 45.7 45.0 45.2 16705300 43.5 KO
9 44.9 45.9 44.7 45.9 17080100 44.1 KO
10 46.3 47.2 46.2 46.4 23738000 44.6 KO
Return
I do not know if this is the right formula for return, but you can change later inside mutate
df %>%
group_by(symbol) %>%
mutate(return = 100*((open/lag(open))-1))
# A tibble: 504 x 8
# Groups: symbol [2]
open high low close volume adjusted symbol return
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
1 44.9 45.5 44.7 44.8 14316500 43.1 KO NA
2 45.3 45.4 44.8 44.9 15111900 43.2 KO 0.801
3 45.1 45.3 44.6 45.2 15146000 43.5 KO -0.331
4 45 45.5 44.8 45.2 13043600 43.5 KO -0.310
5 45.1 45.2 44.5 45.1 13851200 43.3 KO 0.311
6 45.0 45.0 43.8 43.9 16087100 42.2 KO -0.199
7 43.9 45.2 43.9 45.2 15627800 43.4 KO -2.60
8 45.5 45.7 45.0 45.2 16705300 43.5 KO 3.76
9 44.9 45.9 44.7 45.9 17080100 44.1 KO -1.36
10 46.3 47.2 46.2 46.4 23738000 44.6 KO 3.10
# ... with 494 more rows
Assuming the return you're looking for as today's value/yesterday's value, and using the tidyverse:
library(tidyverse)
library(timetk)
tickers <- c('KO', 'AAPL')
quantmod::getSymbols(tickers, from = '2020-07-01', to = '2021-07-01')
# Convert to a tibble to keep the dates
equity1 <- tk_tbl(KO) %>%
select(date = index, 5)
equity2 <- tk_tbl(AAPL) %>%
select(date = index, 5)
# Combine the series using a join, in case dates don't line up exactly.
history <- full_join(equity1, equity2, by = "date")
# Make data long, group by equity, do the calculation, turn back into wide data:
return <- history %>%
pivot_longer(-date) %>%
group_by(name) %>%
mutate(return = value/lag(value)-1) %>%
ungroup() %>%
pivot_wider(names_from = name, values_from = c(value, return))
# A tibble: 252 x 5
date value_KO.Close value_AAPL.Close return_KO.Close return_AAPL.Close
<date> <dbl> <dbl> <dbl> <dbl>
1 2020-07-01 44.8 91.0 NA NA
2 2020-07-02 44.9 91.0 0.00134 0
3 2020-07-06 45.2 93.5 0.00780 0.0268
4 2020-07-07 45.2 93.2 -0.000442 -0.00310
5 2020-07-08 45.1 95.3 -0.00310 0.0233
6 2020-07-09 43.9 95.8 -0.0257 0.00430
7 2020-07-10 45.2 95.9 0.0282 0.00175
8 2020-07-13 45.2 95.5 0.00221 -0.00461
9 2020-07-14 45.9 97.1 0.0137 0.0165
10 2020-07-15 46.4 97.7 0.0116 0.00688
# ... with 242 more rows

How to create several new group-based variables most efficiently?

Let's use the following example:
set.seed(2409)
N=5
T=10
id<- rep(LETTERS[1:N],each=T)
time<-rep(1:T, times=N)
var1<-runif(N*T,0,100)
var2<-runif(N*T,0,100)
var3<-runif(N*T,0,100)
var4<-runif(N*T,0,100)
var5<-runif(N*T,0,100)
df<-data.frame(id,time,var1,var2,var3,var4,var5); rm(N,T,id,time,var1,var2,var3,var4,var5)
I now try to execute a function for several of these variables (not the whole series of variables!) and create new variables accordingly.
I already have a suitable code for creating log variables. For this I would use the following code:
cols <- c("var1",
"var3",
"var5")
log <- log(df[cols])
colnames(log) <- paste(colnames(log), "log", sep = "_")
df <- cbind(df,log); rm(log, cols)
This would give me my additional log variables. But now I also want to create lagged and z-transformed variables. These functions refer to the individual IDs. So I wrote the following code that of course works, but is extremely long and inefficient in my real dataset where I apply the function to 38 variables each:
library(Hmisc)
library(dplyr)
df<-df %>%
group_by(id) %>%
mutate(var1_1=Lag(var1, shift=1),
var3_1=Lag(var3, shift=1),
var5_1=Lag(var5, shift=1),
var1_2=Lag(var1, shift=2),
var3_2=Lag(var3, shift=2),
var5_2=Lag(var5, shift=2),
var1_z=scale(var1),
var3_z=scale(var3),
var5_z=scale(var5)
)
I am very sure that there is also a way to make this more efficient. It would be desirable if I could define the original variable once and execute different functions and create new variables as a result.
Thank you very much!
You can use mutate_at with funs. This will apply the three functions in funs to each of the three variables in vars, creating 9 new columns.
library(dplyr)
df %>%
group_by(id) %>%
mutate_at(vars(var1, var3, var5),
funs(lag1 = lag(.), lag2 = lag(., 2), scale))
# # A tibble: 50 x 16
# # Groups: id [5]
# id time var1 var2 var3 var4 var5 var1_lag1 var3_lag1 var5_lag1
# <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 A 1 38.8 25.7 29.2 91.1 35.3 NA NA NA
# 2 A 2 87.1 22.3 8.27 31.5 93.7 38.8 29.2 35.3
# 3 A 3 61.7 38.8 0.887 63.0 50.4 87.1 8.27 93.7
# 4 A 4 0.692 60.1 71.5 74.0 41.6 61.7 0.887 50.4
# 5 A 5 60.1 13.3 90.4 80.6 47.5 0.692 71.5 41.6
# 6 A 6 46.4 3.67 36.7 86.9 67.5 60.1 90.4 47.5
# 7 A 7 80.4 72.1 82.2 25.5 70.3 46.4 36.7 67.5
# 8 A 8 48.8 25.7 93.4 19.8 81.2 80.4 82.2 70.3
# 9 A 9 48.2 31.5 82.1 47.2 49.2 48.8 93.4 81.2
# 10 A 10 21.8 32.6 76.5 19.7 41.1 48.2 82.1 49.2
# # ... with 40 more rows, and 6 more variables: var1_lag2 <dbl>, var3_lag2 <dbl>,
# # var5_lag2 <dbl>, var1_scale <dbl>, var3_scale <dbl>, var5_scale <dbl>
Here is an option with data.table
library(data.table)
nm1 <- c('var1', 'var3', 'var5')
nm2 <- paste0(nm1, rep(c('_lag1', '_lag2'), each = 3))
nm3 <- paste0(nm1, '_scale')
setDT(df)[, c(nm2, nm3) := c(shift(.SD, n = 1:2), lapply(.SD,
function(x) as.vector(scale(x)))), by = id, .SDcols = nm1]'

Resources