I'm working with datasets (from smartphone experience sampling) where I have to very frequently performed grouped operations (such as find the variability of a measure within each person, or within each day within each person, etc). Typical code might look like the code below, which calculates within-day variability for some variables, then takes the mean of the within-day variability and joins it to the original data.
output <- group_by(mydata, id, day) %>%
mutate_at(vars(angr, sad, guil, anx, hap), funs(sd(., na.rm = TRUE))) %>%
ungroup() %>%
group_by(id) %>%
summarize_at(vars(angr, sad, guil, anx, hap), funs('var_day_mean' = mean(., na.rm = TRUE))) %>%
join(mydata, .)
What I want to do is be able to save this as a function so that instead of having to type out angr, sad, guil, anx, hap many times over, I can call this code (and slight variations on it saved as different functions) on a vector of variable names in a string. So the desired functionality is:
vars <- c('angr', 'sad', 'guil', 'anx', 'hap')
output <- myfunc(vars)
Where myfunc performs the piped operations above.
I'm aware that there is a vignette for non standard evaluation using dplyr but it's very limited and doesn't cover mutate or most of what I need to do with this use case, so would appreciate any insight.
Reproducible example - what I desire is essentially that the below code work, but currently the dplyr pipe cannot take vars as a character vector the way I have input it.
Edit: I was mistaken - the below code does work, and dplyr can function in this way (and can also take character vectors to group_by, making this easy to program with). I leave the code below as a (working) reference.
data <- data.frame('ID' = rep(1:10, each = 10),
'day' = rep(c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2), 10),
'anx' = rnorm(100), 'sad' = rnorm(100), 'hap' = rnorm(100))
vars = c('anx', 'sad', 'hap')
out <- group_by(data, ID, day) %>%
mutate_at(vars, funs(sd(., na.rm = TRUE)))
With mutate_at you can simply supply the names of the columns as a vector:
mtcars %>% mutate_at(c("mpg", "hp"), funs(mean))
This should do the trick.
Related
I have a dataframe with millions of rows and tens of columns that I need to apply a rowwise operation. My solution below works using dplyr but I hope a switch to data.table will speed things up. Any help converting the below code to a data.table version would be appreciated.
library(tidyverse)
library(trend)
df = structure(list(id = 1:2, var = c(3L, 9L), col1_x = c("[(1,2,3)]",
"[(100,90,80,70,60,50,40,30,20)]"), col2_x = c("[(2,4,6)]", "[(100,50,25,12,6,3,1,1,1)]"
)), class = "data.frame", row.names = c(NA, -2L))
df = df %>%
mutate(across(ends_with("x"),~ gsub("[][()]", "", .)))
x_cols = df %>%
select(ends_with("x")) %>%
names()
df = df %>%
rowwise() %>%
mutate(across(all_of(x_cols) ,~ ifelse(var<=4,0,sens.slope(as.numeric(unlist(strsplit(., ','))))$estimates[[1]]),.)) %>%
ungroup()
While what #Ritchie Sacramento wrote is absolutely true, here's the information you asked for.
First, I want to start with set or :=. When you see the keyword set (which can just be part of the function name) or the := symbol, you've told data.table not to make copies of the data. Without declaring or declaration (that pesky = or <-), you've changed the data table. This is one of the key methods to prevent wasted memory with this package.
Keep in mind that the environment pane in RStudio is triggered to update when it registers that operator (= or <-), creating something new. Since you did a replace-in-place, the environment pane may reflect incorrect information. You can use the refresh icon (top right of the pane), or you can print the object to the console to check.
As soon as you declare anything that the pane identifies, everything in the pane is updated.
Change a data frame to a data.table. (Notice that keyword—set!) Both of these do the same thing. However, one copies everything in memory and makes it again. (Naming the frame the same thing does not prevent copies.)
setDT(df)
df <- data.table(df)
I'm not going to start with your first code blurb. I'm starting with the name extraction.
You wrote:
x_cols = df %>%
select(ends_with("x")) %>%
names()
# [1] "col1_x" "col2_x"
There are many ways to get this information. This is what I did. Note that this doesn't really have anything to do with data.table. I just used base R here. You could use a data frame the same way.
xcols <- names(df)[endsWith(names(df), 'x')]
# [1] "col1_x" "col2_x"
I'm going to use this object, xcols in the remaining examples. (Why keep reiterating the same declaration?)
You wrote the following to remove the brackets and parentheses.
df = df %>%
mutate(across(ends_with("x"),~ gsub("[][()]", "", .)))
# id var col1_x col2_x
# 1 1 3 1,2,3 2,4,6
# 2 2 9 100,90,80,70,60,50,40,30,20 100,50,25,12,6,3,1,1,1
There are several ways you could do this, whether in a data frame or a data.table. Here are a couple of methods you can use with data.table. These do the exact same thing as each other and your code.
Note the :=, which means the table changed.
In the first example, I used .SD and .SDcols. These are data column selection tools. You use .SD in place of the column name when you want to use more than one column. Then use .SDcols to tell data.table what columns you're trying to use. By annotating (xcols), where xcols is my variable representing my column names to use, this tells data.table to replace the data in the columns used for the aggregation.
The difference between these two is how I used lapply, which doesn't have anything to do with data.table. If you need more info on this function, you can ask me, or you can look through the many Q & As out there already.
df[,
(xcols) := lapply(.SD, function(k) gsub("[][()]", "", k)),
.SDcols = xcols]
df[,
(xcols) := lapply(.SD, gsub, pattern = "[][()]",
replacement = ""),
.SDcols = xcols]
Your last request was based on this code.
df %>%
rowwise() %>%
mutate(across(all_of(x_cols),
~ifelse(var <= 5, 0, sens.slope(
as.numeric(unlist(
strsplit(., ','))))$estimates[[1]]),.)) %>%
ungroup()
Since you used var to delineate when to apply this, I've used the by argument (as in dplyr's group_by). In terms of the other requirements, you'll see .SD and lapply again.
df[,
(xcols) := lapply(.SD,
function(k) {
ifelse(var <= 3, 0,
sens.slope(as.numeric(strsplit(k, ",")[[1]])
)$estimates[[1]])
}), by = var, .SDcols = xcols]
If you think about how these differ, you may find that, in a lot of ways, they aren't all that different. For example, in this last translation, you may see a similar approach in dplyr that I used.
df %>% group_by(var) %>%
mutate(across(all_of(x_cols),
~ifelse(var <= 5, 0, sens.slope(
as.numeric(unlist(
strsplit(., ','))))$estimates[[1]])))
I have a large data frame (6 million rows, 20 columns) where data in one column corresponds to data in another column. I created a key that I now want to use to fix rows that have the wrong value. As a small example:
key = data.frame(animal = c('dog', 'cat', 'bird'),
sound = c('bark', 'meow', 'chirp'))
The data frame looks like this (minus the other columns of data):
df = data.frame(id = c(1, 2, 3, 4),
animal = c('dog', 'cat', 'bird', 'cat'),
sound = c('meow', 'bark', 'chirp', 'chirp'))
I swear I have done this before but can't remember my solution. Any ideas?
Using dplyr. If you want to fix sound according to animal,
library(dplyr)
df <- df %>%
mutate(sound = sapply(animal, function(x){key %>% filter(animal==x) %>% pull(sound)}))
should do the trick. If you want to fix animal according to sound:
df <- df %>%
mutate(animal = sapply(sound, function(x){key %>% filter(sound==x) %>% pull(animal)}))
I'm not sure about relative efficiency, but it's simpler to replace the partially incorrect column completely. It may not even cost you very much time (since you have to look up values anyway to determine that an animal/sound pair is mismatched).
library(tidyverse)
df %>% select(-sound) %>% full_join(key, by = "animal")
For 6 million rows, you may be better off using data.table. If you convert df and key to data tables (as.data.table()) that will take some up-front computational time but may speed up subsequent operations; you can use tidyverse operations on data.table objects without doing any further modifications, but native data.table operations might be faster:
library(data.table
dft <- as.data.table(df)
k <- as.data.table(key)
merge(dft[,-"sound"], k, by = "animal")
I haven't bothered to do any benchmarking (would need much larger examples to be able to measure any differences).
In our data aggregation pipeline we have a bunch of conditioning variables that are used to define groups. To improve code readability and maintainability, we use a preconfigured list of symbols with tidy evaluation as per this illustrative snippet:
# this is the list of our condition variables
condition_vars <- rlang::exprs(var1, var2, var3)
# split the data into groups
data %>% group_by(!!!condition_vars) %>% summarize(...)
This works great but I can't figure out what would be the elegant way to use this in <tidy-select> context, e.g. for something like nest. The problem is that new nest() wants something like nest(data = c(var1, var2, var3)) and not nest(var1, var2, var3), so nest(!!!condition_vars) will give me a warning.
The best I could come up with is
df <- tibble(x = c(1, 1, 1, 2, 2, 3), y = 1:6, z = 6:1)
vars <- exprs(y, z)
nest(df, data = !!call2("c", !!!vars))
but surely there is a better way...
You can do nest(df, data = c(!!!vars)).
But nowadays, if the expressions are simple column names, I would store them in a character vector. You can supply the character vectors with all_of() in selection contexts. In action verbs like mutate() or group_by(), use across() to create a selection context where you can use all_of() (and other features like starts_with()).
cols <- c("cyl", "am")
mtcars %>% group_by(across(all_of(cols))
mtcars %>% nest(data = all_of(cols))
Let's say I want to calculate the past-7-days ratio between dep_delay and arr_delay for flights in nycflights13. I tried the following, but as soon as I put any function from zoo in the pipeline it seems to completely ungroup the data.
library(tidyverse)
library(nycflights13)
library(zoo)
delay_rate <- flights %>%
group_by(year, month, day) %>%
summarize(delay_rate =
(rollsumr(flights$dep_delay, k = 7, fill = NA)) /
(rollsumr(flights$arr_delay, k = 7, fill = NA)
)
There are several problems:
By writing flights$ the code is telling it to override the grouping and use the original ungrouped vector. Remove flights$ .
summarize is used when one row per group is desired but here it appears we want a result having the same number of rows as the input so use mutate rather than summarize.
There are unneeded parentheses here and while they are not wrong it makes it harder to read. When expressions are potentially ambiguous or rely on rules the reader may have to look up it is a good idea to use extra parentheses but that is not the situation here.
ungroup at the end so we are not left with a grouped data frame.
dplyr clobbers lag and filter in base R so it will conflict with many other packages. Always exclude these in the library statement. This does not affect the code here since neither of those are used but as a precaution I always do that.
Seems unnecessary to load all of the tidyverse when the code is only using dplyr and its dependencies.
library(dplyr, exclude = c("lag", "filter"))
library(nycflights13)
library(zoo)
delay_rate <- flights %>%
group_by(year, month, day) %>%
mutate(delay_rate = rollsumr(dep_delay, k = 7, fill = NA) /
rollsumr(arr_delay, k = 7, fill = NA)) %>%
ungroup
I'm trying to calculate rolling correlations with a five year window based on daily stock data. My dataframe test consists of 20 columns, with "logRet3" being located in column #17 and "logMarRet3" in #18. I want to calculate the correlation of these two return measures.
What makes it difficult is the fact that I want the rolling correlation to be grouped by my share indicator "PERMNO" in column #1. By that I mean that the rolling correlation "restarts" whenever the time-series data of a particular stock ends.
Through research I came up with the following code, using the dplyr, zoo and magrittr packages:
test <- test %>%
group_by(PERMNO) %>%
mutate(CorSecMar = zoo::rollapply(test, width = 1255, function(x) cor(x[,logRet3], x[,logMarRet3]), fill = NA, align = "right"))
However, when I run this code, I get the following error:
Error in x[,logMarRet3]: Incorrect number of dimensions
Me being a newbie, I tried adjusting the code by deleting the ,:
test <- test %>%
group_by(PERMNO) %>%
mutate(CorSecMar = zoo::rollapply(test, width = 1255, function(x) cor(x[logRet3], x[logMarRet3]), fill = NA, align = "right"))
resulting in the following error (translated to English):
Error in x[logMarRet3]: Only zeros are allowed to be mixed with negative indices
Any help on how to fix these errors or alternative ways of calculating the rolling correlation by group would be greatly appreciated.
EDIT: Thanks to G. Grothendieck for pointing out some flaws in my question. I'm referring to his answer for reproducible input and will keep that in mind for further posts.
There are several problems:
rollapply applies to each column separately unless by.column = FALSE is used.
using test within group_by will not cause test to be subsetted. It will refer to the entire dataset. Use individual column names instead.
the column names in the code in the question must have quotes around them; otherwise, it is saying there are variables of those names containing the column names.
when posting to SO you need to reduce your problem to a complete reproducible example and post that. I have done it this time for you in the Note at the end.
With reference to the Note, use this code:
library(dplyr)
library(zoo)
mycor <- function(x) cor(x[, 1], x[, 2])
DF %>%
group_by(stock) %>%
mutate(Cor = rollapplyr(cbind(a, b), 4, mycor, by.column = FALSE, fill = NA)) %>%
ungroup
or this code which only uses zoo. mycor is from above.
library(zoo)
n <- nrow(DF)
roll <- function(i) rollapplyr(DF[i, c("a", "b")], 4, mycor, by.column = FALSE, fill = NA)
transform(DF, Cor = ave(1:n, stock, FUN = roll))
Note
The input in reproducible form is:
DF <- data.frame(stock = rep(LETTERS[1:2], each = 6), a = 1:6, b = (1:6)^3)