Calculating (something similar to) moving averages with grouped data in R? - r

Let's say I want to calculate the past-7-days ratio between dep_delay and arr_delay for flights in nycflights13. I tried the following, but as soon as I put any function from zoo in the pipeline it seems to completely ungroup the data.
library(tidyverse)
library(nycflights13)
library(zoo)
delay_rate <- flights %>%
group_by(year, month, day) %>%
summarize(delay_rate =
(rollsumr(flights$dep_delay, k = 7, fill = NA)) /
(rollsumr(flights$arr_delay, k = 7, fill = NA)
)

There are several problems:
By writing flights$ the code is telling it to override the grouping and use the original ungrouped vector. Remove flights$ .
summarize is used when one row per group is desired but here it appears we want a result having the same number of rows as the input so use mutate rather than summarize.
There are unneeded parentheses here and while they are not wrong it makes it harder to read. When expressions are potentially ambiguous or rely on rules the reader may have to look up it is a good idea to use extra parentheses but that is not the situation here.
ungroup at the end so we are not left with a grouped data frame.
dplyr clobbers lag and filter in base R so it will conflict with many other packages. Always exclude these in the library statement. This does not affect the code here since neither of those are used but as a precaution I always do that.
Seems unnecessary to load all of the tidyverse when the code is only using dplyr and its dependencies.
library(dplyr, exclude = c("lag", "filter"))
library(nycflights13)
library(zoo)
delay_rate <- flights %>%
group_by(year, month, day) %>%
mutate(delay_rate = rollsumr(dep_delay, k = 7, fill = NA) /
rollsumr(arr_delay, k = 7, fill = NA)) %>%
ungroup

Related

Lead and lag issue using dplyr

I have a data frame with data that looks like this that has 365 rows reflecting the calendar year. I am trying to shift the county name columns up by one row. The data frame doesn't contain any missing values.
I tried using the following code to shift it, but the resulting table has values that are all NA.
covid_shift <- covid_pivot %>%
mutate(Maricopa = lag(Maricopa), Cook = lag(Cook), Harris = lag(Harris))
Does anyone know what might be the issue?
Since covid_pivot is grouped by date, and each of these groups has one row, the lead and lag functions return NA.
Try:
covid_shift <- covid_pivot %>%
ungroup() %>%
mutate(Maricopa = lag(Maricopa), Cook = lag(Cook), Harris = lag(Harris))
You might also consider using across()
covid_pivot %>%
ungroup() %>%
mutate(across(-date, ~lag(.x)))

I need to use a loop in R but don't know where to start

I have a calculation that I have to perform for 23 people (they have varying number of rows allocated to each person so difficult to do in excel. What I'd like to do is take the total time each person took to complete a test and divide it into 5 time categories (20%) so that I can look at their reaction time in more detail.
I will just do this by hand but it will take quite a while because they have 8 sets of data each. I'm hoping someone can show me the best way to use a loop or automate this process even just a little. I have tried to understand the examples but I'm afraid I don't have the skill. So by hand I would do it like I have below where I just filter by each subject.
I started by selecting the relevant columns, then filtered by subject so that I could calculate the time they started and the time they finished and used that to create a variable (testDuration) that could be used to create the 20% proportions of RTs that I'm after. I have no idea how to get the individual subjects' test start, end, duration and timeBin sizes to appear in one column. Any help very gratefully received.
Subj1 <- rtTrialsYA_s1 %>%
select(Subject, RetRating.OnsetTime, RetRating.RT, RetRating.RTTime) %>%
filter(Subject==1) %>%
summarise(
testStart =
min(RetRating.OnsetTime),
testEnd = max(RetRating.RTTime)
) %>%
mutate(
testDuration = testEnd - testStart,
timeBin =
testDuration/5
)
Subj2 <- rtTrialsYA_s1 %>%
select(Subject, RetRating.OnsetTime, RetRating.RT, RetRating.RTTime) %>%
filter(Subject==2) %>%
summarise(
testStart =
min(RetRating.OnsetTime),
testEnd = max(RetRating.RTTime)
) %>%
mutate(
testDuration = testEnd - testStart,
timeBin =
testDuration/5
)
I'm not positive that I understand your code, but this function can be called for any Subject value and then return the output:
myfunction <- function(subjectNumber){
Subj <- rtTrialsYA_s1 %>%
select(Subject, RetRating.OnsetTime, RetRating.RT, RetRating.RTTime) %>%
filter(Subject==subjectNumber) %>%
summarise(testStart = min(RetRating.OnsetTime), testEnd = max(RetRating.RTTime)) %>%
mutate(testDuration = testEnd -testStart) %>%
mutate(timeBin = testDuration/5)
return(Subj)
}
Subj1 <- myfunction(1)
Subj2 <- myfunction(2)
To loop through this, I'll need to know what your data and the desired output looks like.
I think you're missing one piece and that is simply dplyr::group_by.
You can use it as follows to break your dataset into groups, each containing the observations belonging to only one subject, and then summarise on those groups with whatever it is you want to analyze.
library(dplyr)
df <- rtTrialsYA_s1 %>%
group_by(Subject) %>%
summarise(
testStart = min(RetRating.OnsetTime),
testEnd = max(RetRating.RTTime),
testDuration = testEnd - testStart,
timeBin = testDuration/5,
.groups = "drop"
)
There is no need to do separate mutate calls in your code, btw. Also, you can continue to do column calculations right within summarise, as long as the result vectors have the same length as your aggregated columns.
And since summarise retains only the grouping columns and whatever you are defining, there is no real need to do a select statement before, either.
// update
You say you need all your calculated columns to appear within one single column. For that you can use tidyr::pivot_longer. Using the df we calculated above:
library(tidyr)
df_long <- df %>%
pivot_longer(-Subject)
Above will take all columns, except Subject and pivot them into two columns, one containing the former col name and one containing the former value.

How can you run a piece of code on different subsets in R with dplyr

I have a dataset that I have to modify.
I want to run a piece of different code for different subsets in the dplyr pipeline.
The data handles glycaemia values in patients in ICU.
It looks like this:
dfavg <- df %>%
group_by(patientid) %>%
mutate(icuoutcome = ifelse(row_number() != n(), 0, icuoutcome)) %>%
mutate(strata = days) %>%
mutate(survtime = max(days)-days) %>%
group_by(days, add = TRUE) %>%
mutate(gly_mean = mean(glycaemia))
However, I need this code only for the first 5 days a patient is in the ICU.
After this I need another code to run for days 6 to 15.
I tried using filter(days<=5) but then I lose all other data.
How can I make my code so that the upper code runs for day 1-5, another code for 6-15, but all in the same dataset or same pipeline.
I also thought of unfiltering but I don't think that's possible and also of using group_by with a condition (like days<=5).
Thank you in advance
We can split the data based on the 'days' and apply the list of corresponding functions on the list output
library(dplyr)
library(purrr)
split(df, df$days > 5) %>%
map2(funslist, ~ .y(.x))
Using a small reproducible example
data(mtcars)
split(mtcars$mpg, mtcars$vs) %>%
map2_dbl(list(mean, max), ~ .y(.x))

Programmatically choosing which variables to put into dplyr pipe

I'm working with datasets (from smartphone experience sampling) where I have to very frequently performed grouped operations (such as find the variability of a measure within each person, or within each day within each person, etc). Typical code might look like the code below, which calculates within-day variability for some variables, then takes the mean of the within-day variability and joins it to the original data.
output <- group_by(mydata, id, day) %>%
mutate_at(vars(angr, sad, guil, anx, hap), funs(sd(., na.rm = TRUE))) %>%
ungroup() %>%
group_by(id) %>%
summarize_at(vars(angr, sad, guil, anx, hap), funs('var_day_mean' = mean(., na.rm = TRUE))) %>%
join(mydata, .)
What I want to do is be able to save this as a function so that instead of having to type out angr, sad, guil, anx, hap many times over, I can call this code (and slight variations on it saved as different functions) on a vector of variable names in a string. So the desired functionality is:
vars <- c('angr', 'sad', 'guil', 'anx', 'hap')
output <- myfunc(vars)
Where myfunc performs the piped operations above.
I'm aware that there is a vignette for non standard evaluation using dplyr but it's very limited and doesn't cover mutate or most of what I need to do with this use case, so would appreciate any insight.
Reproducible example - what I desire is essentially that the below code work, but currently the dplyr pipe cannot take vars as a character vector the way I have input it.
Edit: I was mistaken - the below code does work, and dplyr can function in this way (and can also take character vectors to group_by, making this easy to program with). I leave the code below as a (working) reference.
data <- data.frame('ID' = rep(1:10, each = 10),
'day' = rep(c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2), 10),
'anx' = rnorm(100), 'sad' = rnorm(100), 'hap' = rnorm(100))
vars = c('anx', 'sad', 'hap')
out <- group_by(data, ID, day) %>%
mutate_at(vars, funs(sd(., na.rm = TRUE)))
With mutate_at you can simply supply the names of the columns as a vector:
mtcars %>% mutate_at(c("mpg", "hp"), funs(mean))
This should do the trick.

Understanding %>% operator [duplicate]

This question already has an answer here:
What does %>% mean in R [duplicate]
(1 answer)
Closed 8 years ago.
Sorry for a noob question, but I'm new to R and I need help explaining this.
I see instruction that: x %>% f(y) -> f(x , y)
This "Then" Symbol %>%, I don't get it. Can someone give me a dummy guide for this? I'm having a really hard time understanding this. I do understand it's something like this:
x <- 3, y <- 3x (Or I could be wrong here too). Can anyone help me out here and explain it to me in a really-really simple way? Thanks in advance!
P.S. Here was the example used, and I've no idea what it is
library(dplyr)
hourly_delay <- flights %>%
filter(!is.na(dep_delay)) %>%
group_by(date, hour) %>%
summarise(delay = mean(dep_delay), n = n()) %>%
filter(n > 10)
Would you rather write:
library(magrittr)
sum(multiply_by(add(1:10, 10), 2))
or
1:10 %>% add(10) %>% multiply_by(2) %>% sum()
?
The intent is made a lot more clear in the second example, and it's fundamentally the same. It's usually easiest to think of the first expression (1:10) defining some data object you're working on, then %>% applying a set of operations sequentially on that data set. Reading it out loud,
Take the data 1:10, then add 10, then multiply by 2, then sum
the way we describe the operation in English is almost identical to how we write it with the pipe operator, which is what makes it such a nice tool.
With your example:
library(dplyr)
flights %>%
filter(!is.na(dep_delay)) %>%
group_by(date, hour) %>%
summarise(delay = mean(dep_delay), n = n()) %>%
filter(n > 10)
we are saying
Take the data set flights, then
Filter it so we only keep rows where `dep_delay` is not NA, then
Group the data frame by the `date`, `hour` columns, then
Summarize the data set (across grouping variables), with `delay`
as the mean of `dep_delay`, and `n` as the number of elements
in that 'group' (`n()`), then
Filter it so we only keep rows where there were more than 10
elements per group.

Resources