This question already has an answer here:
What does %>% mean in R [duplicate]
(1 answer)
Closed 8 years ago.
Sorry for a noob question, but I'm new to R and I need help explaining this.
I see instruction that: x %>% f(y) -> f(x , y)
This "Then" Symbol %>%, I don't get it. Can someone give me a dummy guide for this? I'm having a really hard time understanding this. I do understand it's something like this:
x <- 3, y <- 3x (Or I could be wrong here too). Can anyone help me out here and explain it to me in a really-really simple way? Thanks in advance!
P.S. Here was the example used, and I've no idea what it is
library(dplyr)
hourly_delay <- flights %>%
filter(!is.na(dep_delay)) %>%
group_by(date, hour) %>%
summarise(delay = mean(dep_delay), n = n()) %>%
filter(n > 10)
Would you rather write:
library(magrittr)
sum(multiply_by(add(1:10, 10), 2))
or
1:10 %>% add(10) %>% multiply_by(2) %>% sum()
?
The intent is made a lot more clear in the second example, and it's fundamentally the same. It's usually easiest to think of the first expression (1:10) defining some data object you're working on, then %>% applying a set of operations sequentially on that data set. Reading it out loud,
Take the data 1:10, then add 10, then multiply by 2, then sum
the way we describe the operation in English is almost identical to how we write it with the pipe operator, which is what makes it such a nice tool.
With your example:
library(dplyr)
flights %>%
filter(!is.na(dep_delay)) %>%
group_by(date, hour) %>%
summarise(delay = mean(dep_delay), n = n()) %>%
filter(n > 10)
we are saying
Take the data set flights, then
Filter it so we only keep rows where `dep_delay` is not NA, then
Group the data frame by the `date`, `hour` columns, then
Summarize the data set (across grouping variables), with `delay`
as the mean of `dep_delay`, and `n` as the number of elements
in that 'group' (`n()`), then
Filter it so we only keep rows where there were more than 10
elements per group.
Related
I am trying to find the quickest and most effective way to produce a table using a for loop (or map in purrrr) in R.
I have 15,881 values which I am trying to loop over, for this example assume the values are the numbers 1 to 15,881 incremented by 1, which is this variable:
values <- c(1:15881)
I am then trying to filter an existing dataframe where a column matches a value and then perform some data cleaning process - the output of this a single dataframe, for clarity my process is the following:
Assume in this situation that I have chosen a single value from the values object e.g. value = values[1]
So then for a single value I have the following:
df <- df_to_filter %>%
filter(code == value) %>%
group_by(code, country) %>%
group_split() %>%
purrr::map_dfr(some_other_function) %>%
filter(!is.na(country))
The above code works perfectly fine when I run it for a single value. The output is a desired dataframe. This process takes around 0.7 seconds for a single value.
However, I am trying to append the results of this output to an empty dataframe for each and every single value found in the variable values
So far I have tried the following:
For Loop approach
# empty dataframe to append values to
empty_df <- tibble()
for (value in values){
df <- df_to_filter %>%
filter(code == value) %>%
group_by(code, country) %>%
group_split() %>%
purrr::map_dfr(some_other_function) %>%
filter(!is.na(country))
empty_df <- bind_rows(empty_df, df)
}
However the above is extremely slow - I did a quick calculation and it would take around 186 minutes ((0.7 seconds per table x 15,881)/60 - seconds in a minute = around 185.7 minutes) - which is a huge amount of time to process just a dataframe.
Is there a quicker way to speed up the above process instead of a for loop? I can't think of any way to improve the fundamentals of the above code as it does the job well and 0.7 seconds to produce a single table seems fast to me but 15,881 tables is obviously going to take a long time.
I tried using the purrr package along with data.table but the furthest I got was this:
combine_dfs <- function(value){
df <- df_to_filter %>%
filter(code == value) %>%
group_by(code, country) %>%
group_split() %>%
purrr::map_dfr(some_other_function) %>%
filter(!is.na(country))
df <- data.table(df)
rbindlist(list(df, empty_df))
}
Then running with map_df is this:
map_df(values, ~combine_dfs(.))
However, even the above is extremely slow and seems to take around the same time!
Any help is appreciated!
Row binding dataframe in a loop is inefficient irrespective of which library you use.
You have not provided any example data but I think for your case this should work the same.
library(dplyr)
df_to_filter %>%
group_split(code, country) %>%
purrr::map_dfr(some_other_function) %>%
filter(!is.na(country)) -> result
result
You really need to provide an reproducible example firstly. Otherwise we can't provide a complete solution and have nothing to compare the result.
library(data.table)
setDT(df_to_filter)[code %in% values, by = .(code, country)] %>%
group_split(code, country) %>%
purrr::map_dfr(some_other_function) %>%
filter(!is.na(country))
Let's say I want to calculate the past-7-days ratio between dep_delay and arr_delay for flights in nycflights13. I tried the following, but as soon as I put any function from zoo in the pipeline it seems to completely ungroup the data.
library(tidyverse)
library(nycflights13)
library(zoo)
delay_rate <- flights %>%
group_by(year, month, day) %>%
summarize(delay_rate =
(rollsumr(flights$dep_delay, k = 7, fill = NA)) /
(rollsumr(flights$arr_delay, k = 7, fill = NA)
)
There are several problems:
By writing flights$ the code is telling it to override the grouping and use the original ungrouped vector. Remove flights$ .
summarize is used when one row per group is desired but here it appears we want a result having the same number of rows as the input so use mutate rather than summarize.
There are unneeded parentheses here and while they are not wrong it makes it harder to read. When expressions are potentially ambiguous or rely on rules the reader may have to look up it is a good idea to use extra parentheses but that is not the situation here.
ungroup at the end so we are not left with a grouped data frame.
dplyr clobbers lag and filter in base R so it will conflict with many other packages. Always exclude these in the library statement. This does not affect the code here since neither of those are used but as a precaution I always do that.
Seems unnecessary to load all of the tidyverse when the code is only using dplyr and its dependencies.
library(dplyr, exclude = c("lag", "filter"))
library(nycflights13)
library(zoo)
delay_rate <- flights %>%
group_by(year, month, day) %>%
mutate(delay_rate = rollsumr(dep_delay, k = 7, fill = NA) /
rollsumr(arr_delay, k = 7, fill = NA)) %>%
ungroup
I have a calculation that I have to perform for 23 people (they have varying number of rows allocated to each person so difficult to do in excel. What I'd like to do is take the total time each person took to complete a test and divide it into 5 time categories (20%) so that I can look at their reaction time in more detail.
I will just do this by hand but it will take quite a while because they have 8 sets of data each. I'm hoping someone can show me the best way to use a loop or automate this process even just a little. I have tried to understand the examples but I'm afraid I don't have the skill. So by hand I would do it like I have below where I just filter by each subject.
I started by selecting the relevant columns, then filtered by subject so that I could calculate the time they started and the time they finished and used that to create a variable (testDuration) that could be used to create the 20% proportions of RTs that I'm after. I have no idea how to get the individual subjects' test start, end, duration and timeBin sizes to appear in one column. Any help very gratefully received.
Subj1 <- rtTrialsYA_s1 %>%
select(Subject, RetRating.OnsetTime, RetRating.RT, RetRating.RTTime) %>%
filter(Subject==1) %>%
summarise(
testStart =
min(RetRating.OnsetTime),
testEnd = max(RetRating.RTTime)
) %>%
mutate(
testDuration = testEnd - testStart,
timeBin =
testDuration/5
)
Subj2 <- rtTrialsYA_s1 %>%
select(Subject, RetRating.OnsetTime, RetRating.RT, RetRating.RTTime) %>%
filter(Subject==2) %>%
summarise(
testStart =
min(RetRating.OnsetTime),
testEnd = max(RetRating.RTTime)
) %>%
mutate(
testDuration = testEnd - testStart,
timeBin =
testDuration/5
)
I'm not positive that I understand your code, but this function can be called for any Subject value and then return the output:
myfunction <- function(subjectNumber){
Subj <- rtTrialsYA_s1 %>%
select(Subject, RetRating.OnsetTime, RetRating.RT, RetRating.RTTime) %>%
filter(Subject==subjectNumber) %>%
summarise(testStart = min(RetRating.OnsetTime), testEnd = max(RetRating.RTTime)) %>%
mutate(testDuration = testEnd -testStart) %>%
mutate(timeBin = testDuration/5)
return(Subj)
}
Subj1 <- myfunction(1)
Subj2 <- myfunction(2)
To loop through this, I'll need to know what your data and the desired output looks like.
I think you're missing one piece and that is simply dplyr::group_by.
You can use it as follows to break your dataset into groups, each containing the observations belonging to only one subject, and then summarise on those groups with whatever it is you want to analyze.
library(dplyr)
df <- rtTrialsYA_s1 %>%
group_by(Subject) %>%
summarise(
testStart = min(RetRating.OnsetTime),
testEnd = max(RetRating.RTTime),
testDuration = testEnd - testStart,
timeBin = testDuration/5,
.groups = "drop"
)
There is no need to do separate mutate calls in your code, btw. Also, you can continue to do column calculations right within summarise, as long as the result vectors have the same length as your aggregated columns.
And since summarise retains only the grouping columns and whatever you are defining, there is no real need to do a select statement before, either.
// update
You say you need all your calculated columns to appear within one single column. For that you can use tidyr::pivot_longer. Using the df we calculated above:
library(tidyr)
df_long <- df %>%
pivot_longer(-Subject)
Above will take all columns, except Subject and pivot them into two columns, one containing the former col name and one containing the former value.
I'm working with a data frame and looking to calculate the mean age of players debut in baseball.
I can get the answer, however I am a bit confused why I get different outputs doing the same things 2 ways.
Firstly, when I run the below code I get the correct mean:
mean(as.numeric(players$debut_age)/365, na.rm=TRUE)
But when I reorganize this as a pipe, it instead only prints the vector of days in debut_age:
players$debut_age %>% as.numeric()/365 %>% mean(na.rm=TRUE)
I'm sure there is something simple I'm missing, but I would like to know why these don't produce the same result.
We can use divide_by
library(dplyr)
players$debut_age %>%
as.numeric() %>%
magrittr::divide_by(365) %>%
mean(na.rm = TRUE)
Or place the as.numeric with / inside a block of {}
players$debut_age %>%
{as.numeric()/365} %>%
mean(na.rm=TRUE)
Upfront apology if this has been asked, I have been searching all day and have not found an answer I can apply to my problem.
I am trying to solve this issue using dplyr (and co.) because my previous method (for loops) was too inefficient. I have a dataset of event times, at sites, that are in groups. I want to summarize the number (and proportion) of events that occur in a moving window along a sequence.
# Example data
set.seed(1)
sites = rep(letters[1:10],10)
groups = c('red','blue','green','yellow')
times = round(runif(length(sites),1,100))
timePeriod = seq(1,100)
# Example dataframe
df = data.frame(site = sites,
group = rep(groups,length(sites)/length(groups)),
time = times)
This is my attempt to summarize the number of sites from each group that contain a time (event) within a given moving window of time.
The goal is to move through each element of the vector timePeriod and summarize how many events in each group occurred at timePeriod[i] +/- half-window. Ultimately storing them in, e.g., a dataframe with a column for each group, and a row for each time step, is ideal.
df %>%
filter(time > timePeriod[i]-25 & time < timePeriod[i]+25) %>%
group_by(group) %>%
summarise(count = n())
How can I do this without looping through my sequence of time and storing the summary table for each group individually? Thanks!
Combining lapply and dplyr, you can do the following, which is close to what you had worked so far.
lapply(timePeriod, function(i){
df %>%
filter(time > (i - 25) & time < ( i + 25 ) ) %>%
group_by(group) %>%
summarise(count = n()) %>%
mutate(step = i)
}) %>%
bind_rows()