I'm working with a data frame and looking to calculate the mean age of players debut in baseball.
I can get the answer, however I am a bit confused why I get different outputs doing the same things 2 ways.
Firstly, when I run the below code I get the correct mean:
mean(as.numeric(players$debut_age)/365, na.rm=TRUE)
But when I reorganize this as a pipe, it instead only prints the vector of days in debut_age:
players$debut_age %>% as.numeric()/365 %>% mean(na.rm=TRUE)
I'm sure there is something simple I'm missing, but I would like to know why these don't produce the same result.
We can use divide_by
library(dplyr)
players$debut_age %>%
as.numeric() %>%
magrittr::divide_by(365) %>%
mean(na.rm = TRUE)
Or place the as.numeric with / inside a block of {}
players$debut_age %>%
{as.numeric()/365} %>%
mean(na.rm=TRUE)
Related
I am trying to find the quickest and most effective way to produce a table using a for loop (or map in purrrr) in R.
I have 15,881 values which I am trying to loop over, for this example assume the values are the numbers 1 to 15,881 incremented by 1, which is this variable:
values <- c(1:15881)
I am then trying to filter an existing dataframe where a column matches a value and then perform some data cleaning process - the output of this a single dataframe, for clarity my process is the following:
Assume in this situation that I have chosen a single value from the values object e.g. value = values[1]
So then for a single value I have the following:
df <- df_to_filter %>%
filter(code == value) %>%
group_by(code, country) %>%
group_split() %>%
purrr::map_dfr(some_other_function) %>%
filter(!is.na(country))
The above code works perfectly fine when I run it for a single value. The output is a desired dataframe. This process takes around 0.7 seconds for a single value.
However, I am trying to append the results of this output to an empty dataframe for each and every single value found in the variable values
So far I have tried the following:
For Loop approach
# empty dataframe to append values to
empty_df <- tibble()
for (value in values){
df <- df_to_filter %>%
filter(code == value) %>%
group_by(code, country) %>%
group_split() %>%
purrr::map_dfr(some_other_function) %>%
filter(!is.na(country))
empty_df <- bind_rows(empty_df, df)
}
However the above is extremely slow - I did a quick calculation and it would take around 186 minutes ((0.7 seconds per table x 15,881)/60 - seconds in a minute = around 185.7 minutes) - which is a huge amount of time to process just a dataframe.
Is there a quicker way to speed up the above process instead of a for loop? I can't think of any way to improve the fundamentals of the above code as it does the job well and 0.7 seconds to produce a single table seems fast to me but 15,881 tables is obviously going to take a long time.
I tried using the purrr package along with data.table but the furthest I got was this:
combine_dfs <- function(value){
df <- df_to_filter %>%
filter(code == value) %>%
group_by(code, country) %>%
group_split() %>%
purrr::map_dfr(some_other_function) %>%
filter(!is.na(country))
df <- data.table(df)
rbindlist(list(df, empty_df))
}
Then running with map_df is this:
map_df(values, ~combine_dfs(.))
However, even the above is extremely slow and seems to take around the same time!
Any help is appreciated!
Row binding dataframe in a loop is inefficient irrespective of which library you use.
You have not provided any example data but I think for your case this should work the same.
library(dplyr)
df_to_filter %>%
group_split(code, country) %>%
purrr::map_dfr(some_other_function) %>%
filter(!is.na(country)) -> result
result
You really need to provide an reproducible example firstly. Otherwise we can't provide a complete solution and have nothing to compare the result.
library(data.table)
setDT(df_to_filter)[code %in% values, by = .(code, country)] %>%
group_split(code, country) %>%
purrr::map_dfr(some_other_function) %>%
filter(!is.na(country))
I obviously get an error with the below but I was hoping to summarise the same column with regards to mean and median, and also how many points are in the polygon. But within the same pipe. Any help would be great.
Nin_Sep_points_sf_joined <-
st_join(merged_ten_seven_shp, Nin_Sep_sf_3011) %>%
filter(!is.na(Employment_diff)) %>%
group_by(Kod) %>%
summarise(Count=mean(as.numeric(as.character(price)))), summarise(Count_tot=n()), summarise(Count=median(as.numeric(as.character(price))))
You can supply multiple arguments to summarize which you separate with a ,:
library(dplyr)
Nin_Sep_points_sf_joined <-
st_join(merged_ten_seven_shp, Nin_Sep_sf_3011) %>%
filter(!is.na(Employment_diff)) %>%
group_by(Kod) %>%
summarise(Count=mean(as.numeric(as.character(price))),
Count_tot=n(),
Count=median(as.numeric(as.character(price))))
Note that you can even refer to the results of previous arguments in the next argument. So you could calculate SD based on Count_tot.
I am trying to rewrite this expression to magrittr’s pipe operator:
print(mean(pull(df, height), na.rm=TRUE))
which returns 175.4 for my dataset.
I know that I have to start with the data frame and write it as >df%>% but I’m confused about how to write it inside out. For example, should the na.rm=TRUE go inside mean(), pull() or print()?
UPDATE: I actually figured it out by trial and error...
>df%>%
+pull(height)%>%
+mean(na.rm=TRUE)
+print()
returns 175.4
It would be good practice to make a reproducible example, with dummy data like this:
height <- seq(1:30)
weight <- seq(1:30)
df <- data.frame(height, weight)
These pipe operators work with the majority of the tidyverse (not just magrittr). What you are trying to do is actually coming out of dplyr. The na.rm=T is required for many summary variables like mean, sd, as well as certain functions used to gather specific data points like min, max, etc. These functions don't play well with NA values.
df %>% pull(height) %>% mean(na.rm=T) %>% print()
Unless your data is nested you may not even need to use pull
df %>% summarise(mean = mean(height,na.rm=T))
Also, using summarise you can pipe these into another dataframe rather than just printing, and call them out of the dataframe whenever you want.
df %>% summarise(meanHt = mean(height,na.rm=T), sdHt = sd(height,na.rm=T)) -> summary
summary[1]
summary[2]
I have a dataframe so when I try to calculate the mean of column A I just write
mean(df$A)
and it works fine.
But when I try to calculate mean of only part of the data frame I get an error saying it isn't a number or logical value
df$A %>% filter(A=="some value") %>% mean(df$A)
The type of A is double. I also tried to convert it to numeric using
df$A <- as.numeric(as.character(df$A))
but it didn't work.
Best would be to provide an example of your column A.
However, by just looking to your question the problem is in your magrittr-dplyr syntax.
base syntax:
mean(df$A[df$A == 'some value'])
dplyr with pipes:
df %>% filter(A==2) %>% summarise(., average = mean(A))
Careful with syntax and pipes, more info here.
Try df %>% filter(A==some value) %>% summarise(mean(A)).
Note that the mean will be some value because of the filter.
Also, mean() works fine with objects of class double
This question already has an answer here:
What does %>% mean in R [duplicate]
(1 answer)
Closed 8 years ago.
Sorry for a noob question, but I'm new to R and I need help explaining this.
I see instruction that: x %>% f(y) -> f(x , y)
This "Then" Symbol %>%, I don't get it. Can someone give me a dummy guide for this? I'm having a really hard time understanding this. I do understand it's something like this:
x <- 3, y <- 3x (Or I could be wrong here too). Can anyone help me out here and explain it to me in a really-really simple way? Thanks in advance!
P.S. Here was the example used, and I've no idea what it is
library(dplyr)
hourly_delay <- flights %>%
filter(!is.na(dep_delay)) %>%
group_by(date, hour) %>%
summarise(delay = mean(dep_delay), n = n()) %>%
filter(n > 10)
Would you rather write:
library(magrittr)
sum(multiply_by(add(1:10, 10), 2))
or
1:10 %>% add(10) %>% multiply_by(2) %>% sum()
?
The intent is made a lot more clear in the second example, and it's fundamentally the same. It's usually easiest to think of the first expression (1:10) defining some data object you're working on, then %>% applying a set of operations sequentially on that data set. Reading it out loud,
Take the data 1:10, then add 10, then multiply by 2, then sum
the way we describe the operation in English is almost identical to how we write it with the pipe operator, which is what makes it such a nice tool.
With your example:
library(dplyr)
flights %>%
filter(!is.na(dep_delay)) %>%
group_by(date, hour) %>%
summarise(delay = mean(dep_delay), n = n()) %>%
filter(n > 10)
we are saying
Take the data set flights, then
Filter it so we only keep rows where `dep_delay` is not NA, then
Group the data frame by the `date`, `hour` columns, then
Summarize the data set (across grouping variables), with `delay`
as the mean of `dep_delay`, and `n` as the number of elements
in that 'group' (`n()`), then
Filter it so we only keep rows where there were more than 10
elements per group.