Using summarise alters the result of my computation - r

I have the following code, where I don't pipe through the summarise
library(tidyverse)
library(nycflights13)
depArrDelay <- flights %>%
filter_at(vars(c("dep_delay", "arr_delay", "distance")), all_vars(!is.na(.))) %>%
group_by(dep_delay, arr_delay)
Now doing
cor(depArrDelay$dep_delay, depArrDelay$arr_delay) yields 0.9148028 which is the correct value for my calculation
Now I add the %>% summarise (...) as seen below
depArrDelay <- flights %>%
filter_at(vars(c("dep_delay", "arr_delay", "distance")), all_vars(!is.na(.))) %>%
group_by(dep_delay, arr_delay) %>% summarise(count=n())
Now doing: cor(depArrDelay$dep_delay, depArrDelay$arr_delay) yields 0.9260394
So now the cov is altered. Why is this happening? From what I know, summarise should only through away all other columns that are not mentioned, and not alter value. Have I missed something, and how can I avoid that summarise alters the cov?

As already mentioned in the comments, summarise reduces the number of rows. If you need the count without changing number of rows, you can use add_count.
library(nycflights13)
library(dplyr)
temp <- flights %>%
filter_at(vars(c(dep_delay, arr_delay, distance)), all_vars(!is.na(.))) %>%
add_count(dep_delay, arr_delay)
If you then check for correlation you get the same value as earlier.
cor(temp$dep_delay, temp$arr_delay)
#[1] 0.9148027589
If there are more number of columns and you need only limited columns for your analysis, you can select relevant columns using select.

Related

How to add a total distance column in 'flights' dataset? DPLYR, Group_by, Ungroup

I am working with 'flights' dataset from 'nycflights13' package in R.
I want to add a column which adds the total distance covered by each 'carrier' in 2013. I got the total distance covered by each carrier and have stored the value in a new variable.
We have 16 carriers so how I bind a row of 16 numbers with a data frame of many more rows.
carrier <- flights %>%
group_by(carrier) %>%
select(distance) %>%
summarize(TotalDistance = sum(distance)) %>%
arrange(desc(TotalDistance))
How can i add the sum of these carrier distances in a new column in flights dataset?
Thank you for your time and effort here.]
PS. I tried running for loop, but it doesnt work. I am new to programming
Use mutate instead:
flights %>%
group_by(carrier) %>%
mutate(TotalDistance = sum(distance)) %>%
ungroup()-> carrier
We can also use left_join.
library(nycflights13)
data("flights")
library(dplyr)
flights %>%
left_join(flights %>%
group_by(carrier) %>%
select(distance) %>%
summarize(TotalDistance = sum(distance)) %>%
arrange(desc(TotalDistance)), by='carrier')-> carrier
This will work even if you don't use arrange at the end.

How to remove outliers in only one column after grouping by another column in R

I want to remove outliers from a variable MEASURE after grouping by TYPE. I tried the following code but it didn't work. I've searched and I've only came across how to remove outliers for the whole dataframe or one column. But not by after grouping.
df2 <- df %>%
group_by(TYPE) %>%
mutate(MEASURE_WITHOUT_OUTLIERS = remove_outliers(MEASURE))
You can use boxplot.stats to get outlier values in each group and use filter to remove them.
library(dplyr)
df2 <- df %>%
group_by(TYPE) %>%
filter(!MEASURE %in% boxplot.stats(MEASURE)$out) %>%
ungroup

Select last row within each group with dplyr is slow

I have the following R code. Essentially, I am asking R to arrange the dataset based on postcode and paon, then group them by id, and finally keep only the last row within each group. However, R requires more than 3 hours to do this.
I am not sure what I am doing wrong with my code since there is no for loop here.
epc2 is a vector with 324,368 rows.
epc3 <- epc2 %>%
arrange(postcode, paon) %>%
group_by(id) %>%
do(tail(., 1))
Thank you for any and all of your help.
How about:
mtcars %>%
arrange(cyl) %>%
group_by(cyl) %>%
slice(n())

Arrange a grouped_df by group variable not working

I have a data.frame that contains client names, years, and several revenue numbers from each year.
df <- data.frame(client = rep(c("Client A","Client B", "Client C"),3),
year = rep(c(2014,2013,2012), each=3),
rev = rep(c(10,20,30),3)
)
I want to end up with a data.frame that aggregates the revenue by client and year. I then want to sort the data.frame by year then by descending revenue.
library(dplyr)
df1 <- df %>%
group_by(client, year) %>%
summarise(tot = sum(rev)) %>%
arrange(year, desc(tot))
However, when using the code above the arrange() function doesn't change the order of the grouped data.frame at all. When I run the below code and coerce to a normal data.frame it works.
library(dplyr)
df1 <- df %>%
group_by(client, year) %>%
summarise(tot = sum(rev)) %>%
data.frame() %>%
arrange(year, desc(tot))
Am I missing something or will I need to do this every time when trying to arrange a grouped_df by a grouped variable?
R Version: 3.1.1
dplyr package version: 0.3.0.2
EDIT 11/13/2017:
As noted by lucacerone, beginning with dplyr 0.5, arrange once again ignores groups when sorting. So my original code now works in the way I initially expected it would.
arrange() once again ignores grouping, reverting back to the behaviour of dplyr 0.3 and earlier. This makes arrange() inconsistent with other dplyr verbs, but I think this behaviour is generally more useful. Regardless, it’s not going to change again, as more changes will just cause more confusion.
Try switching the order of your group_by statement:
df %>%
group_by(year, client) %>%
summarise(tot = sum(rev)) %>%
arrange(year, desc(tot))
I think arrange is ordering within groups; after summarize, the last group is dropped, so this means in your first example it's arranging rows within the client group. Switching the order to group_by(year, client) seems to fix it because the client group gets dropped after summarize.
Alternatively, there is the ungroup() function
df %>%
group_by(client, year) %>%
summarise(tot = sum(rev)) %>%
ungroup() %>%
arrange(year, desc(tot))
Edit, #lucacerone: since dplyr 0.5 this does not work anymore:
Breaking changes arrange() once again ignores grouping, reverting back
to the behaviour of dplyr 0.3 and earlier. This makes arrange()
inconsistent with other dplyr verbs, but I think this behaviour is
generally more useful. Regardless, it’s not going to change again, as
more changes will just cause more confusion.
Latest versions of dplyr (at least from dplyr_0.7.4) allow to arrange within groups. You just have so set into the arrange() call .by_group = TRUE. More information is available here
In your example, try:
library(dplyr)
df %>%
group_by(client, year) %>%
summarise(tot = sum(rev)) %>%
arrange(desc(tot), .by_group = TRUE)

What does n=n( ) mean in R?

The other day I was reading the following lines in R and I don't understand what the %>% and summarise(n=n()) and summarise(total=n()) meant. I understand the group_by and ungroup methods though.
Can someone help out? There isn't any documentation for this either.
library(dplyr)
net.multiplicity <- group_by(net, nodeid, epoch) %>% summarise(n=n()) %>%
ungroup() %>% group_by(n) %>% summarise(total=n())
This is from the dplyr package. n=n() means that a variable named n will be assigned the number of rows (think number of observations) in the summarized data.
the %>% is read as "and then" and is way of listing your functions sequentially rather then nesting them. So that command is saying you should do the grouping and then summarize the result of the grouping by the number of rows in each group and then ungroup that result, and then group the un-grouped data based on n and then summarize that by the total number of rows in each of the new groups.

Resources