Keep desired columns when using summarise - r

I want to get the top 10 destinations, and also how many flights were made to these destinations. I am using summarise, and my problem is that summarise throws away all columns that a not mentioned in the summarise(..). I need to keep the column origin for later use.
library(tidyverse)
library(nycflights13)
flights %>%
group_by(dest) %>%
summarise(n = n()) %>%
arrange(desc(n)) %>% head(10)
Here is the result from the code above
# A tibble: 10 x 2
dest allFlights
<chr> <int>
1 ORD 17283
2 ATL 17215
3 LAX 16174
4 BOS 15508
5 MCO 14082
6 CLT 14064
7 SFO 13331
8 FLL 12055
9 MIA 11728
10 DCA 9705
I think this is correct. But all I am missing, is another column that prints the origin
I was thinking about doing some a join to get the origin, but this doesn't make sense, as doing the join on this result set might not yield the correct flights.
I found this post: How to summarise all columns using group_by and summarise? but it was not helpful to me, as summarise is unable to find the columns I mention, that are not in its function.

When you sum the flights by destination, you are summing the total number of flights arriving in the destination city, which have many different origin cities. So it would not make sense for there to be a single value in the origin column here.
If you want, you could replace group_by(dest) with group_by(origin,dest). That would give you the top 10 pairs of origin-destination cities, which is a different output than in your question, but would retain the origin and destination columns for further analysis.
library(tidyverse)
library(nycflights13)
flights %>%
group_by(origin, dest) %>%
summarise(n = n()) %>%
arrange(desc(n)) %>% head(10)
output
# A tibble: 10 x 3
# Groups: origin [3]
origin dest n
<chr> <chr> <int>
1 JFK LAX 11262
2 LGA ATL 10263
3 LGA ORD 8857
4 JFK SFO 8204
5 LGA CLT 6168
6 EWR ORD 6100
7 JFK BOS 5898
8 LGA MIA 5781
9 JFK MCO 5464
10 EWR BOS 5327

Related

In R, how to combine dplyr transformations for nycflights13

I am trying to find the five shortest minimum distances, called min_dist, by origin/destination in the nycflights13 package in R Studio. The result should be a tibble with 5 rows and 3 columns(origin, dest, and min_dist).
I am a beginner and this is what I have so far:
Q3 <- flights %>%
arrange(flights, distance)
group_by(origin) %>%
summarise(min_dist = origin/dest)
I am getting the error: Error in group_by(origin) : object 'origin' not found. Any hints on what to do? A lot of the other questions are similar to this so I want to figure out how to do these. Thank you
This may be done by selecting the columns of interest, get the distinct rows and apply the slice_min with n = 5
library(dplyr)
flights %>%
select(origin, dest, min_distance = distance)%>%
distinct %>%
slice_min(n = 5, order_by = min_distance, with_ties = FALSE)
-output
# A tibble: 5 × 3
origin dest min_distance
<chr> <chr> <dbl>
1 EWR LGA 17
2 EWR PHL 80
3 JFK PHL 94
4 LGA PHL 96
5 EWR BDL 116
We could use top_n with negative sign:
library(nycflights13)
library(dplyr)
flights %>%
select(origin, dest, distance) %>%
distinct() %>%
top_n(-5) %>%
arrange(distance)
origin dest distance
<chr> <chr> <dbl>
1 EWR LGA 17
2 EWR PHL 80
3 JFK PHL 94
4 LGA PHL 96
5 EWR BDL 116

How to calculate the number of flights with an specific condition

I'm using the nycflights13::flights dataframe and want to calculate the number of flights an airplane have flown before its first more than 1 hour delay. How can I do this? I've tried with a group_by and filter, but I haven't been able to. Is there a method to count the rows till a condition (e.g. till the first dep_delay >60)?
Thanks.
library(dplyr)
library(nycflights13)
data("flights")
There may be more elegant ways, but this code counts the total number of flights made by each plane (omitting cancelled flights) and joins this with flights that were not cancelled, grouping on the unique plane identifier (tailnum), sorting on departure date/time, assigning the row_number less 1, filtering on delays>60, and taking the first row.
select(
filter(flights, !is.na(dep_time)) %>%
count(tailnum, name="flights") %>% left_join(
filter(flights, !is.na(dep_time)) %>%
group_by(tailnum) %>%
arrange(month, day, dep_time) %>%
mutate(not_delayed=row_number() -1) %>%
filter(dep_delay>60) %>% slice(1)),
tailnum, flights, not_delayed)
# A tibble: 4,037 x 3
tailnum flights not_delayed
<chr> <int> <dbl>
1 D942DN 4 0
2 N0EGMQ 354 53
3 N10156 146 9
4 N102UW 48 25
5 N103US 46 NA
6 N104UW 47 3
7 N10575 272 0
8 N105UW 45 22
9 N107US 41 20
10 N108UW 60 36
# ... with 4,027 more rows
The plane with tailnum N103US has made 46 flights, of which none have been delayed by more than 1 hour. So the number of flights it has made before its first 1 hour delay is undefined or NA.
I got the answer:
flights %>%
#Eliminate the NAs
filter(!is.na(dep_time)) %>%
#Sort by date and time
arrange(time_hour) %>%
group_by(tailnum) %>%
#cumulative number of flights delayed more than one hour
mutate(acum_delay = cumsum(dep_delay > 60)) %>%
#count the number of flights
summarise(before_1hdelay = sum(acum_delay < 1))

Retrieve discarded column after using summarise

I am selecting the top 10 destinations of flights, and how many flights went there. To achieve this I needed to use summarise, which throws away everything that I didn't mention in the group_by(..).
Later I need the column origin, but I no longer can retrieve this column, as it is discarded along with other columns. To keep the origin it seems that I would need to mention it in my group_by(..) but I don't want this, as my result would then be incorrect. How can I get the origin of these top 10 flights?
library(tidyverse)
library(nycflights13)
(newFlights<- flights %>%
group_by("Destination" = dest) %>%
summarise("AllFlights" = n()) %>%
arrange(desc(AllFlights)) %>% top_n(10))
You want to include origin in the call to group_by(). See documentation:
newFlights <- as.data.frame(flights %>%
group_by(origin, dest)%>%
summarize("AllFlights" = n()) %>%
arrange(desc(AllFlights)) %>%
top_n(10)
)
head(newFlights, 10)
Giving you:
origin dest AllFlights
1 JFK LAX 11262
2 LGA ATL 10263
3 LGA ORD 8857
4 JFK SFO 8204
5 LGA CLT 6168
6 EWR ORD 6100
7 JFK BOS 5898
8 LGA MIA 5781
9 JFK MCO 5464
10 EWR BOS 5327

How to subset and order rows in one data.table call?

I'm learning data.table and am trying to subset and re-order grouped data while calculating averages, all in one data.table statement.
I got the flight data from here while following along in this tutorial.
From the tutorial,
How can we get the average arrival and departure delay for each orig,dest pair for each month for carrier code “AA”?
ans <- flights[carrier == "AA",
.(mean(arr_delay), mean(dep_delay)),
by = .(origin, dest, month)]
ans
# origin dest month V1 V2
# 1: JFK LAX 1 6.590361 14.2289157
# 2: LGA PBI 1 -7.758621 0.3103448
# 3: EWR LAX 1 1.366667 7.5000000
# 4: JFK MIA 1 15.720670 18.7430168
# 5: JFK SEA 1 14.357143 30.7500000
# ---
# 196: LGA MIA 10 -6.251799 -1.4208633
# 197: JFK MIA 10 -1.880184 6.6774194
# 198: EWR PHX 10 -3.032258 -4.2903226
# 199: JFK MCO 10 -10.048387 -1.6129032
# 200: JFK DCA 10 16.483871 15.5161290
This makes sense to me. However, I want to improve on this by organizing the flights which departed from the same origin together. Following the schema of...
data.table(subset, select/compute, group by)
I came up with this:
flights[(carrier=="AA")[order(origin, dest)],
.(Arrival_Delay=mean(arr_delay), Depart_Delay=mean(dep_delay)),
by=.(origin, dest, month)]
But my call to order doesn't seem to have done anything. Why?
I can achieve this in dplyr with:
flights %>%
filter(carrier=="AA") %>%
group_by(origin, dest, month) %>%
summarize(Arrival_Delay=mean(arr_delay), Depart_Delay=mean(dep_delay)) %>%
as.data.table()
I'm curious how to do this in data.table and why my approach did not work.
With data.table, you can use keyby when you want to sort a result after a by operation. From help("data.table", package = "data.table")
keyby
Same as by, but with an additional setkey() run on the by columns of the result, for convenience. It is common practice to use 'keyby=' routinely when you wish the result to be sorted.
You could then use keyby instead of by in your code
library(data.table)
# get the data from the web
flights <- fread("https://github.com/arunsrinivasan/flights/wiki/NYCflights14/flights14.csv")
# keyby will call setkey and then sort your result
ans <- flights[carrier == "AA",
.(mean(arr_delay), mean(dep_delay)),
keyby = .(origin, dest, month)]
ans
#> origin dest month V1 V2
#> 1: EWR DFW 1 6.427673 10.0125786
#> 2: EWR DFW 2 10.536765 11.3455882
#> 3: EWR DFW 3 12.865031 8.0797546
#> 4: EWR DFW 4 17.792683 12.9207317
#> 5: EWR DFW 5 18.487805 18.6829268
#> ---
#> 196: LGA PBI 1 -7.758621 0.3103448
#> 197: LGA PBI 2 -7.865385 2.4038462
#> 198: LGA PBI 3 -5.754098 3.0327869
#> 199: LGA PBI 4 -13.966667 -4.7333333
#> 200: LGA PBI 5 -10.357143 -6.8571429

Grouping within group in R, plyr/dplyr

I'm working on the baseball data set:
data(baseball, package="plyr")
library(dplyr)
baseball[,1:4] %>% head
id year stint team
4 ansonca01 1871 1 RC1
44 forceda01 1871 1 WS3
68 mathebo01 1871 1 FW1
99 startjo01 1871 1 NY2
102 suttoez01 1871 1 CL1
106 whitede01 1871 1 CL1
First I want to group the data set by team in order to find the first year each team appears, and the number of distinct players that has ever played for each team:
baseball[,1:4] %>% group_by(team) %>%
summarise("first_year"=min(year), "num_distinct_players"=n_distinct(id))
# A tibble: 132 × 3
team first_year num_distinct_players
<chr> <int> <int>
1 ALT 1884 1
2 ANA 1997 29
3 ARI 1998 43
4 ATL 1966 133
5 BAL 1954 158
Now I want to add a column showing the maximum number of years any player (id) has played for the team in question. To do this, I need to somehow group by player within the existing group (team), and select the maximum number of rows. How do I do this?
Perhaps this helps
baseball %>%
select(1:4) %>%
group_by(id, team) %>%
dplyr::mutate(nyear = n_distinct(year)) %>%
group_by(team) %>%
dplyr::summarise(first_year = min(year),
num_distinct_players = n_distinct(id),
maxYear = max(nyear))
I tried doing this with base R and came up with this. It's fairly slow.
df = data.frame(t(sapply(split(baseball, baseball$team), function(x)
cbind( min(x$year),
length(unique(x$id)),
max(sapply(split(x,x$id), function(y)
nrow(y))),
names(which.max(sapply(split(x,x$id), function(y)
nrow(y)))) ))))
colnames(df) = c("Year", "Unique Players", "Longest played duration",
"Longest Playing Player")
First, split by team into different groups
For each group, obtain the minimum year as first year when the team appears
Get length of unique ids which is the number of players in that team
Split each group into subgroup by id and obtain the maximum number of rows that will give the maximum duration played by a player in that team
For each subgroup, get names of the id with maximum rows which gives the name of the player that played for the longest time in that team

Resources