How to subset and order rows in one data.table call? - r

I'm learning data.table and am trying to subset and re-order grouped data while calculating averages, all in one data.table statement.
I got the flight data from here while following along in this tutorial.
From the tutorial,
How can we get the average arrival and departure delay for each orig,dest pair for each month for carrier code “AA”?
ans <- flights[carrier == "AA",
.(mean(arr_delay), mean(dep_delay)),
by = .(origin, dest, month)]
ans
# origin dest month V1 V2
# 1: JFK LAX 1 6.590361 14.2289157
# 2: LGA PBI 1 -7.758621 0.3103448
# 3: EWR LAX 1 1.366667 7.5000000
# 4: JFK MIA 1 15.720670 18.7430168
# 5: JFK SEA 1 14.357143 30.7500000
# ---
# 196: LGA MIA 10 -6.251799 -1.4208633
# 197: JFK MIA 10 -1.880184 6.6774194
# 198: EWR PHX 10 -3.032258 -4.2903226
# 199: JFK MCO 10 -10.048387 -1.6129032
# 200: JFK DCA 10 16.483871 15.5161290
This makes sense to me. However, I want to improve on this by organizing the flights which departed from the same origin together. Following the schema of...
data.table(subset, select/compute, group by)
I came up with this:
flights[(carrier=="AA")[order(origin, dest)],
.(Arrival_Delay=mean(arr_delay), Depart_Delay=mean(dep_delay)),
by=.(origin, dest, month)]
But my call to order doesn't seem to have done anything. Why?
I can achieve this in dplyr with:
flights %>%
filter(carrier=="AA") %>%
group_by(origin, dest, month) %>%
summarize(Arrival_Delay=mean(arr_delay), Depart_Delay=mean(dep_delay)) %>%
as.data.table()
I'm curious how to do this in data.table and why my approach did not work.

With data.table, you can use keyby when you want to sort a result after a by operation. From help("data.table", package = "data.table")
keyby
Same as by, but with an additional setkey() run on the by columns of the result, for convenience. It is common practice to use 'keyby=' routinely when you wish the result to be sorted.
You could then use keyby instead of by in your code
library(data.table)
# get the data from the web
flights <- fread("https://github.com/arunsrinivasan/flights/wiki/NYCflights14/flights14.csv")
# keyby will call setkey and then sort your result
ans <- flights[carrier == "AA",
.(mean(arr_delay), mean(dep_delay)),
keyby = .(origin, dest, month)]
ans
#> origin dest month V1 V2
#> 1: EWR DFW 1 6.427673 10.0125786
#> 2: EWR DFW 2 10.536765 11.3455882
#> 3: EWR DFW 3 12.865031 8.0797546
#> 4: EWR DFW 4 17.792683 12.9207317
#> 5: EWR DFW 5 18.487805 18.6829268
#> ---
#> 196: LGA PBI 1 -7.758621 0.3103448
#> 197: LGA PBI 2 -7.865385 2.4038462
#> 198: LGA PBI 3 -5.754098 3.0327869
#> 199: LGA PBI 4 -13.966667 -4.7333333
#> 200: LGA PBI 5 -10.357143 -6.8571429

Related

Group and add variable of type stock and another type in a single step?

I want to group by district summing 'incoming' values at quarter and get the value of the 'stock' in the last quarter (3) in just one step. 'stock' can not summed through quarters.
My example dataframe:
library(dplyr)
df <- data.frame ("district"= rep(c("ARA", "BJI", "CMC"), each=3),
"quarter"=rep(1:3,3),
"incoming"= c(4044, 2992, 2556, 1639, 9547, 1191,2038,1942,225),
"stock"= c(19547,3160, 1533,5355,6146,355,5816,1119,333)
)
df
district quarter incoming stock
1 ARA 1 4044 19547
2 ARA 2 2992 3160
3 ARA 3 2556 1533
4 BJI 1 1639 5355
5 BJI 2 9547 6146
6 BJI 3 1191 355
7 CMC 1 2038 5816
8 CMC 2 1942 1119
9 CMC 3 225 333
The actual dataframe has ~45.000 rows and 41 variables of which 8 are of type stock.
The result should be:
# A tibble: 3 × 3
district stock incoming
<chr> <dbl> <dbl>
1 ARA 1533 9592
2 BJI 355 12377
3 CMC 333 4205
I know how to get to the result but in three steps and I don't think it's efficient and error prone due to the data.
My approach:
basea <- df %>%
group_by(district) %>%
filter(quarter==3) %>% #take only the last quarter
summarise(across(stock, sum)) %>%
baseb <- df %>%
group_by(district) %>%
summarise(across(incoming, sum)) %>%
final <- full_join(basea, baseb)
Does anyone have any suggestions to perform the procedure in one (or at least two) steps?
Grateful,
Modus
Given that the dataset only has 3 quarters and not 4. If that's not the case use nth(3) instead of last()
library(tidyverse)
df %>%
group_by(district) %>%
summarise(stock = last(stock),
incoming = sum(incoming))
# A tibble: 3 × 3
district stock incoming
<chr> <dbl> <dbl>
1 ARA 1533 9592
2 BJI 355 12377
3 CMC 333 4205
here is a data.table approach
library(data.table)
setDT(df)[, .(incoming = sum(incoming), stock = stock[.N]), by = .(district)]
district incoming stock
1: ARA 9592 1533
2: BJI 12377 355
3: CMC 4205 333
Here's a refactor that removes some of the duplicated code. This also seems like a prime use-case for creating a custom function that can be QC'd and maintained easier:
library(dplyr)
df <- data.frame ("district"= rep(c("ARA", "BJI", "CMC"), each=3),
"quarter"=rep(1:3,3),
"incoming"= c(4044, 2992, 2556, 1639, 9547, 1191,2038,1942,225),
"stock"= c(19547,3160, 1533,5355,6146,355,5816,1119,333)
)
aggregate_stocks <- function(df, n_quarter) {
base <- df %>%
group_by(district)
basea <- base %>%
filter(quarter == n_quarter) %>%
summarise(across(stock, sum))
baseb <- base %>%
summarise(across(incoming, sum))
final <- full_join(basea, baseb, by = "district")
return(final)
}
aggregate_stocks(df, 3)
#> # A tibble: 3 × 3
#> district stock incoming
#> <chr> <dbl> <dbl>
#> 1 ARA 1533 9592
#> 2 BJI 355 12377
#> 3 CMC 333 4205
Here is the same solution as #Tom Hoel but without using a function to subset, instead just use []:
library(dplyr)
df %>%
group_by(district) %>%
summarise(stock = stock[3],
incoming = sum(incoming))
district stock incoming
<chr> <dbl> <dbl>
1 ARA 1533 9592
2 BJI 355 12377
3 CMC 333 4205

In R, how to combine dplyr transformations for nycflights13

I am trying to find the five shortest minimum distances, called min_dist, by origin/destination in the nycflights13 package in R Studio. The result should be a tibble with 5 rows and 3 columns(origin, dest, and min_dist).
I am a beginner and this is what I have so far:
Q3 <- flights %>%
arrange(flights, distance)
group_by(origin) %>%
summarise(min_dist = origin/dest)
I am getting the error: Error in group_by(origin) : object 'origin' not found. Any hints on what to do? A lot of the other questions are similar to this so I want to figure out how to do these. Thank you
This may be done by selecting the columns of interest, get the distinct rows and apply the slice_min with n = 5
library(dplyr)
flights %>%
select(origin, dest, min_distance = distance)%>%
distinct %>%
slice_min(n = 5, order_by = min_distance, with_ties = FALSE)
-output
# A tibble: 5 × 3
origin dest min_distance
<chr> <chr> <dbl>
1 EWR LGA 17
2 EWR PHL 80
3 JFK PHL 94
4 LGA PHL 96
5 EWR BDL 116
We could use top_n with negative sign:
library(nycflights13)
library(dplyr)
flights %>%
select(origin, dest, distance) %>%
distinct() %>%
top_n(-5) %>%
arrange(distance)
origin dest distance
<chr> <chr> <dbl>
1 EWR LGA 17
2 EWR PHL 80
3 JFK PHL 94
4 LGA PHL 96
5 EWR BDL 116

How can I merge two data sets with a custom function that applies a rule to non-common columns?

I am trying to merge two data frames of different sizes, but I am running into difficulties because of the panel structure of the data.
Consider the example below where 'toy.left' is a panel of three variables: a coordinate ('coord'), and a name ('name') assigned to that coordinate in a particular month ('month'). Next, consider 'toy.right,' which is comprised of four variables: a name ('name'), the start of that name's tenure for the assignment ('tenure.start') to that coordinate, and the end of their tenure for the assignment ('tenure.end') to that coordinate.
toy.left <- tribble(~month, ~coord, ~name,
"2000-01-01", 1301, "Alpha",
"2000-03-01", 1301, "Beta",
"2000-06-01", 1302, "Charlie",
"2000-09-01", 1303, "Delta",
"2000-12-01", 1303, "Epsilon")
toy.right <- tribble(~name, ~coord, ~tenure.start, ~tenure.end,
"Alpha", 1301, "2000-02-01", "2000-04-01",
"Beta", 1301, "1999-11-01", "2000-04-01",
"Charlie", 1302, "2000-04-01", "2000-07-01",
"Delta", 1303, "2000-08-01", "2000-10-01",
"Epsilon", 1303, "2000-11-01", "2001-01-01",
"Delta", 1303, "2002-01-01", "2004-01-01")
I would like to merge these two data sets, but there are rules that make it difficult with merge() in dplyr. For example, I cannot simply use inner_join() and merge by 'name' and 'coord' because this violates the panel structure of the data. If I do this, the tenure of an individual does not overlap with the month of the observation (first, see Rows 1 & 2, which should be inverted; second, see Rows 4 & 5 where merge() duplicates the month observation, but should only include Row 4).
toy.left %>%
inner_join(toy.right, by = c("name", "coord"))
*Output*
month coord name tenure.start tenure.end
2000-01-01 1301 Alpha 2000-02-01 2000-04-01
2000-03-01 1301 Beta 1999-11-01 2000-04-01
2000-06-01 1302 Charlie 2000-04-01 2000-07-01
2000-09-01 1303 Delta 2000-08-01 2000-10-01
2000-09-01 1303 Delta 2002-01-01 2004-01-01
2000-12-01 1303 Epsilon 2000-11-01 2001-01-01
To solve this problem, I could merge the data by 'name,' 'coord,' and 'month,' but I would need to condition merging by 'month' on whether or not the date falls between 'tenure.start' and 'tenure.end.' After searching around, I could not find a way to apply a custom rule to merge() in dplyr.
I understand that a custom function or loop might be the best way to approach this, but I am not sure where to begin. Additionally, the original data set has more than 1.5 million observations, which may generate further issues.
I welcome your suggestions!
(All this after converting month and tenure.* to Date-class.)
fuzzyjoin
fuzzyjoin::fuzzy_inner_join(
toy.left, toy.right,
by=c("name", "coord", month="tenure.start", month="tenure.end"),
match_fun=list(`==`, `==`, `>=`, `<=`))
# # A tibble: 4 x 7
# month coord.x name.x name.y coord.y tenure.start tenure.end
# <date> <dbl> <chr> <chr> <dbl> <date> <date>
# 1 2000-03-01 1301 Beta Beta 1301 1999-11-01 2000-04-01
# 2 2000-06-01 1302 Charlie Charlie 1302 2000-04-01 2000-07-01
# 3 2000-09-01 1303 Delta Delta 1303 2000-08-01 2000-10-01
# 4 2000-12-01 1303 Epsilon Epsilon 1303 2000-11-01 2001-01-01
sqldf
sqldf::sqldf(
"select tl.name, tl.coord, tl.month, tr.[tenure.start], tr.[tenure.end]
from [toy.left] tl
inner join [toy.right] tr on tl.name=tr.name and tl.coord=tr.coord
and tl.month between tr.[tenure.start] and tr.[tenure.end]")
# name coord month tenure.start tenure.end
# 1 Beta 1301 2000-03-01 1999-11-01 2000-04-01
# 2 Charlie 1302 2000-06-01 2000-04-01 2000-07-01
# 3 Delta 1303 2000-09-01 2000-08-01 2000-10-01
# 4 Epsilon 1303 2000-12-01 2000-11-01 2001-01-01
(I use [tenure.start] with the bracket-notation to differentiate between the table identifier tl and the column name tenure.start, where in SQL dots in the column names usually indicate schema.tablename.columnname-like nomenclature.)
data.table
This does left-joins, not the other types. To identify which should be removed due to doing left instead of inner, I'll add a column to toy.left:
library(data.table)
setDT(toy.left)
setDT(toy.right)
toy.left[, val := 2]
toy.left[toy.right, on = .(name, coord, month >= tenure.start, month <= tenure.end)][ !is.na(val),]
# month coord name val month.1
# <Date> <num> <char> <num> <Date>
# 1: 1999-11-01 1301 Beta 2 2000-04-01
# 2: 2000-04-01 1302 Charlie 2 2000-07-01
# 3: 2000-08-01 1303 Delta 2 2000-10-01
# 4: 2000-11-01 1303 Epsilon 2 2001-01-01
data.table has its way of renaming columns, so be aware of it. When I'm not certain I know how the naming is going to end up, I often copy the columns around so that it's always clear ... but part of the reason I'm doing that is laziness in learning exactly how it determines the resulting names.

Keep desired columns when using summarise

I want to get the top 10 destinations, and also how many flights were made to these destinations. I am using summarise, and my problem is that summarise throws away all columns that a not mentioned in the summarise(..). I need to keep the column origin for later use.
library(tidyverse)
library(nycflights13)
flights %>%
group_by(dest) %>%
summarise(n = n()) %>%
arrange(desc(n)) %>% head(10)
Here is the result from the code above
# A tibble: 10 x 2
dest allFlights
<chr> <int>
1 ORD 17283
2 ATL 17215
3 LAX 16174
4 BOS 15508
5 MCO 14082
6 CLT 14064
7 SFO 13331
8 FLL 12055
9 MIA 11728
10 DCA 9705
I think this is correct. But all I am missing, is another column that prints the origin
I was thinking about doing some a join to get the origin, but this doesn't make sense, as doing the join on this result set might not yield the correct flights.
I found this post: How to summarise all columns using group_by and summarise? but it was not helpful to me, as summarise is unable to find the columns I mention, that are not in its function.
When you sum the flights by destination, you are summing the total number of flights arriving in the destination city, which have many different origin cities. So it would not make sense for there to be a single value in the origin column here.
If you want, you could replace group_by(dest) with group_by(origin,dest). That would give you the top 10 pairs of origin-destination cities, which is a different output than in your question, but would retain the origin and destination columns for further analysis.
library(tidyverse)
library(nycflights13)
flights %>%
group_by(origin, dest) %>%
summarise(n = n()) %>%
arrange(desc(n)) %>% head(10)
output
# A tibble: 10 x 3
# Groups: origin [3]
origin dest n
<chr> <chr> <int>
1 JFK LAX 11262
2 LGA ATL 10263
3 LGA ORD 8857
4 JFK SFO 8204
5 LGA CLT 6168
6 EWR ORD 6100
7 JFK BOS 5898
8 LGA MIA 5781
9 JFK MCO 5464
10 EWR BOS 5327

How to group_by values and get the count for multiple attributes in dataframe using R

I have a dataframe of the below format. I am producing sample data, but I have thousands of record of similar format:
ORIGIN DEST CARRIER_DELAY WEATHER_DELAY NAS_DELAY
JFK MCO 1 0 47
JFK LAX
JFK MCO 1 2 30
LOG DFW 12 20 3
LOG DFW
I need to group by origin and and destination and calculate number of occurrence(count) of each delay using dplyr function. The values present in delay columns are in minutes. I need to consider the values greater than 0 and increase the count by 1 for those values. There are null values present for certain rows and I need to ignore them as well.
The output should look like below:
ORIGIN DEST CARR_DELAY_COUNT WEATHER_DELAY_COUNT NAS_DELAY_COUNT
JFK MCO 2 1 2
LOG DFW 1 1 1
I am using below dplyr function:
flight.df %>%
group_by(ORIGIN,DEST) %>%
summarize(carr_delay=sum(CARRIER_DELAY,na.rm=TRUE),
weather_delay=sum(WEATHER_DELAY,na.rm=TRUE),
nas_delay=sum(NAS_DELAY,na.rm=TRUE) %>%
group_by() %>%
{.} -> delays.df
The above function will generate sum of delay values grouping by each category of delay for a particular source and destination.
Here how do I need to insert another function for having the count of each delay apart from sum?
You can use summarize_each after a group_by using dplyr package. You'll have to rename the columns though.
library(dplyr)
df %>% group_by(ORIGIN, DEST) %>% summarize_each(funs(Count = sum(.>0, na.rm=T)))
Source: local data frame [3 x 5]
Groups: ORIGIN [?]
ORIGIN DEST CARRIER_DELAY WEATHER_DELAY NAS_DELAY
(fctr) (fctr) (int) (int) (int)
1 JFK LAX 0 0 0
2 JFK MCO 2 1 2
3 LOG DFW 1 1 1
It is also straightforward to calculate this using the base R function, aggregate.
aggregate(cbind("CARRIER_DELAY"=CARRIER_DELAY,
"WEATHER_DELAY"=WEATHER_DELAY,
"NAS_DELAY"=NAS_DELAY) ~ ORIGIN + DEST,
data=df, FUN=function(x) sum(x > 0, na.rm=TRUE))
which returns
ORIGIN DEST CARRIER_DELAY WEATHER_DELAY NAS_DELAY
1 LOG DFW 1 1 1
2 JFK MCO 2 1 2
I use cbind to group the summary variables together and to also give names to the output.
We can use data.table
library(data.table)
setDT(df)[, lapply(.SD, function(x) sum(x > 0, na.rm=TRUE)) , .(ORIGIN, DEST)]
# ORIGIN DEST CARRIER_DELAY WEATHER_DELAY NAS_DELAY
#1: JFK MCO 2 1 2
#2: JFK LAX 0 0 0
#3: LOG DFW 1 1 1
NOTE: This straightforward method also provides the correct output as the accepted one.

Resources