I'm using the nycflights13::flights dataframe and want to calculate the number of flights an airplane have flown before its first more than 1 hour delay. How can I do this? I've tried with a group_by and filter, but I haven't been able to. Is there a method to count the rows till a condition (e.g. till the first dep_delay >60)?
Thanks.
library(dplyr)
library(nycflights13)
data("flights")
There may be more elegant ways, but this code counts the total number of flights made by each plane (omitting cancelled flights) and joins this with flights that were not cancelled, grouping on the unique plane identifier (tailnum), sorting on departure date/time, assigning the row_number less 1, filtering on delays>60, and taking the first row.
select(
filter(flights, !is.na(dep_time)) %>%
count(tailnum, name="flights") %>% left_join(
filter(flights, !is.na(dep_time)) %>%
group_by(tailnum) %>%
arrange(month, day, dep_time) %>%
mutate(not_delayed=row_number() -1) %>%
filter(dep_delay>60) %>% slice(1)),
tailnum, flights, not_delayed)
# A tibble: 4,037 x 3
tailnum flights not_delayed
<chr> <int> <dbl>
1 D942DN 4 0
2 N0EGMQ 354 53
3 N10156 146 9
4 N102UW 48 25
5 N103US 46 NA
6 N104UW 47 3
7 N10575 272 0
8 N105UW 45 22
9 N107US 41 20
10 N108UW 60 36
# ... with 4,027 more rows
The plane with tailnum N103US has made 46 flights, of which none have been delayed by more than 1 hour. So the number of flights it has made before its first 1 hour delay is undefined or NA.
I got the answer:
flights %>%
#Eliminate the NAs
filter(!is.na(dep_time)) %>%
#Sort by date and time
arrange(time_hour) %>%
group_by(tailnum) %>%
#cumulative number of flights delayed more than one hour
mutate(acum_delay = cumsum(dep_delay > 60)) %>%
#count the number of flights
summarise(before_1hdelay = sum(acum_delay < 1))
Related
I want to get the top 10 destinations, and also how many flights were made to these destinations. I am using summarise, and my problem is that summarise throws away all columns that a not mentioned in the summarise(..). I need to keep the column origin for later use.
library(tidyverse)
library(nycflights13)
flights %>%
group_by(dest) %>%
summarise(n = n()) %>%
arrange(desc(n)) %>% head(10)
Here is the result from the code above
# A tibble: 10 x 2
dest allFlights
<chr> <int>
1 ORD 17283
2 ATL 17215
3 LAX 16174
4 BOS 15508
5 MCO 14082
6 CLT 14064
7 SFO 13331
8 FLL 12055
9 MIA 11728
10 DCA 9705
I think this is correct. But all I am missing, is another column that prints the origin
I was thinking about doing some a join to get the origin, but this doesn't make sense, as doing the join on this result set might not yield the correct flights.
I found this post: How to summarise all columns using group_by and summarise? but it was not helpful to me, as summarise is unable to find the columns I mention, that are not in its function.
When you sum the flights by destination, you are summing the total number of flights arriving in the destination city, which have many different origin cities. So it would not make sense for there to be a single value in the origin column here.
If you want, you could replace group_by(dest) with group_by(origin,dest). That would give you the top 10 pairs of origin-destination cities, which is a different output than in your question, but would retain the origin and destination columns for further analysis.
library(tidyverse)
library(nycflights13)
flights %>%
group_by(origin, dest) %>%
summarise(n = n()) %>%
arrange(desc(n)) %>% head(10)
output
# A tibble: 10 x 3
# Groups: origin [3]
origin dest n
<chr> <chr> <int>
1 JFK LAX 11262
2 LGA ATL 10263
3 LGA ORD 8857
4 JFK SFO 8204
5 LGA CLT 6168
6 EWR ORD 6100
7 JFK BOS 5898
8 LGA MIA 5781
9 JFK MCO 5464
10 EWR BOS 5327
library(tidyverse)
library(nycflights13)
nycflights13::flights
If the following expression gives flights per day from the dataset:
daily <- dplyr::group_by( flights, year, month, day)
(per_day <- dplyr::summarize( daily, flights = n()))
I wanted something similar for cancelled flights:
canx <- dplyr::filter( flights, is.na(dep_time) & is.na(arr_time))
canx2 <- canx %>% dplyr::group_by( year, month, day)
My goal was to have the same length of data frame as for all summarised flights.
I can get number of flights cancelled per day:
(canx_day <- dplyr::summarize( canx2, flights = n()))
but obviously this is a slightly shorter data frame, so I cannot run e.g.:
canx_day$propcanx <- per_day$flights/canx_day$flights
Even if I introduce NAs I can replace them.
So my question is, should I not be using filter, or are there arguments to filter I should be applying?
Many thanks
You should not be using filter. As others suggest, this is easy with a canceled column, so our first step will be to create that column. Then you can easily get whatever you want with a single summarize. For example:
flights %>%
mutate(canceled = as.integer(is.na(dep_time) & is.na(arr_time))) %>%
group_by(year, month, day) %>%
summarize(n_scheduled = n(),
n_not_canceled = sum(!canceled),
n_canceled = sum(canceled),
prop_canceled = mean(canceled))
# # A tibble: 365 x 7
# # Groups: year, month [?]
# year month day n_scheduled n_not_canceled n_canceled prop_canceled
# <int> <int> <int> <int> <int> <int> <dbl>
# 1 2013 1 1 842 838 4 0.004750594
# 2 2013 1 2 943 935 8 0.008483563
# 3 2013 1 3 914 904 10 0.010940919
# 4 2013 1 4 915 909 6 0.006557377
# 5 2013 1 5 720 717 3 0.004166667
# 6 2013 1 6 832 831 1 0.001201923
# 7 2013 1 7 933 930 3 0.003215434
# 8 2013 1 8 899 895 4 0.004449388
# ...
This gives you flights and canceled flight per day by flight, year, month, day
nycflights13::flights %>%
group_by(flight, year, month, day) %>%
summarize(per_day = n(),
canx = sum(ifelse(is.na(arr_time), 1, 0)))
There is a simple way to calculate number of flights canceled per day. Lets assume that Cancelled column is TRUE for the cancelled flight. If so then way to calculate daily canceled flights will be:
flights %>%
group_by(year, month, day) %>%
summarize( canx_day = sum(Cancelled))
canx_day will contain canceled flights for a day.
library(dplyr)
library(forcats)
Using the simple dataframe and code below, I want to create a table with total rows and sub-rows. For example, the first row would be "Region1" from the NEW column and 70 from the TotNumber column, then below that would be three rows for "Town1", "Town2", and "Town3", and their associated numbers from the Number column, and the same for "Region2" and "Region3". I attached a pic of the desired table...
I'm also looking for a solution using dplyr and Tidyverse.
Number<-c(10,30,30,10,56,30,40,50,33,10)
Town<-("Town1","Town2","Town3","Town4","Town5","Town6","Town7","Town8","Town9","Town10")
DF<-data_frame(Town,Number)
DF<-DF%>%mutate_at(vars(Town),funs(as.factor))
To create Region variable...
DF<-DF%>%mutate(NEW=fct_collapse(Town,
Region1=c("Town1","Town2","Town3"),
Region2=c("Town4","Town5","Town6"),
Region3=c("Town7","Town8","Town9","Town10")))%>%
group_by(NEW)%>%
summarise(TotNumber=sum(Number))
Modifying your last pipes and adding some addition steps:
library(dplyr)
library(forcats)
DF%>%mutate(NEW=fct_collapse(Town,
Region1=c("Town1","Town2","Town3"),
Region2=c("Town4","Town5","Town6"),
Region3=c("Town7","Town8","Town9","Town10")),
NEW = as.character(NEW)) %>%
group_by(NEW) %>%
mutate(TotNumber=sum(Number)) %>%
ungroup() %>%
split(.$NEW) %>%
lapply(function(x) rbind(setNames(x[1,3:4], names(x)[1:2]), x[1:2])) %>%
do.call(rbind, .)
Results:
# A tibble: 13 × 2
Town Number
* <chr> <dbl>
1 Region1 70
2 Town1 10
3 Town2 30
4 Town3 30
5 Region2 96
6 Town4 10
7 Town5 56
8 Town6 30
9 Region3 133
10 Town7 40
11 Town8 50
12 Town9 33
13 Town10 10
Data:
Number<-c(10,30,30,10,56,30,40,50,33,10)
Town<-c("Town1","Town2","Town3","Town4","Town5","Town6","Town7","Town8","Town9","Town10")
DF<-data_frame(Town,Number) %>%
mutate_at(vars(Town),funs(as.factor))
I'm working on the baseball data set:
data(baseball, package="plyr")
library(dplyr)
baseball[,1:4] %>% head
id year stint team
4 ansonca01 1871 1 RC1
44 forceda01 1871 1 WS3
68 mathebo01 1871 1 FW1
99 startjo01 1871 1 NY2
102 suttoez01 1871 1 CL1
106 whitede01 1871 1 CL1
First I want to group the data set by team in order to find the first year each team appears, and the number of distinct players that has ever played for each team:
baseball[,1:4] %>% group_by(team) %>%
summarise("first_year"=min(year), "num_distinct_players"=n_distinct(id))
# A tibble: 132 × 3
team first_year num_distinct_players
<chr> <int> <int>
1 ALT 1884 1
2 ANA 1997 29
3 ARI 1998 43
4 ATL 1966 133
5 BAL 1954 158
Now I want to add a column showing the maximum number of years any player (id) has played for the team in question. To do this, I need to somehow group by player within the existing group (team), and select the maximum number of rows. How do I do this?
Perhaps this helps
baseball %>%
select(1:4) %>%
group_by(id, team) %>%
dplyr::mutate(nyear = n_distinct(year)) %>%
group_by(team) %>%
dplyr::summarise(first_year = min(year),
num_distinct_players = n_distinct(id),
maxYear = max(nyear))
I tried doing this with base R and came up with this. It's fairly slow.
df = data.frame(t(sapply(split(baseball, baseball$team), function(x)
cbind( min(x$year),
length(unique(x$id)),
max(sapply(split(x,x$id), function(y)
nrow(y))),
names(which.max(sapply(split(x,x$id), function(y)
nrow(y)))) ))))
colnames(df) = c("Year", "Unique Players", "Longest played duration",
"Longest Playing Player")
First, split by team into different groups
For each group, obtain the minimum year as first year when the team appears
Get length of unique ids which is the number of players in that team
Split each group into subgroup by id and obtain the maximum number of rows that will give the maximum duration played by a player in that team
For each subgroup, get names of the id with maximum rows which gives the name of the player that played for the longest time in that team
In R (or other language), I want to transform an upper data frame to lower one.
How can I do that?
Thank you beforehand.
year month income expense
2016 07 50 15
2016 08 30 75
month income_expense
1 2016-07 50
2 2016-07 -15
3 2016-08 30
4 2016-08 -75
Well, it seems that you are trying to do multiple operations in the same question: combine dates columns, melt your data, some colnames transformations and sorting
This will give your expected output:
library(tidyr); library(reshape2); library(dplyr)
df %>% unite("date", c(year, month)) %>%
mutate(expense=-expense) %>% melt(value.name="income_expense") %>%
select(-variable) %>% arrange(date)
#### date income_expense
#### 1 2016_07 50
#### 2 2016_07 -15
#### 3 2016_08 30
#### 4 2016_08 -75
I'm using three different libraries here, for better readability of the code. It might be possible to do it with base R, though.
Here's a solution using only two packages, dplyr and tidyr
First, your dataset:
df <- dplyr::data_frame(
year =2016,
month = c("07", "08"),
income = c(50,30),
expense = c(15, 75)
)
The mutate() function in dplyr creates/edits individual variables. The gather() function in tidyr will bring multiple variables/columns together in the way that you specify.
df <- df %>%
dplyr::mutate(
month = paste0(year, "-", month)
) %>%
tidyr::gather(
key = direction, #your name for the new column containing classification 'key'
value = income_expense, #your name for the new column containing values
income:expense #which columns you're acting on
) %>%
dplyr::mutate(income_expense =
ifelse(direction=='expense', -income_expense, income_expense)
)
The output has all the information you'd need (but we will clean it up in the last step)
> df
# A tibble: 4 × 4
year month direction income_expense
<dbl> <chr> <chr> <dbl>
1 2016 2016-07 income 50
2 2016 2016-08 income 30
3 2016 2016-07 expense -15
4 2016 2016-08 expense -75
Finally, we select() to drop columns we don't want, and then arrange it so that df shows the rows in the same order as you described in the question.
df <- df %>%
dplyr::select(-year, -direction) %>%
dplyr::arrange(month)
> df
# A tibble: 4 × 2
month income_expense
<chr> <dbl>
1 2016-07 50
2 2016-07 -15
3 2016-08 30
4 2016-08 -75
NB: I guess that I'm using three libraries, including magrittr for the pipe operator %>%. But, since the pipe operator is the best thing ever, I often forget to count magrittr.