Retrieve discarded column after using summarise - r

I am selecting the top 10 destinations of flights, and how many flights went there. To achieve this I needed to use summarise, which throws away everything that I didn't mention in the group_by(..).
Later I need the column origin, but I no longer can retrieve this column, as it is discarded along with other columns. To keep the origin it seems that I would need to mention it in my group_by(..) but I don't want this, as my result would then be incorrect. How can I get the origin of these top 10 flights?
library(tidyverse)
library(nycflights13)
(newFlights<- flights %>%
group_by("Destination" = dest) %>%
summarise("AllFlights" = n()) %>%
arrange(desc(AllFlights)) %>% top_n(10))

You want to include origin in the call to group_by(). See documentation:
newFlights <- as.data.frame(flights %>%
group_by(origin, dest)%>%
summarize("AllFlights" = n()) %>%
arrange(desc(AllFlights)) %>%
top_n(10)
)
head(newFlights, 10)
Giving you:
origin dest AllFlights
1 JFK LAX 11262
2 LGA ATL 10263
3 LGA ORD 8857
4 JFK SFO 8204
5 LGA CLT 6168
6 EWR ORD 6100
7 JFK BOS 5898
8 LGA MIA 5781
9 JFK MCO 5464
10 EWR BOS 5327

Related

In R, how to combine dplyr transformations for nycflights13

I am trying to find the five shortest minimum distances, called min_dist, by origin/destination in the nycflights13 package in R Studio. The result should be a tibble with 5 rows and 3 columns(origin, dest, and min_dist).
I am a beginner and this is what I have so far:
Q3 <- flights %>%
arrange(flights, distance)
group_by(origin) %>%
summarise(min_dist = origin/dest)
I am getting the error: Error in group_by(origin) : object 'origin' not found. Any hints on what to do? A lot of the other questions are similar to this so I want to figure out how to do these. Thank you
This may be done by selecting the columns of interest, get the distinct rows and apply the slice_min with n = 5
library(dplyr)
flights %>%
select(origin, dest, min_distance = distance)%>%
distinct %>%
slice_min(n = 5, order_by = min_distance, with_ties = FALSE)
-output
# A tibble: 5 × 3
origin dest min_distance
<chr> <chr> <dbl>
1 EWR LGA 17
2 EWR PHL 80
3 JFK PHL 94
4 LGA PHL 96
5 EWR BDL 116
We could use top_n with negative sign:
library(nycflights13)
library(dplyr)
flights %>%
select(origin, dest, distance) %>%
distinct() %>%
top_n(-5) %>%
arrange(distance)
origin dest distance
<chr> <chr> <dbl>
1 EWR LGA 17
2 EWR PHL 80
3 JFK PHL 94
4 LGA PHL 96
5 EWR BDL 116

How do I fix the following error message in R: "Error in data.frame arguments imply differing number of rows"?

I'm very new to coding in R, and was assigned a task to web scrape. When I attempt to create a data frame (thanks youtube!), I keep getting the error message listed in the above title. Do you spot any obvious errors in my code below or do you have any suggestions for fixing the issue? Thank you!
link = "http://www.hockeycentral.co.uk/nhl/records/alltimegoal.php"
page = read_html(link)
rank = page %>% html_nodes('#example :nth-child(1)') %>% html_text()
player = page %>% html_nodes('.text-left:nth-child(2)') %>% html_text()
teams = page %>% html_nodes('.text-left:nth-child(3)') %>% html_text()
goals = page %>% html_nodes(':nth-child(4)') %>% html_text()
games = page %>% html_nodes(':nth-child(5)') %>% html_text()
assists = page %>% html_nodes(':nth-child(6)') %>% html_text()
points = page %>% html_nodes(':nth-child(7)') %>% html_text()
PPG = page %>% html_nodes(':nth-child(8)') %>% html_text()
SHG = page %>% html_nodes(':nth-child(9)') %>% html_text()
nhlcareer = data.frame(rank, player, teams, goals, games, assists, points, PPG, SHG, stringsAsFactors
= FALSE)
You could use html_table:
library(rvest)
link = "http://www.hockeycentral.co.uk/nhl/records/alltimegoal.php"
page = read_html(link)
table <- page %>% html_table()
table <- table[[1]]
head(table)
Rank Player Team(s) Goals Games Assists Points PPG SHG
1 1 Wayne Gretzky EDM, LAK, STL, NYR 894 1487 1963 2857 204 73
2 2 Gordie Howe DET, HFD. 801 1767 1049 1850 211 24
3 3 Jaromir Jagr Pit, WSH, NYR, PHI, DAL, BOS, NJD, FLA. 766 1733 1155 1921 217 11
4 4 Brett Hull CGY, STL, DAL, DET, PHX 741 1269 650 1391 265 20
5 5 Marcel Dionne DET, LAK, NYR 731 1348 1040 1771 234 19
6 6 Phil Esposito CHI, BOS, NYR 717 1282 873 1590 246 23
The extraction column by column you did returned different column length, for example :
> length(player)
[1] 201
> length(rank)
[1] 204
As R doesn't know how to put the columns together it returns an error message.

Summing row values using mutate_at

I am trying to sum row values by specific columns using mutate_at and sum function. The dataset is given below:
Country/Region 1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20
Chili 0 0 0 3 1 2
Chili 1 0 1 4 2 1
China 23 26 123 12 56 70
China 45 25 56 23 16 18
I am using following code but instead of summing all the column values, I am getting zeroes.
tb <- confirmed_raw %>% group_by(`Country/Region`) %>%
filter(`Country/Region` != "Cruise Ship") %>%
select(-`Province/State`, -Lat, -Long) %>%
mutate_at(vars(-group_cols()), ~sum)
The output which I want is:
Country/Region 1/22/20 1/23/20 1/24/20 1/25/20 1/26/20 1/27/20
Chili 2 0 1 7 3 3
China 68 51 179 35 72 88
But instead of above, all the date columns are coming 0. How can I solve this?
Can you try summarise_all instead of mutate_at(vars(-group_cols()), ~sum)?
tb %>% group_by(`Country.Region`) %>% summarise_all(funs(sum))
PS: I guess you have few typos here such as tb[1,1] should return 1, not 2. Also, the example code does not correspond to the data entirely (ther is no Cruise Ship or Province/State in it. Still, ignoring those, I found this works to generate the expected output.
To complete, another option :
tb %>% group_by(`Country/Region`) %>% mutate_all(sum) %>% distinct(`Country/Region`,.keep_all = TRUE)

Keep desired columns when using summarise

I want to get the top 10 destinations, and also how many flights were made to these destinations. I am using summarise, and my problem is that summarise throws away all columns that a not mentioned in the summarise(..). I need to keep the column origin for later use.
library(tidyverse)
library(nycflights13)
flights %>%
group_by(dest) %>%
summarise(n = n()) %>%
arrange(desc(n)) %>% head(10)
Here is the result from the code above
# A tibble: 10 x 2
dest allFlights
<chr> <int>
1 ORD 17283
2 ATL 17215
3 LAX 16174
4 BOS 15508
5 MCO 14082
6 CLT 14064
7 SFO 13331
8 FLL 12055
9 MIA 11728
10 DCA 9705
I think this is correct. But all I am missing, is another column that prints the origin
I was thinking about doing some a join to get the origin, but this doesn't make sense, as doing the join on this result set might not yield the correct flights.
I found this post: How to summarise all columns using group_by and summarise? but it was not helpful to me, as summarise is unable to find the columns I mention, that are not in its function.
When you sum the flights by destination, you are summing the total number of flights arriving in the destination city, which have many different origin cities. So it would not make sense for there to be a single value in the origin column here.
If you want, you could replace group_by(dest) with group_by(origin,dest). That would give you the top 10 pairs of origin-destination cities, which is a different output than in your question, but would retain the origin and destination columns for further analysis.
library(tidyverse)
library(nycflights13)
flights %>%
group_by(origin, dest) %>%
summarise(n = n()) %>%
arrange(desc(n)) %>% head(10)
output
# A tibble: 10 x 3
# Groups: origin [3]
origin dest n
<chr> <chr> <int>
1 JFK LAX 11262
2 LGA ATL 10263
3 LGA ORD 8857
4 JFK SFO 8204
5 LGA CLT 6168
6 EWR ORD 6100
7 JFK BOS 5898
8 LGA MIA 5781
9 JFK MCO 5464
10 EWR BOS 5327

How to subset and order rows in one data.table call?

I'm learning data.table and am trying to subset and re-order grouped data while calculating averages, all in one data.table statement.
I got the flight data from here while following along in this tutorial.
From the tutorial,
How can we get the average arrival and departure delay for each orig,dest pair for each month for carrier code “AA”?
ans <- flights[carrier == "AA",
.(mean(arr_delay), mean(dep_delay)),
by = .(origin, dest, month)]
ans
# origin dest month V1 V2
# 1: JFK LAX 1 6.590361 14.2289157
# 2: LGA PBI 1 -7.758621 0.3103448
# 3: EWR LAX 1 1.366667 7.5000000
# 4: JFK MIA 1 15.720670 18.7430168
# 5: JFK SEA 1 14.357143 30.7500000
# ---
# 196: LGA MIA 10 -6.251799 -1.4208633
# 197: JFK MIA 10 -1.880184 6.6774194
# 198: EWR PHX 10 -3.032258 -4.2903226
# 199: JFK MCO 10 -10.048387 -1.6129032
# 200: JFK DCA 10 16.483871 15.5161290
This makes sense to me. However, I want to improve on this by organizing the flights which departed from the same origin together. Following the schema of...
data.table(subset, select/compute, group by)
I came up with this:
flights[(carrier=="AA")[order(origin, dest)],
.(Arrival_Delay=mean(arr_delay), Depart_Delay=mean(dep_delay)),
by=.(origin, dest, month)]
But my call to order doesn't seem to have done anything. Why?
I can achieve this in dplyr with:
flights %>%
filter(carrier=="AA") %>%
group_by(origin, dest, month) %>%
summarize(Arrival_Delay=mean(arr_delay), Depart_Delay=mean(dep_delay)) %>%
as.data.table()
I'm curious how to do this in data.table and why my approach did not work.
With data.table, you can use keyby when you want to sort a result after a by operation. From help("data.table", package = "data.table")
keyby
Same as by, but with an additional setkey() run on the by columns of the result, for convenience. It is common practice to use 'keyby=' routinely when you wish the result to be sorted.
You could then use keyby instead of by in your code
library(data.table)
# get the data from the web
flights <- fread("https://github.com/arunsrinivasan/flights/wiki/NYCflights14/flights14.csv")
# keyby will call setkey and then sort your result
ans <- flights[carrier == "AA",
.(mean(arr_delay), mean(dep_delay)),
keyby = .(origin, dest, month)]
ans
#> origin dest month V1 V2
#> 1: EWR DFW 1 6.427673 10.0125786
#> 2: EWR DFW 2 10.536765 11.3455882
#> 3: EWR DFW 3 12.865031 8.0797546
#> 4: EWR DFW 4 17.792683 12.9207317
#> 5: EWR DFW 5 18.487805 18.6829268
#> ---
#> 196: LGA PBI 1 -7.758621 0.3103448
#> 197: LGA PBI 2 -7.865385 2.4038462
#> 198: LGA PBI 3 -5.754098 3.0327869
#> 199: LGA PBI 4 -13.966667 -4.7333333
#> 200: LGA PBI 5 -10.357143 -6.8571429

Resources