How to cross-reference tibbles in R? - r

library(nycflights13)
library(tidyverse)
My task is
Look at each destination. Can you find flights that are suspiciously fast? (i.e. flights that represent a potential data entry error).
I have generated a tibble with the average flight times between every two airports:
# A tibble: 224 x 3
# Groups: origin [?]
origin dest mean_time
<chr> <chr> <dbl>
1 EWR ALB 31.78708
2 EWR ANC 413.12500
3 EWR ATL 111.99385
4 EWR AUS 211.24765
5 EWR AVL 89.79681
6 EWR BDL 25.46602
7 EWR BNA 114.50915
8 EWR BOS 40.31275
9 EWR BQN 196.17288
10 EWR BTV 46.25734
# ... with 214 more rows
Now I need to sweep through flights and extract all rows, whose air_time is outside say (mean_time/2, mean_time*2). How do I do that?

Assuming you have stored the tibble with the average flight times, join it to the flights table:
flights_suspicious <- left_join(flights, average_flight_times, by=c("origin","dest")) %>%
filter(air_time < mean_time / 2 | air_time > mean_time * 2)

You would first join that average flight time data frame onto your original flights data and then apply the filter. Something like this should work.
library(nycflights13)
library(tidyverse)
data("flights")
#get mean time
mean_time <- flights %>%
group_by(origin, dest) %>%
summarise(mean_time = mean(air_time, na.rm = TRUE))
#join mean time to original data
df <- left_join(flights, mean_time)
flag_flights <- df %>%
filter(air_time <= (mean_time / 2) | air_time >= (mean_time * 2))
> flag_flights
# A tibble: 29 x 20
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour minute
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 2013 1 16 635 608 27 916 725 111 UA 541 N837UA EWR BOS 81 200 6 8
2 2013 1 21 1851 1900 -9 2034 2012 22 US 2140 N956UW LGA BOS 76 184 19 0
3 2013 1 28 1917 1825 52 2118 1935 103 US 1860 N755US LGA PHL 75 96 18 25
4 2013 10 7 1059 1105 -6 1306 1215 51 MQ 3230 N524MQ JFK DCA 96 213 11 5
5 2013 10 10 950 959 -9 1155 1115 40 EV 5711 N829AS JFK IAD 97 228 9 59
6 2013 2 17 841 840 1 1044 1003 41 9E 3422 N913XJ JFK BOS 86 187 8 40
7 2013 3 8 1136 1001 95 1409 1116 173 UA 1240 N17730 EWR BOS 82 200 10 1
8 2013 3 8 1246 1245 1 1552 1350 122 AA 1850 N3FEAA JFK BOS 80 187 12 45
9 2013 3 12 1607 1500 67 1803 1608 115 US 2132 N946UW LGA BOS 77 184 15 0
10 2013 3 12 1612 1557 15 1808 1720 48 UA 1116 N37252 EWR BOS 81 200 15 57
# ... with 19 more rows, and 2 more variables: time_hour <dttm>, mean_time <dbl>

Related

Error finding object when removing outliers in pipe in R

Okay. I have looked everywhere and read documentation, watched videos, talked to people for help, etc... and cant seem to get this figured out. I need to remove the outliers in one variable of a data set using object assignment and the quartile method, but I have to do it in the pipe. When I run the code, the object cannot be found. Here is the code:
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(nycflights13))
suppressPackageStartupMessages(library(lm.beta))
Q1 <- flights %>%
dep_delay_upper <- quantile(dep_delay$y, 0.997, na.rm = TRUE) %>%
dep_delay_lower <- quantile(dep_delay$y, 0.003, na.rm = TRUE) %>%
dep_delay_out <- which(dep_delay$y > dep_delay_upper | dep_delay$y < dep_delay_lower) %>%
dep_delay_noout <- dep_delay[-dep_delay_out,]
Here is a screenshot with my error in the terminal:
enter image description here
With magrittr's pipe, you can reuse the piped object with a . as so.
The first way gets only the values of dep_delay:
flights$dep_delay %>%
.[which(. < quantile(., 0.997, na.rm = TRUE) & . > quantile(., 0.003, na.rm = TRUE))]
And the second way filters the entire flights dataframe:
flights %>%
.[which(.$dep_delay < quantile(.$dep_delay, 0.997, na.rm = TRUE) &
.$dep_delay > quantile(.$dep_delay, 0.003, na.rm = TRUE)),]
# # A tibble: 326,164 × 19
# year month day dep_time sched_dep_time dep_delay arr_time sched_…¹ arr_d…² carrier flight tailnum origin dest air_t…³ dista…⁴ hour minute time_hour
# <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dttm>
# 1 2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR IAH 227 1400 5 15 2013-01-01 05:00:00
# 2 2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA IAH 227 1416 5 29 2013-01-01 05:00:00
# 3 2013 1 1 542 540 2 923 850 33 AA 1141 N619AA JFK MIA 160 1089 5 40 2013-01-01 05:00:00
# 4 2013 1 1 544 545 -1 1004 1022 -18 B6 725 N804JB JFK BQN 183 1576 5 45 2013-01-01 05:00:00
# 5 2013 1 1 554 600 -6 812 837 -25 DL 461 N668DN LGA ATL 116 762 6 0 2013-01-01 06:00:00
# 6 2013 1 1 554 558 -4 740 728 12 UA 1696 N39463 EWR ORD 150 719 5 58 2013-01-01 05:00:00
# 7 2013 1 1 555 600 -5 913 854 19 B6 507 N516JB EWR FLL 158 1065 6 0 2013-01-01 06:00:00
# 8 2013 1 1 557 600 -3 709 723 -14 EV 5708 N829AS LGA IAD 53 229 6 0 2013-01-01 06:00:00
# 9 2013 1 1 557 600 -3 838 846 -8 B6 79 N593JB JFK MCO 140 944 6 0 2013-01-01 06:00:00
# 10 2013 1 1 558 600 -2 753 745 8 AA 301 N3ALAA LGA ORD 138 733 6 0 2013-01-01 06:00:00
# # … with 326,154 more rows, and abbreviated variable names ¹​sched_arr_time, ²​arr_delay, ³​air_time, ⁴​distance
# # ℹ Use `print(n = ...)` to see more rows
Or alternatively with dplyr:
flights %>%
filter(dep_delay < quantile(dep_delay, 0.997, na.rm = TRUE) &
dep_delay > quantile(dep_delay, 0.003, na.rm = TRUE))

How to sum all variables that aren't characters/factors using group_by? [duplicate]

This question already has answers here:
Group by multiple columns and sum other multiple columns
(7 answers)
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 1 year ago.
I am new to R. I have some data from local elections in Mexico and I want to determine how many votes each party had in each municipality.
Here is an example of the data (political parties are all variables from PRI onwards, NAME_MUN is the name of the municipalities):
head(Campeche)
# A tibble: 6 x 14
CABECERA_DISTRITAL CIRCUNSCRIPCION NOMBRE_ESTADO NOM_MUN PRI PAN MORENA PRD PVEM PT MC
<chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 SAN FRANCISCO DE CAMPECHE 3 CAMPECHE CAMPECHE 153 137 43 5 6 9 7
2 SAN FRANCISCO DE CAMPECHE 3 CAMPECHE CAMPECHE 109 113 52 15 9 4 5
3 SAN FRANCISCO DE CAMPECHE 3 CAMPECHE CAMPECHE 169 154 33 14 12 5 6
4 SAN FRANCISCO DE CAMPECHE 3 CAMPECHE CAMPECHE 1414 1474 415 154 73 62 53
5 SAN FRANCISCO DE CAMPECHE 3 CAMPECHE CAMPECHE 199 238 88 25 17 11 12
6 SAN FRANCISCO DE CAMPECHE 3 CAMPECHE CAMPECHE 176 197 60 15 7 13 11
# … with 3 more variables: NVA_ALIANZA <dbl>, PH <dbl>, ES <dbl>
tail(Campeche)
CABECERA_DISTRITAL CIRCUNSCRIPCION NOMBRE_ESTADO NOM_MUN PRI PAN MORENA PRD PVEM PT MC
<chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 SABANCUY 3 CAMPECHE CARMEN 83 74 21 7 0 3 1
2 SABANCUY 3 CAMPECHE CARMEN 68 47 28 5 3 4 1
3 SABANCUY 3 CAMPECHE CARMEN 56 72 16 1 0 1 1
4 SEYBAPLAYA 3 CAMPECHE CHAMPOTON 90 147 3 2 4 1 3
5 SEYBAPLAYA 3 CAMPECHE CHAMPOTON 141 161 39 30 4 9 15
6 SEYBAPLAYA 3 CAMPECHE CHAMPOTON 84 77 1 6 0 0 3
# … with 3 more variables: NVA_ALIANZA <dbl>, PH <dbl>, ES <dbl>
The data is disaggregated by electoral section, there is more than one electoral section for each municipality, what I am looking for is to obtain the total votes for each political party by municipality.
This is what I was doing, but I believe there is a faster way to do the same and that can be replicated for different municipalities with different parties.
results_Campeche <- Campeche %>% group_by(NOM_MUN) %>%
summarize(PRI = sum(PRI), PAN = sum(PAN), PRD = sum(PRD), MORENA = sum(MORENA),
PVEM = sum(PVEM), PT = sum(PT), MC = sum(MC), NVA_ALIANZA = sum(NVA_ALIANZA),
PH = sum(PH),ES = sum(ES), .groups = "drop")
head(results_Campeche)
NOM_MUN PRI PAN PRD MORENA PVEM PT MC NVA_ALIANZA PH ES
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 CALAKMUL 4861 5427 290 198 70 109 84 236 9 53
2 CALKINI 9035 1326 319 11714 684 194 282 4537 41 262
3 CAMPECHE 39386 32574 4394 11639 2211 2033 1451 4656 1995 4681
4 CANDELARIA 6060 11982 98 209 38 73 135 73 21 21
5 CARMEN 25252 38239 2505 9314 1164 708 712 1124 742 838
6 CHAMPOTON 16415 8500 3212 5387 457 636 1122 1034 203 340

Getting an error when trying to use "filter" function on several parameters

library(tidyverse)
library(nycflights13)
I want to only select the flights that have values in given columns. So I don't care about the flights that have nulls in the columns dep_delay, arr_delay and distance
I am getting an error saying: Error: Result must have length 1, not 3
This error is caused by this: filter(!is.na(c("dep_delay", "arr_delay", "distance")))
flights %>%
group_by(dep_delay, arr_delay, distance) %>%
filter(!is.na(c("dep_delay", "arr_delay", "distance"))) %>%
summarise()
I also tried doing filter(!is.na("dep_delay", "arr_delay", "distance")) (removing the c(...)
If there are multiple columns, use filter_at (assuming that we are removing rows if there are any NAs in a row for each of the columnss
library(dplyr)
flights %>%
filter_at(vars(c("dep_delay", "arr_delay", "distance")),
all_vars(!is.na(.)))
# A tibble: 327,346 x 19
# year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest
# <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> <chr>
# 1 2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR IAH
# 2 2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA IAH
# 3 2013 1 1 542 540 2 923 850 33 AA 1141 N619AA JFK MIA
# 4 2013 1 1 544 545 -1 1004 1022 -18 B6 725 N804JB JFK BQN
# 5 2013 1 1 554 600 -6 812 837 -25 DL 461 N668DN LGA ATL
# 6 2013 1 1 554 558 -4 740 728 12 UA 1696 N39463 EWR ORD
# 7 2013 1 1 555 600 -5 913 854 19 B6 507 N516JB EWR FLL
# 8 2013 1 1 557 600 -3 709 723 -14 EV 5708 N829AS LGA IAD
# 9 2013 1 1 557 600 -3 838 846 -8 B6 79 N593JB JFK MCO
#10 2013 1 1 558 600 -2 753 745 8 AA 301 N3ALAA LGA ORD
# … with 327,336 more rows, and 5 more variables: air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>
In the devel version, we can use across with filter
flights %>%
filter(across(c(dep_delay, arr_delay, distance), ~ !is.na(.)))
If the condition is to have at least one non-NA among those columns, replace the all_vars with any_vars
flights %>%
filter_at(vars(c("dep_delay", "arr_delay", "distance")),
any_vars(!is.na(.)))
NOTE: the group_by step can be after the filter step as we are using the same columns

dplyr group_by summarise inconsistent number of rows

I have been following the tutorial on DataCamp. I have the following line of code, that when I run it produces a different value for "drows"
hflights %>%
group_by(UniqueCarrier, Dest) %>%
summarise(rows= n(), drows = n_distinct(rows))
First time:
Source: local data frame [234 x 4]
Groups: UniqueCarrier [?]
UniqueCarrier Dest rows drows
<chr> <chr> <int> <int>
1 AirTran ATL 211 86
2 AirTran BKG 14 6
3 Alaska SEA 32 18
4 American DFW 186 74
5 American MIA 129 57
6 American_Eagle DFW 234 101
7 American_Eagle LAX 74 34
8 American_Eagle ORD 133 56
9 Atlantic_Southeast ATL 64 28
10 Atlantic_Southeast CVG 1 1
# ... with 224 more rows
Second time:
Source: local data frame [234 x 4]
Groups: UniqueCarrier [?]
UniqueCarrier Dest rows drows
<chr> <chr> <int> <int>
1 AirTran ATL 211 125
2 AirTran BKG 14 13
3 Alaska SEA 32 29
4 American DFW 186 118
5 American MIA 129 76
6 American_Eagle DFW 234 143
7 American_Eagle LAX 74 47
8 American_Eagle ORD 133 85
9 Atlantic_Southeast ATL 64 44
10 Atlantic_Southeast CVG 1 1
# ... with 224 more rows
Third time:
Source: local data frame [234 x 4]
Groups: UniqueCarrier [?]
UniqueCarrier Dest rows drows
<chr> <chr> <int> <int>
1 AirTran ATL 211 88
2 AirTran BKG 14 7
3 Alaska SEA 32 16
4 American DFW 186 79
5 American MIA 129 61
6 American_Eagle DFW 234 95
7 American_Eagle LAX 74 31
8 American_Eagle ORD 133 67
9 Atlantic_Southeast ATL 64 31
10 Atlantic_Southeast CVG 1 1
# ... with 224 more rows
My question is why does this value constantly change? What is it doing?
Apparently this is normal behaviour, see this issue here. https://github.com/tidyverse/dplyr/issues/2222.
This is because values in list columns are compared by reference, so
n_distinct() treats them as different unless they really point to the
same object:
So the internal storage of the df changes the way the thing works. Hadley's comment in that issue seems to say it might be a bug (in the sense of unwanted behaviour), or it might be expected behaviour they need to document better.

Selecting a subset of a sqlite database with dplyr

I'm trying to pull down a subset of rows in a sqlite database using dplyr. Since slice doesn't work with tbl_sql objects, I'm using the window function row_number. But I get the following error:
Source: sqlite 3.8.6
[/Library/Frameworks/R.framework/Versions/3.2/Resources/library/dplyr/db/nycflights13.sqlite]
Error in sqliteSendQuery(con, statement, bind.data) :
error in statement: no such function: ROW_NUMBER
dplyr version 0.4.3.9000, RSQLite version 1.0.0. Reproducible example:
library(dplyr)
library(nycflights13)
flights_sqlite <- tbl(nycflights13_sqlite(), "flights")
filter(flights_sqlite, row_number(month) == 1L) %>% collect()
Probably there's a more efficient and faster way, but head seems to do the job.
To extract first n rows, for instance first 10 records:
head(flights_sqlite, 10) %>% collect()
Output:
year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin dest air_time distance hour minute
1 2013 1 1 517 2 830 11 UA N14228 1545 EWR IAH 227 1400 5 17
2 2013 1 1 533 4 850 20 UA N24211 1714 LGA IAH 227 1416 5 33
3 2013 1 1 542 2 923 33 AA N619AA 1141 JFK MIA 160 1089 5 42
4 2013 1 1 544 -1 1004 -18 B6 N804JB 725 JFK BQN 183 1576 5 44
5 2013 1 1 554 -6 812 -25 DL N668DN 461 LGA ATL 116 762 5 54
6 2013 1 1 554 -4 740 12 UA N39463 1696 EWR ORD 150 719 5 54
7 2013 1 1 555 -5 913 19 B6 N516JB 507 EWR FLL 158 1065 5 55
8 2013 1 1 557 -3 709 -14 EV N829AS 5708 LGA IAD 53 229 5 57
9 2013 1 1 557 -3 838 -8 B6 N593JB 79 JFK MCO 140 944 5 57
10 2013 1 1 558 -2 753 8 AA N3ALAA 301 LGA ORD 138 733 5 58
A percentage of the first rows
head(flights_sqlite, nrow(flights_sqlite)*0.1) %>% collect()
To subset any specific number of rows. For instance rows 578 and 579:
head(flights_sqlite, nrow(flights_sqlite))[578:579, ] %>% collect()
Output:
year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin dest air_time distance hour minute
578 2013 1 1 1701 -9 2026 11 AA N3FUAA 695 JFK AUS 247 1521 17 1
579 2013 1 1 1701 1 1856 16 UA N418UA 689 LGA ORD 144 733 17 1

Resources