I installed the nycflights13 package and the tidyverse package
Here is the first 6 rows of the flight data set
head(flights)
# A tibble: 6 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr>
1 2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR
2 2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA
3 2013 1 1 542 540 2 923 850 33 AA 1141 N619AA JFK
4 2013 1 1 544 545 -1 1004 1022 -18 B6 725 N804JB JFK
5 2013 1 1 554 600 -6 812 837 -25 DL 461 N668DN LGA
6 2013 1 1 554 558 -4 740 728 12 UA 1696 N39463 EWR
with 19 variables: year <int>, month <int>, day <int>, dep_time <int>, sched_dep_time <int>, dep_delay <dbl>,
arr_time <int>, sched_arr_time <int>, arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, origin <chr>,
dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
I am trying to do two things.
1: add a new column at the end of this data set using the mutate()
2: In this new column I want to compute the total proportion of the delay for each destination. Meaning I will have to calculate the total minutes of arr_delay
Here is the code I am running. I used the select function to create my data table. I created the new column using the mutate function but I am having trouble writing a line of code to calculate the proportion for each row or destination (there are 336,776 rows)
new.table<-select(flights, dest, month, day, dep_time, carrier, flight, arr_delay)
mutate(new.table, arr_delay_prop=sum(arr_delay)/(1:336776))
I understand I need to use the sum function to get the total sum of the arrival delay but I am not sure how to take a proportion for each row against the total arr_delay.
Do you want this?
mutate(new.table, arr_delay_prop=arr_delay/sum(arr_delay, na.rm = T))
# A tibble: 336,776 x 8
dest month day dep_time carrier flight arr_delay arr_delay_prop
<chr> <int> <int> <int> <chr> <int> <dbl> <dbl>
1 IAH 1 1 517 UA 1545 11 0.00000487
2 IAH 1 1 533 UA 1714 20 0.00000886
3 MIA 1 1 542 AA 1141 33 0.0000146
4 BQN 1 1 544 B6 725 -18 -0.00000797
5 ATL 1 1 554 DL 461 -25 -0.0000111
6 ORD 1 1 554 UA 1696 12 0.00000532
7 FLL 1 1 555 B6 507 19 0.00000842
8 IAD 1 1 557 EV 5708 -14 -0.00000620
9 MCO 1 1 557 B6 79 -8 -0.00000354
10 ORD 1 1 558 AA 301 8 0.00000354
# ... with 336,766 more rows
Related
Okay. I have looked everywhere and read documentation, watched videos, talked to people for help, etc... and cant seem to get this figured out. I need to remove the outliers in one variable of a data set using object assignment and the quartile method, but I have to do it in the pipe. When I run the code, the object cannot be found. Here is the code:
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(nycflights13))
suppressPackageStartupMessages(library(lm.beta))
Q1 <- flights %>%
dep_delay_upper <- quantile(dep_delay$y, 0.997, na.rm = TRUE) %>%
dep_delay_lower <- quantile(dep_delay$y, 0.003, na.rm = TRUE) %>%
dep_delay_out <- which(dep_delay$y > dep_delay_upper | dep_delay$y < dep_delay_lower) %>%
dep_delay_noout <- dep_delay[-dep_delay_out,]
Here is a screenshot with my error in the terminal:
enter image description here
With magrittr's pipe, you can reuse the piped object with a . as so.
The first way gets only the values of dep_delay:
flights$dep_delay %>%
.[which(. < quantile(., 0.997, na.rm = TRUE) & . > quantile(., 0.003, na.rm = TRUE))]
And the second way filters the entire flights dataframe:
flights %>%
.[which(.$dep_delay < quantile(.$dep_delay, 0.997, na.rm = TRUE) &
.$dep_delay > quantile(.$dep_delay, 0.003, na.rm = TRUE)),]
# # A tibble: 326,164 × 19
# year month day dep_time sched_dep_time dep_delay arr_time sched_…¹ arr_d…² carrier flight tailnum origin dest air_t…³ dista…⁴ hour minute time_hour
# <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dttm>
# 1 2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR IAH 227 1400 5 15 2013-01-01 05:00:00
# 2 2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA IAH 227 1416 5 29 2013-01-01 05:00:00
# 3 2013 1 1 542 540 2 923 850 33 AA 1141 N619AA JFK MIA 160 1089 5 40 2013-01-01 05:00:00
# 4 2013 1 1 544 545 -1 1004 1022 -18 B6 725 N804JB JFK BQN 183 1576 5 45 2013-01-01 05:00:00
# 5 2013 1 1 554 600 -6 812 837 -25 DL 461 N668DN LGA ATL 116 762 6 0 2013-01-01 06:00:00
# 6 2013 1 1 554 558 -4 740 728 12 UA 1696 N39463 EWR ORD 150 719 5 58 2013-01-01 05:00:00
# 7 2013 1 1 555 600 -5 913 854 19 B6 507 N516JB EWR FLL 158 1065 6 0 2013-01-01 06:00:00
# 8 2013 1 1 557 600 -3 709 723 -14 EV 5708 N829AS LGA IAD 53 229 6 0 2013-01-01 06:00:00
# 9 2013 1 1 557 600 -3 838 846 -8 B6 79 N593JB JFK MCO 140 944 6 0 2013-01-01 06:00:00
# 10 2013 1 1 558 600 -2 753 745 8 AA 301 N3ALAA LGA ORD 138 733 6 0 2013-01-01 06:00:00
# # … with 326,154 more rows, and abbreviated variable names ¹sched_arr_time, ²arr_delay, ³air_time, ⁴distance
# # ℹ Use `print(n = ...)` to see more rows
Or alternatively with dplyr:
flights %>%
filter(dep_delay < quantile(dep_delay, 0.997, na.rm = TRUE) &
dep_delay > quantile(dep_delay, 0.003, na.rm = TRUE))
library(tidyverse)
library(nycflights13)
I want to only select the flights that have values in given columns. So I don't care about the flights that have nulls in the columns dep_delay, arr_delay and distance
I am getting an error saying: Error: Result must have length 1, not 3
This error is caused by this: filter(!is.na(c("dep_delay", "arr_delay", "distance")))
flights %>%
group_by(dep_delay, arr_delay, distance) %>%
filter(!is.na(c("dep_delay", "arr_delay", "distance"))) %>%
summarise()
I also tried doing filter(!is.na("dep_delay", "arr_delay", "distance")) (removing the c(...)
If there are multiple columns, use filter_at (assuming that we are removing rows if there are any NAs in a row for each of the columnss
library(dplyr)
flights %>%
filter_at(vars(c("dep_delay", "arr_delay", "distance")),
all_vars(!is.na(.)))
# A tibble: 327,346 x 19
# year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest
# <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> <chr>
# 1 2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR IAH
# 2 2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA IAH
# 3 2013 1 1 542 540 2 923 850 33 AA 1141 N619AA JFK MIA
# 4 2013 1 1 544 545 -1 1004 1022 -18 B6 725 N804JB JFK BQN
# 5 2013 1 1 554 600 -6 812 837 -25 DL 461 N668DN LGA ATL
# 6 2013 1 1 554 558 -4 740 728 12 UA 1696 N39463 EWR ORD
# 7 2013 1 1 555 600 -5 913 854 19 B6 507 N516JB EWR FLL
# 8 2013 1 1 557 600 -3 709 723 -14 EV 5708 N829AS LGA IAD
# 9 2013 1 1 557 600 -3 838 846 -8 B6 79 N593JB JFK MCO
#10 2013 1 1 558 600 -2 753 745 8 AA 301 N3ALAA LGA ORD
# … with 327,336 more rows, and 5 more variables: air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>
In the devel version, we can use across with filter
flights %>%
filter(across(c(dep_delay, arr_delay, distance), ~ !is.na(.)))
If the condition is to have at least one non-NA among those columns, replace the all_vars with any_vars
flights %>%
filter_at(vars(c("dep_delay", "arr_delay", "distance")),
any_vars(!is.na(.)))
NOTE: the group_by step can be after the filter step as we are using the same columns
I am using ggplot2 in R to produce the plot from the code:
final_flights <- augment(flights_model, flights_tbl) %>% collect()'
final_flights
# A tibble: 327,346 x 22
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
# ... with 327,336 more rows, and 14 more variables: arr_delay <dbl>,
# carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
# air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>,
# PC1 <dbl>, PC2 <dbl>, PC3 <dbl>
I have tried already:
ggplot(final_flights, aes(PC1, PC2)) +
geom_point(aes(colour=air_time))
ggplot(final_flights, aes(PC1, PC2, PC3))+
geom_point(aes(colour=air_time, distance, dep_time))
ml_predict(kmeans_model) %>%
collect() %>%
ggplot(aes(air_time, distance, dep_time)) +
geom_point(aes(air_time, distance, dep_time, col = factor(prediction+1)),
size=2, alpha=0.5)+
geom_point(data=kmeans_model$k, aes(air_time, distance, dep_time),
pch='x', size=12)+
scale_color_discrete(name="Predicted cluster")
> Warning: Ignoring unknown aesthetics: Warning: Ignoring unknown
> aesthetics: Error: Column 3 must be named.
I want to produce the ggplot model with two principal components, which variable explains the clustering in the data
I have built a matrix whose names are those of a regressor subset that i want to insert in a regression model formula in R.
For example:
data$age is the response variable
X is the design matrix whose column names are, for example, data$education and data$wage.
The problem is that the column names of X are not fixed (i.e. i don't know which are them in advance), so i tried to code this:
best_model <- lm(data$age ~ paste(colnames(x[, GA#solution == 1]), sep = "+"))
But it doesn't work.
Rather than writing formula by yourself, using pipe(%>%) and dplyr::select() appropriately might be helpful. (Here, change your matrix to data frame.)
library(tidyverse)
mpg
#> # A tibble: 234 x 11
#> manufacturer model displ year cyl trans drv cty hwy fl class
#> <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
#> 1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
#> 2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
#> 3 audi a4 2 2008 4 manu… f 20 31 p comp…
#> 4 audi a4 2 2008 4 auto… f 21 30 p comp…
#> 5 audi a4 2.8 1999 6 auto… f 16 26 p comp…
#> 6 audi a4 2.8 1999 6 manu… f 18 26 p comp…
#> 7 audi a4 3.1 2008 6 auto… f 18 27 p comp…
#> 8 audi a4 q… 1.8 1999 4 manu… 4 18 26 p comp…
#> 9 audi a4 q… 1.8 1999 4 auto… 4 16 25 p comp…
#> 10 audi a4 q… 2 2008 4 manu… 4 20 28 p comp…
#> # ... with 224 more rows
Select
dplyr::select() subsets column.
mpg %>%
select(hwy, manufacturer, displ, cyl, cty) %>% # subsetting
lm(hwy ~ ., data = .)
#>
#> Call:
#> lm(formula = hwy ~ ., data = .)
#>
#> Coefficients:
#> (Intercept) manufacturerchevrolet manufacturerdodge
#> 2.65526 -1.08632 -2.55442
#> manufacturerford manufacturerhonda manufacturerhyundai
#> -2.29897 -2.98863 -0.94980
#> manufacturerjeep manufacturerland rover manufacturerlincoln
#> -3.36654 -1.87179 -1.10739
#> manufacturermercury manufacturernissan manufacturerpontiac
#> -2.64828 -2.44447 0.75427
#> manufacturersubaru manufacturertoyota manufacturervolkswagen
#> -3.04204 -2.73963 -1.62987
#> displ cyl cty
#> -0.03763 0.06134 1.33805
Denote that -col.name exclude that column. %>% enables formula to use . notation.
Tidyselect
Lots of data sets group their columns using underscore.
nycflights13::flights
#> # A tibble: 336,776 x 19
#> year month day dep_time sched_dep_time dep_delay arr_time
#> <int> <int> <int> <int> <int> <dbl> <int>
#> 1 2013 1 1 517 515 2 830
#> 2 2013 1 1 533 529 4 850
#> 3 2013 1 1 542 540 2 923
#> 4 2013 1 1 544 545 -1 1004
#> 5 2013 1 1 554 600 -6 812
#> 6 2013 1 1 554 558 -4 740
#> 7 2013 1 1 555 600 -5 913
#> 8 2013 1 1 557 600 -3 709
#> 9 2013 1 1 557 600 -3 838
#> 10 2013 1 1 558 600 -2 753
#> # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
#> # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
#> # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> # minute <dbl>, time_hour <dttm>
For instance, both dep_delay and arr_delay are about delay time. Select helpers such as starts_with(), ends_with(), and contains() can handle this kind of columns.
nycflights13::flights %>%
select(starts_with("sched"),
ends_with("delay"),
distance)
#> # A tibble: 336,776 x 5
#> sched_dep_time sched_arr_time dep_delay arr_delay distance
#> <int> <int> <dbl> <dbl> <dbl>
#> 1 515 819 2 11 1400
#> 2 529 830 4 20 1416
#> 3 540 850 2 33 1089
#> 4 545 1022 -1 -18 1576
#> 5 600 837 -6 -25 762
#> 6 558 728 -4 12 719
#> 7 600 854 -5 19 1065
#> 8 600 723 -3 -14 229
#> 9 600 846 -3 -8 944
#> 10 600 745 -2 8 733
#> # ... with 336,766 more rows
After that, just %>% lm().
nycflights13::flights %>%
select(starts_with("sched"),
ends_with("delay"),
distance) %>%
lm(dep_delay ~ ., data = .)
#>
#> Call:
#> lm(formula = dep_delay ~ ., data = .)
#>
#> Coefficients:
#> (Intercept) sched_dep_time sched_arr_time arr_delay
#> -0.151408 0.002737 0.000951 0.816684
#> distance
#> 0.001859
library(nycflights13)
library(tidyverse)
My task is
Look at each destination. Can you find flights that are suspiciously fast? (i.e. flights that represent a potential data entry error).
I have generated a tibble with the average flight times between every two airports:
# A tibble: 224 x 3
# Groups: origin [?]
origin dest mean_time
<chr> <chr> <dbl>
1 EWR ALB 31.78708
2 EWR ANC 413.12500
3 EWR ATL 111.99385
4 EWR AUS 211.24765
5 EWR AVL 89.79681
6 EWR BDL 25.46602
7 EWR BNA 114.50915
8 EWR BOS 40.31275
9 EWR BQN 196.17288
10 EWR BTV 46.25734
# ... with 214 more rows
Now I need to sweep through flights and extract all rows, whose air_time is outside say (mean_time/2, mean_time*2). How do I do that?
Assuming you have stored the tibble with the average flight times, join it to the flights table:
flights_suspicious <- left_join(flights, average_flight_times, by=c("origin","dest")) %>%
filter(air_time < mean_time / 2 | air_time > mean_time * 2)
You would first join that average flight time data frame onto your original flights data and then apply the filter. Something like this should work.
library(nycflights13)
library(tidyverse)
data("flights")
#get mean time
mean_time <- flights %>%
group_by(origin, dest) %>%
summarise(mean_time = mean(air_time, na.rm = TRUE))
#join mean time to original data
df <- left_join(flights, mean_time)
flag_flights <- df %>%
filter(air_time <= (mean_time / 2) | air_time >= (mean_time * 2))
> flag_flights
# A tibble: 29 x 20
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour minute
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 2013 1 16 635 608 27 916 725 111 UA 541 N837UA EWR BOS 81 200 6 8
2 2013 1 21 1851 1900 -9 2034 2012 22 US 2140 N956UW LGA BOS 76 184 19 0
3 2013 1 28 1917 1825 52 2118 1935 103 US 1860 N755US LGA PHL 75 96 18 25
4 2013 10 7 1059 1105 -6 1306 1215 51 MQ 3230 N524MQ JFK DCA 96 213 11 5
5 2013 10 10 950 959 -9 1155 1115 40 EV 5711 N829AS JFK IAD 97 228 9 59
6 2013 2 17 841 840 1 1044 1003 41 9E 3422 N913XJ JFK BOS 86 187 8 40
7 2013 3 8 1136 1001 95 1409 1116 173 UA 1240 N17730 EWR BOS 82 200 10 1
8 2013 3 8 1246 1245 1 1552 1350 122 AA 1850 N3FEAA JFK BOS 80 187 12 45
9 2013 3 12 1607 1500 67 1803 1608 115 US 2132 N946UW LGA BOS 77 184 15 0
10 2013 3 12 1612 1557 15 1808 1720 48 UA 1116 N37252 EWR BOS 81 200 15 57
# ... with 19 more rows, and 2 more variables: time_hour <dttm>, mean_time <dbl>