Error finding object when removing outliers in pipe in R - r

Okay. I have looked everywhere and read documentation, watched videos, talked to people for help, etc... and cant seem to get this figured out. I need to remove the outliers in one variable of a data set using object assignment and the quartile method, but I have to do it in the pipe. When I run the code, the object cannot be found. Here is the code:
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(nycflights13))
suppressPackageStartupMessages(library(lm.beta))
Q1 <- flights %>%
dep_delay_upper <- quantile(dep_delay$y, 0.997, na.rm = TRUE) %>%
dep_delay_lower <- quantile(dep_delay$y, 0.003, na.rm = TRUE) %>%
dep_delay_out <- which(dep_delay$y > dep_delay_upper | dep_delay$y < dep_delay_lower) %>%
dep_delay_noout <- dep_delay[-dep_delay_out,]
Here is a screenshot with my error in the terminal:
enter image description here

With magrittr's pipe, you can reuse the piped object with a . as so.
The first way gets only the values of dep_delay:
flights$dep_delay %>%
.[which(. < quantile(., 0.997, na.rm = TRUE) & . > quantile(., 0.003, na.rm = TRUE))]
And the second way filters the entire flights dataframe:
flights %>%
.[which(.$dep_delay < quantile(.$dep_delay, 0.997, na.rm = TRUE) &
.$dep_delay > quantile(.$dep_delay, 0.003, na.rm = TRUE)),]
# # A tibble: 326,164 × 19
# year month day dep_time sched_dep_time dep_delay arr_time sched_…¹ arr_d…² carrier flight tailnum origin dest air_t…³ dista…⁴ hour minute time_hour
# <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dttm>
# 1 2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR IAH 227 1400 5 15 2013-01-01 05:00:00
# 2 2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA IAH 227 1416 5 29 2013-01-01 05:00:00
# 3 2013 1 1 542 540 2 923 850 33 AA 1141 N619AA JFK MIA 160 1089 5 40 2013-01-01 05:00:00
# 4 2013 1 1 544 545 -1 1004 1022 -18 B6 725 N804JB JFK BQN 183 1576 5 45 2013-01-01 05:00:00
# 5 2013 1 1 554 600 -6 812 837 -25 DL 461 N668DN LGA ATL 116 762 6 0 2013-01-01 06:00:00
# 6 2013 1 1 554 558 -4 740 728 12 UA 1696 N39463 EWR ORD 150 719 5 58 2013-01-01 05:00:00
# 7 2013 1 1 555 600 -5 913 854 19 B6 507 N516JB EWR FLL 158 1065 6 0 2013-01-01 06:00:00
# 8 2013 1 1 557 600 -3 709 723 -14 EV 5708 N829AS LGA IAD 53 229 6 0 2013-01-01 06:00:00
# 9 2013 1 1 557 600 -3 838 846 -8 B6 79 N593JB JFK MCO 140 944 6 0 2013-01-01 06:00:00
# 10 2013 1 1 558 600 -2 753 745 8 AA 301 N3ALAA LGA ORD 138 733 6 0 2013-01-01 06:00:00
# # … with 326,154 more rows, and abbreviated variable names ¹​sched_arr_time, ²​arr_delay, ³​air_time, ⁴​distance
# # ℹ Use `print(n = ...)` to see more rows
Or alternatively with dplyr:
flights %>%
filter(dep_delay < quantile(dep_delay, 0.997, na.rm = TRUE) &
dep_delay > quantile(dep_delay, 0.003, na.rm = TRUE))

Related

Somthing is wrong with using pivot_wider and pivot_longer to gather data(I finished it by myself.It was solved.)

I used this method to gather mean and sd result successly before here .And then, I tried to use this methond to gather my gene counts DEG data with "logFC","cil","cir","ajustP_value" .But I failed because something wrong with my result.
Just like this:
data_1<-data.frame(matrix(sample(1:1200,1200,replace = T),48,25))
names(data_1) <- c(paste0("Gene_", 1:25))
rownames(data_1)<-NULL
head(data_1)
A<-paste0(1:48,"_logFC")
data_logFC<-data.frame(A=A,data_1)
#
data_2<-data.frame(matrix(sample(1:1200,1200,replace = T),48,25))
names(data_2) <- c(paste0("Gene_", 1:25))
rownames(data_1)<-NULL
B_L<-paste0(1:48,"_CI.L")
data_CIL<-data.frame(A=B_L,data_2)
data_CIL[1:48,1:6]
#
data_3<-data.frame(matrix(sample(1:1200,1200,replace = T),48,25))
names(data_3) <- c(paste0("Gene_", 1:25))
rownames(data_3)<-NULL
C_R<-paste0(1:48,"_CI.R")
data_CIR<-data.frame(A=C_R,data_3)
data_CIR[1:48,1:6]
#
data_4<-data.frame(matrix(sample(1:1200,1200,replace = T),48,25))
names(data_4) <- c(paste0("Gene_", 1:25))
rownames(data_4)<-NULL
D<-paste0(1:48,"_adj.P.Val")
data_ajustP<-data.frame(A=D,data_4)
data_ajustP[1:48,1:6]
# combine data_logFC data_CIL data_CIR data_ajustP
data <- bind_rows(list(
logFC = data_logFC,
CIL = data_CIL,
CIR =data_CIR,
AJSTP=data_ajustP
), .id = "stat")
data[1:10,1:6]
data_DEG<- data %>%
pivot_longer(-c(stat,A), names_to = "Gene", values_to = "value") %>%pivot_wider(names_from = "stat", values_from = "value")
head(data_DEG,100)
str(data_DEG$CIL)
> head(data_DEG,100)
# A tibble: 100 x 6
A Gene logFC CIL CIR AJSTP
<chr> <chr> <int> <int> <int> <int>
1 1_logFC Gene_1 504 NA NA NA
2 1_logFC Gene_2 100 NA NA NA
3 1_logFC Gene_3 689 NA NA NA
4 1_logFC Gene_4 779 NA NA NA
5 1_logFC Gene_5 397 NA NA NA
6 1_logFC Gene_6 1152 NA NA NA
7 1_logFC Gene_7 780 NA NA NA
8 1_logFC Gene_8 155 NA NA NA
9 1_logFC Gene_9 142 NA NA NA
10 1_logFC Gene_10 1150 NA NA NA
# … with 90 more rows
Why is there so many NAs ?
Can somebody help me ? Vary thankful.
EDITE:
I confused the real sample group of my data. So I reshape my data without a right index.
Here is my right method:
data[1:10,1:6]
data<-separate(data,A,c("Name","stat2"),"_")
data<-data[,-3]
data_DEG<- data %>%
pivot_longer(-c(stat,Name), names_to = "Gene", values_to = "value") %>%pivot_wider(names_from = "stat", values_from = "value")
head(data_DEG,10)
tail(data_DEG,10)
> head(data_DEG,10)
# A tibble: 10 x 6
Name Gene logFC CIL CIR AJSTP
<chr> <chr> <int> <int> <int> <int>
1 1 Gene_1 504 1116 774 278
2 1 Gene_2 100 936 448 887
3 1 Gene_3 689 189 718 933
4 1 Gene_4 779 943 690 19
5 1 Gene_5 397 976 40 135
6 1 Gene_6 1152 304 343 647
7 1 Gene_7 780 1076 796 1024
8 1 Gene_8 155 645 469 180
9 1 Gene_9 142 256 889 1047
10 1 Gene_10 1150 976 1194 670
> tail(data_DEG,10)
# A tibble: 10 x 6
Name Gene logFC CIL CIR AJSTP
<chr> <chr> <int> <int> <int> <int>
1 48 Gene_16 448 633 1080 1122
2 48 Gene_17 73 772 14 388
3 48 Gene_18 652 999 699 912
4 48 Gene_19 600 1163 512 241
5 48 Gene_20 428 1119 1142 348
6 48 Gene_21 66 553 240 82
7 48 Gene_22 753 1119 630 117
8 48 Gene_23 1017 305 1120 447
9 48 Gene_24 432 1175 447 670
10 48 Gene_25 482 394 371 696
It's a perfect result!!

Getting an error when trying to use "filter" function on several parameters

library(tidyverse)
library(nycflights13)
I want to only select the flights that have values in given columns. So I don't care about the flights that have nulls in the columns dep_delay, arr_delay and distance
I am getting an error saying: Error: Result must have length 1, not 3
This error is caused by this: filter(!is.na(c("dep_delay", "arr_delay", "distance")))
flights %>%
group_by(dep_delay, arr_delay, distance) %>%
filter(!is.na(c("dep_delay", "arr_delay", "distance"))) %>%
summarise()
I also tried doing filter(!is.na("dep_delay", "arr_delay", "distance")) (removing the c(...)
If there are multiple columns, use filter_at (assuming that we are removing rows if there are any NAs in a row for each of the columnss
library(dplyr)
flights %>%
filter_at(vars(c("dep_delay", "arr_delay", "distance")),
all_vars(!is.na(.)))
# A tibble: 327,346 x 19
# year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest
# <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> <chr>
# 1 2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR IAH
# 2 2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA IAH
# 3 2013 1 1 542 540 2 923 850 33 AA 1141 N619AA JFK MIA
# 4 2013 1 1 544 545 -1 1004 1022 -18 B6 725 N804JB JFK BQN
# 5 2013 1 1 554 600 -6 812 837 -25 DL 461 N668DN LGA ATL
# 6 2013 1 1 554 558 -4 740 728 12 UA 1696 N39463 EWR ORD
# 7 2013 1 1 555 600 -5 913 854 19 B6 507 N516JB EWR FLL
# 8 2013 1 1 557 600 -3 709 723 -14 EV 5708 N829AS LGA IAD
# 9 2013 1 1 557 600 -3 838 846 -8 B6 79 N593JB JFK MCO
#10 2013 1 1 558 600 -2 753 745 8 AA 301 N3ALAA LGA ORD
# … with 327,336 more rows, and 5 more variables: air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>
In the devel version, we can use across with filter
flights %>%
filter(across(c(dep_delay, arr_delay, distance), ~ !is.na(.)))
If the condition is to have at least one non-NA among those columns, replace the all_vars with any_vars
flights %>%
filter_at(vars(c("dep_delay", "arr_delay", "distance")),
any_vars(!is.na(.)))
NOTE: the group_by step can be after the filter step as we are using the same columns

Unable to perform a filter

Everything has been working fine until a few minutes ago. Below is a sample of the data set.
silence<-suppressWarnings
silence(library(tidyverse))
silence(library(nycflights13))
UAdaily<-flights %>%
filter(carrier=="UA")
To which I get the following error:
Error in filter.default(., carrier == "UA") : argument "x" is missing, with no default
This is happening in the data set I'm supposed to work with too. What is possibly wrong?
Here is a traceback call
10: filter.default(., carrier == "UA")
9: filter(., carrier == "UA")
8: function_list[[k]](value)
7: withVisible(function_list[[k]](value))
6: freduce(value, `_function_list`)
5: `_fseq`(`_lhs`)
4: eval(quote(`_fseq`(`_lhs`)), env, env)
3: eval(quote(`_fseq`(`_lhs`)), env, env)
2: withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
1: flights %>% filter(carrier == "UA")
The problem was that there is another filter function that goes by filter{signal}. Solved it by using dplyr::filter
This this :
library(dplyr)
library(nycflights13)
data(flights)
flights %>%
filter(carrier =="UA")
# A tibble: 58,665 x 19
# year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum
# <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr>
# 1 2013 1 1 517 515 2 830 819 11 UA 1545 N14228
# 2 2013 1 1 533 529 4 850 830 20 UA 1714 N24211
# 3 2013 1 1 554 558 -4 740 728 12 UA 1696 N39463
# 4 2013 1 1 558 600 -2 924 917 7 UA 194 N29129
# 5 2013 1 1 558 600 -2 923 937 -14 UA 1124 N53441
# 6 2013 1 1 559 600 -1 854 902 -8 UA 1187 N76515
# 7 2013 1 1 607 607 0 858 915 -17 UA 1077 N53442
# 8 2013 1 1 611 600 11 945 931 14 UA 303 N532UA
[…]

How to cross-reference tibbles in R?

library(nycflights13)
library(tidyverse)
My task is
Look at each destination. Can you find flights that are suspiciously fast? (i.e. flights that represent a potential data entry error).
I have generated a tibble with the average flight times between every two airports:
# A tibble: 224 x 3
# Groups: origin [?]
origin dest mean_time
<chr> <chr> <dbl>
1 EWR ALB 31.78708
2 EWR ANC 413.12500
3 EWR ATL 111.99385
4 EWR AUS 211.24765
5 EWR AVL 89.79681
6 EWR BDL 25.46602
7 EWR BNA 114.50915
8 EWR BOS 40.31275
9 EWR BQN 196.17288
10 EWR BTV 46.25734
# ... with 214 more rows
Now I need to sweep through flights and extract all rows, whose air_time is outside say (mean_time/2, mean_time*2). How do I do that?
Assuming you have stored the tibble with the average flight times, join it to the flights table:
flights_suspicious <- left_join(flights, average_flight_times, by=c("origin","dest")) %>%
filter(air_time < mean_time / 2 | air_time > mean_time * 2)
You would first join that average flight time data frame onto your original flights data and then apply the filter. Something like this should work.
library(nycflights13)
library(tidyverse)
data("flights")
#get mean time
mean_time <- flights %>%
group_by(origin, dest) %>%
summarise(mean_time = mean(air_time, na.rm = TRUE))
#join mean time to original data
df <- left_join(flights, mean_time)
flag_flights <- df %>%
filter(air_time <= (mean_time / 2) | air_time >= (mean_time * 2))
> flag_flights
# A tibble: 29 x 20
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour minute
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 2013 1 16 635 608 27 916 725 111 UA 541 N837UA EWR BOS 81 200 6 8
2 2013 1 21 1851 1900 -9 2034 2012 22 US 2140 N956UW LGA BOS 76 184 19 0
3 2013 1 28 1917 1825 52 2118 1935 103 US 1860 N755US LGA PHL 75 96 18 25
4 2013 10 7 1059 1105 -6 1306 1215 51 MQ 3230 N524MQ JFK DCA 96 213 11 5
5 2013 10 10 950 959 -9 1155 1115 40 EV 5711 N829AS JFK IAD 97 228 9 59
6 2013 2 17 841 840 1 1044 1003 41 9E 3422 N913XJ JFK BOS 86 187 8 40
7 2013 3 8 1136 1001 95 1409 1116 173 UA 1240 N17730 EWR BOS 82 200 10 1
8 2013 3 8 1246 1245 1 1552 1350 122 AA 1850 N3FEAA JFK BOS 80 187 12 45
9 2013 3 12 1607 1500 67 1803 1608 115 US 2132 N946UW LGA BOS 77 184 15 0
10 2013 3 12 1612 1557 15 1808 1720 48 UA 1116 N37252 EWR BOS 81 200 15 57
# ... with 19 more rows, and 2 more variables: time_hour <dttm>, mean_time <dbl>

Selecting a subset of a sqlite database with dplyr

I'm trying to pull down a subset of rows in a sqlite database using dplyr. Since slice doesn't work with tbl_sql objects, I'm using the window function row_number. But I get the following error:
Source: sqlite 3.8.6
[/Library/Frameworks/R.framework/Versions/3.2/Resources/library/dplyr/db/nycflights13.sqlite]
Error in sqliteSendQuery(con, statement, bind.data) :
error in statement: no such function: ROW_NUMBER
dplyr version 0.4.3.9000, RSQLite version 1.0.0. Reproducible example:
library(dplyr)
library(nycflights13)
flights_sqlite <- tbl(nycflights13_sqlite(), "flights")
filter(flights_sqlite, row_number(month) == 1L) %>% collect()
Probably there's a more efficient and faster way, but head seems to do the job.
To extract first n rows, for instance first 10 records:
head(flights_sqlite, 10) %>% collect()
Output:
year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin dest air_time distance hour minute
1 2013 1 1 517 2 830 11 UA N14228 1545 EWR IAH 227 1400 5 17
2 2013 1 1 533 4 850 20 UA N24211 1714 LGA IAH 227 1416 5 33
3 2013 1 1 542 2 923 33 AA N619AA 1141 JFK MIA 160 1089 5 42
4 2013 1 1 544 -1 1004 -18 B6 N804JB 725 JFK BQN 183 1576 5 44
5 2013 1 1 554 -6 812 -25 DL N668DN 461 LGA ATL 116 762 5 54
6 2013 1 1 554 -4 740 12 UA N39463 1696 EWR ORD 150 719 5 54
7 2013 1 1 555 -5 913 19 B6 N516JB 507 EWR FLL 158 1065 5 55
8 2013 1 1 557 -3 709 -14 EV N829AS 5708 LGA IAD 53 229 5 57
9 2013 1 1 557 -3 838 -8 B6 N593JB 79 JFK MCO 140 944 5 57
10 2013 1 1 558 -2 753 8 AA N3ALAA 301 LGA ORD 138 733 5 58
A percentage of the first rows
head(flights_sqlite, nrow(flights_sqlite)*0.1) %>% collect()
To subset any specific number of rows. For instance rows 578 and 579:
head(flights_sqlite, nrow(flights_sqlite))[578:579, ] %>% collect()
Output:
year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin dest air_time distance hour minute
578 2013 1 1 1701 -9 2026 11 AA N3FUAA 695 JFK AUS 247 1521 17 1
579 2013 1 1 1701 1 1856 16 UA N418UA 689 LGA ORD 144 733 17 1

Resources