I'm trying to pull down a subset of rows in a sqlite database using dplyr. Since slice doesn't work with tbl_sql objects, I'm using the window function row_number. But I get the following error:
Source: sqlite 3.8.6
Error in sqliteSendQuery(con, statement, bind.data) :
error in statement: no such function: ROW_NUMBER
dplyr version, RSQLite version 1.0.0. Reproducible example:
flights_sqlite <- tbl(nycflights13_sqlite(), "flights")
filter(flights_sqlite, row_number(month) == 1L) %>% collect()
Probably there's a more efficient and faster way, but head seems to do the job.
To extract first n rows, for instance first 10 records:
head(flights_sqlite, 10) %>% collect()
year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin dest air_time distance hour minute
1 2013 1 1 517 2 830 11 UA N14228 1545 EWR IAH 227 1400 5 17
2 2013 1 1 533 4 850 20 UA N24211 1714 LGA IAH 227 1416 5 33
3 2013 1 1 542 2 923 33 AA N619AA 1141 JFK MIA 160 1089 5 42
4 2013 1 1 544 -1 1004 -18 B6 N804JB 725 JFK BQN 183 1576 5 44
5 2013 1 1 554 -6 812 -25 DL N668DN 461 LGA ATL 116 762 5 54
6 2013 1 1 554 -4 740 12 UA N39463 1696 EWR ORD 150 719 5 54
7 2013 1 1 555 -5 913 19 B6 N516JB 507 EWR FLL 158 1065 5 55
8 2013 1 1 557 -3 709 -14 EV N829AS 5708 LGA IAD 53 229 5 57
9 2013 1 1 557 -3 838 -8 B6 N593JB 79 JFK MCO 140 944 5 57
10 2013 1 1 558 -2 753 8 AA N3ALAA 301 LGA ORD 138 733 5 58
A percentage of the first rows
head(flights_sqlite, nrow(flights_sqlite)*0.1) %>% collect()
To subset any specific number of rows. For instance rows 578 and 579:
head(flights_sqlite, nrow(flights_sqlite))[578:579, ] %>% collect()
year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin dest air_time distance hour minute
578 2013 1 1 1701 -9 2026 11 AA N3FUAA 695 JFK AUS 247 1521 17 1
579 2013 1 1 1701 1 1856 16 UA N418UA 689 LGA ORD 144 733 17 1
I have a dataframe containing location data of different animals. Each animal has a unique id and each observation has a time stamp and some further metrics of the location observation. See a subset of the data below. The subset contains the first two observations of each id.
> sub
id lc lon lat a b c date
1 111 3 -79.2975 25.6996 414 51 77 2019-04-01 22:08:50
2 111 3 -79.2975 25.6996 414 51 77 2019-04-01 22:08:50
3 222 3 -79.2970 25.7001 229 78 72 2019-01-07 20:36:27
4 222 3 -79.2970 25.7001 229 78 72 2019-01-07 20:36:27
5 333 B -80.8211 24.8441 11625 6980 37 2018-12-17 20:45:05
6 333 3 -80.8137 24.8263 155 100 69 2018-12-17 21:00:43
7 444 3 -80.4535 25.0848 501 33 104 2019-10-20 19:44:16
8 444 1 -80.8086 24.8364 6356 126 87 2020-01-18 20:32:28
9 555 3 -77.7211 24.4887 665 45 68 2020-07-12 21:09:17
10 555 3 -77.7163 24.4897 285 129 130 2020-07-12 21:10:35
11 666 2 -77.7221 24.4902 1129 75 66 2020-07-12 21:09:02
12 666 2 -77.7097 24.4905 314 248 164 2020-07-12 21:11:37
13 777 3 -77.7133 24.4820 406 58 110 2020-06-20 11:18:18
14 777 3 -77.7218 24.4844 170 93 107 2020-06-20 11:51:06
15 888 3 -79.2975 25.6996 550 34 79 2017-11-25 19:10:45
16 888 3 -79.2975 25.6996 550 34 79 2017-11-25 19:10:45
However, I need to do some data housekeeping, i.e. I need to include the day/time and location each animal was released. And after that I need to filter out observations for each animal that occurred pre-release of the corresponding animal.
I have a an additional dataframe that contains the necessary release metadata:
> stack
id release lat lon
1 888 2017-11-27 14:53 25.69201 -79.31534
2 333 2019-01-31 16:09 25.68896 -79.31326
3 222 2019-02-02 15:55 25.70051 -79.31393
4 111 2019-04-02 10:43 25.68534 -79.31341
5 444 2020-03-13 15:04 24.42892 -77.69518
6 666 2020-10-27 09:40 24.58290 -77.69561
7 555 2020-01-21 14:38 24.43333 -77.69637
8 777 2020-06-25 08:54 24.42712 -77.76427
So my question is: how can I add the release information (time and lat/lon) to the dataframe fore each id (while the columns a, b, and c can be NA). And how can I then filter out the observations that occured before each animal's release time? I have been looking into possibilites using dplyr but was not yet able to resolve my issue.
You've not provided an easy way of obtaining your data (dput()) is by far the best and you have issues with your date time values (release uses Y-M-D H:M whereas date uses Y:M:D H:M:S) so for clarity I've included code to obtain the data frames I use at the end of this post.
First, the solution:
sub %>%
left_join(stack, by="id") %>%
release=ymd_hms(paste0(release, ":00")),
) %>%
filter(date >= release)
id lc lon.x lat.x a b c date release lat.y lon.y
1 555 3 -77.7211 24.4887 665 45 68 2020-07-12 21:09:17 2020-01-21 14:38:00 24.43333 -77.69637
2 555 3 -77.7163 24.4897 285 129 130 2020-07-12 21:10:35 2020-01-21 14:38:00 24.43333 -77.69637
As I indicated in comments.
To obtain the data
sub <- read.table(textConnection("id lc lon lat a b c date
1 111 3 -79.2975 25.6996 414 51 77 '2019-04-01 22:08:50'
2 111 3 -79.2975 25.6996 414 51 77 '2019-04-01 22:08:50'
3 222 3 -79.2970 25.7001 229 78 72 '2019-01-07 20:36:27'
4 222 3 -79.2970 25.7001 229 78 72 '2019-01-07 20:36:27'
5 333 B -80.8211 24.8441 11625 6980 37 '2018-12-17 20:45:05'
6 333 3 -80.8137 24.8263 155 100 69 '2018-12-17 21:00:43'
7 444 3 -80.4535 25.0848 501 33 104 '2019-10-20 19:44:16'
8 444 1 -80.8086 24.8364 6356 126 87 '2020-01-18 20:32:28'
9 555 3 -77.7211 24.4887 665 45 68 '2020-07-12 21:09:17'
10 555 3 -77.7163 24.4897 285 129 130 '2020-07-12 21:10:35'
11 666 2 -77.7221 24.4902 1129 75 66 '2020-07-12 21:09:02'
12 666 2 -77.7097 24.4905 314 248 164 '2020-07-12 21:11:37'
13 777 3 -77.7133 24.4820 406 58 110 '2020-06-20 11:18:18'
14 777 3 -77.7218 24.4844 170 93 107 '2020-06-20 11:51:06'
15 888 3 -79.2975 25.6996 550 34 79 '2017-11-25 19:10:45'
16 888 3 -79.2975 25.6996 550 34 79 '2017-11-25 19:10:45'"), header=TRUE)
stack <- read.table(textConnection("id release lat lon
1 888 '2017-11-27 14:53' 25.69201 -79.31534
2 333 '2019-01-31 16:09' 25.68896 -79.31326
3 222 '2019-02-02 15:55' 25.70051 -79.31393
4 111 '2019-04-02 10:43' 25.68534 -79.31341
5 444 '2020-03-13 15:04' 24.42892 -77.69518
6 666 '2020-10-27 09:40' 24.58290 -77.69561
7 555 '2020-01-21 14:38' 24.43333 -77.69637
8 777 '2020-06-25 08:54' 24.42712 -77.76427"), header=TRUE)
Okay. I have looked everywhere and read documentation, watched videos, talked to people for help, etc... and cant seem to get this figured out. I need to remove the outliers in one variable of a data set using object assignment and the quartile method, but I have to do it in the pipe. When I run the code, the object cannot be found. Here is the code:
Q1 <- flights %>%
dep_delay_upper <- quantile(dep_delay$y, 0.997, na.rm = TRUE) %>%
dep_delay_lower <- quantile(dep_delay$y, 0.003, na.rm = TRUE) %>%
dep_delay_out <- which(dep_delay$y > dep_delay_upper | dep_delay$y < dep_delay_lower) %>%
dep_delay_noout <- dep_delay[-dep_delay_out,]
Here is a screenshot with my error in the terminal:
enter image description here
With magrittr's pipe, you can reuse the piped object with a . as so.
The first way gets only the values of dep_delay:
flights$dep_delay %>%
.[which(. < quantile(., 0.997, na.rm = TRUE) & . > quantile(., 0.003, na.rm = TRUE))]
And the second way filters the entire flights dataframe:
flights %>%
.[which(.$dep_delay < quantile(.$dep_delay, 0.997, na.rm = TRUE) &
.$dep_delay > quantile(.$dep_delay, 0.003, na.rm = TRUE)),]
# # A tibble: 326,164 × 19
# year month day dep_time sched_dep_time dep_delay arr_time sched_…¹ arr_d…² carrier flight tailnum origin dest air_t…³ dista…⁴ hour minute time_hour
# <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dttm>
# 1 2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR IAH 227 1400 5 15 2013-01-01 05:00:00
# 2 2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA IAH 227 1416 5 29 2013-01-01 05:00:00
# 3 2013 1 1 542 540 2 923 850 33 AA 1141 N619AA JFK MIA 160 1089 5 40 2013-01-01 05:00:00
# 4 2013 1 1 544 545 -1 1004 1022 -18 B6 725 N804JB JFK BQN 183 1576 5 45 2013-01-01 05:00:00
# 5 2013 1 1 554 600 -6 812 837 -25 DL 461 N668DN LGA ATL 116 762 6 0 2013-01-01 06:00:00
# 6 2013 1 1 554 558 -4 740 728 12 UA 1696 N39463 EWR ORD 150 719 5 58 2013-01-01 05:00:00
# 7 2013 1 1 555 600 -5 913 854 19 B6 507 N516JB EWR FLL 158 1065 6 0 2013-01-01 06:00:00
# 8 2013 1 1 557 600 -3 709 723 -14 EV 5708 N829AS LGA IAD 53 229 6 0 2013-01-01 06:00:00
# 9 2013 1 1 557 600 -3 838 846 -8 B6 79 N593JB JFK MCO 140 944 6 0 2013-01-01 06:00:00
# 10 2013 1 1 558 600 -2 753 745 8 AA 301 N3ALAA LGA ORD 138 733 6 0 2013-01-01 06:00:00
# # … with 326,154 more rows, and abbreviated variable names ¹sched_arr_time, ²arr_delay, ³air_time, ⁴distance
# # ℹ Use `print(n = ...)` to see more rows
Or alternatively with dplyr:
flights %>%
filter(dep_delay < quantile(dep_delay, 0.997, na.rm = TRUE) &
dep_delay > quantile(dep_delay, 0.003, na.rm = TRUE))
I want to only select the flights that have values in given columns. So I don't care about the flights that have nulls in the columns dep_delay, arr_delay and distance
I am getting an error saying: Error: Result must have length 1, not 3
This error is caused by this: filter(!is.na(c("dep_delay", "arr_delay", "distance")))
flights %>%
group_by(dep_delay, arr_delay, distance) %>%
filter(!is.na(c("dep_delay", "arr_delay", "distance"))) %>%
I also tried doing filter(!is.na("dep_delay", "arr_delay", "distance")) (removing the c(...)
If there are multiple columns, use filter_at (assuming that we are removing rows if there are any NAs in a row for each of the columnss
flights %>%
filter_at(vars(c("dep_delay", "arr_delay", "distance")),
# A tibble: 327,346 x 19
# year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest
# <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> <chr>
# 1 2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR IAH
# 2 2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA IAH
# 3 2013 1 1 542 540 2 923 850 33 AA 1141 N619AA JFK MIA
# 4 2013 1 1 544 545 -1 1004 1022 -18 B6 725 N804JB JFK BQN
# 5 2013 1 1 554 600 -6 812 837 -25 DL 461 N668DN LGA ATL
# 6 2013 1 1 554 558 -4 740 728 12 UA 1696 N39463 EWR ORD
# 7 2013 1 1 555 600 -5 913 854 19 B6 507 N516JB EWR FLL
# 8 2013 1 1 557 600 -3 709 723 -14 EV 5708 N829AS LGA IAD
# 9 2013 1 1 557 600 -3 838 846 -8 B6 79 N593JB JFK MCO
#10 2013 1 1 558 600 -2 753 745 8 AA 301 N3ALAA LGA ORD
# … with 327,336 more rows, and 5 more variables: air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
# time_hour <dttm>
In the devel version, we can use across with filter
flights %>%
filter(across(c(dep_delay, arr_delay, distance), ~ !is.na(.)))
If the condition is to have at least one non-NA among those columns, replace the all_vars with any_vars
flights %>%
filter_at(vars(c("dep_delay", "arr_delay", "distance")),
NOTE: the group_by step can be after the filter step as we are using the same columns
I have a data set with closing and opening dates of public schools in California. Available here or dput() at the bottom of the question. The data also lists what type of school it is and where it is. I am trying to create a running total column which also takes into account school closings as well as school type.
Here is the solution I've come up with, which basically entails me encoding a lot of different 1's and 0's based on the conditions using ifelse:
# open charter schools
pubschls$open_chart <- ifelse(pubschls$Charter=="Y" & is.na(pubschls$ClosedDate)==TRUE, 1, 0)
# open public schools
pubschls$open_pub <- ifelse(pubschls$Charter=="N" & is.na(pubschls$ClosedDate)==TRUE, 1, 0)
# closed charters
pubschls$closed_chart <- ifelse(pubschls$Charter=="Y" & is.na(pubschls$ClosedDate)==FALSE, 1, 0)
# closed public schools
pubschls$closed_pub <- ifelse(pubschls$Charter=="N" & is.na(pubschls$ClosedDate)==FALSE, 1, 0)
lausd <- filter(pubschls, NCESDist=="0622710")
# count number open during each year
Then I subtract the columns from each other to get totals.
la_schools_count <- aggregate(lausd[c('open_chart','closed_chart','open_pub','closed_pub')],
by=list(year(lausd$OpenDate)), sum)
# find net charters by subtracting closed from open
la_schools_count$net_chart <- la_schools_count$open_chart - la_schools_count$closed_chart
# find net public schools by subtracting closed from open
la_schools_count$net_pub <- la_schools_count$open_pub - la_schools_count$closed_pub
# add running totals
la_schools_count$cum_chart <- cumsum(la_schools_count$net_chart)
la_schools_count$cum_pub <- cumsum(la_schools_count$net_pub)
# total totals
la_schools_count$total <- la_schools_count$cum_chart + la_schools_count$cum_pub
My output looks like this:
la_schools_count <- select(la_schools_count, "year", "cum_chart", "cum_pub", "pen_rate", "total")
year cum_chart cum_pub pen_rate total
1 1952 1 0 100.00000 1
2 1956 1 1 50.00000 2
3 1969 1 2 33.33333 3
4 1980 55 469 10.49618 524
5 1989 55 470 10.47619 525
6 1990 55 470 10.47619 525
7 1991 55 473 10.41667 528
8 1992 55 476 10.35782 531
9 1993 55 477 10.33835 532
10 1994 56 478 10.48689 534
11 1995 57 478 10.65421 535
12 1996 57 479 10.63433 536
13 1997 58 481 10.76067 539
14 1998 59 480 10.94620 539
15 1999 61 480 11.27542 541
16 2000 61 481 11.25461 542
17 2001 62 482 11.39706 544
18 2002 64 484 11.67883 548
19 2003 73 485 13.08244 558
20 2004 83 496 14.33506 579
21 2005 90 524 14.65798 614
22 2006 96 532 15.28662 628
23 2007 90 534 14.42308 624
24 2008 97 539 15.25157 636
25 2009 108 546 16.51376 654
26 2010 124 566 17.97101 690
27 2011 140 580 19.44444 720
28 2012 144 605 19.22563 749
29 2013 162 609 21.01167 771
30 2014 179 611 22.65823 790
31 2015 195 611 24.19355 806
32 2016 203 614 24.84700 817
33 2017 211 619 25.42169 830
I'm just wondering if this could be done in a better way. Like an apply statement to all rows based on the conditions?
structure(list(CDSCode = c("19647330100289", "19647330100297",
"19647330100669", "19647330100677", "19647330100743", "19647330100750"
), OpenDate = structure(c(12324, 12297, 12240, 12299, 12634,
12310), class = "Date"), ClosedDate = structure(c(NA, 15176,
NA, NA, NA, NA), class = "Date"), Charter = c("Y", "Y", "Y",
"Y", "Y", "Y")), .Names = c("CDSCode", "OpenDate", "ClosedDate",
"Charter"), row.names = c(NA, -6L), class = c("tbl_df", "tbl",
I followed your code and learned what you were doing except pen_rate. It seems that pen_rate is calculated dividing cum_chart by total. I download the original data set and did the following. I called the data set foo. Whenclosed_pub), I combined Charter and ClosedDate. I checked if ClosedDate is NA or not, and converted the logical output to numbers (1 = open, 0 = closed). This is how I created the four groups (i.e., open_chart, closed_chart, open_pub, and closed_pub). I guess this would ask you to do less typing. Since the dates are in character, I extracted year using substr(). If you have a date object, you need to do something else. Once you have year, you group the data with it and calculate how many schools exist for each type of school using count(). This part is the equivalent of your aggregate() code. Then, Convert the output to a wide-format data with spread() and did the rest of the calculation as you demonstrated in your codes. The final output seems different from what you have in your question, but my outcome was identical to one that I obtained by running your codes. I hope this will help you.
# Get the necessary data
foo <- read_xls("pubschls.xls") %>%
select(NCESDist, CDSCode, OpenDate, ClosedDate, Charter) %>%
filter(NCESDist == "0622710" & (!Charter %in% NA))
mutate(foo, group = paste(Charter, as.numeric(is.na(ClosedDate)), sep = "_"),
year = substr(OpenDate, star = nchar(OpenDate) - 3, stop = nchar(OpenDate))) %>%
count(year, group) %>%
spread(key = group, value = n, fill = 0) %>%
mutate(net_chart = Y_1 - Y_0,
net_pub = N_1 - N_0,
cum_chart = cumsum(net_chart),
cum_pub = cumsum(net_pub),
total = cum_chart + cum_pub,
pen_rate = cum_chart / total)
# A part of the outcome
# year N_0 N_1 Y_0 Y_1 net_chart net_pub cum_chart cum_pub total pen_rate
#1 1866 0 1 0 0 0 1 0 1 1 0.00000000
#2 1873 0 1 0 0 0 1 0 2 2 0.00000000
#3 1878 0 1 0 0 0 1 0 3 3 0.00000000
#4 1881 0 1 0 0 0 1 0 4 4 0.00000000
#5 1882 0 2 0 0 0 2 0 6 6 0.00000000
#110 2007 0 2 15 9 -6 2 87 393 480 0.18125000
#111 2008 2 8 9 15 6 6 93 399 492 0.18902439
#112 2009 1 9 4 15 11 8 104 407 511 0.20352250
#113 2010 5 26 5 21 16 21 120 428 548 0.21897810
#114 2011 2 16 2 18 16 14 136 442 578 0.23529412
#115 2012 2 27 3 7 4 25 140 467 607 0.23064250
#116 2013 1 5 1 19 18 4 158 471 629 0.25119237
#117 2014 1 3 1 18 17 2 175 473 648 0.27006173
#118 2015 0 0 2 18 16 0 191 473 664 0.28765060
#119 2016 0 3 0 8 8 3 199 476 675 0.29481481
#120 2017 0 5 0 9 9 5 208 481 689 0.30188679
My task is
Look at each destination. Can you find flights that are suspiciously fast? (i.e. flights that represent a potential data entry error).
I have generated a tibble with the average flight times between every two airports:
# A tibble: 224 x 3
# Groups: origin [?]
origin dest mean_time
<chr> <chr> <dbl>
1 EWR ALB 31.78708
2 EWR ANC 413.12500
3 EWR ATL 111.99385
4 EWR AUS 211.24765
5 EWR AVL 89.79681
6 EWR BDL 25.46602
7 EWR BNA 114.50915
8 EWR BOS 40.31275
9 EWR BQN 196.17288
10 EWR BTV 46.25734
# ... with 214 more rows
Now I need to sweep through flights and extract all rows, whose air_time is outside say (mean_time/2, mean_time*2). How do I do that?
Assuming you have stored the tibble with the average flight times, join it to the flights table:
flights_suspicious <- left_join(flights, average_flight_times, by=c("origin","dest")) %>%
filter(air_time < mean_time / 2 | air_time > mean_time * 2)
You would first join that average flight time data frame onto your original flights data and then apply the filter. Something like this should work.
#get mean time
mean_time <- flights %>%
group_by(origin, dest) %>%
summarise(mean_time = mean(air_time, na.rm = TRUE))
#join mean time to original data
df <- left_join(flights, mean_time)
flag_flights <- df %>%
filter(air_time <= (mean_time / 2) | air_time >= (mean_time * 2))
> flag_flights
# A tibble: 29 x 20
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour minute
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 2013 1 16 635 608 27 916 725 111 UA 541 N837UA EWR BOS 81 200 6 8
2 2013 1 21 1851 1900 -9 2034 2012 22 US 2140 N956UW LGA BOS 76 184 19 0
3 2013 1 28 1917 1825 52 2118 1935 103 US 1860 N755US LGA PHL 75 96 18 25
4 2013 10 7 1059 1105 -6 1306 1215 51 MQ 3230 N524MQ JFK DCA 96 213 11 5
5 2013 10 10 950 959 -9 1155 1115 40 EV 5711 N829AS JFK IAD 97 228 9 59
6 2013 2 17 841 840 1 1044 1003 41 9E 3422 N913XJ JFK BOS 86 187 8 40
7 2013 3 8 1136 1001 95 1409 1116 173 UA 1240 N17730 EWR BOS 82 200 10 1
8 2013 3 8 1246 1245 1 1552 1350 122 AA 1850 N3FEAA JFK BOS 80 187 12 45
9 2013 3 12 1607 1500 67 1803 1608 115 US 2132 N946UW LGA BOS 77 184 15 0
10 2013 3 12 1612 1557 15 1808 1720 48 UA 1116 N37252 EWR BOS 81 200 15 57
# ... with 19 more rows, and 2 more variables: time_hour <dttm>, mean_time <dbl>