I'm currently learning how to use SQL Lite, and would like to sort the top 3 most popular pickup locations by hour. I have millions of rows of data with columns of interest being lpep_pickup_datetime (Pickup time) and POLocationID (Pickup location).
I'd like to the top 3 most popular pickup locations by hour.
Here is a sample of the data:
+----------------------+--------------+-----------------+
| lpep_pickup_datetime | PULocationID | passenger_count |
+----------------------+--------------+-----------------+
| 1/1/2017 0:01 | 42 | 1 |
| 1/1/2017 0:03 | 75 | 1 |
| 1/1/2017 0:04 | 82 | 5 |
| 1/1/2017 0:01 | 255 | 1 |
| 1/1/2017 0:00 | 166 | 1 |
| 1/1/2017 0:00 | 179 | 1 |
| 1/1/2017 0:02 | 74 | 1 |
| 1/1/2017 0:15 | 112 | 1 |
| 1/1/2017 0:06 | 36 | 1 |
| 1/1/2017 0:14 | 127 | 5 |
| 1/1/2017 0:01 | 41 | 1 |
| 1/1/2017 0:31 | 97 | 1 |
| 1/1/2017 0:01 | 255 | 5 |
| 1/1/2017 0:00 | 70 | 1 |
| 1/1/2017 0:03 | 255 | 1 |
| 1/1/2017 0:03 | 82 | 1 |
| 1/1/2017 0:00 | 36 | 1 |
| 1/1/2017 0:01 | 7 | 1 |
+----------------------+--------------+-----------------+
Trying this on SQLLiteStudio 3.2.1 - might I just need to use a full MySQL suite in order to be able to use the proper functions?
SELECT
PULocationID, count(PULocationID)
FROM GreenCabs2017
GROUP BY PULocationID
ORDER BY count(PULocationID) DESC
LIMIT 3
The query I've tried only returns top 3 pickup locations across the entire dataset and not by hour of day - how would I be able to group by hour of day? Other solutions on StackExchange reference date_time and date_format functions that won't execute when I try them on SQL Lite - what's a query that would work on SQL Lite?
Ideally would have something like the below:
+-------------+--------------+-----------------+
| Time of Day | PULocationID | PULocationCount |
+-------------+--------------+-----------------+
| 0:00 | 74 | 677 |
| 0:00 | 65 | 333 |
| 0:00 | 55 | 220 |
+-------------+--------------+-----------------+
This would be the output for top 3 pickup locations from midnight to 1:00 AM. This time range would have to apply across all the dates, i.e. 1/1 to 1/31 and not just 1/1 like the sample I provided.
UPDATE:
Changed the format of the timestamps to be YYYY-MM-DD HH:MM:SS format, so I can use the datetime functions now.
Was able to run a query which I think may bring me much closer to what I'm looking for:
SELECT lpep_pickup_datetime, PULocationID, count(PULocationID)
FROM GreenCabs2017
WHERE STRFTIME('%Y', lpep_pickup_datetime) = '2017' AND
STRFTIME('%H', lpep_pickup_datetime) <= '01' AND
STRFTIME('%H', lpep_pickup_datetime) >= '00'
GROUP BY PULocationID
ORDER BY count(PULocationID) DESC
LIMIT 3
That gave an output of
+----------------------+--------------+---------------------+
| lpep_pickup_datetime | PULocationID | count(PULocationID) |
+----------------------+--------------+---------------------+
| 1/31/2017 1:13 | 255 | 7845 |
| 1/31/2017 1:04 | 7 | 4596 |
| 1/31/2017 1:07 | 82 | 3892 |
+----------------------+--------------+---------------------+
But the lpep_pickup_datetime column still indicates that this would be in between 1:00 AM and 2:00 AM and not 12:00 AM and 1:00 AM? Removing the "=" sign in the query results in no results being returned. And I would prefer to not do this for every hour in the day - would there be a way to have an output by hour through one query?
The timestamp string format your data is using, m/d/YYYY H:MM, is not very good. It can't be used by sqlite date and time functions, can't be meaningfully ordered for sorting, and in general is very hard to work with in sqlite. Remember, sqlite does not have dedicated date or time types, just strings or numbers, so the format you're using has to obey the rules of those types. So your first step is to, by whatever means, fix those timestamps. The following assumes you changed them to YYYY-mm-dd HH:MM strings like 2017-01-01 00:01, or another compatible format. It also assumes you're using a fairly recent sqlite release, as it uses window functions which were added in 3.25.
(Edit: You appear to be using NYC taxi data from here, which has timestamps in a good format already, and is suitable for easy importing into sqlite. That makes it trivial to fix.)
Given all that, this query:
WITH ranked AS
(SELECT hour, PULocationID, pickups
, row_number() OVER (PARTITION BY hour ORDER BY pickups DESC) AS rn
FROM (SELECT strftime('%H:00', lpep_pickup_datetime) AS hour
, PULocationID
, count(*) AS pickups
FROM GreenCabs2017
GROUP BY strftime('%H:00', lpep_pickup_datetime), PULocationID))
SELECT * FROM ranked
WHERE rn <= 3
ORDER BY hour, rn
will give, for NYC Green Cab data for January 2017
hour PULocationID pickups rn
---------- ------------ ---------- ----------
00:00 255 4224 1
00:00 7 2518 2
00:00 82 2135 3
01:00 255 3621 1
01:00 7 2078 2
01:00 256 1870 3
02:00 255 3261 1
02:00 256 1798 2
02:00 7 1676 3
03:00 255 2854 1
03:00 256 1589 2
03:00 7 1475 3
and so on.
Basically, it counts the number of times each location appears in each hour, and for each hour, assigns each location a row number based on sorting by that number. Then only the first three rows of each hour are returned in the final outer select. You can also use rank() or dense_rank() instead of row_number(), which will potentially return more than 3 rows per hour in case of ties but also more accurately reflect the most popular locations in those cases.
(This query benefits a lot from having an index on the group by expression:
CREATE INDEX greencabs2017_idx_hour_loc ON GreenCabs2017(strftime('%H:00', lpep_pickup_datetime), PULocationID);
)
Test table created from the sqlite3 shell via:
sqlite> .mode csv
sqlite> .import '|curl -s https://s3.amazonaws.com/nyctlc/trip+data/green_tripdata_2017-01.csv | sed 2d' GreenCabs2017
Related
I want to merge two files using a unique ID and timestamp, and also get measurements for next next n intervals.
The first file has over 15,000 unique IDs. Each ID has measurements taken at 15 minute intervals from Jan 1, 00:00 to Dec 31, 23:45. The database is quite big (35 GB) with over 500 million rows. The file looks something like this.
First file
| ID | Time | Measurement|
|:----:|:---------------:|:------:|
| 1 |2012-12-31 22:45| 61 |
| 1 |2012-12-31 23:00| 60 |
| 1 |2012-12-31 23:15| 61 |
| 1 |2012-12-31 23:30| 59 |
| 1 |2012-12-31 23:45| 59 |
| 2 |2012-01-01 0:00| 60 |
| 2 |2012-01-01 0:15| 61 |
| 2 |2012-01-01 0:30| 60 |
| 2 |2012-01-01 0:45| 62 |
The second file has unique IDs and a timestamp. IDs in this file is a subset of IDs in the first file. The file is realtively small (~50 MB) compared to the first file.
Second file
| ID | Time |
|:----:|:---------------:|
| 1 |2012-12-31 22:48|
| 1 |2012-12-31 23:48|
| 2 |2012-01-01 0:16|
I want to merge the two files such that the measurements are extracted for current interval, and the next n intervals. I also want to be able to specify n and and run the code dynamically.
The merged file file should look like this for n = 3. For example, for the second row the measurements for next intervals should not be derived from another ID.
After merge
| ID | Time | Measurement 1| Measurement 2| Measurement 3|
|:----:|:---------------:|:----:|:---------------:|:----:|
| 1 | 2012-12-31 22:48| 61| 60| 61 |
| 1 | 2012-12-31 23:48| 59| 59| 59 |
| 2 | 2012-01-01 0:16| 61| 60| 62 |
I'm wanting to plot the country vs a date range of exposures to COVID-19, as a learning tool in RStudio.
I've been trying to read the CSV and store as a dataframe, then plot via ggplot, but I think that I'm doing this incorrectly, since this is a date range.
How could I approach this to plot the infected countries to the dates, which increase daily in the header?
| Province/State | 1/21/2020 22:00 | 1/22/2020 12:00 | 1/23/2020 12:00 | 1/24/2020 0:00 |...
|----------------|-----------------|-----------------|-----------------|----------------|
| Anhui | 1 | 1 | 2 | 5 |...
| Beijing | 1 | 1 | 3 | 4 |...
| Chongqing | 2 | 4 | 5 | 6 |...
These cases are not accurate, just generated through MD to provide a table of data.
Thank you!
I have one input table, it has dates and a fixed number of events. I need to create, from it, the list of events with date of occurrence, and the list of combined events and their occurrence.
Example:
initial table:
CREATE TABLE events (
date DATE PRIMARY KEY,
e1 INTEGER,
e2 INTEGER,
e3 INTEGER
);
date | e1 | e2 | e3 |
--------------------------
2017-02-04 | 2 | 1 | 26 |
2017-02-05 | 14 | 2 | 1 |
2017-02-06 | 1 | 3 | 2 |
Output 1
eventN | total | date1 | date2 |...| date'N'
--------------------------------------------------------
01 | 3 | 2017-02-04 | 2017-02-05 |...| 2017-02-06
02 | 2 | 2017-02-05 | 2017-02-06 |...| (null)
...
26 | 1 | 2017-02-04 | (null) |...| (null)
Output 2
CobineEventN | total | date1 | ... | date'N'
-----------------------------------------------------
0102 | 2 | 2017-02-05 | ... | 2017-02-06
0103 | 1 | 2017-02-06 | ... | (null)
....
2526 | 1 | 2017-02-04 | ... | (null)
....
Limitations:
this has to be done in SQLite.
there is no limit for the dates (i.e. 'n' unique dates).
the events, are a fixed list of (around)50 ids
the output will be tables, one for each type of combination.
the author SQL skills.
After some talk with a teacher, he pointed out that my model was all wrong from start.
changing the table to:
REATE TABLE event (id integer NOT NULL PRIMARY KEY AUTOINCREMENT, date date NOT NULL, event smallint unsigned NOT NULL);
*'id' is unnecessary
something like:
Id date event
31 2016-10-05 1
44 2016-10-07 1
32 2016-10-05 2
I could use a query like:
select A.event as nA, B.event as nB, C.event as nC, date
from event as A, event as B, event as C
where A.date = B.date
and B.date=C.date
and nC<>nA
and nC<>nB
and obtain the values needed.
nA nB nC date
1 2 3 2016-10-05
1 2 3 2016-10-07
1 2 4 2016-10-07
...
Although the format is not exactly what I had imagined, the results work fine.
And I don't need to create any more columns for the rest of the project, just have to do the right queries.
BR
i am trying to count (in minutes) between two records on 1 table and WHERE clause is the same condition.
_id | venue_id | act_time | status |
1 | 1 | 13:30 | 0 |
2 | 1 | 15:40 | 1 |
3 | 2 | 13:03 | 0 |
4 | 2 | 16:06 | 1 |
when i exec query like this :
SELECT _id, venue_id, status, (julianday(act_time IN (SELECT act_time FROM reports WHERE venue_id='1' AND status='1')) - julianday(act_time))*1440 AS duration FROM reports WHERE venue_id='1' AND status='0'
but, the result show the wrong calculation
Please help me what is the correct query for this problem?
so, if i count the duration between 15:40 - 13:30 (at venue_id='1') = 130 minutes.
thank you.
The IN operator checks whether the value on the left side is contained in the set of values on the right side, and returns a boolean result (0 or 1).
You just want to use the act_time value directly; drop act_time IN.
I'm trying to created a report for my asp.net application which will show the quantity of each item in combination with unit that was ordered for each day of the week. The days of the week are columns.
To be more specific:
I have two table, one is the Orders table with order id, customer name, date etc...
The second table is OrderItems, this table has order id as a foreign key, order Item id, item name, unit (exp: each, box , case), quantity, and price.
When a user picks a date range for the report, for example from 3/2/12 to 4/2/12, on my asp page, the report will group order items by week and will look as follows:
**week (1) starting from sunday of such date to saturday of such date**
item | unit | Sun | Mon | Tues | Wedn | Thur | Fri | Sat | Total Price for week
item1 | bag | 3 | 0 | 12 | 8 | 45 | 1 | 4 | $1234
item4 | box | 2 | 4 | 5 | 0 | 5 | 2 | 6 | $1234
**week (2) starting from sunday of such date to saturday of such date**
item | unit | Sun | Mon | Tues | Wedn | Thur | Fri | Sat | Total Price for week
item1 | bag | 3 | 0 | 12 | 8 | 45 | 1 | 4 | $1234
item4 | box | 2 | 4 | 5 | 0 | 5 | 2 | 6 | $2354
**week (2) starting from sunday of such date to saturday of such date**
item | unit | Sun | Mon | Tues | Wedn | Thur | Fri | Sat | Total Price for week
item1 | bag | 3 | 0 | 12 | 8 | 45 | 1 | 4 | $1234
item4 | box | 2 | 4 | 5 | 0 | 5 | 2 | 6 | $2354
I wish I could have something to show that I have already started, but crystal isn't my strong point and I dont even know where start tackling this one. I do know how to pass parameters and a datatable that I myself pre-filtered before passing it to the report. For example filtering items by date range and customer or order id.
any help would be much appreciated
Create a formula for each day of the week that totals the order.
ie Sunday quantity:
if dayOfWeek(dateField) = 'Sun'
then order.quantity
else 0
Add each day formula to the detail section of the report and then summarize it for each group level. To group it by week, just group by the date field, then set the grouping option to by week. Suppress the detail, and you'll have what you are looking for.
I don't remember the exact name of the dayOfWeek function, but it's something like that.