Percentage & GROUP BY - sqlite

I'm currently working with a collisions dataset which provides all cases that occur in a given day. My first instinct was to get the totals for a given day, where the output looked something like:
collision_date
SUM(severe_injury_count)
SUM(injured_victims)
2001-02-20
19
785
2001-02-20
12
697
2001-02-20
28
823
2001-02-20
29
871
The above example is the output of the below query:
SELECT collision_date, SUM(severe_injury_count),SUM(injured_victims)
FROM collisions c
GROUP BY collision_date
LIMIT 50,100;
I wanted to calculate a percentage of severe_injury_count/injured_victims, I thought it would be straightforward, therefore I attempted running this query (with a few variations of how I might have calc. the % - once I noticed it wasn't giving me what I intended):
SELECT
collision_date,
SUM(severe_injury_count/injured_victims) AS chance_being_sever_injured,
SUM(severe_injury_count),
SUM(injured_victims),
(severe_injury_count/injured_victims)*100,
(SUM(severe_injury_count)/SUM(injured_victims))*100
FROM collisions c
GROUP BY collision_date;
But the output I've been given does do the calculation as I might have expected, giving me results like:
collision_date
chance_being_sever_injured
SUM(severe_injury_count)
SUM(injured_victims)
(severe_injury_count/injured_victims)*100
(SUM(severe_injury_count)/SUM(injured_victims))*100
2001-02-20
13
19
785
NULL
0
2001-02-20
5
12
697
NULL
0
2001-02-20
17
28
823
0
0
2001-02-20
18
29
871
NULL
0
I checked the variable types and they are all integers and not strings, so I would have expected to have the actual percentages calculated.
Given the output results, I believe that I'm missing something fundamental when doing this type of operation.
I also tried using FORMAT(), but the output were all zero's as well...
FORMAT((SUM(severe_injury_count)/SUM(injured_victims))*100,2)
Any insight would be much appreciated.
Thank you for your time and feedback.
Implementing suggestions, hence extending initial post:
I tried the following as well:
SELECT collision_date, SUM(severe_injury_count),SUM(injured_victims),CAST(SUM(severe_injury_count)/SUM(injured_victims) AS DECIMAL)
FROM collisions c
GROUP BY collision_date
LIMIT 50,100;
Tried also to exclude possible NULL's by:
SELECT collision_date, SUM(severe_injury_count),SUM(injured_victims),CAST(SUM(severe_injury_count)/SUM(injured_victims) AS DECIMAL)
FROM collisions c WHERE severe_injury_count IS NOT NULL OR injured_victims IS NOT NULL
GROUP BY collision_date
LIMIT 50,100;
SELECT collision_date, SUM(severe_injury_count),SUM(injured_victims),CAST(SUM(severe_injury_count)/SUM(injured_victims) AS DECIMAL)
FROM collisions c WHERE severe_injury_count > 0 OR injured_victims > 0
GROUP BY collision_date
LIMIT 50,100;
All the above alternatives give me 0 as values for the "percentage" column I'm trying to calculate.
Also attempted to coerce the type for a given column as suggested by #easleyfixed like so:
SELECT collision_date, SUM(severe_injury_count),SUM(injured_victims),CAST(SUM(CAST(severe_injury_count AS INT))/SUM(CAST(injured_victims AS INT)) AS DECIMAL)
FROM collisions c WHERE severe_injury_count > 0 OR injured_victims > 0
GROUP BY collision_date;
Expanding on #nnichols & #easleyfixed suggestions
To better illustrate the data, running:
SELECT collision_date,COUNT(*)
FROM collisions c
GROUP BY collision_date;
Gives me (represents the number of records for a given date):
collision_date
COUNT(*)
2001-01-01
1000
2001-01-02
1330
2001-01-03
1329
2001-01-04
1346
2001-01-05
1457
etc
etc
I therefore expanded the query to try and include what I'm trying to assess.
SELECT collision_date,COUNT(*),SUM(severe_injury_count),SUM(injured_victims),
SUM(severe_injury_count)/SUM(injured_victims) AS chance_being_sever_injured
FROM collisions c
GROUP BY collision_date;
Outputs:
collision_date
COUNT(*)
SUM(severe_injury_count)
SUM(injured_victims)
SUM(severe_injury_count)/SUM(injured_victims) AS chance_being_sever_injured
2001-01-01
1000
37
676
0
2001-01-02
1330
30
797
0
2001-01-03
1329
28
793
0
2001-01-04
1346
23
758
0
2001-01-05
1457
30
836
0
etc
etc
etc
etc
etc
I double checked the database types and the ones with columns are INT but the collision_date is actually set as "TEXT".
For Sh*t and giggles I did:
SELECT CAST(collision_date AS DATE),COUNT(*),SUM(severe_injury_count),SUM(injured_victims),
SUM(severe_injury_count)/SUM(injured_victims) AS chance_being_sever_injured
FROM collisions c
GROUP BY collision_date;
CAST(collision_date AS DATE)
COUNT(*)
SUM(severe_injury_count)
SUM(injured_victims)
SUM(severe_injury_count)/SUM(injured_victims) AS chance_being_sever_injured
2,001
1000
37
676
0
2,001
1330
30
797
0
2,001
1329
28
793
0
2,001
1346
23
758
0
2,001
1457
30
836
0
etc
etc
etc
etc
etc
Also attempting to coerce NULL's into 0 as also suggested.
SELECT collision_date ,COUNT(*),SUM(IFNULL(severe_injury_count,0)),SUM(IFNULL(injured_victims,0)),
SUM(IFNULL(severe_injury_count,0))/SUM(IFNULL(injured_victims,0)) AS chance_being_sever_injured
FROM collisions c
GROUP BY collision_date;
Outputs:
collision_date
COUNT(*)
SUM(severe_injury_count)
SUM(injured_victims)
SUM(severe_injury_count)/SUM(injured_victims) AS chance_being_sever_injured
2001-01-01
1000
37
676
0
2001-01-02
1330
30
797
0
2001-01-03
1329
28
793
0
2001-01-04
1346
23
758
0
2001-01-05
1457
30
836
0
etc
etc
etc
etc
etc
I'm truly baffled...

MySQL and SQLite are definitely not the same thing! I have updated the tag on your question.
Integer divide yields an integer result, truncated toward zero. docs
You need to cast to REAL or FLOAT for the division to work on SQLite:
SELECT
collision_date,
SUM(severe_injury_count),
SUM(injured_victims),
ROUND(CAST(SUM(severe_injury_count) AS REAL) / CAST(SUM(injured_victims) AS REAL) * 100, 2)
FROM collisions
GROUP BY collision_date
The NULLS observed in one of your tests were the result of division by 0 (zero).

Related

Combing "previous row" of same table and JOIN from different table in Sqlite

I have the following table
CREATE TABLE "shots" (
"player" INTEGER,
"tournament" TEXT,
"year" INTEGER,
"course" INTEGER,
"round" INTEGER,
"hole" INTEGER,
"shot" INTEGER,
"text" TEXT,
"distance" REAL,
"x" TEXT,
"y" TEXT,
"z" TEXT
);
With a sample of the data:
28237 470 2015 717 1 1 1 Shot 1 302 yds to left fairway, 257 yds to hole 10874 11451.596 10623.774 78.251
28237 470 2015 717 1 1 2 Shot 2 234 yds to right fairway, 71 ft to hole 8437 12150.454 10700.381 86.035
28237 470 2015 717 1 1 3 Shot 3 70 ft to green, 4 ft to hole 838 12215.728 10725.134 88.408
28237 470 2015 717 1 1 4 Shot 4 in the hole 46 12215.1 10729.1 88.371
28237 470 2015 717 1 2 1 Shot 1 199 yds to green, 29 ft to hole 7162 12776.03 10398.086 91.017
28237 470 2015 717 1 2 2 Shot 2 putt 26 ft 7 in., 2 ft 4 in. to hole 319 12749.444 10398.854 90.998
28237 470 2015 717 1 2 3 Shot 3 in the hole 28 12747.3 10397.6 91.027
28237 470 2015 717 1 3 1 Shot 1 296 yds to left intermediate, 204 yds to hole 10651 12596.857 9448.27 94.296
28237 470 2015 717 1 3 2 Shot 2 208 yds to green, 15 ft to hole 7478 12571.0 8825.648 94.673
28237 470 2015 717 1 3 3 Shot 3 putt 17 ft 6 in., 2 ft 5 in. to hole 210 12561.831 8840.539 94.362
I want to get for each shot the previous location (x, y, z). I wrote the below query.
SELECT cur.player, cur.tournament, cur.year, cur.course, cur.round, cur.hole, cur.shot, cur.x, cur.y, cur.z, prev.x, prev.y, prev.z
FROM shots cur
INNER JOIN shots prev
ON (cur.player, cur.tournament, cur.year, cur.course, cur.round, cur.hole, cur.shot) =
(prev.player, prev.tournament, prev.year, prev.course, prev.round, prev.hole, prev.shot - 1)
This query takes forever basically. How can I rewrite it to make it faster?
In addition, I need to make an adjustment for the first shot on a hole (shot = 1). This shot is made from tee_x, tee_y and tee_z. These values are available in table holes
CREATE TABLE "holes" (
"tournament" TEXT,
"year" INTEGER,
"course" INTEGER,
"round" INTEGER,
"hole" INTEGER,
"tee_x" TEXT,
"tee_y" TEXT,
"tee_z" TEXT
);
With data:
470 2015 717 1 1 11450 10625 78.25
470 2015 717 1 2 12750 10400 91
470 2015 717 1 3 2565 8840.5 95
Thanks
First, you need a composite index to speed up the operation:
CREATE INDEX idx_shots ON shots (player, tournament, year, course, round, hole, shot);
With that index, your query should run faster:
SELECT cur.player, cur.tournament, cur.year, cur.course, cur.round, cur.hole, cur.shot, cur.x, cur.y, cur.z,
prev.x AS prev_x, prev.y AS prev_y, prev.z AS prev_z
FROM shots cur LEFT JOIN shots prev
ON (cur.player, cur.tournament, cur.year, cur.course, cur.round, cur.hole, cur.shot) =
(prev.player, prev.tournament, prev.year, prev.course, prev.round, prev.hole, prev.shot + 1);
The changes I made:
the join should be a LEFT join so that all rows are included and
not only the ones that have a previous row
-1 should be +1 because the previous row's shot is 1 less than the current row's shot
added aliases for the previous row's x, y and z
But, if your version of SQLite is 3.25.0+ it would be better to use window function LAG() instead of a self join:
SELECT *,
LAG(x) OVER w AS prev_x,
LAG(y) OVER w AS prev_y,
LAG(z) OVER w AS prev_z
FROM shots
WINDOW w AS (PARTITION BY player, tournament, year, course, round, hole ORDER BY shot);
See the demo (I include the query plan for both queries where you can see the use of the composite index).

SQLite Ordering Whole Numbers

I am fairly aware that ORDER BY in SQLite puts the number in Ascending order unless DESC is at the end. But I realized that it only worked for the starting numbers.
i.e
INT
14
78
357
2999
57
888
ORDER BY INT
Gives
14
2999
357
57
78
888
Is it possible to use the ORDER BY function where the whole numbers are in ascending Order?
As such
14
57
78
357
888
2999
select (INT * 1) as "int_number" from mytable order by 1
or as someone points out in the link:
select INT from mytable order by (cast INT as Integer)

adding and subtracting values in multiple data frames of different lengths - flow analysis

Thank you jakub and Hack-R!
Yes, these are my actual data. The data I am starting from are the following:
[A] #first, longer dataset
CODE_t2 VALUE_t2
111 3641
112 1691
121 1271
122 185
123 522
124 0
131 0
132 0
133 0
141 626
142 170
211 0
212 0
213 0
221 0
222 0
223 0
231 95
241 0
242 0
243 0
244 0
311 129
312 1214
313 0
321 0
322 0
323 565
324 0
331 0
332 0
333 0
334 0
335 0
411 0
412 0
421 0
422 0
423 0
511 6
512 0
521 0
522 0
523 87
In the above table, we can see the 44 land use CODES (which I inappropriately named "class" in my first entry) for a certain city. Some values are just 0, meaning that there are no land uses of that type in that city.
Starting from this table, which displays all the land use types for t2 and their corresponding values ("VALUE_t2") I have to reconstruct the previous amount of land uses ("VALUE_t1") per each type.
To do so, I have to add and subtract the value per each land use (if not 0) by using the "change land use table" from t2 to t1, which is the following:
[B] #second, shorter dataset
CODE_t2 CODE_t1 VALUE_CHANGE1
121 112 2
121 133 12
121 323 0
121 511 3
121 523 2
123 523 4
133 123 3
133 523 4
141 231 12
141 511 37
So, in order to get VALUE_t1 from VALUE_t2, I have, for instance, to subtract 2 + 12 + 0 + 3 + 2 hectares (first 5 values of the second, shorter table) from the value of land use type/code 121 of the first, longer table (1271 ha), and add 2 hectares to land type 112, 12 hectares to land type 133, 3 hectares to land type 511 and 2 hectares to land type 523. And I have to do that for all the land use types different than 0, and later also from t1 to t0.
What I have to do is a sort of loop that would both add and subtract, per each land use type/code, the values from VALUE_t2 to VALUE_t1, and from VALUE_t1 to VALUE_t0.
Once I estimated VALUE_t1 and VALUE_t0, I will put the values in a simple table showing the relative variation (here the values are not real):
CODE VALUE_t0 VALUE_t2 % VAR t2-t0
code1 50 100 ((100-50)/50)*100
code2 70 80 ((80-70)/70)*100
code3 45 34 ((34-45)/45)*100
What I could do so far is:
land_code <- names(A)[-1]
land_code
A$VALUE_t1 <- for(code in land_code{
cbind(A[1], A[land_code] - B[match(A$CODE_t2, B$CODE_t2), land_code])
}
If I use the loop I get an error, while if I take it away:
A$VALUE_t1 <- cbind(A[1], A[land_code] - B[match(A$CODE_t2, B$CODE_t2), land_code])
it works but I don't really get what I want to get... so far I was working on how to get a new column which would contain the new "add & subtract" values, but haven't succeeded yet. So I worked on how to get a new column which would at least match the land use types first, to then include the "add and subtract" formula.
Another problem is that, by using "match", I get a shorter A$VALUE_t1 table (13 rows instead of 44), while I would like to keep all the land use types in dataset A, because I will have then to match it with the table including VALUES_t0 (which I haven't shown here).
Sorry that I cannot do better than this at the moment... and I hope to have explained better what I have to do. I am extremely grateful for any help you can provide to me.
thanks a lot

Parsing out all repeat and consecutive numbers in R

Suppose I have a dataframe like this:
1360 C 0 403
1361 A 0 403
1362 G 0 403
1402 0 A 444
2019 T 0 1060
2020 T 0 1060
2021 G 0 1060
2022 T 0 1060
2057 T 0 1085
2062 0 A 1093
2062 0 C 1094
2062 0 C 1095
Desired Output
1402 0 A 444
2057 0 0 1085
I was trying to parse out all the rows with repeats or consecutive numbers in the column 1. So, I want only the rows with the numbers which were not a repeat number or a consecutive number in the dataset. Any help will be much appreciated.
You can use diff to find the difference between adjacent elements in a vector. Assuming the vector is sorted, diff will return zero for repeat numbers and one for consecutive numbers.
keep1 <- diff(df[,1]) > 1
This will include values that are after a jump, but at the start of a new sequence, so we need to check the lag1 value, and pad the logical vector to make it as long as the original.
keep <- c(keep1, TRUE) & c(TRUE, keep1)
df[keep,]

sqlite selection help needed

I have the following bill table
building name amount payments receiptno
1234 name a 123 0 0
1234 name a 12 10 39
1234 name a 125 125 40
1235 name a 133 10 41
1235 name b 125 125 50
1234 name c 100 90 0
I want to select rows that amount minus payments is greater than zero and display the max value of receiptno
so I want to select only the following from building 1234
name a 39
name c 0
How can I do this?
Translating your description into SQL results in this:
SELECT building,
name,
MAX(receiptno)
FROM BillTable
WHERE amount - payments > 0
GROUP BY building,
name

Resources