Percentage & GROUP BY - sqlite
I'm currently working with a collisions dataset which provides all cases that occur in a given day. My first instinct was to get the totals for a given day, where the output looked something like:
collision_date
SUM(severe_injury_count)
SUM(injured_victims)
2001-02-20
19
785
2001-02-20
12
697
2001-02-20
28
823
2001-02-20
29
871
The above example is the output of the below query:
SELECT collision_date, SUM(severe_injury_count),SUM(injured_victims)
FROM collisions c
GROUP BY collision_date
LIMIT 50,100;
I wanted to calculate a percentage of severe_injury_count/injured_victims, I thought it would be straightforward, therefore I attempted running this query (with a few variations of how I might have calc. the % - once I noticed it wasn't giving me what I intended):
SELECT
collision_date,
SUM(severe_injury_count/injured_victims) AS chance_being_sever_injured,
SUM(severe_injury_count),
SUM(injured_victims),
(severe_injury_count/injured_victims)*100,
(SUM(severe_injury_count)/SUM(injured_victims))*100
FROM collisions c
GROUP BY collision_date;
But the output I've been given does do the calculation as I might have expected, giving me results like:
collision_date
chance_being_sever_injured
SUM(severe_injury_count)
SUM(injured_victims)
(severe_injury_count/injured_victims)*100
(SUM(severe_injury_count)/SUM(injured_victims))*100
2001-02-20
13
19
785
NULL
0
2001-02-20
5
12
697
NULL
0
2001-02-20
17
28
823
0
0
2001-02-20
18
29
871
NULL
0
I checked the variable types and they are all integers and not strings, so I would have expected to have the actual percentages calculated.
Given the output results, I believe that I'm missing something fundamental when doing this type of operation.
I also tried using FORMAT(), but the output were all zero's as well...
FORMAT((SUM(severe_injury_count)/SUM(injured_victims))*100,2)
Any insight would be much appreciated.
Thank you for your time and feedback.
Implementing suggestions, hence extending initial post:
I tried the following as well:
SELECT collision_date, SUM(severe_injury_count),SUM(injured_victims),CAST(SUM(severe_injury_count)/SUM(injured_victims) AS DECIMAL)
FROM collisions c
GROUP BY collision_date
LIMIT 50,100;
Tried also to exclude possible NULL's by:
SELECT collision_date, SUM(severe_injury_count),SUM(injured_victims),CAST(SUM(severe_injury_count)/SUM(injured_victims) AS DECIMAL)
FROM collisions c WHERE severe_injury_count IS NOT NULL OR injured_victims IS NOT NULL
GROUP BY collision_date
LIMIT 50,100;
SELECT collision_date, SUM(severe_injury_count),SUM(injured_victims),CAST(SUM(severe_injury_count)/SUM(injured_victims) AS DECIMAL)
FROM collisions c WHERE severe_injury_count > 0 OR injured_victims > 0
GROUP BY collision_date
LIMIT 50,100;
All the above alternatives give me 0 as values for the "percentage" column I'm trying to calculate.
Also attempted to coerce the type for a given column as suggested by #easleyfixed like so:
SELECT collision_date, SUM(severe_injury_count),SUM(injured_victims),CAST(SUM(CAST(severe_injury_count AS INT))/SUM(CAST(injured_victims AS INT)) AS DECIMAL)
FROM collisions c WHERE severe_injury_count > 0 OR injured_victims > 0
GROUP BY collision_date;
Expanding on #nnichols & #easleyfixed suggestions
To better illustrate the data, running:
SELECT collision_date,COUNT(*)
FROM collisions c
GROUP BY collision_date;
Gives me (represents the number of records for a given date):
collision_date
COUNT(*)
2001-01-01
1000
2001-01-02
1330
2001-01-03
1329
2001-01-04
1346
2001-01-05
1457
etc
etc
I therefore expanded the query to try and include what I'm trying to assess.
SELECT collision_date,COUNT(*),SUM(severe_injury_count),SUM(injured_victims),
SUM(severe_injury_count)/SUM(injured_victims) AS chance_being_sever_injured
FROM collisions c
GROUP BY collision_date;
Outputs:
collision_date
COUNT(*)
SUM(severe_injury_count)
SUM(injured_victims)
SUM(severe_injury_count)/SUM(injured_victims) AS chance_being_sever_injured
2001-01-01
1000
37
676
0
2001-01-02
1330
30
797
0
2001-01-03
1329
28
793
0
2001-01-04
1346
23
758
0
2001-01-05
1457
30
836
0
etc
etc
etc
etc
etc
I double checked the database types and the ones with columns are INT but the collision_date is actually set as "TEXT".
For Sh*t and giggles I did:
SELECT CAST(collision_date AS DATE),COUNT(*),SUM(severe_injury_count),SUM(injured_victims),
SUM(severe_injury_count)/SUM(injured_victims) AS chance_being_sever_injured
FROM collisions c
GROUP BY collision_date;
CAST(collision_date AS DATE)
COUNT(*)
SUM(severe_injury_count)
SUM(injured_victims)
SUM(severe_injury_count)/SUM(injured_victims) AS chance_being_sever_injured
2,001
1000
37
676
0
2,001
1330
30
797
0
2,001
1329
28
793
0
2,001
1346
23
758
0
2,001
1457
30
836
0
etc
etc
etc
etc
etc
Also attempting to coerce NULL's into 0 as also suggested.
SELECT collision_date ,COUNT(*),SUM(IFNULL(severe_injury_count,0)),SUM(IFNULL(injured_victims,0)),
SUM(IFNULL(severe_injury_count,0))/SUM(IFNULL(injured_victims,0)) AS chance_being_sever_injured
FROM collisions c
GROUP BY collision_date;
Outputs:
collision_date
COUNT(*)
SUM(severe_injury_count)
SUM(injured_victims)
SUM(severe_injury_count)/SUM(injured_victims) AS chance_being_sever_injured
2001-01-01
1000
37
676
0
2001-01-02
1330
30
797
0
2001-01-03
1329
28
793
0
2001-01-04
1346
23
758
0
2001-01-05
1457
30
836
0
etc
etc
etc
etc
etc
I'm truly baffled...
MySQL and SQLite are definitely not the same thing! I have updated the tag on your question.
Integer divide yields an integer result, truncated toward zero. docs
You need to cast to REAL or FLOAT for the division to work on SQLite:
SELECT
collision_date,
SUM(severe_injury_count),
SUM(injured_victims),
ROUND(CAST(SUM(severe_injury_count) AS REAL) / CAST(SUM(injured_victims) AS REAL) * 100, 2)
FROM collisions
GROUP BY collision_date
The NULLS observed in one of your tests were the result of division by 0 (zero).
Related
Combing "previous row" of same table and JOIN from different table in Sqlite
I have the following table CREATE TABLE "shots" ( "player" INTEGER, "tournament" TEXT, "year" INTEGER, "course" INTEGER, "round" INTEGER, "hole" INTEGER, "shot" INTEGER, "text" TEXT, "distance" REAL, "x" TEXT, "y" TEXT, "z" TEXT ); With a sample of the data: 28237 470 2015 717 1 1 1 Shot 1 302 yds to left fairway, 257 yds to hole 10874 11451.596 10623.774 78.251 28237 470 2015 717 1 1 2 Shot 2 234 yds to right fairway, 71 ft to hole 8437 12150.454 10700.381 86.035 28237 470 2015 717 1 1 3 Shot 3 70 ft to green, 4 ft to hole 838 12215.728 10725.134 88.408 28237 470 2015 717 1 1 4 Shot 4 in the hole 46 12215.1 10729.1 88.371 28237 470 2015 717 1 2 1 Shot 1 199 yds to green, 29 ft to hole 7162 12776.03 10398.086 91.017 28237 470 2015 717 1 2 2 Shot 2 putt 26 ft 7 in., 2 ft 4 in. to hole 319 12749.444 10398.854 90.998 28237 470 2015 717 1 2 3 Shot 3 in the hole 28 12747.3 10397.6 91.027 28237 470 2015 717 1 3 1 Shot 1 296 yds to left intermediate, 204 yds to hole 10651 12596.857 9448.27 94.296 28237 470 2015 717 1 3 2 Shot 2 208 yds to green, 15 ft to hole 7478 12571.0 8825.648 94.673 28237 470 2015 717 1 3 3 Shot 3 putt 17 ft 6 in., 2 ft 5 in. to hole 210 12561.831 8840.539 94.362 I want to get for each shot the previous location (x, y, z). I wrote the below query. SELECT cur.player, cur.tournament, cur.year, cur.course, cur.round, cur.hole, cur.shot, cur.x, cur.y, cur.z, prev.x, prev.y, prev.z FROM shots cur INNER JOIN shots prev ON (cur.player, cur.tournament, cur.year, cur.course, cur.round, cur.hole, cur.shot) = (prev.player, prev.tournament, prev.year, prev.course, prev.round, prev.hole, prev.shot - 1) This query takes forever basically. How can I rewrite it to make it faster? In addition, I need to make an adjustment for the first shot on a hole (shot = 1). This shot is made from tee_x, tee_y and tee_z. These values are available in table holes CREATE TABLE "holes" ( "tournament" TEXT, "year" INTEGER, "course" INTEGER, "round" INTEGER, "hole" INTEGER, "tee_x" TEXT, "tee_y" TEXT, "tee_z" TEXT ); With data: 470 2015 717 1 1 11450 10625 78.25 470 2015 717 1 2 12750 10400 91 470 2015 717 1 3 2565 8840.5 95 Thanks
First, you need a composite index to speed up the operation: CREATE INDEX idx_shots ON shots (player, tournament, year, course, round, hole, shot); With that index, your query should run faster: SELECT cur.player, cur.tournament, cur.year, cur.course, cur.round, cur.hole, cur.shot, cur.x, cur.y, cur.z, prev.x AS prev_x, prev.y AS prev_y, prev.z AS prev_z FROM shots cur LEFT JOIN shots prev ON (cur.player, cur.tournament, cur.year, cur.course, cur.round, cur.hole, cur.shot) = (prev.player, prev.tournament, prev.year, prev.course, prev.round, prev.hole, prev.shot + 1); The changes I made: the join should be a LEFT join so that all rows are included and not only the ones that have a previous row -1 should be +1 because the previous row's shot is 1 less than the current row's shot added aliases for the previous row's x, y and z But, if your version of SQLite is 3.25.0+ it would be better to use window function LAG() instead of a self join: SELECT *, LAG(x) OVER w AS prev_x, LAG(y) OVER w AS prev_y, LAG(z) OVER w AS prev_z FROM shots WINDOW w AS (PARTITION BY player, tournament, year, course, round, hole ORDER BY shot); See the demo (I include the query plan for both queries where you can see the use of the composite index).
SQLite Ordering Whole Numbers
I am fairly aware that ORDER BY in SQLite puts the number in Ascending order unless DESC is at the end. But I realized that it only worked for the starting numbers. i.e INT 14 78 357 2999 57 888 ORDER BY INT Gives 14 2999 357 57 78 888 Is it possible to use the ORDER BY function where the whole numbers are in ascending Order? As such 14 57 78 357 888 2999
select (INT * 1) as "int_number" from mytable order by 1 or as someone points out in the link: select INT from mytable order by (cast INT as Integer)
adding and subtracting values in multiple data frames of different lengths - flow analysis
Thank you jakub and Hack-R! Yes, these are my actual data. The data I am starting from are the following: [A] #first, longer dataset CODE_t2 VALUE_t2 111 3641 112 1691 121 1271 122 185 123 522 124 0 131 0 132 0 133 0 141 626 142 170 211 0 212 0 213 0 221 0 222 0 223 0 231 95 241 0 242 0 243 0 244 0 311 129 312 1214 313 0 321 0 322 0 323 565 324 0 331 0 332 0 333 0 334 0 335 0 411 0 412 0 421 0 422 0 423 0 511 6 512 0 521 0 522 0 523 87 In the above table, we can see the 44 land use CODES (which I inappropriately named "class" in my first entry) for a certain city. Some values are just 0, meaning that there are no land uses of that type in that city. Starting from this table, which displays all the land use types for t2 and their corresponding values ("VALUE_t2") I have to reconstruct the previous amount of land uses ("VALUE_t1") per each type. To do so, I have to add and subtract the value per each land use (if not 0) by using the "change land use table" from t2 to t1, which is the following: [B] #second, shorter dataset CODE_t2 CODE_t1 VALUE_CHANGE1 121 112 2 121 133 12 121 323 0 121 511 3 121 523 2 123 523 4 133 123 3 133 523 4 141 231 12 141 511 37 So, in order to get VALUE_t1 from VALUE_t2, I have, for instance, to subtract 2 + 12 + 0 + 3 + 2 hectares (first 5 values of the second, shorter table) from the value of land use type/code 121 of the first, longer table (1271 ha), and add 2 hectares to land type 112, 12 hectares to land type 133, 3 hectares to land type 511 and 2 hectares to land type 523. And I have to do that for all the land use types different than 0, and later also from t1 to t0. What I have to do is a sort of loop that would both add and subtract, per each land use type/code, the values from VALUE_t2 to VALUE_t1, and from VALUE_t1 to VALUE_t0. Once I estimated VALUE_t1 and VALUE_t0, I will put the values in a simple table showing the relative variation (here the values are not real): CODE VALUE_t0 VALUE_t2 % VAR t2-t0 code1 50 100 ((100-50)/50)*100 code2 70 80 ((80-70)/70)*100 code3 45 34 ((34-45)/45)*100 What I could do so far is: land_code <- names(A)[-1] land_code A$VALUE_t1 <- for(code in land_code{ cbind(A[1], A[land_code] - B[match(A$CODE_t2, B$CODE_t2), land_code]) } If I use the loop I get an error, while if I take it away: A$VALUE_t1 <- cbind(A[1], A[land_code] - B[match(A$CODE_t2, B$CODE_t2), land_code]) it works but I don't really get what I want to get... so far I was working on how to get a new column which would contain the new "add & subtract" values, but haven't succeeded yet. So I worked on how to get a new column which would at least match the land use types first, to then include the "add and subtract" formula. Another problem is that, by using "match", I get a shorter A$VALUE_t1 table (13 rows instead of 44), while I would like to keep all the land use types in dataset A, because I will have then to match it with the table including VALUES_t0 (which I haven't shown here). Sorry that I cannot do better than this at the moment... and I hope to have explained better what I have to do. I am extremely grateful for any help you can provide to me. thanks a lot
Parsing out all repeat and consecutive numbers in R
Suppose I have a dataframe like this: 1360 C 0 403 1361 A 0 403 1362 G 0 403 1402 0 A 444 2019 T 0 1060 2020 T 0 1060 2021 G 0 1060 2022 T 0 1060 2057 T 0 1085 2062 0 A 1093 2062 0 C 1094 2062 0 C 1095 Desired Output 1402 0 A 444 2057 0 0 1085 I was trying to parse out all the rows with repeats or consecutive numbers in the column 1. So, I want only the rows with the numbers which were not a repeat number or a consecutive number in the dataset. Any help will be much appreciated.
You can use diff to find the difference between adjacent elements in a vector. Assuming the vector is sorted, diff will return zero for repeat numbers and one for consecutive numbers. keep1 <- diff(df[,1]) > 1 This will include values that are after a jump, but at the start of a new sequence, so we need to check the lag1 value, and pad the logical vector to make it as long as the original. keep <- c(keep1, TRUE) & c(TRUE, keep1) df[keep,]
sqlite selection help needed
I have the following bill table building name amount payments receiptno 1234 name a 123 0 0 1234 name a 12 10 39 1234 name a 125 125 40 1235 name a 133 10 41 1235 name b 125 125 50 1234 name c 100 90 0 I want to select rows that amount minus payments is greater than zero and display the max value of receiptno so I want to select only the following from building 1234 name a 39 name c 0 How can I do this?
Translating your description into SQL results in this: SELECT building, name, MAX(receiptno) FROM BillTable WHERE amount - payments > 0 GROUP BY building, name