COUNT AND SUM WITH INNER JOIN - count

I have two tables that are linked on 4 fields
DP_STOP has those 4 fields plus a customer id
DP_ORDER has those 4 fields plus orders
The Customer (LOCATION_ID) is on 4 Routes as Stops on different in the DP_STOP table
The Customer had a total of 7 Orders for those 4 Routes
When I try to count the number of Customers while SUM the Orders, the Count is the number of Orders instead of the number of Stops
DP_STOP table
REGION | ROUTE_ID | ROUTE_DATE | INTERNAL_STOP | LOCATION_ID
11600-A| 202 | 2018-11-01 | 9 | 00001
11600-A| 202 | 2018-11-08 | 9 | 00001
11600-A| 202 | 2018-11-15 | 9 | 00001
11600-A| 202 | 2018-11-22 | 9 | 00001
DP_ORDER table
REGION | ROUTE_ID | ROUTE_DATE | INTERNAL_STOP | ORDER_NUMBER | PLANNED_SIZE1
11600-A| 202 | 2018-11-01 9 | 1A | 5
11600-A| 202 | 2018-11-08 9 | 2B | 5
11600-A| 202 | 2018-11-08 9 | 2C | 5
11600-A| 202 | 2018-11-15 9 | 3A | 5
11600-A| 202 | 2018-11-15 9 | 3B | 5
11600-A| 202 | 2018-11-22 9 | 4A | 5
11600-A| 202 | 2018-11-22 9 | 4B | 5
When I attempt to COUNT the number of stops while SUM of the Order, the COUNT is counting the number of Orders instead of the number of Stops with the following query:
SELECT
COUNT (L.LOCATION_ID) AS DELIVERIES,
L.LOCATION_ID AS CUSTOMER_ID,
SUM (O.PLANNED_SIZE1) AS CASES
FROM TSDBA.DP_STOP L
INNER JOIN TSDBA.DP_ORDER O
ON (O.REGION_ID=L.REGION_ID)
AND (O.ROUTE_DATE=L.ROUTE_DATE)
AND (O.ROUTE_ID=L.ROUTE_ID)
AND (O.INTERNAL_STOP_ID=L.INTERNAL_STOP_ID)
WHERE L.ROUTE_DATE BETWEEN '2018-11-01' AND '2018-11-28'
AND L.REGION_ID='11600-A'
GROUP BY L.LOCATION_ID
My results from the query are:
DELIVERIES | CUSTOMER_ID | PLANNED_SIZE1
7 | 00001 | 35
I want it to be:
DELIVERIES | CUSTOMER_ID | PLANNED_SIZE1
4 | 00001 | 35

You should add distinct by expression which is uniquely defines a row for that, i.e. COUNT (distinct L.ROUTE_DATE) (in your case).

Related

Skipping number of rows after a certain value in R

I have a data looks like below, I would like to skip 2 rows after max index of certain types (3 and 4). For example, I have two 4s in my table, but I only need to remove 2 rows after the second 4. Same for 3, I only need to remove 2 rows after the third 3.
-----------------
| grade | type |
-----------------
| 93 | 2 |
-----------------
| 90 | 2 |
-----------------
| 54 | 2 |
-----------------
| 36 | 4 |
-----------------
| 31 | 4 |
-----------------
| 94 | 1 |
-----------------
| 57 | 1 |
-----------------
| 16 | 3 |
-----------------
| 11 | 3 |
-----------------
| 12 | 3 |
-----------------
| 99 | 1 |
-----------------
| 99 | 1 |
-----------------
| 9 | 3 |
-----------------
| 10 | 3 |
-----------------
| 97 | 1 |
-----------------
| 96 | 1 |
-----------------
The desired output would be:
-----------------
| grade | type |
-----------------
| 93 | 2 |
-----------------
| 90 | 2 |
-----------------
| 54 | 2 |
-----------------
| 36 | 4 |
-----------------
| 31 | 4 |
-----------------
| 16 | 3 |
-----------------
| 11 | 3 |
-----------------
| 12 | 3 |
-----------------
| 9 | 3 |
-----------------
| 10 | 3 |
-----------------
Here is the code of my example:
data <- data.frame(grade = c(93,90,54,36,31,94,57,16,11,12,99,99,9,10,97,96), type = c(2,2,2,4,4,1,1,3,3,3,1,1,3,3,1,1))
Could anyone give me some hints on how to approach this in R? Thanks a bunch in advance for your help and your time!
data[-c(max(which(data$type==3))+1:2,max(which(data$type==4))+1:2),]
# grade type
# 1 93 2
# 2 90 2
# 3 54 2
# 4 36 4
# 5 31 4
# 8 16 3
# 9 11 3
# 10 12 3
Using some indexing:
data[-(nrow(data) - match(c(3,4), rev(data$type)) + 1 + rep(1:2, each=2)),]
# grade type
#1 93 2
#2 90 2
#3 54 2
#4 36 4
#5 31 4
#8 16 3
#9 11 3
#10 12 3
Or more generically:
vals <- c(3,4)
data[-(nrow(data) - match(vals, rev(data$type)) + 1 + rep(1:2, each=length(vals))),]
The logic is to match the first instance of each value to the reversed values in the column, then spin that around to give the original row index, then add 1 and 2 to the row indexes, then drop these rows.
Similar to Ric, but I find it a bit easier to read (way more verbose, though):
idx = data %>% mutate(id = row_number()) %>%
filter(type %in% 3:4) %>% group_by(type) %>% filter(id == max(id)) %>% pull(id)
data[-c(idx + 1, idx + 2),]

r data.table groupby join in pyspark 1.6

I have the following datatables (R code):
accounts <- fread("ACC_ID | DATE | RATIO | VALUE
1 | 2017-12-31 | 2.00 | 8
2 | 2017-12-31 | 2.00 | 12
3 | 2017-12-31 | 6.00 | 20
4 | 2017-12-31 | 1.00 | 5 ", sep='|')
timeline <- fread(" DATE
2017-12-31
2018-12-31
2019-12-31
2020-12-31", sep="|")
In R, I know I can join on DATE, by ACC_ID, RATIO and VALUE:
accounts[, .SD[timeline, on='DATE'], by=c('ACC_ID', 'RATIO', 'VALUE')]
This way, I can "project" ACC_ID, RATIO and VALUE values over timeline dates, getting the following data table:
ACC_ID | RATIO | VALUE | DATE
1 | 2 | 8 |2017-12-31
2 | 2 | 12 |2017-12-31
3 | 6 | 20 |2017-12-31
4 | 1 | 5 |2017-12-31
1 | 2 | 8 |2018-12-31
2 | 2 | 12 |2018-12-31
3 | 6 | 20 |2018-12-31
4 | 1 | 5 |2018-12-31
1 | 2 | 8 |2019-12-31
2 | 2 | 12 |2019-12-31
3 | 6 | 20 |2019-12-31
4 | 1 | 5 |2019-12-31
1 | 2 | 8 |2020-12-31
2 | 2 | 12 |2020-12-31
3 | 6 | 20 |2020-12-31
4 | 1 | 5 |2020-12-31
I've been trying hard to find something similar with PySpark, but I've not been able to. What should be the appropriate way to solve this?
Thanks very much for your time. I greatly appreciate any help you can give me, this one is important for me.
It looks like you're trying to do a cross join?
spark.sql('''
select ACC_ID, RATIO, VALUE, timeline.DATE
from accounts, timeline
''')

Subtract column values using coalesce

I want to subtract values in the "place" column for each record returned in a "race", "bib", "split" group by so that a "diff" column appears like so.
Desired Output:
race | bib | split | place | diff
----------------------------------
10 | 514 | 1 | 5 | 0
10 | 514 | 2 | 3 | 2
10 | 514 | 3 | 2 | 1
10 | 17 | 1 | 8 | 0
10 | 17 | 2 | 12 | -4
10 | 17 | 3 | 15 | -3
I'm new to using the coalesce statement and the closest I have come to the desired output is the following
select a.race,a.bib,a.split, a.place,
coalesce(a.place -
(select b.place from ranking b where b.split < a.split), a.place) as diff
from ranking a
group by race,bib, split
which produces:
race | bib | split | place | diff
----------------------------------
10 | 514 | 1 | 5 | 5
10 | 514 | 2 | 3 | 2
10 | 514 | 3 | 2 | 1
10 | 17 | 1 | 8 | 8
10 | 17 | 2 | 12 | 11
10 | 17 | 3 | 15 | 14
Thanks for looking!
To compute the difference, you have to look up the value in the row that has the same race and bib values, and the next-smaller split value:
SELECT race, bib, split, place,
coalesce((SELECT r2.place
FROM ranking AS r2
WHERE r2.race = ranking.race
AND r2.bib = ranling.bib
AND r2.split < ranking.split
ORDER BY r2.split DESC
LIMIT 1
) - place,
0) AS diff
FROM ranking;

Oracle. Calculate sum of one column in subgroup and save the others column in query result

I have table1 like this.
id | name | sum1 | sum2 | bonus
———|——————|—————————|——————|——————
9 | X | 225 | 0,68 | 3
10 | X | 30 | 0,85 | 3
11 | X | 3384,73 | 0,8 | 3
15 | Y | 2800 | 2 | 3
16 | Y | 500 | 0 | 0
17 | Y | 2077,49 | 0,8 | 3
18 | Y | 26736,96| 0,7 | 8
19 | Z | 209,9 | 1,5 | 3
20 | Z | 700 | 1 | 3
21 | Z | 6550 | 0 | 0
I want sum bonus column for each of "name" subgroup and get in result query table2
id | name | sum1 | sum2 | bonus
————————|——————| ————————|——————|——————
9 | X | 225 | 0,68 | 3
10 | X | 30 | 0,85 | 3
11 | X | 3384,73 | 0,8 | 3
totalX | null | null | null | 9
15 | Y | 2800 | 2 | 3
16 | Y | 500 | 0 | 0
17 | Y | 2077,49 | 0,8 | 3
18 | Y | 26736,96| 0,7 | 8
totalY | null | null | null | 14
19 | Z | 209,9 | 1,5 | 3
20 | Z | 700 | 1 | 3
21 | Z | 6550 | 0 | 0
totalZ | null | null | null | 6
I did try "over partition by"
SELECT table1.*, sum(bonus) over (PARTITION by name) as bonus_total FROM table1
It got me an extra column with bonus sum for each subgroup but this is not exactly what I want to get
id | name | sum1 | sum2 | bonus| bonus_total
————————|——————| ————————|——————|——————|————————————
9 | X | 225 | 0,68 | 3| 9
10 | X | 30 | 0,85 | 3| 9
11 | X | 3384,73 | 0,8 | 3| 9
15 | Y | 2800 | 2 | 3| 14
16 | Y | 500 | 0 | 0| 14
17 | Y | 2077,49 | 0,8 | 3| 14
18 | Y | 26736,96| 0,7 | 8| 14
19 | Z | 209,9 | 1,5 | 3| 6
20 | Z | 700 | 1 | 3| 6
21 | Z | 6550 | 0 | 0| 6
You can do this by doing a partial group by rollup plus some conditional clauses:
with table1 as (select 9 id, 'X' name, 225 sum1, 0.68 sum2, 3 bonus from dual union all
select 10 id, 'X' name, 30 sum1, 0.85 sum2, 3 bonus from dual union all
select 11 id, 'X' name, 3384.73 sum1, 0.8 sum2, 3 bonus from dual union all
select 15 id, 'Y' name, 2800 sum1, 2 sum2, 3 bonus from dual union all
select 16 id, 'Y' name, 500 sum1, 0 sum2, 0 bonus from dual union all
select 17 id, 'Y' name, 2077.49 sum1, 0.8 sum2, 3 bonus from dual union all
select 18 id, 'Y' name, 26736.96 sum1, 0.7 sum2, 8 bonus from dual union all
select 19 id, 'Z' name, 209.9 sum1, 1.5 sum2, 3 bonus from dual union all
select 20 id, 'Z' name, 700 sum1, 1 sum2, 3 bonus from dual union all
select 21 id, 'Z' name, 6550 sum1, 0 sum2, 0 bonus from dual)
select case when id is null then 'total'||name else to_char(id) end id,
case when id is not null then name end name,
case when id is not null then sum(sum1) end sum1,
case when id is not null then sum(sum2) end sum2,
sum(bonus) bonus
from table1 t1
group by name, rollup (id)
order by t1.name, t1.id;
ID NAME SUM1 SUM2 BONUS
-------- ---- ---------- ---------- ----------
9 X 225 .68 3
10 X 30 .85 3
11 X 3384.73 .8 3
totalX 9
15 Y 2800 2 3
16 Y 500 0 0
17 Y 2077.49 .8 3
18 Y 26736.96 .7 8
totalY 14
19 Z 209.9 1.5 3
20 Z 700 1 3
21 Z 6550 0 0
totalZ 6
The case statements are required purely to get the formatting you required. I had to include sums around the sum1 and sum2 columns in order to get them to appear in the results as you wanted - we turn them into nulls for the output.
Also, I am assuming that the id column is set to disallow null values.

order grouping variable in R

I have a database like this:
ID | familysize | age | gender
------+------------+-------------------+------------+-----+----------
1001 | 4 | 26 | 1
1001 | 4 | 38 | 2
1001 | 4 | 30 | 2
1001 | 4 | 7 | 1
1002 | 3 | 25 | 2
1002 | 3 | 39 | 1
1002 | 3 | 10 | 2
1003 | 5 | 60 | 1
1003 | 5 | 50 | 2
1003 | 5 | 26 | 2
1003 | 5 | 23 | 1
1003 | 5 | 20 | 1
1004 | ....
I want to order this dataframe by age of people in each ID , so I use this command:
library(plyr)
require(plyr)
b2<-ddply(b , "ID", function(x) head(x[order(x$ age, decreasing = TRUE), ], ))
but when I use this command I lost some of observation. what should I do for ordering this database ?
b2 <- b[order(b$ID, -b$age), ]
should do the trick.
The arrange function in plyr does a great job here. Order by ID after that by age but in a descending order.
arrange(b, ID, desc(age))

Resources