Remove duplicates based on multiple values in R or POWER BI

Remove duplicates based on multiple values in R or POWER BI - r

I have a data set, each line representing a "service visit" for customers. A customer might have between 0 and 5 service calls. If there isn't a service call for someone, the columns associated with a service call would all be empty.
+--------------+-------------------+-------------------+------------------------+---------------------+
| Project Name | Customer Name | Service Call.Name | Service Call Date Time | Service Call Status |
+--------------+-------------------+-------------------+------------------------+---------------------+
| OO-99999 | A | SC-001762 | 3/21/2022 7:00:00 PM | Completed |
| OO-99999 | A | SC-002323 | null | Completed |
| OO-99999 | A | SC-002357 | 10/3/2022 7:00:00 PM | 2nd Visit Scheduled |
| OO-88888 | B | SC-001260 | 2/1/2022 8:00:00 PM | Completed |
| OO-88888 | B | SC-002938 | 8/25/2022 7:00:00 PM | Scheduled |
| OO-55555 | C | SC-000957 | 12/27/2021 8:00:00 PM | Completed |
| OO-55555 | C | SC-001418 | 2/7/2022 4:30:00 PM | Completed |
| OO-55555 | C | SC-003007 | null | null |
| OO-66666 | D | SC-001626 | null | No Longer Required |
| OO-66666 | D | SC-002329 | 6/9/2022 7:00:00 PM | Completed |
| OO-66666 | D | SC-002538 | null | Completed |
| OO-66666 | D | SC-002932 | null | Call Reviewed |
| OO-66666 | D | SC-003350 | 9/29/2022 7:00:00 PM | Scheduled |
| OO-11111 | F | null | null | null |
+--------------+-------------------+-------------------+------------------------+---------------------+
My goal is to filter out duplicates. I only want one row per customer, but I want to keep a specific row. A duplicate only appears if someone has multiple service calls.
If someone has a service call (Service Call.Name not equal to null), and one of those has a service call status of something OTHER than "Completed" or "Not required", I want to keep that row. So for Customer A, I want the third row since the service call status is not "completed" or "Not required".
If someone has multiple service calls, like customer , and they are all "completed" or "Not required". I don't care which one I keep, as long as I only keep one.
If someone has one service call or no service call, there will be no duplicate of that person, so I want to keep that row.
EDIT
There were cases of duplicates I didn't realize I had, I've edited the data to show them.
For someone with more than one open service call like customer E, I only want to keep one of them. If there is a date for both, I want the latest date of the two. If one has a date and the other doesn't, i want the one with a date. If neither have a date, i don't care which is kept, but i only want one.
I am working in Power BI, but I have access to R and think that might be easier.

Here is a solution. duplicated will give what rows to keep by customer name and another logical index, created with %in%, the rows to keep by status.
dat <- read.table(text = '+--------------+---------------+-------------------+------------------------+---------------------+
| Project Name | Customer Name | Service Call.Name | Service Call Date Time | Service Call Status |
+--------------+---------------+-------------------+------------------------+---------------------+
| OO-99999 | A | SC-001762 | 3/21/2022 7:00:00 PM | Completed |
| OO-99999 | A | SC-002323 | null | Completed |
| OO-99999 | A | SC-002357 | 10/3/2022 7:00:00 PM | 2nd Visit Scheduled |
| OO-88888 | B | SC-001260 | 2/1/2022 8:00:00 PM | Completed |
| OO-88888 | B | SC-002938 | 8/25/2022 7:00:00 PM | Scheduled |
| OO-55555 | C | SC-000957 | 12/27/2021 8:00:00 PM | Completed |
| OO-55555 | C | SC-001418 | 2/7/2022 4:30:00 PM | Completed |
| OO-55555 | C | SC-003007 | null | null |
| OO-11111 | D | null | null | null |
+--------------+---------------+-------------------+------------------------+---------------------+
', header = TRUE, sep = "|", comment.char = "+", strip.white = TRUE, check.names = FALSE)
dat <- dat[-c(1, ncol(dat))]
not_wanted <- c("Completed", "Not required")
i <- dat[['Service Call Status']] %in% not_wanted
i <- ave(i, dat[['Customer Name']], FUN = \(k) {
if(all(k)) k[1] <- FALSE
!k
})
result <- dat[i,]
j <- ave(result[['Service Call Status']], result[['Customer Name']], FUN = duplicated)
result <- result[!as.logical(j), ]
result
#> Project Name Customer Name Service Call.Name Service Call Date Time Service Call Status
#> 3 OO-99999 A SC-002357 10/3/2022 7:00:00 PM 2nd Visit Scheduled
#> 5 OO-88888 B SC-002938 8/25/2022 7:00:00 PM Scheduled
#> 8 OO-55555 C SC-003007 null null
#> 9 OO-11111 D null null null
Created on 2022-10-26 with reprex v2.0.2

Related

Kusto/KQL: summarize by time bucket AND count(string) column

I have a table of http responses including timestamp, service name and the http response code I want to query using KQL/Kusto.
My goal is to have a table that tells me "How many http responses of a certain type (2xx, 4xx etc) did a particular service have within the last 5 minutes over time"
I want to summarize the rows by a time bucket of 5min and the ResponseType (basically the response code class) as well - but I can't seem to make it work. When I add count(ResponseType) to the summarize clause, it returns the error message Function 'count' cannot be invoked in current context.
My KQL looks like this
InsightsMetrics
| extend Tags = parse_json(Tags)
| extend Responsecode = tostring(Tags.["code"])
| extend ResponseType = strcat(substring(Responsecode, 0, 1), "XX")
| extend Service = tostring(Tags.["service"])
| where TimeGenerated >= now(-4h)
| where Namespace == "prometheus"
| where Name contains "traefik_service_requests_total"
| project TimeGenerated, Responsecode, Service, ResponseType
| summarize by bin(TimeGenerated, 5m), ResponseType
which returns data like this:
| TimeGenerated | ResponseType | Service |
|---------------------|--------------|----------------------------------------------------------|
| 2020-10-01 10:25:00 | 3XX | prod-service-internal-50f0bab542c7d81ed22e#kubernetescrd |
| 2020-10-01 10:30:00 | 2XX | prod-service-internal-50f0bab542c7d81ed22e#kubernetescrd |
| 2020-10-01 10:30:00 | 2XX | prod-service-internal-50f0bab542c7d81ed22e#kubernetescrd |
| 2020-10-01 10:30:00 | 4XX | prod-service-internal-50f0bab542c7d81ed22e#kubernetescrd |
When I want something like this instead
| TimeGenerated | ResponseType | count(ResponseType) | Service |
|---------------------|--------------|---------------------|----------------------------------------------------------|
| 2020-10-01 10:25:00 | 3XX | 1 | prod-service-internal-50f0bab542c7d81ed22e#kubernetescrd |
| 2020-10-01 10:30:00 | 2XX | 2 | prod-service-internal-50f0bab542c7d81ed22e#kubernetescrd |
| 2020-10-01 10:30:00 | 4XX | 1 | prod-service-internal-50f0bab542c7d81ed22e#kubernetescrd |

All you have to do is replace
| summarize by bin(TimeGenerated, 5m), ResponseType
with
| summarize count() by bin(TimeGenerated, 5m), ResponseType, Service

Default value for LAG function in MariaDB

I'm trying to build a view which allows me to track the difference between paid values at two consecutive month_ids. When a figure is missing however, that would be because it's the first entry and therefore has a paid amount of 0. At present, I'm using the below to represent the previous figure since the [,default] argument has not been implemented in MariaDB.
CASE WHEN (
NOT(policy_agent_month.policy_agent_month_id IS NOT NULL
AND LAG(days_paid, 1) OVER (PARTITION BY claim_id ORDER BY month_id ) IS NULL)) THEN
LAG(days_paid, 1) OVER ( PARTITION BY claim_id ORDER BY month_id)
ELSE
0
END
The problem I have with this is that I have about 30 variables which this function needs to be applied over and it makes my code unreadable and very clunky. Is there a better solution?

Why use WITH?
SELECT province, tot_pop,
tot_pop - COALESCE(
(LAG(tot_pop) OVER (ORDER BY tot_pop ASC)),
0) AS delta
FROM provinces
ORDER BY tot_pop asc;
+---------------------------+----------+---------+
| province | tot_pop | delta |
+---------------------------+----------+---------+
| Nunavut | 14585 | 14585 |
| Yukon | 21304 | 6719 |
| Northwest Territories | 24571 | 3267 |
| Prince Edward Island | 63071 | 38500 |
| Newfoundland and Labrador | 100761 | 37690 |
| New Brunswick | 332715 | 231954 |
| Nova Scotia | 471284 | 138569 |
| Saskatchewan | 622467 | 151183 |
| Manitoba | 772672 | 150205 |
| Alberta | 2481213 | 1708541 |
| British Columbia | 3287519 | 806306 |
| Quebec | 5321098 | 2033579 |
| Ontario | 10071458 | 4750360 |
+---------------------------+----------+---------+
13 rows in set (0.00 sec)
However, it is not cheap (at least in MySQL 8.0);
the table has 13 rows, yet
FLUSH STATUS;
SELECT ...
SHOW SESSION STATUS LIKE 'Handler%';
MySQL 8.0:
+----------------------------+-------+
| Variable_name | Value |
+----------------------------+-------+
| Handler_read_rnd | 89 |
| Handler_read_rnd_next | 52 |
| Handler_write | 26 |
(and others)
MariaDB 10.3:
| Handler_read_rnd | 77 |
| Handler_read_rnd_next | 42 |
| Handler_tmp_write | 13 |
| Handler_update | 13 |

You can use a CTE (Common Table Expression) in MariaDB 10.2+ to pre-compute frequently used expressions and name them for later use:
with
x as ( -- first we compute the CTE that we name "x"
select
*,
coalesce(
LAG(days_paid, 1) OVER (PARTITION BY claim_id ORDER BY month_id),
123456
) as prev_month -- this expression gets the name "prev_month"
from my_table -- or a simple/complex join here
)
select -- now the main query
prev_month
from x
... -- rest of your query here where "prev_month" is computed.
In the main query prev_month has the lag value, or the default value 123456 when it's null.

Optimizing query that looks at a specific time window each day

This is a followup to my previous question
Optimizing query to get entire row where one field is the maximum for a group
I'll change the names from what I used there to make them a little more memorable, but these don't represent my actual use-case (so don't estimate the number of records from them).
I have a table with a schema like this:
OrderTime DATETIME(6),
Customer VARCHAR(50),
DrinkPrice DECIMAL,
Bartender VARCHAR(50),
TimeToPrepareDrink TIME(6),
...
I'd like to extract the rows from the table representing each customer's most expensive drink order during happy hour (3 PM - 6 PM) each day. So for instance I'd want results like
Date | Customer | OrderTime | MaxPrice | Bartender | ...
-------+----------+-------------+------------+-----------+-----
1/1/18 | Alice | 1/1/18 3:45 | 13.15 | Jane | ...
1/1/18 | Bob | 1/1/18 5:12 | 9.08 | Jane | ...
1/1/18 | Carol | 1/1/18 4:45 | 20.00 | Tarzan | ...
1/2/18 | Alice | 1/2/18 3:45 | 13.15 | Jane | ...
1/2/18 | Bob | 1/2/18 5:57 | 6.00 | Tarzan | ...
1/2/18 | Carol | 1/2/18 3:13 | 6.00 | Tarzan | ...
...
The table has an index on OrderTime, and contains tens of billions of records. (My customers are heavy drinkers).
Thanks to the previous question I'm able to extract this for a specific day pretty easily. I can do something like:
SELECT * FROM orders b
INNER JOIN (
SELECT Customer, MAX(DrinkPrice) as MaxPrice
FROM orders
WHERE OrderTime >= '2018-01-01 15:00'
AND OrderTime <= '2018-01-01 18:00'
GROUP BY Customer
) AS a
ON a.Customer = b.Customer
AND a.MaxPrice = b.DrinkPrice
WHERE b.OrderTime >= '2018-01-01 15:00'
AND b.OrderTime <= '2018-01-01 18:00';
This query runs in less than a second. The explain plan looks like this:
+---+-------------+------------+-------+---------------+------------+--------------------+--------------------------------------------------------+
| id| select_type | table | type | possible_keys | key | ref | Extra |
+---+-------------+------------+-------+---------------+------------+--------------------+--------------------------------------------------------+
| 1 | PRIMARY | b | range | OrderTime | OrderTime | NULL | Using index condition |
| 1 | PRIMARY | <derived2> | ref | key0 | key0 | b.Customer,b.Price | |
| 2 | DERIVED | orders | range | OrderTime | OrderTime | NULL | Using index condition; Using temporary; Using filesort |
+---+-------------+------------+-------+---------------+------------+--------------------+--------------------------------------------------------+
I can also get the information about the relevant rows for my query:
SELECT Date, Customer, MAX(DrinkPrice) AS MaxPrice
FROM
orders
INNER JOIN
(SELECT '2018-01-01' AS Date
UNION
SELECT '2018-01-02' AS Date) dates
WHERE OrderTime >= TIMESTAMP(Date, '15:00:00')
AND OrderTime <= TIMESTAMP(Date, '18:00:00')
GROUP BY Date, Customer
HAVING MaxPrice > 0;
This query also runs in less than a second. Here's how its explain plan looks:
+------+--------------+------------+------+---------------+------+------+------------------------------------------------+
| id | select_type | table | type | possible_keys | key | ref | Extra |
+------+--------------+------------+------+---------------+------+------+------------------------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | Using temporary; Using filesort |
| 1 | PRIMARY | orders | ALL | OrderTime | NULL | NULL | Range checked for each record (index map: 0x1) |
| 2 | DERIVED | NULL | NULL | NULL | NULL | NULL | No tables used |
| 3 | UNION | NULL | NULL | NULL | NULL | NULL | No tables used |
| NULL | UNION RESULT | <union2,3> | ALL | NULL | NULL | NULL | |
+------+--------------+------------+------+---------------+------+------+------------------------------------------------+
The problem now is retrieving the remaining fields from the table. I tried adapting the trick from before, like so:
SELECT * FROM
orders a
INNER JOIN
(SELECT Date, Customer, MAX(DrinkPrice) AS MaxPrice
FROM
orders
INNER JOIN
(SELECT '2018-01-01' AS Date
UNION
SELECT '2018-01-02' AS Date) dates
WHERE OrderTime >= TIMESTAMP(Date, '15:00:00')
AND OrderTime <= TIMESTAMP(Date, '18:00:00')
GROUP BY Date, Customer
HAVING MaxPrice > 0) b
ON a.OrderTime >= TIMESTAMP(b.Date, '15:00:00')
AND a.OrderTime <= TIMESTAMP(b.Date, '18:00:00')
AND a.Customer = b.Customer;
However, for reasons I don't understand, the database chooses to execute this in a way that takes forever. Explain plan:
+------+--------------+------------+------+---------------+------+------------+------------------------------------------------+
| id | select_type | table | type | possible_keys | key | ref | Extra |
+------+--------------+------------+------+---------------+------+------------+------------------------------------------------+
| 1 | PRIMARY | a | ALL | OrderTime | NULL | NULL | |
| 1 | PRIMARY | <derived2> | ref | key0 | key0 | a.Customer | Using where |
| 2 | DERIVED | <derived3> | ALL | NULL | NULL | NULL | Using temporary; Using filesort |
| 2 | DERIVED | orders | ALL | OrderTime | NULL | NULL | Range checked for each record (index map: 0x1) |
| 3 | DERIVED | NULL | NULL | NULL | NULL | NULL | No tables used |
| 4 | UNION | NULL | NULL | NULL | NULL | NULL | No tables used |
| NULL | UNION RESULT | <union3,4> | ALL | NULL | NULL | NULL | |
+------+--------------+------------+------+---------------+------+------------+------------------------------------------------+
Questions:
What is going on here?
How can I fix it?

To extract the rows from the table representing each customer's most expensive drink order during happy hour (3 PM - 6 PM) each day I would use row_number() over() within a case expression evaluating the hour of day, like this:
CREATE TABLE mytable(
Date DATE
,Customer VARCHAR(10)
,OrderTime DATETIME
,MaxPrice NUMERIC(12,2)
,Bartender VARCHAR(11)
);
note changes were made to OrderTime
INSERT INTO mytable(Date,Customer,OrderTime,MaxPrice,Bartender)
VALUES
('1/1/18','Alice','1/1/18 13:45',13.15,'Jane')
, ('1/1/18','Bob' ,'1/1/18 15:12', 9.08,'Jane')
, ('1/2/18','Alice','1/2/18 13:45',13.15,'Jane')
, ('1/2/18','Bob' ,'1/2/18 15:57', 6.00,'Tarzan')
, ('1/2/18','Carol','1/2/18 13:13', 6.00,'Tarzan')
;
The suggested query is this:
select
*
from (
select
*
, case when hour(OrderTime) between 15 and 18 then
row_number() over(partition by `Date`, customer
order by MaxPrice DESC)
else null
end rn
from mytable
) d
where rn = 1
;
and the result will give access to all columns you include in the derived table.
Date | Customer | OrderTime | MaxPrice | Bartender | rn
:--------- | :------- | :------------------ | -------: | :-------- | -:
0001-01-18 | Bob | 0001-01-18 15:12:00 | 9.08 | Jane | 1
0001-02-18 | Bob | 0001-02-18 15:57:00 | 6.00 | Tarzan | 1
To help display how this works, running the derived table subquery:
select
*
, case when hour(OrderTime) between 15 and 18 then
row_number() over(partition by `Date`, customer order by MaxPrice DESC)
else null
end rn
from mytable
;
produces this interim resultset:
Date | Customer | OrderTime | MaxPrice | Bartender | rn
:--------- | :------- | :------------------ | -------: | :-------- | ---:
0001-01-18 | Alice | 0001-01-18 13:45:00 | 13.15 | Jane | null
0001-01-18 | Bob | 0001-01-18 15:12:00 | 9.08 | Jane | 1
0001-02-18 | Alice | 0001-02-18 13:45:00 | 13.15 | Jane | null
0001-02-18 | Bob | 0001-02-18 15:57:00 | 6.00 | Tarzan | 1
0001-02-18 | Carol | 0001-02-18 13:13:00 | 6.00 | Tarzan | null
db<>fiddle here

The task seems to be a "groupwise-max" problem. Here's one approach, involving only 2 'queries' (the inner one is called a "derived table").
SELECT x.OrderDate, x.Customer, b.OrderTime,
x.MaxPrice, b.Bartender
FROM
(
SELECT DATE(OrderTime) AS OrderDate,
Customer,
Max(Price) AS MaxPrice
FROM tbl
WHERE TIME(OrderTime) BETWEEN '15:00' AND '18:00'
GROUP BY OrderDate, Customer
) AS x
JOIN tbl AS b
ON b.OrderDate = X.OrderDate
AND b.customer = x.Customer
AND b.Price = x.MaxPrice
WHERE TIME(b.OrderTime) BETWEEN '15:00' AND '18:00'
ORDER BY x.OrderDate, x.Customer
Desirable index:
INDEX(Customer, Price)
(There's no good reason to be using MyISAM.)
Billions of new rows per day
This adds new wrinkles. That's upwards of a terabyte of additional disk space needed each and every day?
Is it possible to summarize the data? The goal here is to add summary info as the new data comes in, and never have to re-scan the billions of old data. This may also let you remove all the secondary indexes on the Fact table.
Normalization will help shrink the table size, hence speeding up the queries. Bartender and Customer are prime candidates for such -- perhaps a SMALLINT UNSIGNED (2 bytes; 65K values) for the former and MEDIUMINT UNSIGNED (3 bytes, 16M) for the latter. That would probably shrink by 50% the 5 columns you currently show. You may get a 2x speedup on many operations after normalizing.
Normalization is best done by 'staging' the data -- Load the data into a temporary table, normalize within it, summarize it, then copy into the main Fact table.
See http://mysql.rjweb.org/doc.php/summarytables
and http://mysql.rjweb.org/doc.php/staging_table
Before getting back to the question of optimizing the one query, we need to see the schema, the data flow, whether things can be normalized, whether summary tables can be effective, etc. I would hope to have the 'answer' for the query to be mostly digested in a summary table. Sometimes this leads to a 10x speedup.

Select single row per unique field value with SQL Developer

I have thousands of rows of data, a segment of which looks like:
+-------------+-----------+-------+
| Customer ID | Company | Sales |
+-------------+-----------+-------+
| 45678293 | Sears | 45 |
| 01928573 | Walmart | 6 |
| 29385068 | Fortinoes | 2 |
| 49582015 | Walmart | 1 |
| 49582015 | Joe's | 1 |
| 19285740 | Target | 56 |
| 39506783 | Target | 4 |
| 39506783 | H&M | 4 |
+-------------+-----------+-------+
In every case that a customer ID occurs more than once, the value in 'Sales' is also the same but the value in 'Company' is different (this is true throughout the entire table). I need for each value in 'Customer ID to only appear once, so I need a single row for each customer ID.
In other words, I'd like for the above table to look like:
+-------------+-----------+-------+
| Customer ID | Company | Sales |
+-------------+-----------+-------+
| 45678293 | Sears | 45 |
| 01928573 | Walmart | 6 |
| 29385068 | Fortinoes | 2 |
| 49582015 | Walmart | 1 |
| 19285740 | Target | 56 |
| 39506783 | Target | 4 |
+-------------+-----------+-------+
If anyone knows how I can go about doing this, I'd much appreciate some help.
Thanks!

Well it would have been helpful, if you have put your sql generate that data.
but it might go something like;
SELECT customer_id, Max(Company) as company, Count(sales.*) From Customers <your joins and where clause> GROUP BY customer_id
Assumes; there are many company and picks out the most number of occurance and the sales data to be in a different table.
Hope this helps.

Calculating Duration in org mode table

I'm trying to figure out how to to use org-mode to calculate the duration between two time points, however, whilst I figured out how to do it for two separate dates, when I add in the time component, it gives an answer, but I'd rather have the answer in
XX days, xx hours, xx minutes
| Start | End | Duration |
|------------------------+------------------------+----------|
| <2013-07-16 Tue 15:15> | <2013-07-17 Wed 11:15> | 0.833333 |
| | | 0 |
#+TBLFM: $3=(date(<$2>)-date(<$1>))

You may use the T flag to use the form HH:MM[:SS]. Example:
| Start | End | Days | HH:MM:SS |
|------------------------+------------------------+----------+----------|
| <2013-07-15 Tue 10:15> | <2013-07-17 Wed 11:15> | 2.041667 | 49:00:00 |
| | | 0 | 00:00:00 |
#+TBLFM: $3=date(<$2>)-date(<$1>)::$4=60*60*24*$3;T

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Remove duplicates based on multiple values in R or POWER BI - r

Related

Kusto/KQL: summarize by time bucket AND count(string) column

Default value for LAG function in MariaDB

Optimizing query that looks at a specific time window each day

Select single row per unique field value with SQL Developer

Calculating Duration in org mode table

Categories

Resources