How to find count and percentage of that count again over all in teradata - teradata

enter image description here
Name Date Key sce
Apple 1/1/2022 1111 1
Apple 1/1/2022 123 -11
Apple 1/1/2022 3435 -11
Mango 1/1/2022 14124 1
Mango 1/1/2022 1314 -11
Mango 1/2/2022 424 1
Mango 1/2/2022 5136 -11
Mango 1/2/2022 156 1
Mango 1/2/2022 1111 -11
i'm looking for an output like below i.e. for each name and for each date, count of all -11's and percentage of -11 over the total
ex: for 1/1/2022, apple has two -11 in column sce so the count should be 2 and percentage should to (2/3) *100.. Can you please help
Name date count (of -11's) percentage ( of all -11's over total)
Apple 1/1/2022 2 66.7
Mango 1/1/2022 1 50.0
Mango 1/2/2022 2 50.0

You can use a case expression to limit what rows are counted by a condition:
SELECT name,
date,
count(CASE WHEN sce = -11 THEN sce END) as "count",
100.0 * count(CASE WHEN sce = -11 THEN sce END)/count(*) as "percentage"
FROM yourtable
GROUP BY name, date

Related

calculate difference between Rows in R by setting specific target date

I am new at R, my df is as the following and I would like to set my bench comparison date as 2020/02/01, compare the results against the row with this date:
Here is my data frame, I want to be able to genearte the Diff Column with R
DATE
FRUIT
LOCATION
VALUE
DIFF
2010-01-01
Apple
USA
2
-2
2010-02-01
Apple
USA
4
0
2020-11-01
Apple
USA
100
96
2020-12-01
Apple
USA
54
50
2010-01-01
Apple
China
0
-4
2010-02-01
Apple
China
4
0
2020-11-01
Apple
China
40
36
2020-12-01
Apple
China
44
40
2010-01-01
Banana
USA
1
-1
2010-02-01
Banana
USA
2
0
2020-11-01
Banana
USA
12
10
2020-12-01
Banana
USA
13
11
2010-01-01
Banana
China
0
-100
2010-02-01
Banana
China
100
0
2020-11-01
Banana
China
130
30
2020-12-01
Banana
China
145
45
Thank you!
Using dplyr you can do :
library(dplyr)
compare_date <- as.Date('2010-02-01')
df %>%
mutate(Date = as.Date(Date)) %>%
group_by(Fruit, Metric) %>%
mutate(Diff = Value - Value[match(compare_date, Date)]) -> result
result

Backtrack values in R for a logic

My request is slightly complicated.
Below is how my data looks like.
**S.no Date City Sales diff Indicator
1 1 1/1/2017 New York 2795 0 0
2 2 1/31/2017 New York 4248 1453 0
3 3 3/2/2017 New York 1330 -2918 1
4 4 4/1/2017 New York 3535 2205 0
5 5 5/1/2017 New York 4330 795 0
6 6 5/31/2017 New York 3360 -970 1
7 7 6/30/2017 New York 2238 -1122 1
8 8 1/1/2017 Paris 1451 0 0
9 9 1/31/2017 Paris 2339 888 0
10 10 3/2/2017 Paris 2029 -310 1
11 11 4/1/2017 Paris 1850 -179 1
12 12 5/1/2017 Paris 2800 950 1
13 13 5/31/2017 Paris 1986 -814 0
14 14 6/30/2017 Paris 3776 1790 0
15 15 1/1/2017 London 1646 0 0
16 16 1/31/2017 London 3575 1929 0
17 17 3/2/2017 London 1161 -2414 1
18 18 4/1/2017 London 1766 605 0
19 19 5/1/2017 London 2799 1033 0
20 20 5/31/2017 London 2761 -38 1
21 21 6/30/2017 London 1048 -1713 1**
diff is Current Month Sales-Last Month Sales, for each group, and Indicator is when diff is negative or positive.
I want to compute a logic for each group starting from last row to first row, aka in reverse order.
I want to see in reverse order, the value of Sales when indicator was 1. The compare that captured Sales value to the threshold value(2000), for next steps.
Now below are two cases of comparison(Capture Sales v/s Threshold).
a. If captured value of sales, when Indicator is first 1(starting from last row to 1st row), is less than 2000, then store the captured values in a new dataset for each group.
b. If the captured of sales, when Indicator is first 1(starting from last row to 1st row), is greater than 2000, then skip that Indicator=1 row and move to the next row where Indicator=1, and repeat the same step for pt.a) and pt. b)
I want to bring the result in a new dataset, that will have a single row for each City providing me the "Sales value" for the aforementioned logic, along with the Date.
I simply want to understand how can i bring up this logic in R. Will rle function help?
Result:
S.no Date City Value(Sales)
3. 3/2/2017 New York 1330
11. 4/1/2017 Paris 1850
21. 6/30/2017 London 1048
Thanks,
J
If we assume that your data is already arranged it ascending order, you can do the following with base R:
threshold <- 2000
my_new_df <- my_df[my_df$Indicator == 1 & my_df$Sales < threshold, ]
my_new_df
# S.no Date City Sales diff Indicator
# 3 3 2017-03-02 New York 1330 -2918 1
# 11 11 2017-04-01 Paris 1850 -179 1
# 17 17 2017-03-02 London 1161 -2414 1
# 21 21 2017-06-30 London 1048 -1713 1
Now we have all rows where the Indicator is equal to one and the Salse value less than our threshold. But London has to rows and we only wnat the last one:
my_new_df <- my_new_df[!duplicated(my_new_df$City, fromLast = T),
c("S.no", "Date", "City", "Sales")]
my_new_df
# S.no Date City Sales
# 3 3 2017-03-02 New York 1330
# 11 11 2017-04-01 Paris 1850
# 21 21 2017-06-30 London 1048
With the fromLast-argument in the duplicated, we start in the last row to check, whether the City has already been in the data set.

Calculate number of distinct instances occurring in a given time period

I have dummy data
structure(list(id = c(1, 1, 2, 3, 3, 3, 4, 5, 5, 5, 6, 7, 7,
7), policy_num = c(41551662L, 50966414L, 43077202L, 46927463L,
57130236L, 57050065L, 26196559L, 33545119L, 52304024L, 73953064L,
50340507L, 50491162L, 76577511L, 108067534L), product = c("apple",
"apple", "pear", "apple", "apple", "apple", "plum", "apple",
"pear", "apple", "apple", "apple", "pear", "pear"), start_date =
structure(c(13607, 15434, 14276, 15294, 15660, 15660, 10547, 15117, 15483,
16351, 15429, 15421, 16474, 17205), class = "Date"), end_date = structure(c(15068,
16164, 17563, 15660, 15660, 16390, 13834, 16234, 17674, 17447,
15794, 15786, 17205, 17570), class = "Date")), .Names = c("id",
"policy_num", "product", "start_date", "end_date"), row.names = c(NA,
-14L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x0000000000320788>)
id policy_num product start_date end_date
1 41551662 apple 2007-04-04 2011-04-04
1 50966414 apple 2012-04-04 2014-04-04
2 43077202 pear 2009-02-01 2018-02-01
3 46927463 apple 2011-11-16 2012-11-16
3 57130236 apple 2012-11-16 2012-11-16
3 57050065 apple 2012-11-16 2014-11-16
4 26196559 plum 1998-11-17 2007-11-17
5 33545119 apple 2011-05-23 2014-06-13
5 52304024 pear 2012-05-23 2018-05-23
5 73953064 apple 2014-10-08 2017-10-08
6 50340507 apple 2012-03-30 2013-03-30
7 50491162 apple 2012-03-22 2013-03-22
7 76577511 pear 2015-02-08 2017-02-08
7 108067534 pear 2017-02-08 2018-02-08
Based on it, I'd like to calculate the following variables (grouped by user_id):
1) Number of currently held product (no_prod_now) - number of distinct products, whose end_date > currently evaluated start_date. Simply, number of products held by user_id at the time of start_date
2) Number of currently held active policies (no_policies_now) - as above, but applied to policy_num
3) Number of policies opened within 3 months prior the current start_date (policies_open_3mo)
4) policies_closed_3mo - as above, but number of closed policies in the past 3 months
The desirable output would look like this:
id policy_num product start_date end_date no_prod_now no_policies_now policies_closed_3mo
1 41551662 apple 2007-04-04 2011-04-04 1 1 0
1 50966414 apple 2012-04-04 2014-04-04 1 1 0
2 43077202 pear 2009-02-01 2018-02-01 1 1 0
3 46927463 apple 2011-11-16 2012-11-16 1 1 0
3 57130236 apple 2012-11-16 2012-11-16 1 1 1
3 57050065 apple 2012-11-16 2014-11-16 1 1 2
4 26196559 plum 1998-11-17 2007-11-17 1 1 0
5 33545119 apple 2011-05-23 2014-06-13 1 1 0
5 52304024 pear 2012-05-23 2018-05-23 2 2 0
5 73953064 apple 2014-10-08 2017-10-08 2 2 0
6 50340507 apple 2012-03-30 2013-03-30 1 1 0
7 50491162 apple 2012-03-22 2013-03-22 1 1 0
7 76577511 pear 2015-02-08 2017-02-08 1 1 0
7 108067534 pear 2017-02-08 2018-02-08 1 1 1
policies_open_3mo
0
0
0
0
0
1
0
0
1
0
0
0
0
0
I'm looking for the solution implemented ideally in data.table, as I'm going to apply it to big data volumes, but base R or dplyr solutions I could always convert to data.table, o would be also valuable, thanks!
This is quite tricky but can be solved with a number of non-equi self-joins.
Edit: It has turned out that update on join doesn't work together with non-equi self-joins as I had expected (see here). So, I had to revise the code completely to avoid updates in place.
Instead, the four additional columns are created by three separate non-equi self-joins and are combined for the final result.
library(data.table)
library(lubridate)
result <-
# create helper column for previous three months periods.
# lubridate's month arithmetic avoids NAs at end of month, e.g., February
DT[, start_date_3mo := start_date %m-% period(month = 3L)][
# start "cbind()" with original columns
, c(.SD,
# count number of products and policies held at time of start_date
DT[DT, on = c("id", "start_date<=start_date", "end_date>start_date"),
.(no_prod_now = uniqueN(product), no_pols_now = uniqueN(policy_num)),
by = .EACHI][, c("no_prod_now", "no_pols_now")],
# policies closed within previous 3 months of start_date
DT[DT, on = c("id", "end_date>=start_date_3mo", "end_date<=start_date"),
.(pols_closed_3mo = .N), by = .EACHI][, "pols_closed_3mo"],
# additional policies opened within previous 3 months of start_date
DT[DT, on = c("id", "start_date>=start_date_3mo", "start_date<=start_date"),
.(pols_opened_3mo = .N - 1L), by = .EACHI][, "pols_opened_3mo"])][
# omit helper column
, -"start_date_3mo"]
result
id policy_num product start_date end_date no_prod_now no_pols_now pols_closed_3mo pols_opened_3mo
1: 1 41551662 apple 2007-04-04 2011-04-04 1 1 0 0
2: 1 50966414 apple 2012-04-04 2014-04-04 1 1 0 0
3: 2 43077202 pear 2009-02-01 2018-02-01 1 1 0 0
4: 3 46927463 apple 2011-11-16 2012-11-16 1 1 0 0
5: 3 57130236 apple 2012-11-16 2012-11-16 1 1 2 1
6: 3 57050065 apple 2012-11-16 2014-11-16 1 1 2 1
7: 4 26196559 plum 1998-11-17 2007-11-17 1 1 0 0
8: 5 33545119 apple 2011-05-23 2014-06-13 1 1 0 0
9: 5 52304024 pear 2012-05-23 2018-05-23 2 2 0 0
10: 5 73953064 apple 2014-10-08 2017-10-08 2 2 0 0
11: 6 50340507 apple 2012-03-30 2013-03-30 1 1 0 0
12: 7 50491162 apple 2012-03-22 2013-03-22 1 1 0 0
13: 7 76577511 pear 2015-02-08 2017-02-08 1 1 0 0
14: 7 108067534 pear 2017-02-08 2018-02-08 1 1 1 0
Note that there are discrepancies for policies opened within 3 previous months before start_date between OP's expected result and the result here. For id == 3, there are 2 policies starting both on 2012-11-16, so it's one additional policy to count for each row. For id == 5, all start_date differ by more than 3 months, so there shouldn't be an overlap.
Also, rows 5 and 6 both show a value of 2 for policies closed within 3 previous months before start_date because id == 3 has two policies ending on 2012-11-16.

R: calculate number of distinct categories in the specified time frame

here's some dummy data:
user_id date category
27 2016-01-01 apple
27 2016-01-03 apple
27 2016-01-05 pear
27 2016-01-07 plum
27 2016-01-10 apple
27 2016-01-14 pear
27 2016-01-16 plum
11 2016-01-01 apple
11 2016-01-03 pear
11 2016-01-05 pear
11 2016-01-07 pear
11 2016-01-10 apple
11 2016-01-14 apple
11 2016-01-16 apple
I'd like to calculate for each user_id the number of distinct categories in the specified time period (e.g. in the past 7, 14 days), including the current order
The solution would look like this:
user_id date category distinct_7 distinct_14
27 2016-01-01 apple 1 1
27 2016-01-03 apple 1 1
27 2016-01-05 pear 2 2
27 2016-01-07 plum 3 3
27 2016-01-10 apple 3 3
27 2016-01-14 pear 3 3
27 2016-01-16 plum 3 3
11 2016-01-01 apple 1 1
11 2016-01-03 pear 2 2
11 2016-01-05 pear 2 2
11 2016-01-07 pear 2 2
11 2016-01-10 apple 2 2
11 2016-01-14 apple 2 2
11 2016-01-16 apple 1 2
I posted similar questions here or here, however none of it referred to counting cumulative unique values for the specified time period. Thanks a lot for your help!
I recommend using runner package. You can use any R function on running windows with runner function. Code below obtains desided output, which is past 7-days + current and past 14-days + current (current 8 and 15 days):
df <- read.table(
text = " user_id date category
27 2016-01-01 apple
27 2016-01-03 apple
27 2016-01-05 pear
27 2016-01-07 plum
27 2016-01-10 apple
27 2016-01-14 pear
27 2016-01-16 plum
11 2016-01-01 apple
11 2016-01-03 pear
11 2016-01-05 pear
11 2016-01-07 pear
11 2016-01-10 apple
11 2016-01-14 apple
11 2016-01-16 apple", header = TRUE, colClasses = c("integer", "Date", "character"))
library(dplyr)
library(runner)
df %>%
group_by(user_id) %>%
mutate(distinct_7 = runner(category, k = 7 + 1, idx = date,
f = function(x) length(unique(x))),
distinct_14 = runner(category, k = 14 + 1, idx = date,
f = function(x) length(unique(x))))
More informations in package and function documentation.
Here are two data.table solutions, one with two nested lapplyand the other using non-equi joins.
The first one is a rather clumsy data.table solution but it reproduces the expected answer. And it would work for an arbitrary number of time frames. (Although #alistaire's concise tidyverse solution he had suggested in his comment could be modified as well).
It uses two nested lapply. The first one loops over the time frames, the second one over the dates. The tempory result is joined with the original data and then reshaped from long to wide format so that we will end with a separate column for each of the time frames.
library(data.table)
tmp <- rbindlist(
lapply(c(7L, 14L),
function(ldays) rbindlist(
lapply(unique(dt$date),
function(ldate) {
dt[between(date, ldate - ldays, ldate),
.(distinct = sprintf("distinct_%02i", ldays),
date = ldate,
N = uniqueN(category)),
by = .(user_id)]
})
)
)
)
dcast(tmp[dt, on=c("user_id", "date")],
... ~ distinct, value.var = "N")[order(-user_id, date, category)]
# date user_id category distinct_07 distinct_14
# 1: 2016-01-01 27 apple 1 1
# 2: 2016-01-03 27 apple 1 1
# 3: 2016-01-05 27 pear 2 2
# 4: 2016-01-07 27 plum 3 3
# 5: 2016-01-10 27 apple 3 3
# 6: 2016-01-14 27 pear 3 3
# 7: 2016-01-16 27 plum 3 3
# 8: 2016-01-01 11 apple 1 1
# 9: 2016-01-03 11 pear 2 2
#10: 2016-01-05 11 pear 2 2
#11: 2016-01-07 11 pear 2 2
#12: 2016-01-10 11 apple 2 2
#13: 2016-01-14 11 apple 2 2
#14: 2016-01-16 11 apple 1 2
Here is a variant following a suggestion by #Frank which uses data.table's non-equi joins instead of the second lapply:
tmp <- rbindlist(
lapply(c(7L, 14L),
function(ldays) {
dt[.(user_id = user_id, dago = date - ldays, d = date),
on=.(user_id, date >= dago, date <= d),
.(distinct = sprintf("distinct_%02i", ldays),
N = uniqueN(category)),
by = .EACHI]
}
)
)[, date := NULL]
#
dcast(tmp[dt, on=c("user_id", "date")],
... ~ distinct, value.var = "N")[order(-user_id, date, category)]
Data:
dt <- fread("user_id date category
27 2016-01-01 apple
27 2016-01-03 apple
27 2016-01-05 pear
27 2016-01-07 plum
27 2016-01-10 apple
27 2016-01-14 pear
27 2016-01-16 plum
11 2016-01-01 apple
11 2016-01-03 pear
11 2016-01-05 pear
11 2016-01-07 pear
11 2016-01-10 apple
11 2016-01-14 apple
11 2016-01-16 apple")
dt[, date := as.IDate(date)]
BTW: The wording in the past 7, 14 days is somewhat misleading as the time periods actually consist of 8 and 15 days, resp.
In the tidyverse, you can use map_int to iterate over a set of values and simplify to an integer à la sapply or vapply. Count distinct occurrences with n_distinct (like length(unique(...))) of an object subset by comparisons or the helper between, with a minimum set by the appropriate amount subtracted from that day, and you're set.
library(tidyverse)
df %>% group_by(user_id) %>%
mutate(distinct_7 = map_int(date, ~n_distinct(category[between(date, .x - 7, .x)])),
distinct_14 = map_int(date, ~n_distinct(category[between(date, .x - 14, .x)])))
## Source: local data frame [14 x 5]
## Groups: user_id [2]
##
## user_id date category distinct_7 distinct_14
## <int> <date> <fctr> <int> <int>
## 1 27 2016-01-01 apple 1 1
## 2 27 2016-01-03 apple 1 1
## 3 27 2016-01-05 pear 2 2
## 4 27 2016-01-07 plum 3 3
## 5 27 2016-01-10 apple 3 3
## 6 27 2016-01-14 pear 3 3
## 7 27 2016-01-16 plum 3 3
## 8 11 2016-01-01 apple 1 1
## 9 11 2016-01-03 pear 2 2
## 10 11 2016-01-05 pear 2 2
## 11 11 2016-01-07 pear 2 2
## 12 11 2016-01-10 apple 2 2
## 13 11 2016-01-14 apple 2 2
## 14 11 2016-01-16 apple 1 2

duplicate data and dates

I am bit of a newbie to R I have two questions. I have a dataframe, say FruitsNew
Fruit
1 Apples
2 Oranges
3 Bananas
Q1) I want to duplicate the data and add monthly dates starting from 31-May-2000 to the above, for example
Fruit date
1 Apples 2000-05-31
2 Oranges 2000-05-31
3 Bananas 2000-05-31
4 Apples 2000-06-30
5 Oranges 2000-06-30
6 Bananas 2000-06-30
and so on...
Q2) After I obtain the above, I merge it with a Sales dataset which is only available yearly at end of May so it looks like this
Fruit date sales
1 Apples 2000-05-31 1000
2 Oranges 2000-05-31
3 Bananas 2000-05-31 500
4 Apples 2000-06-30
5 Oranges 2000-06-30
6 Bananas 2000-06-30
...
7 Apples 2001-05-31 2000
8 Oranges 2001-05-31 200
9 Bananas 2001-05-31 600
The oranges don't have sales, but I would like to fill it with a 0 for all the monthly dates between 05/31/2000 and the next available sales data which occurs in 05/31/2001
The other fruits should have the same sales number between 05/31/2000 and 05/31/2001 and so on.
The above is just an example but the idea is if missing to fill the previously available sales number for the date, if previously available date is empty then fill 0
Something like this
Fruit date sales
1 Apples 2000-05-31 1000
2 Oranges 2000-05-31 0
3 Bananas 2000-05-31 500
4 Apples 2000-06-30 1000
5 Oranges 2000-06-30 0
6 Bananas 2000-06-30 500
7 Apples 2001-05-31 2000
8 Oranges 2001-05-31 200
9 Bananas 2001-05-31 600
Let's assume your first dataframe will be named core and your second dataframe named merg.yr:
merg.yr <- merge(core, year.sale, by.x=1:2, by.y=1:2, all.x=TRUE)
merg.yr[is.na(merg.yr)] <- 0
To build the core df I came up with a method that created the dates at the first of the months, then subtracted 1 from each to get the last date in the prior month. I then repeated each one three times and let the `data.frame function fill in the fruits:
core <- data.frame(fruit =c('Apples','Oranges','Bananas'),
date=rep( as.Date(seq(ISOdate(2000, 6,1),
ISOdate(2001,6,1), by='month')) -1,
each=3)
)

Resources