Select the IDs in given series in Teradata SQL - teradata

How to select the ids following the series, for example, for every id,
when condition 1 is 11,12,13 and corresponding condition 2 is null, 14, 16.
The data is as follows
id condition1 condition2
101 11 ?
101 13 14
101 15 16
102 11 ?
102 13 14
102 15 16
102 17 18
103 13 14
103 15 16
104 11 ?
104 13 14
104 15 16
104 13 14
104 15 16
105 11 ?
105 13 14
expected output
id condition1 condition2
101 11 ?
101 13 14
101 15 16
103 13 14
103 15 16
104 11 ?
104 13 14
104 15 16
104 13 14
104 15 16
thank you in advance

Edit to match your new logic.
If there are no duplicate rows:
select *
from tab
qualify
sum(case when condition1 = 11 and condition2 is null
then 0 -- ignore it
when condition1 = 13 and condition2 = 14
or condition1 = 15 and condition2 = 16
then 1 -- increase for specific rows
else -1 -- decrease for any other row
end)
over (partition by id) = 2
Unfortunatly Windowed Aggregates don't support DISTINCT, but there's a workaroud, check the 1st row only:
select *
from
(
select t.*
,row_number() -- check 1st row only
over (partition by id, condition1, condition2
order by id) as rn
from tab as t
) as dt
qualify
sum(case when rn = 1 -- check 1st row only
and condition1 = 11 and condition2 is null
then 0 -- ignore it
when rn = 1
and (condition1 = 13 and condition2 = 14
or condition1 = 15 and condition2 = 16)
then 1 -- increase for specific rows
else -1 -- decrease for any other row
end)
over (partition by id) = 2

Related

SQL query to pick the IDs according to below details

I need to determine the last event in the data when condition1 is 13 and condition2 is 14. However, it should not pick those IDs which have already passed through condition1 as 15 and condition2 as 16 and then again the last event is 13 and 14. i.e, in the below data, it should not pick the ids 102 and 103.
The data is as follows
id datetime date condition1 condition2
101 01-08-2021 13:00:41 01-08-2021 11 12
101 06-08-2021 08:08:21 05-08-2021 13 14
101 07-08-2021 21:05:32 07-08-2021 15 16
102 05-08-2021 14:08:32 05-08-2021 11 12
102 08-08-2021 06:13:13 08-08-2021 13 14
102 10-08-2021 13:09:55 10-08-2021 15 16
102 11-08-2021 18:00:00 11-08-2021 13 14
103 26-08-2021 14:04:22 26-08-2021 11 12
103 28-08-2021 12:09:08 28-08-2021 13 14
103 31-08-2021 17:45:00 31-08-2021 15 16
103 02-09-2021 07:00:04 02-09-2021 17 18
103 05-09-2021 09:00:04 05-09-2021 13 14
104 21-08-2021 11:11:12 21-08-2021 11 12
104 25-08-2021 10:09:35 25-08-2021 13 14
104 31-08-2021 08:35:40 31-08-2021 15 16
105 23-08-2021 09:05:54 23-08-2021 11 12
105 24-08-2021 10:00:22 24-08-2021 13 14
Expected output
id datetime date condition1 condition2
105 04-09-2021 10:00:22 24-08-2021 13 14
I would use a subquery to exclude ids that have condition1 = 15 and condition2 = 16
SELECT * FROM mytable
WHERE
condition1 = 13
AND condition2 = 14
AND id NOT IN (
SELECT id FROM mytable WHERE condition1 = 15 AND condition2 = 16
)
ORDER BY datetime DESC
LIMIT 1;
select *
from vt
qualify row_number() -- last event in the data
over (partition by id
order by datetime desc) = 1
and condition1 = 13 -- when condition1 is 13
and condition2 = 14 -- and condition2 is 14
-- not pick those IDs which have already passed through condition1 as 15 and condition2 as 16
and count(case when condition1 = 15 and condition2 = 16 then 1 end)
over (partition by id) = 0
;

SAS/SQL group by and keeping all rows

I have a table like this, observing the behavior of some accounts in time, here two accounts with acc_ids 1 and 22:
acc_id date mob
1 Dec 13 -1
1 Jan 14 0
1 Feb 14 1
1 Mar 14 2
22 Mar 14 10
22 Apr 14 11
22 May 14 12
I would like to create a column orig_date that would be equal to date if mob=0 and to minimum date by acc_id group if there is no mob=0 for that acc_id.
Therefore the expected output is:
acc_id date mob orig_date
1 Dec 13 -1 Jan 14
1 Jan 14 0 Jan 14
1 Feb 14 1 Jan 14
1 Mar 14 2 Jan 14
22 Mar 14 10 Mar 14
22 Apr 14 11 Mar 14
22 May 14 12 Mar 14
The second account does not have mob=0 observation, therefore orig_date is set to min(date) by group.
Is there some way how to achieve this in SAS, preferably by one proc sql step?
Seems pretty simple. Just calculate the min date in two ways and use coalesce() to pick the one you want.
First let's turn your printout into an actual dataset.
data have ;
input acc_id date :anydtdte. mob ;
format date date9.;
cards;
1 Dec13 -1
1 Jan14 0
1 Feb14 1
1 Mar14 2
22 Mar14 10
22 Apr14 11
22 May14 12
;
To find the DATE when MOB=0 use a CAsE clause. PROC SQL will automatically remerge the MIN() aggregate results calculated at the ACC_ID level back onto all of the detail rows.
proc sql ;
create table want as
select *
, coalesce( min(case when mob=0 then date else . end)
, min(date)
) as orig_date format=date9.
from have
group by acc_id
order by acc_id, date
;
quit;
Result:
Obs acc_id date mob orig_date
1 1 01DEC2013 -1 01JAN2014
2 1 01JAN2014 0 01JAN2014
3 1 01FEB2014 1 01JAN2014
4 1 01MAR2014 2 01JAN2014
5 22 01MAR2014 10 01MAR2014
6 22 01APR2014 11 01MAR2014
7 22 01MAY2014 12 01MAR2014
Here is a data step approach
data have;
input acc_id date $ mob;
datalines;
1 Dec13 -1
1 Jan14 0
1 Feb14 1
1 Mar14 2
22 Mar14 10
22 Apr14 11
22 May14 12
;
data want;
do until (last.acc_id);
set have;
by acc_id;
if first.acc_id then orig_date=date;
if mob=0 then orig_date=date;
end;
do until (last.acc_id);
set have;
by acc_id;
output;
end;
run;

How to remove rows with NULL / Zero (0) in R

CampActID AccountID LocationName LocationID
<int> <chr> <chr> <int>
1 12 3 Mark + Brandy 3
2 12 15 NULL 0
3 12 102 Spuntino 100
4 12 126 NULL 0
5 12 128 Intersport Concept Store 312
6 12 15 NULL 0
7 12 48 Aspeli Dame 46
8 12 75 Albert Bistro 73
9 12 126 NULL 0
10 12 128 Intersport Concept Store 312
We can try
library(dplyr)
df1%>%
filter(LocationName != "NULL" & LocationID!=0)

Create a new variable base on other variables contain a specific value in r

I have several time serious variables and I want to create two new dummy variables.
Variable one: if other variables contain a specific value, then variable one equal 1.
Variable two: if other variables contain a specific value continuously, then variable two equal 1.
My data looks like
ID score_2011 score_2012 score_2013 score_2014 score_2015
1 12 15 96 96 16
2 12 15 15 15 16
3 12 96 20 15 16
4 12 15 18 15 16
5 12 15 96 15 16
I want to get the new variables like the following
IF score_2011~2015 contain 96 then with_96=1
IF score_2011~2015 contain continuous 96 then back_to_back_96=1
I want the result to look like..
ID score_2011 score_2012 score_2013 score_2014 score_2015 with_96 back_to_back_96
1 12 15 96 96 16 1 1
2 12 15 15 15 16 0 0
3 12 96 20 15 16 1 0
4 12 15 18 15 16 0 0
5 96 15 96 15 16 1 0
Thanks in advance
One option would be to loop through the rows, find if there are any values that are 96 ('x1'), do run-length encoding on each of the rows, check whether there are any of the lengths for the 'TRUE' values are greater than 1 ('x2') , concatenate both, transpose and assign two new columns to the output.
df1[c("with_96", "back_to_back_96")] <- t(apply(df1[-1], 1, FUN= function(x) {
x1 <- as.integer(any(x==96))
rl <- rle(x==96)
x2 <- any(rl$lengths[rl$values]>1)
c(x1, x2)}))
df1
# ID score_2011 score_2012 score_2013 score_2014 score_2015 with_96 back_to_back_96
#1 1 12 15 96 96 16 1 1
#2 2 12 15 15 15 16 0 0
#3 3 12 96 20 15 16 1 0
#4 4 12 15 18 15 16 0 0
#5 5 12 15 96 15 16 1 0
Or another option is using rowSums
df1["with_96"] <- +(!!rowSums(df1[-1]==96))
df1["back_to_back_96"] <- rowSums((df1[-c(1, ncol(df1))]==96) +
(df1[-c(1,2)]==96)>1)
You can do some fanciness with data.table if you are so inclined. Working on a long format, melted dataset might make the logic of some of these comparisons a bit simpler.
library(data.table)
setDT(dat)
melt(dat, id="ID")[, .(with96=any(value==96), b2b96=any(diff(which(value==96))==1)), by=ID]
# ID with96 b2b96
#1: 1 TRUE TRUE
#2: 2 FALSE FALSE
#3: 3 TRUE FALSE
#4: 4 FALSE FALSE
#5: 5 TRUE FALSE

Rank function to rank multiple variables in R

I am trying to rank multiple numeric variables ( around 700+ variables) in the data and am not sure exactly how to do this as I am still pretty new to using R.
I do not want to overwrite the ranked values in the same variable and hence need to create a new rank variable for each of these numeric variables.
From reading the posts, I believe assign and transform function along with rank maybe able to solve this. I tried implementing as below ( sample data and code) and am struggling to get it to work.
The output dataset in addition to variables xcount, xvisit, ysales need to be populated
With variables xcount_rank, xvisit_rank, ysales_rank containing the ranked values.
input <- read.table(header=F, text="101 2 5 6
102 3 4 7
103 9 12 15")
colnames(input) <- c("id","xcount","xvisit","ysales")
input1 <- input[,2:4] #need to rank the numeric variables besides id
for (i in 1:3)
{
transform(input1,
assign(paste(input1[,i],"rank",sep="_")) =
FUN = rank(-input1[,i], ties.method = "first"))
}
input[paste(names(input)[2:4], "rank", sep = "_")] <-
lapply(input[2:4], cut, breaks = 10)
The problem with this approach is that it's creating the rank values as (101, 230] , (230, 450] etc whereas I would like to see the values in the rank variable to be populated as 1, 2 etc up to 10 categories as per the splits I did. Is there any way to achieve this? input[5:7] <- lapply(input[5:7], rank, ties.method = "first")
The approach I tried from the solutions provided below is:
input <- read.table(header=F, text="101 20 5 6
102 2 4 7
103 9 12 15
104 100 8 7
105 450 12 65
109 25 28 145
112 854 56 93")
colnames(input) <- c("id","xcount","xvisit","ysales")
input[paste(names(input)[2:4], "rank", sep = "_")] <-
lapply(input[2:4], cut, breaks = 3)
Current output I get is:
id xcount xvisit ysales xcount_rank xvisit_rank ysales_rank
1 101 20 5 6 (1.15,286] (3.95,21.3] (5.86,52.3]
2 102 2 4 7 (1.15,286] (3.95,21.3] (5.86,52.3]
3 103 9 12 15 (1.15,286] (3.95,21.3] (5.86,52.3]
4 104 100 8 7 (1.15,286] (3.95,21.3] (5.86,52.3]
5 105 450 12 65 (286,570] (3.95,21.3] (52.3,98.7]
6 109 25 28 145 (1.15,286] (21.3,38.7] (98.7,145]
7 112 854 56 93 (570,855] (38.7,56.1] (52.3,98.7]
Desired output:
id xcount xvisit ysales xcount_rank xvisit_rank ysales_rank
1 101 20 5 6 1 1 1
2 102 2 4 7 1 1 1
3 103 9 12 15 1 1 1
4 104 100 8 7 1 1 1
5 105 450 12 65 2 1 2
6 109 25 28 145 1 2 3
Would like to see the records in the group they would fall under if I try to rank the interval values.
Using dplyr
library(dplyr)
nm1 <- paste("rank", names(input)[2:4], sep="_")
input[nm1] <- mutate_each(input[2:4],funs(rank(., ties.method="first")))
input
# id xcount xvisit ysales rank_xcount rank_xvisit rank_ysales
#1 101 2 5 6 1 2 1
#2 102 3 4 7 2 1 2
#3 103 9 12 15 3 3 3
Update
Based on the new input and using cut
input[nm1] <- mutate_each(input[2:4], funs(cut(., breaks=3, labels=FALSE)))
input
# id xcount xvisit ysales rank_xcount rank_xvisit rank_ysales
#1 101 20 5 6 1 1 1
#2 102 2 4 7 1 1 1
#3 103 9 12 15 1 1 1
#4 104 100 8 7 1 1 1
#5 105 450 12 65 2 1 2
#6 109 25 28 145 1 2 3
#7 112 854 56 93 3 3 2

Resources