SAS/SQL group by and keeping all rows - plsql

I have a table like this, observing the behavior of some accounts in time, here two accounts with acc_ids 1 and 22:
acc_id date mob
1 Dec 13 -1
1 Jan 14 0
1 Feb 14 1
1 Mar 14 2
22 Mar 14 10
22 Apr 14 11
22 May 14 12
I would like to create a column orig_date that would be equal to date if mob=0 and to minimum date by acc_id group if there is no mob=0 for that acc_id.
Therefore the expected output is:
acc_id date mob orig_date
1 Dec 13 -1 Jan 14
1 Jan 14 0 Jan 14
1 Feb 14 1 Jan 14
1 Mar 14 2 Jan 14
22 Mar 14 10 Mar 14
22 Apr 14 11 Mar 14
22 May 14 12 Mar 14
The second account does not have mob=0 observation, therefore orig_date is set to min(date) by group.
Is there some way how to achieve this in SAS, preferably by one proc sql step?

Seems pretty simple. Just calculate the min date in two ways and use coalesce() to pick the one you want.
First let's turn your printout into an actual dataset.
data have ;
input acc_id date :anydtdte. mob ;
format date date9.;
cards;
1 Dec13 -1
1 Jan14 0
1 Feb14 1
1 Mar14 2
22 Mar14 10
22 Apr14 11
22 May14 12
;
To find the DATE when MOB=0 use a CAsE clause. PROC SQL will automatically remerge the MIN() aggregate results calculated at the ACC_ID level back onto all of the detail rows.
proc sql ;
create table want as
select *
, coalesce( min(case when mob=0 then date else . end)
, min(date)
) as orig_date format=date9.
from have
group by acc_id
order by acc_id, date
;
quit;
Result:
Obs acc_id date mob orig_date
1 1 01DEC2013 -1 01JAN2014
2 1 01JAN2014 0 01JAN2014
3 1 01FEB2014 1 01JAN2014
4 1 01MAR2014 2 01JAN2014
5 22 01MAR2014 10 01MAR2014
6 22 01APR2014 11 01MAR2014
7 22 01MAY2014 12 01MAR2014

Here is a data step approach
data have;
input acc_id date $ mob;
datalines;
1 Dec13 -1
1 Jan14 0
1 Feb14 1
1 Mar14 2
22 Mar14 10
22 Apr14 11
22 May14 12
;
data want;
do until (last.acc_id);
set have;
by acc_id;
if first.acc_id then orig_date=date;
if mob=0 then orig_date=date;
end;
do until (last.acc_id);
set have;
by acc_id;
output;
end;
run;

Related

Accamulated data in pivot mode

Now i accamulate columns via row_cumsum
test
| project Boenheter, Ar, Maned, ManedTLA
| extend _date = make_datetime(toint(Ar), Maned, 1)
| extend key1 = Ar, __auto0 = datetime_part('Month', startofmonth(_date))
| summarize value0 = sum(Boenheter) by key1, __auto0, ManedTLA
| order by __auto0 asc, key1 asc
| serialize value0 = **row_cumsum(value0, __auto0 != prev(__auto0))**
| extend __p = pack(tostring(ManedTLA), value0)
| summarize __p = make_bag(__p) by key1
| evaluate bag_unpack(__p)
| order by key1 asc
But i wanna do accamulation for rows in next approach:
Feb = Jan + Feb, Mar = Jan + Feb + Mar, etc... so Feb = 304, Mar = 624 (for 2012 year as example) and so on
Does Kusto have some hack for do accamulation for row instead columns (row_cumsum)?
Help please)
Use row_cumsum, with restart on year change, before using pivot
// Generation of a data sample. No part of the solution.
let t = materialize(range i from 1 to 200 step 1 | extend dt = ago(365d*10*rand()));
// The solution starts here.
t
| summarize count() by year = getyear(dt), month = format_datetime(dt,'MM')
| order by year asc, month asc
| extend cumsum = row_cumsum(count_, year != prev(year))
| evaluate pivot(month, any(cumsum), year)
year
01
02
03
04
05
06
07
08
09
10
11
12
2012
2
4
6
7
10
14
16
2013
2
3
7
8
10
11
15
16
17
18
2014
2
7
11
12
13
14
15
17
19
20
2015
2
3
6
10
11
12
13
14
15
2016
1
2
3
5
6
8
10
11
12
15
16
19
2017
1
2
5
8
13
16
17
20
21
2018
4
5
8
12
15
18
20
23
24
25
26
2019
5
7
8
10
11
14
18
19
20
21
2020
2
5
8
10
11
13
15
16
19
22
2021
2
5
6
7
8
9
11
17
2022
2
4
5
Fiddle

Split data when time intervals exceed a defined value

I have a data frame of GPS locations with a column of seconds. How can I split create a new column based on time-gaps? i.e. for this data.frame:
df <- data.frame(secs=c(1,2,3,4,5,6,7,10,11,12,13,14,20,21,22,23,24,28,29,31))
I would like to cut the data frame when there is a time gap between locations of 3 or more seconds seconds and create a new column entitled 'bouts' which gives a running tally of the number of sections to give a data frame looking like this:
id secs bouts
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 6 1
7 7 1
8 10 2
9 11 2
10 12 2
11 13 2
12 14 2
13 20 3
14 21 3
15 22 3
16 23 3
17 24 3
18 28 4
19 29 4
20 31 4
Use cumsum and diff:
df$bouts <- cumsum(c(1, diff(df$secs) >= 3))
Remember that logical values get coerced to numeric values 0/1 automatically and that diff output is always one element shorter than its input.

Querying time lag in many-to-many table relationship

I am trying to figure out how to count the number of records in Table B (clinic visits) that occurred in the 6 months before any given event in Table A (survey administration). I am seeking advice about merging or relating these tables in R, and then querying them based on a date column:
Table A contains data on surveys that were administered to each study participant roughly every 6 months (though not the same admin date for each participant). This contains Participant ID and Survey Date, with a 4-5 unique dates for each participant ID:
PartID SurveyDate
12 12/1/12
12 6/8/12
12 12/15/11
12 5/29/11
13 12/15/12
13 6/20/12
13 12/7/11
13 6/15/11
14 11/28/12
14 6/1/12
14 1/1/12
14 6/30/11
Additionally I have a table of clinic visits for each participant and their result (binary) for a certain disease test. Clinic visits occur throughout the year and may happen 0, 1, or many times between each survey administration. At each clinic visit, a test is done and the result is recorded at 1 if positive, 0 if negative.
Part_ID Clinic_date Test_result
1 12 12/1/12 0
2 12 11/30/12 1
3 12 7/1/12 0
4 12 4/1/12 1
5 12 11/15/11 0
6 12 6/15/11 1
7 12 6/5/11 0
8 12 4/1/11 1
9 12 10/15/10 0
10 12 10/13/10 1
11 12 7/15/10 0
12 13 11/30/12 1
13 13 7/1/12 1
14 13 4/1/12 0
15 13 11/15/11 0
16 13 6/15/11 1
17 13 6/5/11 1
18 13 4/1/11 0
19 13 10/15/10 0
20 13 10/13/10 1
21 13 7/15/10 1
22 14 11/30/12 0
23 14 7/1/12 0
24 14 4/1/12 1
25 14 11/15/11 0
26 14 6/15/11 1
27 14 6/5/11 0
I would like to add a column to the survey administration table (table A) showing the number of positive clinic tests (1 in the Test_result column, co could use a sum) for that participant in the 6 months prior to the survey being given. Any advice would be much appreciated!

Difference in Timestamp

I want to calculate the difference of two incidents. First five columns indicate a date-time of incident. The rest five columns indicate the date-time of death.
dat <- read.table(header=TRUE, text="
YEAR MONTH DAY HOUR MINUTE D.YEAR D.MONTH D.DAY D.HOUR D.MINUTE
2013 1 6 0 55 2013 1 6 0 56
2013 2 3 21 24 2013 2 4 23 14
2013 1 6 11 45 2013 1 6 12 29
2013 3 6 12 25 2013 3 6 23 55
2013 4 6 18 28 2013 5 3 11 18
2013 4 8 14 31 2013 4 8 14 32")
dat
YEAR MONTH DAY HOUR MINUTE D.YEAR D.MONTH D.DAY D.HOUR D.MINUTE
2013 1 6 1 55 2013 1 6 0 56
2013 2 3 21 24 2013 2 4 23 14
2013 1 6 11 45 2013 1 6 12 29
2013 3 6 12 25 2013 3 6 23 55
2013 4 6 18 28 2013 5 3 11 18
2013 4 8 14 31 2013 4 8 14 32
I want to calculate the difference of time (in minutes). The following code is not going anywhere. The timestamp will look like 2013-04-06 04:08.
library(lubridate)
dat$tstamp1 <- mdy(paste(dat$YEAR, dat$MONTH, dat$DAY, dat$HOUR, dat$MINUTE,sep = "-"))
dat$tstamp2 <- mdy(paste(dat$D.YEAR, dat$D.MONTH, dat$D.DAY, dat$D.HOUR, dat$D.MINUTE, sep = "-"))
dat$diff <- dat$tstamp2 -dat$tstamp2 ### want the difference in minutes
In order to parse a date/time string of the "-"-separated format you're creating, you'll need to give a custom format, and pass it to parse_date_time. For example:
parse_date_time(paste(dat$D.YEAR, dat$D.MONTH, dat$D.DAY, dat$D.HOUR, dat$D.MINUTE, sep = "-"),
"%Y-%m-%d-%H-%M")
Your new code would therefore look like:
library(lubridate)
dat$tstamp1 <- parse_date_time(paste(dat$YEAR, dat$MONTH, dat$DAY, dat$HOUR, dat$MINUTE, sep = "-"),
"%Y-%m-%d-%H-%M")
dat$tstamp2 <- parse_date_time(paste(dat$D.YEAR, dat$D.MONTH, dat$D.DAY, dat$D.HOUR, dat$D.MINUTE, sep = "-"),
"%Y-%m-%d-%H-%M")
Then the following will get you the time difference in minutes:
dat$diff <- as.numeric(dat$tstamp2 - dat$tstamp1)
You can try this:
library(lubridate)
dat$tstamp1 <- strptime(paste(dat$YEAR, dat$MONTH, dat$DAY, dat$HOUR, dat$MINUTE,sep = "-"),"%Y-%m-%d-%H-%M")
dat$tstamp2 <- strptime(paste(dat$D.YEAR, dat$D.MONTH, dat$D.DAY, dat$D.HOUR, dat$D.MINUTE, sep = "-"),"%Y-%m-%d-%H-%M")
dat$diff <- as.POSIXct(dat$tstamp2) - as.POSIXct(dat$tstamp1)
Using strptime is faster and bit safer against unexpected data. You can read more about it here.

combine similar consecutive observations into one observation in R

I have a data set like this
date ID key value
05 1 3 2
05 1 3 5
05 1 3 1
05 1 5 2
05 1 7 3
05 1 7 3
05 1 3 4
05 2 9 8
I need the output to look like this
date ID key value
05 1 3 8
05 1 5 2
05 1 7 6
05 1 3 4
05 2 9 8
so as you see if consecutive date, ID, and key are the same , I want to know how to combine these observation and add their value. I need this to happen only if the events where consecutive.
is it possible to do it r?
if yes, can anyone please tell me how to do it?
thanks
Use rle to look for consecutive sequences
# your data
df <- read.table(text="date ID key value
05 1 3 2
05 1 3 5
05 1 3 1
05 1 5 2
05 1 7 3
05 1 7 3
05 1 3 4
05 2 9 8", header=T)
# get consecutive values - add a grouping variables
r <- with(df, rle(paste(date, ID, key)))
df$grps <- rep(seq(r$lengths), r$lengths)
# aggregate values
a <- aggregate(value ~ date + ID + key + grps, data = df , sum)
# remove the grouping variable
a$grps <- NULL

Resources