divide counts in one column where a condition is met - count

I am trying to determine the on time delivery rate of orders:
The column of interest is on time delivery orders, which contains a field of 0 (not on time) or 1 ( on time). How can I calculate in sql the on time rate for each person? Basically count the number of 0 / over total count(0's & 1's) for each person? Same thing for on time ( count 1/total count (0's & 1's)?
Heres a data example:
Week Delivery on time Person
1 0 sARAH
1 0 sARAH
1 1 sARAH
2 1 vIC
2 0 Vic

You may aggregate by person, and then take the average of the on time statistic:
SELECT Person, AVG(1.0*DeliveryOnTime) AS OnTime,
AVG(1.0 - DeliveryOnTime) AS NotOnTime
FROM yourTable
GROUP BY Person;
Demo
The demo given is for SQL Server, and the above syntax might have to change slightly depending on your actual database, which you did not reveal to us.

Related

Using "shift" function in R to subtract one row from another by group

I have a data.table that looks like this:
dt
id month balance
1: 1 4 100
2: 1 5 50
3: 2 4 200
4: 2 5 135
5: 3 4 100
6: 3 5 100
7: 4 5 300
"id" is the client's ID, "month" indicates what month it is, and "balance" indicates the account balance of a client. In a sense, this is longitudinal data where, say, element (2,3) indicates that Client #1 has an account balance of 50 at the end of month 5.
I want to generate a column that will give me the difference between a client's balance between month 5 and 4 to know the transactions carried out from one month to another.
This new variable should let me know that Client 1 drew 50, Client 2 drew 65 and Client 3 didn't do anything in aggregate terms between april and may. Client 4 is a new client that joined in may.
I thought of the following code:
dt$transactions <- dt$balance - shift(dt$balance, 1, "up")
However, it does not work properly because it's telling me that Client 4 made a 200 dollar deposit (but Client 4 is new!). Therefore, I want to be able to introduce the argument "by=id" to this somehow.
I know the solution lies in using the following notation:
dt[, transactions := balance - shift(balance, ??? ), by=id]
I just need to figure out how to make the aforementioned code work properly.
Thanks in advance.
Given that I only have two observations (at most), the following code gives me an elegant solution:
dt[, transaction := balance - first(balance), by = id]
This prevents any NAs from entering the variable transaction.
However, if I had more observations per id, I would do the following:
dt[,transaction := balance - shift(balance,1), by = id]
Big thanks to #Ryan and #Onyambu for helping.

How to find unique duplicates in SQL

Hi all i want to find the unique duplicate and their number of occurrence,
Table 1
Id name Amount(in and out) Day
1 ram 100 Sunday
2 ram -100 Sunday
3 ram 100 Monday
4 ram -100 Monday
5 ram 100 Wednesday
6 ram 100 Wednesday
Ram got 100 from the company on sunday i.e. id =1 and amount = 100 ,on same day he gave the money back to the Company i.e. id = 2,amount = -100
similay for id = 3 and id = 4
but id = 5 and 6 are duplicates as the amount is not reversed and it occured on same day .
i want to display
Count name Amount
2 ram 100
Count is the number of occurrence of duplicate values .
i have tried many logic but no use .please help me. thanks
Note : Duplicated means two sequential positive/negative values for the any day .
You can try below query (assuming table name as tbl):
Logic : if it is reversed then it will have one positive and one negative value in your case 100 and -100 so sum of them will be zero hence in having clause we are ignoring them .
select count(Id),name,amount from tbl group by name,day,amount having count(id)>1 and sum(amount)>0

Use Oracle Partition and Over By clause to retrieve section numbers

My Table comprises 4 Columns (Patient, Sample, Analysis and Component). I am trying to write a query that will look at the combination of Patient, Analysis and Component for each record and assign a "Section Number".
The numbering should re-start for every patient.
See expected output below. Patient 1010 has 3 samples but all have same Analysis-component. Hence they all have the same section (1).
Now, counting restarts for Patient 2020. This patient has 2 samples but both have a different Analysis-Component combination. Hence they are placed in separate sections 1 and 2.
Patient Sample Analysis Component Section Number
_______ ______ ________ _________ ______________
1010 720000140249 CALC Calcium 1
1010 720000140288 CALC Calcium 1
1010 720000140288 CALC Calcium 1
2020 720000190504 ALB Albumin 1
2020 720000160504 ALB Albumin Pct 2
3030 720000134568 CALC Calcium 1
3030 720000123404 ALB Albumin 2
3030 720000160765 ALB Albumin Pct 3
I have written the following query but all it does is groups samples with the same Component into one section. It does not consider the Patient or Analysis at all.
Your help is much appreciated (as always!)
select
x.patient, x.sample_number, x.analysis, x.component
a.myRowCount
from
X_PREV_PAT_RESULTS x inner join (
select distinct
x1.COMPONENT
, ROW_NUMBER() OVER (ORDER BY x1.COMPONENT) myRowCount
from X_PREV_PAT_RESULTS x1
group by x1.patient ) A on x.COMPONENT = A.COMPONENT
order by a.myRowCount, x.patient;
My guess is that you want
dense_rank() over (partition by patient
order by analysis desc, component) myRowCount
What happens with rows after a tie? If patient 1010 gets an ALB analysis? Would that have a MyRowCount of 2? Or 4? rank would return 4. dense_rank would return 2.
How are you determining the order of rows for a partiticular patient? It appears that you're going in reverse alphabetical order for analysis and then alphabetically for component but that seems like a pretty unusual ordering.
select x.patient, x.sample_number, x.analysis, x.component,
dense_rank() over(partition by x.patient order by x.analysis, x.component)
from X_PREV_PAT_RESULTS x
where exists (select 1 from X_PREV_PAT_RESULTS x1 where x1.COMPONENT = x.COMPONENT);

How to enter censored data into R's survival model?

I'm attempting to model customer lifetimes on subscriptions. As the data is censored I'll be using R's survival package to create a survival curve.
The original subscriptions dataset looks like this..
id start_date end_date
1 2013-06-01 2013-08-25
2 2013-06-01 NA
3 2013-08-01 2013-09-12
Which I manipulate to look like this..
id tenure_in_months status(1=cancelled, 0=active)
1 2 1
2 ? 0
3 1 1
..in order to feed the survival model:
obj <- with(subscriptions, Surv(time=tenure_in_months, event=status, type="right"))
fit <- survfit(obj~1, data=subscriptions)
plot(fit)
What shall I put in the tenure_in_months variable for the consored cases i.e. the cases where the subscription is still active today - should it be the tenure up until today or should it be NA?
First I shall say I disagree with the previous answer. For a subscription still active today, it should not be considered as tenure up until today, nor NA. What do we know exactly about those subscriptions? We know they tenured up until today, that is equivalent to say tenure_in_months for those observations, although we don't know exactly how long they are, they are longer than their tenure duration up to today.
This is a situation known as right-censor in survival analysis. See: http://en.wikipedia.org/wiki/Censoring_%28statistics%29
So your data would need to translate from
id start_date end_date
1 2013-06-01 2013-08-25
2 2013-06-01 NA
3 2013-08-01 2013-09-12
to:
id t1 t2 status(3=interval_censored)
1 2 2 3
2 3 NA 3
3 1 1 3
Then you will need to change your R surv object, from:
Surv(time=tenure_in_months, event=status, type="right")
to:
Surv(t1, t2, event=status, type="interval2")
See http://stat.ethz.ch/R-manual/R-devel/library/survival/html/Surv.html for more syntax details. A very good summary of computational details can be found: http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_lifereg_sect018.htm
Interval censored data can be represented in two ways. For the first use type = interval and the codes shown above. In that usage the value of the time2 argument is ignored unless event=3. The second approach is to think of each observation as a time interval with (-infinity, t) for left censored, (t, infinity) for right censored, (t,t) for exact and (t1, t2) for an interval. This is the approach used for type = interval2, with NA taking the place of infinity. It has proven to be the more useful.
If a missing end date means that the subscription is still active, then you need to take the time until the current date as censor date.
NA wont work with the survival object. I think those cases will be omitted. That is not what you want! Because these cases contain important information about the survival.
SQL code to get the time till event (use in SELECT part of query)
DATEDIFF(M,start_date,ISNULL(end_date,GETDATE()) AS tenure_in_months
BTW:
I would use difference in days, for my analysis. Does not make sense to round off the time to months.
You need to know the date the data was collected. The tenure_in_months for id 2 should then be this date minus 2013-06-01.
Otherwise I believe your encoding of the data is correct. the status of 0 for id 2 indicates it's right-censored (meaning we have a lower bound on it's lifetime, but not an upper bound).

Moving Between States in a Markov Model - How to Tell R?

I have been struggling with this problem for quite a while and any help would be much appreciated.
I am trying to write a function to calculate a transition matrix from observed data for a markov model.
My initial data I am using to build the function look something like this;
Season Team State
1 1 Manchester United 1
2 1 Chelsea 1
3 1 Manchester City 1
.
.
.
99 5 Charlton Athletic 4
100 5 Watford 4
with 5 seasons and 4 states.
I know how I am going to calculate the transition matrix, but in order to do this I need to count the number of teams that move from state i to state j for each season.
I need code that will do something like this,
a<-function(x,i,j){
if("team x is in state i in season 1 and state j in season 2") 1 else 0
}
sum(a)
and then I could do this for each team and pair of states and repeat for all 5 seasons. However, I am having a hard time getting my head around how to tell R the thing in quotation marks. Sorry if there is a really obvious answer but I am a rubbish programmer.
Thanks so much for reading!
This function tells you if a team made the transition from state1 to state2 from season1 to season2
a <- function(team, state1, state2, data, season1, season2) {
team.rows = data[team == data["Team",],]
in.season1.in.state1 = ifelse(team.rows["Season",]==season1 && team.rows["State",state1],1,0)
in.season2.in.state2 = ifelse(team.rows["Season",]==season2 && team.rows["State",state2],1,0)
return(sum(in.season1.in.stat1) * sum(in.season2.in.state2))
}
In the first line I select all rows of a particular team.
The second line is determining for each entry if a team is ever in state1 in season1.
The third line is determining for each entry if a team is ever in state2 in season2,
and the return statement returns 0 if the team was never in the respective state in the respective season or 1 otherwise (only works if there are no duplicates, in that case it might return a value greater than 1)

Resources