Need formula on how to calculate the work completion days on average - math

Client give 100 task to employee.
Employee complete 50 task in 1 day
20 task in 2 days
15 task in 3 days
4 task in 4 days
5 taak in 6 days
6 task in 10 days.
Now I want to know on a average how many days employee will take to complete for 1 task
Need formula for this query..

Assuming tasks are not completed in parallel (i.e. days are mutually exclusive with respect to completing/working on tasks), average days per task = 0.26:
This is where the solution should terminate - however, I provide a number of checks/alternative approaches which serve to demonstrate (unequivocally) the veracity of the above function...
check 1
The same can be derived using the 'weighted average calculation:
check 2
Intuitively, if each task takes ~0.26 days to complete, and there are 100 tasks, then the total duration (days) ~= 26: summing column B gives just that:
check 3
If still unconvinced, you can calculate the average days per task for each category/type (i.e. for those that take 1,2,3,.., 10 days to complete):
Then expand these out using sequence / other method:
Again, this yields 0.26 and which should confirm (unequivocally) the veracity of the simple/direct ratio...


Measure similarity of objects over a period of time

I've got a dataset that has monthly metrics for different stores. Each store has three monthly (Total sales, customers and transaction count), my task is over a year I need to find the store that most closely matches a specific test store (Ex: Store 77).
Therefore over the year both the test store and most similar store need to have similar performance. My question is how do I go about finding the most similar store? I've currently used euclidean distance but would like to know if there's a better way to go about it.
Thanks in advance
Metric 1
Is correlation a better way to measure similarity in this case compared to distance? I'm fairly new to data so if there's any resources where I can learn more about this stuff it would be much appreciated!!
In general, deciding similarity of items is domain-specific, i.e. it depends on the problem you try to solve. Therefore, there is not one-size-fits-all solution. Nevertheless, there is some a basic procedure someone can follow trying to solve this kind of problems.
Case 1 - only distance matters:
If you want to find the most similar items (stores in our case) using a distance measure, it's a good tactic to firstly scale your features in some way.
Example (min-max normalization):
Total sales
Total sales (normalized)
After you apply normalization on all attributes, you can calculate euclidean distance or any other metric that you think it fits your data.
Some resources:
Similarity measures
Feature scaling
Case 2 - Trend matters:
Now, say that you want to find the similarity over the whole year. If the definition of similarity for your problem is just the instance of the stores at the end of the year, then distance will do the job.
But if you want to find similar trends of increase/decrease of the attributes of two stores, then distance measures conceal this information. You would have to use correlation metrics or any other more sophisticated technique than just a distance.
Simple example:
To keep it simple, let's say we are interested in 3-months analysis and that we use only sales attribute (unscaled):
Total sales
At the end of March, in terms of distance Store 1 and Store 2 are identical, both having 60 total sales.
But, as far as the increase ratio per month is concerned, Store 2 and Store 3 is our match. In February they both had 2 times more sales and in March 1.67 and 1.6 times more sales respectively.
Bottom line: It really depends on what you want to quantify.
Well-known correlation metrics:
Pearson correlation coefficient
Spearman correlation coefficient

R: optimal sorting/allocation/distribution of items

I'm hoping someone may be able to help with a problem I have - trying to solve using R.
Individuals can submit requests for items. The minimum number of requests per person is one. There is a recommended maximum of five, but people can submit more in exceptional circumstances. Each item can only be allocated one individual.
Each item has a 'desirability'/quality score ranging from 10 (high quality) down to 0 (low quality). The idea is to allocate items, in line with requests, such that as many high quality items as possible are allocated. It is less important that individuals have an equitable spread of requests met.
Everyone has to have at least one request met. Next priority is to look at whether we can get anyone who is over the recommended limit within it by allocating requests to others. After that the priority is to look at where the item would rank in each individual's request list based on quality score, and allocate to the person where it would rank highest (eg, if it would be first in someone's list and third in another's, give it to the former).
Effectively I'd need a sorting algorithm of some kind that:
Identifies where an item has been requested more than once
Check all the requests of everyone making said request
If that request is the only one a person has made, give it to them
(if this scenario applies to more than one person, it should be
flagged in some way)
If all requestees have made more than one request, check to see if
any have made more than five requests - if they have it can be taken
off them.
If all are within the recommended limit, see where the request would
rank (based on quality score) and give to the person in whose list it
would rank highest.
The process needs to check that the above step isn't happening to people so many times that it leaves them without any it
effectively has to check one item at a time.
Does anyone have any ideas about how to approach this? I can think of all kinds of why I could arrange the data to make it easy to identify and see where this needs to happen, but not to automate the process itself. Thanks in advance for any help.
The data (at least the bits needed for this process) looks like the below:
Item ID Person ID Item Score
1 AAG 9
1 AAK 8
2 AAAX 8
2 AN 8
2 AAAK 8
3 Z 8
3 K 8
4 AAC 7
4 AR 5
5 W 10
5 V 9
6 AAAM 7
6 AAAL 7
8 AB 9
8 D 9
10 A 3
10 AY 3

Efficient chunking timeseries in R

I am working with a number of large timeseries currency pair pricing data in R. Files tend to be 100-300MB in size, and I will generally be working with 3 files at a time. I am looking for a (much) more efficient way of considering the TIME column of these data.
My data begins looking like :
1 USD/JPY 2012-01-02 00:00:00.307 77.023 77.055
2 USD/JPY 2012-01-02 00:00:00.493 77.030 77.049
3 USD/JPY 2012-01-02 00:00:05.003 77.030 77.050
4 USD/JPY 2012-01-02 00:00:05.005 77.023 77.056
5 USD/JPY 2012-01-02 00:00:05.006 77.024 77.056
6 USD/JPY 2012-01-02 00:00:06.008 77.023 77.056
... ... ... ...
R has no problem understanding the TIME column. For instance,
Gives output
Time difference of 0.1860001 secs
Data are already organized into files by month. Unfortunately this is also much too large. I would like to break down pricing data by 'trading week'
Forex trading occurs in continuous multi-day stretches, usually from Monday to Friday. Some trading holidays will suspend trading, and there will not be data on these days. The nature of trading scheduling is such that, if
... is greater than 12 hours, time t is the last time index for that week in USDJPY.
I have not found an acceptable way to break the data into trading weeks, by indices, or otherwise. All my attempts end up hanging. The USDJPY file contains ~1,900,000 rows.
One approach I tried :
for(i in 1:(length(USDJPY$TIME)-1)){
takes far too long (I quit before it could finish)
I would think data.table should speed things up here quite a bit:
library(data.table) #1.9.5
data[, DIFF := as.numeric(TIME-shift(TIME,n=1,type="lag"))]
Week number calc (increment with difference is greater than 12 hours)
data[, Week.num := cumsum(DIFF>12)]

How to optimize for loops in extremely large dataframe

I have a dataframe "x" with 5.9 million rows and 4 columns: idnumber/integer, compdate/integer and judge/character,, representing individual cases completed in an administrative court. The data was imported from a stata dataset and the date field came in as integer, which is fine for my purposes. I want to create the caseload variable by calculating the number of cases completed by the judge within the 30 day window of the completion date of the case at issue.
here are the first 34 rows of data:
idnumber compdate judge
1 9615 JVC
2 15316 BAN
3 15887 WLA
4 11968 WFN
5 15001 CLR
6 13914 IEB
7 14760 HSD
8 11063 RJD
9 10948 PPL
10 16502 BAN
11 15391 WCP
12 14587 LRD
13 10672 RTG
14 11864 JCW
15 15071 GMR
16 15082 PAM
17 11697 DLK
18 10660 ADP
19 13284 ECC
20 13052 JWR
21 15987 MAK
22 10105 HEA
23 14298 CLR
24 18154 MMT
25 10392 HEA
26 10157 ERH
27 9188 RBR
28 12173 JCW
29 10234 PAR
30 10437 ADP
31 11347 RDW
32 14032 JTZ
33 11876 AMC
34 11470 AMC
Here's what I came up with. So for each record I'm taking a subset of the data for that particular judge and then subsetting the cases decided in the 30 day window, and then assigning the length of a vector in the subsetted dataframe to the caseload variable for the subject case, as follows:
for(i in 1:length(x$idnumber)){
a<-x[x$judge==x$judge[i] & !$compdate),]
b<-a[a$compdate<=e & a$compdate>=f,]
It is working but it is taking extremely long to complete. How can I optimize this or do this easier. Sorry I'm very new to r and to programming -- I'm a law professor trying to analyze court data.... Your help is appreciated. Thanks.
You don't have to loop through every row. You can do operations on the entire column at once. First, create some data:
# Create some data.
n<-6e6 # cases
judges<-apply(combn(LETTERS,3),2,paste0,collapse='') # About 2600 judges
Now, you can make a rolling window function, and run it on each judge.
# Sort
# Create a little rolling window function.
rolling.window<-function(y,window=30) seq_along(y) - findInterval(y-window,y)
# Run the little function on each judge.
I don't have much experience with rolling calculations, but...
Calculate this per-day, not per-case (since it will be the same for cases on the same day).
Calculate a cumulative sum of the number of cases, and then take the difference of the current value of this sum and the value of the sum 31 days ago (or min{daysAgo:daysAgo>30} since cases are not resolved every day).
It's probably fastest to use a data.table. This is my attempt, using #nograpes simulated data. Comments start with #.
DT <- data.table(x)
# count cases for each day
ldt <- DT[,.N,by='judge,compdate']
# cumulative sum of counts
# see how far to look back
z <- compdate[i]-compdate[i:1]
older <- which(z>30)
if (length(older)) min(older)-1L else as(NA,'integer')
# compute cumsum(today) - cumsum(more than 30 days ago)
On my laptop, this takes under a minute. Run this command to see the output for one judge:

How to enter censored data into R's survival model?

I'm attempting to model customer lifetimes on subscriptions. As the data is censored I'll be using R's survival package to create a survival curve.
The original subscriptions dataset looks like this..
id start_date end_date
1 2013-06-01 2013-08-25
2 2013-06-01 NA
3 2013-08-01 2013-09-12
Which I manipulate to look like this..
id tenure_in_months status(1=cancelled, 0=active)
1 2 1
2 ? 0
3 1 1 order to feed the survival model:
obj <- with(subscriptions, Surv(time=tenure_in_months, event=status, type="right"))
fit <- survfit(obj~1, data=subscriptions)
What shall I put in the tenure_in_months variable for the consored cases i.e. the cases where the subscription is still active today - should it be the tenure up until today or should it be NA?
First I shall say I disagree with the previous answer. For a subscription still active today, it should not be considered as tenure up until today, nor NA. What do we know exactly about those subscriptions? We know they tenured up until today, that is equivalent to say tenure_in_months for those observations, although we don't know exactly how long they are, they are longer than their tenure duration up to today.
This is a situation known as right-censor in survival analysis. See:
So your data would need to translate from
id start_date end_date
1 2013-06-01 2013-08-25
2 2013-06-01 NA
3 2013-08-01 2013-09-12
id t1 t2 status(3=interval_censored)
1 2 2 3
2 3 NA 3
3 1 1 3
Then you will need to change your R surv object, from:
Surv(time=tenure_in_months, event=status, type="right")
Surv(t1, t2, event=status, type="interval2")
See for more syntax details. A very good summary of computational details can be found:
Interval censored data can be represented in two ways. For the first use type = interval and the codes shown above. In that usage the value of the time2 argument is ignored unless event=3. The second approach is to think of each observation as a time interval with (-infinity, t) for left censored, (t, infinity) for right censored, (t,t) for exact and (t1, t2) for an interval. This is the approach used for type = interval2, with NA taking the place of infinity. It has proven to be the more useful.
If a missing end date means that the subscription is still active, then you need to take the time until the current date as censor date.
NA wont work with the survival object. I think those cases will be omitted. That is not what you want! Because these cases contain important information about the survival.
SQL code to get the time till event (use in SELECT part of query)
DATEDIFF(M,start_date,ISNULL(end_date,GETDATE()) AS tenure_in_months
I would use difference in days, for my analysis. Does not make sense to round off the time to months.
You need to know the date the data was collected. The tenure_in_months for id 2 should then be this date minus 2013-06-01.
Otherwise I believe your encoding of the data is correct. the status of 0 for id 2 indicates it's right-censored (meaning we have a lower bound on it's lifetime, but not an upper bound).
