I have quite an interesting task at work - I need to find out how much time user spent doing something and all I have is timestamps of his savings. I know for a fact that user saves after each small portion of a work, so they is not far apart.
The obvious solution would be to find out how much time one small item could possibly take and then just go through sorted timestamps and if the difference between current one and previous one is more than that, it means user had a coffee break, and if it's less, we can just add up this difference into total sum. Simple example code to illustrate that:
var prev_timestamp = null;
var total_time = 0;
foreach (timestamp in timestamps) {
if (prev_timestamp != null) {
var diff = timestamp - prev_timestamp;
if (diff < threshold) {
total_time += diff;
}
}
prev_timestamp = timestamp;
}
The problem is, while I know about how much time is spent on one small portion, I don't want to depend on it. What if some user just that much slower than my predictions, I don't want him to be left without paycheck. So I was thinking, could there be some clever math solution to this problem, that could work without knowledge of what time interval is acceptable?
PS. Sorry for misunderstanding, of course no one would pay people based on this numbers and even if they would, they understand that it is just an approximation. But I'd like to find a solution that would emit numbers as close to real life as possible.
You could get the median TimeSpan, and then discard those TimeSpans which are off by, say >50%.
But this algorithm should IMHO only be used to get estimated spent hours per project, not for payrolls.
You need to either look at the standard deviation for the group of all users or the variance in the intervals for a single user or better a combination of the two for your sample set.
Grab all periods and look at the average? If some are far outside the average span you could discard them or use an adjusted value for them in the average.
I agree with Groo that using something based only on the 'save' timestamp is NOT what you should do - it will NEVER provide you with the actual time spent on the tasks.
The clever math you seek is called "standard deviation".
Related
Initially I was using poissrnd command to generate Poisson distributed numbers but I had no info on how to make them 'arrive' in my code. So I decided to generate the inter-arrival times. I do that as below.
t=exprnd(1/0.1);
for i=1:5
t=t+exprnd(1/0.1);
end
%t is like 31.3654 47.1014 72.0024 77.5162 102.3227 104.5794
%Even if this way of producing arrival times is wrong, still my question remains the same
All is ok to see but how can I actually use these times in my code to say that yes, arrival number 1 occuring at 31.3654 time, then arrival number 2 at 47.1014 time,etc. As ultimately I have to take an arrival, do some action, then receive another call. I cannot use a loop to increment through such varying numbers (even if i use ceil()).
So, what do people mean when they say they generated arrivals using Poisson distbn. How do they make use of the Posson distbn.? Because, the code won't know if a number is from Poisson or Rand distbn? I am tired of thinking an answer to this question. Please suggest something.
PREFACE This is a question about using linear modelling to understand an electricity generation system but you actually don't need to know very much of either to understand this. I'm pretty sure this is a question about R.
I am building a linear model to optimise the dispatch, hourly, of electric generators in a country (called "Lebanon" but actually it's a little fictitious in terms of the data I am using). I have a model which optimises the hourly generation satisfactorily, the code looks like the below:
lp.newobjfun.norelax <- lpSolve::lp(dir = "min", objfun.lebanon.postwalk1, constraintmatrix.lebanon.postwalk.allgenerators, directions.lebanon.postwalk3, rhs.lebanon.postwalk4)
The above works fine. Of course though, doing it per day is a bit useless, so instead I want to be able to run it iteratively every day for a year. The below code is supposed to do that, but instead the returned values (the objective function's value) is always 0. Any ideas what I am doing wrong?
for(i in 1:365)
{
rhs.lebanon.postwalk4[1:24] = as.numeric(supplylebanon2010wholeyear[i,])
lp.newobjfun.norelax <- lpSolve::lp(dir = "min", objfun.lebanon.postwalk1, constraintmatrix.lebanon.postwalk.allgenerators, directions.lebanon.postwalk3, rhs.lebanon.postwalk4)
print(lp.newobjfun.norelax$solution);
}
Just to be clear, in the second version, the right hand side of the first 24 constraints are modified to relfect how the hourly supply of electricity changes each day of the year.
Thanks in advance!
Okay, nevermind I've figured this out it's that there's a unit conversion from kWh to MWh which I hadn't taken care of.
Sorry for any bother!
I have a string as YYYY-MM-DD:HH:mm:SS:sss (ex : 2017-10-11:04:36:26.376). Now I want to convert it into epoch time . What would be programmatic approach for this ?
I am programming in C++, able to extract information in variable.
It turns out there is a formula, but it's fairly ugly. I originally implemented something similar in BASIC 2.0 in 1982 (when each byte counted), and later converted it to Perl:
sub datestar {
$_=shift;
/^(....)(..)(..)/;
$fy=($1-($2<3));
$jd=$fy*365+int($fy/4)-int($fy/100)+int($fy/400)+int(((($2-3+12*($2<3))*30.6)+.5)+$3);
return(86400*($jd-719469))
}
Note that this takes something like "20171011", not "2017-10-11", and doesn't convert hours/minutes/seconds (which are easy to convert).
As always, doublecheck code before use, and use it as a template to write your own code if you really want to.
However, you would be infinitely better off using your programming language's existing functions to do this.
As others said the formula is so complex and would make whole code a mess, So to avoid these I am calculating the number of days from Input date to 01-01-2000. As I know epoch time till 01-01-2000, thus by finding number of days considering leap year I can calculate total epoch time.
I have a problem going out of basic programming towards more sophisticated. Could you help me to adjust this code?
There are two vectors with dates and times, one is when activities happens, and another one - when triggers appear. The aim is to find nearest activities date/time to each of triggers, after each trigger happen. Final result is average of all differences.
I have this code. It works. But it's very slow when working with large dataset.
time_activities<- as.POSIXct(c("2008-09-14 22:15:14","2008-09-15 09:05:14","2008-09-16 14:05:14","2008-09-17 12:05:14"), , "%Y-%m-%d %H:%M:%S")
time_triggers<- as.POSIXct(c("2008-09-15 06:05:14","2008-09-17 12:05:13"), , "%Y-%m-%d %H:%M:%S")
for (j in 1:length(time_triggers))
{
for(i in 1:length(time_activities))
{
if(time_triggers[j]<time_activities[i])
{
result[j] = ceiling(difftime(time_activities[i], time_triggers[j], units = 'mins'))
break
}
}
}
print(mean(as.numeric(result)))
Can I somehow get rid of the loop, and do everything with vectors? Maybe you can give me some hint of which function I could use to compare dates at once?
delay=sapply(time_triggers,function(x) max(subset(difftime(x,time_activities,units='mins'),difftime(x,time_activities,units='mins')<0)))
mean(delay[is.finite(delay)])
This should do the trick. As always, the apply family of functions is a good replacement for a for loop.
This gives the average number of minutes that an activity occurred after a trigger.
If you want to see what the activity delay was after each trigger (rather than just the mean of all the triggers), you can just remove the mean() at the beginning. The values will then correspond to each value in time_triggers.
UPDATE:
I updated the code to ignore Inf values as requested. Sadly, this means the code should be 2 lines rather than 1. If you really want, you can make this all one line, but then you will be doing the majority of the computation twice (not very efficient).
I'm trying to graph data using statsd and graphite. I have a simple counter, I increment it by 1, and then when I graph the values for the counter over the day, I see strange values like 0.09 as the peak in my graph (see http://i.stack.imgur.com/o4gmz.png)
This graph should be showing 2 logins, but instead it's showing 0.09. If I change the time scale from 1 day to the last 15 minutes, then it correctly shows the two logins (see http://i.stack.imgur.com/23vDJ.png)
I've set up my finest retention to be in 10s increments in storage-schemas.conf:
retentions = 10s:7d,1m:21d,24h:5y
I've set up my storage-aggregation.conf file to sum counts:
[sum]
pattern = \.count$
xFilesFactor = 0
aggregationMethod = sum
(And, before you ask, yes; this is a .count).
If I try my URL with &rawData=true then in either case I see some Nones, some 0.0s, and a pair of 1.0s separated by some 0.0s. I never see these fractional values that somehow show up on the graph. So... Is this a bug? Am I doing something wrong?
There's also consolidateBy function which tells graphite what to do if there's no enough pixels to draw everything accurately. By default it's using "avg" function and therefore strange results when time ranges are greater. Here excerpt from documentation:
When a graph is drawn where width of the graph size in pixels is
smaller than the number of datapoints to be graphed, Graphite
consolidates the values to to prevent line overlap. The
consolidateBy() function changes the consolidation function from the
default of ‘average’ to one of ‘sum’, ‘max’, or ‘min’. This is
especially useful in sales graphs, where fractional values make no
sense and a ‘sum’ of consolidated values is appropriate.
Another function that could be useful is hitcount. Short excerpt from here why it's useful:
This function is like summarize(), except that it compensates
automatically for different time scales (so that a similar graph
results from using either fine-grained or coarse-grained records) and
handles rarely-occurring events gracefully.
I spent some time scratching my head why I get fractions for my counter with time ranges longer than couple hours when my aggregation rule is max. It's pretty confusing, especially at the beginning when you play with single counters to see if everything works. Checking rawData is quite a good way for debugging sanity check ;)