I have a table that has sequence numbers. Its a very big table 16 million rows give or take. The table has a key and it has events that happen to that key. Every time the key changes the seq_nums restarts in theory.
In the original table I had there was a timestamp associated with each event. In order to get the duration of the event i created a lag column and subtracted the lag column from the time stamp of the current event giving us the duration. This duration is called time_in_minutes in the table below
The new table has a number of properties
Each key in this case is a car wash with each event being assigned a category so on line 3 the car was submitted to a drying procedure for 45 mins
The second line which contains 23 mins, isn't actually 23 mins for the wash, it took the machine 23 minutes to power up
In ID number 144 the record for the powering up of the machine is missing. This seems to be prevalent in the data set
key Event time in mins seq_num
1 Start 0 1
1 Wash 23 2
1 Dry 45 3
1 Wash 56 4
1 Wash 78 5
1 Boil 20 6
1 ShutDown 11 7
2 Start 0 1
2 Wash 11 2
2 Dry 12 3
-------------------------------------------
144 Wash 0 1
144 Wash 11 2
144 Dry 12 3
I would like to move the time_in_mins to the seq_num 1 if is an Event of type Start in the previous record. So when we aggregate this later the minutes will be properly assigned to starting up
I could try and update the table by creating a new column again with another lag for time_in_mins this time but this seems to be quite expensive
Does anyone know of a clever way of doing this?
Edit 14/10/2016
The final output for the customer is like below albeit slightly out of order
key event total minutes
1 Start 23
1 Boil 20
1 Dry 45
1 Wash 134
1 ShutDown 11
2 Start 11
2 Dry 12
2 Wash 0
Thanks for your help
This will switch 1st and 2nd value based on your description, resulting in a single STAT-step in Explain:
SELECT key, seq_num, event,
CASE
WHEN seq_num = 1
AND Event = 'Start'
THEN Min(CASE WHEN seq_num = 2 THEN time_in_mins ELSE 0 END)
Over (PARTITION BY key)
WHEN seq_num = 2
AND Min(CASE WHEN seq_num = 1 THEN Event END)
Over (PARTITION BY key) = 'Start' THEN 0
ELSE time_in_mins
END AS new_time_in_mins
FROM tab
Now you can do the sum.
But it might be possible to include the logic in your previous step when you create the Voltile Table, can you add this Select, too?
Related
I am willing to create a new variable called recency - how recent is the transaction of the customer - which is useful for RFM analysis. The definition is as follows: We observe transaction log of each customer weekly and assign dummy variable called "trans" if the customers made a transaction. Recency variable will equal to the number of the week if she made a transaction at that week, otherwise recency will be equal to the previous recency value. To make it more clear, I have also created a demo data.table for you.
demo<-data.table( cust=rep(c(1:3), 3))
demo[,week:=seq(1,3,1),by=cust]
demo[, trans:=c(1,1,1,0,1,0,1,1,0)]
demo[, rec:=c(1,1,1, 1,2,1,3,3,1)]
I need to calculate "rec" variable which I entered manually in demo data.table. Please also consider that, I can handle it with looping which takes a lot of time. Therefore, I would be grateful if you help me with data.table way. Thanks in advance.
This works for the example:
demo[, v := cummax(week*trans), by=cust]
cust week trans rec v
1: 1 1 1 1 1
2: 2 1 1 1 1
3: 3 1 1 1 1
4: 1 2 0 1 1
5: 2 2 1 2 2
6: 3 2 0 1 1
7: 1 3 1 3 3
8: 2 3 1 3 3
9: 3 3 0 1 1
We observe transaction log of each customer weekly and assign dummy variable called "trans" if the customers made a transaction. Recency variable will equal to the number of the week if she made a transaction at that week, otherwise recency will be equal to the previous recency value.
This means taking the cumulative max week, ignoring weeks where there is no transaction. Since weeks are positive numbers, we can treat the no-transaction weeks as zero.
So I have a table where every row represents a given user in a specific event. Each row contains two types of information: the outcomes of such event, as well as data regarding a user specifically. Multiple users can take part in the a same event.
For clarity, here is an simplified example of such table:
EventID Date Revenue Time(s) UserID X Y Z
1 1/1/2017 $10 120 1 3 2 2
1 1/1/2017 $15 150 2 2 1 2
2 2/1/2017 $50 60 1 1 5 1
2 2/1/2017 $45 100 4 3 5 2
3 3/1/2017 $25 75 1 2 3 1
3 3/1/2017 $20 210 2 5 5 1
3 3/1/2017 $25 120 3 1 0 4
3 3/1/2017 $15 100 4 3 1 1
4 4/1/2017 $75 25 4 0 2 1
My goal is to build a model that can, given a specific user's performance history (in the example attributes X, Y and Z), predict a given revenue and time for an event.
What I am after now is a way to format my data in order to train and test such model. More specifically, I want to transform the table in a way that each row would keep the event specific information, while presenting the moving average of each users attributes up until the previous event. An example of the thought process could be: a user up until an event presents averages of 2, 3.5, and 1.5 in attributes X, Y and Z respectively, and the revenue and time outcomes of such event were $25 and 75, now I will use this as a input for my training.
Once again for clarity, here is an example of the output I would expect applying such logic on the original table:
EventID Date Revenue Time(s) UserID X Y Z
1 1/1/2017 $10 120 1 0 0 0
1 1/1/2017 $15 150 2 0 0 0
2 2/1/2017 $50 60 1 3 2 2
2 2/1/2017 $45 100 4 0 0 0
3 3/1/2017 $25 75 1 2 3.5 1.5
3 3/1/2017 $20 210 2 2 1 2
3 3/1/2017 $25 120 3 0 0 0
3 3/1/2017 $15 100 4 3 5 2
4 4/1/2017 $75 25 4 3 3 1.5
Notice that in each users first appearance all attributes are 0, since we still know nothing about them. Also, in a user's second appearance, all we know is the result of his first appearance. In lines 5 and 9, users 1 and 4 third appearances start to show the rolling mean of their previous performances.
If I were dealing with only a single user, I would tackle this problem by simply calculating the moving average of his attributes, and then shifting only the data in the attribute columns down one row. My questions are:
Is there a way to perform such shift filtered by UserID, when dealing with a table with multiple users?
Or is there a better way in R to calculate the rolling mean directly from the original table by always placing a result in each user's next appearance?
It can assumed that all rows are already sorted by date. Any other tips or references related to this problem are also welcome.
Also, It wasn't obvious how to summarize my question with a one liner title, so I'm open to suggestions from any R experts that might think of an improved way of describing it.
We can achieve your desired output using the dplyr package.
library(dplyr)
tablinka %>%
arrange(UserID, EventID) %>%
group_by(UserID) %>%
mutate_at(c("X", "Y", "Z"), cummean) %>%
mutate_at(c("X", "Y", "Z"), lag) %>%
mutate_at(c("X", "Y", "Z"), funs(ifelse(is.na(.), 0, .))) %>%
arrange(EventID, UserID) %>%
ungroup()
We arrange the data, group it, and then apply the desired transformations (the dplyr functions cummean, lag, and replacing NA with 0 using an ifelse).
Once this is done, we rearrange the data to its original state, and ungroup it.
I want to compute the time interval that has been needed for pressing the different keys of the keyboard by writing a message.
In order to get the data I use a program that produces a .csv after writing a text. This .csv has three columns: the first one with the key pressed and released, the second column says if the key has been pressed (0) or released (1), and the last column registers the time for each event.
Then the idea is to compute the time interval that has been needed for each key, since it is pressed until it is released.
In the following extra simple example we can see that the key 16777248 has been pressed at time 5.0067901611328125e-06 and released at time 0.21875882148742676, therefore, the time interval for this key is 0.21875882148742676-5.0067901611328125e-06. The time interval for the key 72 should be 0.1861410140991211-0.08675289154052734.
16777248 0 5.0067901611328125e-06
72 0 0.08675289154052734
72 1 0.1861410140991211
16777248 1 0.21875882148742676
At the moment I have written a code in R that, first of all, reads the table in .csv. Then it searchs the first 1 in the second column and takes the corresponding key name. Next, it searchs for the previous key with a 0. It computes the time interval, saves this value in a vector and then deletes this two rows from the matrix. Following, it should repeat this until the are no more rows.
data.csv <- read.table("example.csv",header=F, sep=",", dec=".")
myTable<- data.csv
keySearched=0
timeInterval=c( rep( 0,length(myTable[,1]) ) )
L=(length(myTable[,1]))
for( i in 1:L ){
if( myTable[i,2]==1 ){
keySearched <- myTable[i,1]
for( j in 1:(i-1) ){
if( myTable[j,1]==keySearched ){
timeInterval[j] <- (myTable[i,3]-myTable[j,3])
myTable <- myTable[ -c(j,i), ]
}
}
}
}
The problem is that sometimes the value myTable[x,y] is NA because the corresponding row has been deleted. In each iteration two rows are deleted (the one with the pressed key and the corresponding released key).
At this point I get the following error:
Error in if (myTable[j, 1] == keySearched) { :
missing value where TRUE/FALSE needed
How could I solve this problem?
You could try doing it like this:
key = c(3,6,3,8,8,3,6,3)
pressed = c(0,0,1,0,1,0,1,1)
time = c(12,14,16,17,19,22,34,35)
a = data.frame(key,time,pressed)
>a
key time pressed
1 3 12 0
2 6 14 0
3 3 16 1
4 8 17 0
5 8 19 1
6 3 22 0
7 6 34 1
8 3 35 1
First order your data frame (or matrix if you prefer) by the key number and then the time. This should group pressed and released keys together. Then you calculate the time difference between the same keys using diff. And finally, set to NA those diffs that don't make sense.
a = a[order(a$key,a$time),]
a$lapse = c(0,diff(a$time))
a$lapse[seq(1,nrow(a),2)] = NA
>a
key time pressed lapse
1 3 12 0 NA
3 3 16 1 4
6 3 22 0 NA
8 3 35 1 13
2 6 14 0 NA
7 6 34 1 20
4 8 17 0 NA
5 8 19 1 2
We are looking at a delay of a server that can only take care of one customer simultaneously. Let's say we have two data frames: agg_data and ind_data.
> agg_data
minute service_minute
1 0 1
2 60 3
3 120 2
4 180 3
5 240 2
6 300 4
agg_data provides service time between two successive customers for every hour. For instance, between 60 and 120 (the second hour from the beginning), we can serve a new customer every 3 minutes and we can in total serve 20 customers for that given hour.
ind_data provides arrival minutes of each customer:
Arrival
1 51
2 63
3 120
4 121
5 125
6 129
I need to generates the departure minutes for the customers, which are affected by the service_minute in the agg_data.
The output looks like:
Arrival Dep
1 51 52
2 63 66
3 120 122
4 121 124
5 125 127
6 129 131
Here is my current code, which is correct but very inefficient:
ind_data$Dep = rep(0,now(ind_data))
# After the service time, the first customer can leave the system with no delay
# Service time is taken as that of the hour when the customer arrives
ind_data$Dep[1] = ind_data$Arrival[1] + agg_data[max(which(agg_data$minute<=ind_data$Arrival[1])),'service_minute']
# For customers after the first one,
# if they arrive when there is no delay (arrival time > departure time of the previous customer),
# then the service time is that of the hour when the arrive and
# departure time is arrival time + service time;
# if they arrive when there is delay (arrival time < departure time of the previous customer),
# then the service time is that of the hour when the previous customer leaves the system and
# the departure time is the departure time of the previous customer + service time.
for (i in 2:nrow(ind_data)){
ind_data$Dep[i] = max(
ind_data$Dep[i-1] + agg_data[max(which(agg_data$minute<=ind_data$Dep[i-1])),'service_minute'],
ind_data$Arrival[i] + agg_data[max(which(agg_data$minute<=ind_data$Arrival[i])),'service_minute']
)
}
I think it is the step where we search for the right service time to use in agg_data takes long. Is there a more efficient algorithm?
Thank you.
This should be fairly efficient. It's a very simple lookup problem with an obvious vectorized solution:
out <- data.frame(Arrival = ind_data$Arrival,
Dep = ind_data$Arrival + agg_data$service_minute[ # need an index to choose min
findInterval(ind_data$Arrival, agg_data$minute)]
)
> out
Arrival Dep
1 51 52
2 63 66
3 120 122
4 121 123
5 125 127
6 129 131
I trust my code more than your example. I think there are obvious errors in it.
I have table that lists items like below. It basically has Operation Numbers (OP_NO) that tell where a product is at in the process. These OP Numbers can be either Released or Completed. They follow a process as in 10 must happen before 20, 20 must happen before 30 etc. However users do not update all steps in reality so we end up with some items out of order complete while the earlier steps are not as show below (OP30 is completed but OP 10 and 20 are not).
I basically want to produce a listing of each ORDER_ID showing the furthest point of completion for each ORDER_ID. I figured I could do this by querying for STATUS = 'Completed' and Sorting by OP_NO Desc. However I can't figure out how to produce only 1 result for each ORDER_ID. For example in ORDER_ID 345 Steps 10 and 20 are completed. I would only want to return that STEP 20 is where it is currently at. I was figuring I could do this with 'WHERE ROWNUM <= 1' but haven't had much luck. Could any experts weigh in?
Thanks!
ORDER_ID | ORDER_SEC | ORDER_RELEASE | OP_NO | STATUS | Description
123 2 3 10 Released Op10
123 2 3 20 Released Op20
123 2 3 30 Completed Op30
123 2 3 40 Released Op40
345 1 8 10 Completed Op10
345 1 8 20 Completed Op20
345 1 8 30 Released Op30
345 1 8 40 Released Op40
If I understand correctly what you want the below should do what you need. Just replace test table with your table name.
select *
from test_table tst
where status = 'Completed'
and op_no = (select max(op_no)
from test_table tst1
where tst.order_id = tst1.order_id
and status = 'Completed');
Given your sample data this produced the below results.
Order_Id Order_Sec Order_Release op_no Status Description
123 2 3 30 Completed Op30
345 1 8 20 Completed Op20
Cheers
Shaun Peterson