Select consecutive date entries - r

I have updated the question, as a) i articulated the question not clearly on the first attempt, b) my exact need also shifted somewhat.
I want to especially thank Hemmo for great help so far - and apologies for not articulating my question clearly enough to him. His code (that addressed earlier version of problem) is shown in the answer section.
At a high-level - i am looking for code that helps to identify and differentiate the different blocks of consecutive free time of different individuals. More specifically - the code would ideally:
Check whehter an activity is labelled as "Free"
Check whether consecutive weeks (week earlier, week later) of time spent by the same person where also labelled as "Free".
Give the entire block of consecutive weeks of that person that are labelled "Free" an indicator in the desired outcome column. Note that the lenght of time-periods (e.g. 1 consec week, 4 consec weeks, 8 consec weeks) will vary
Finally - due to a need for further analysis on the characteristics of these clusters, different blocks should receive different indicators. (e.g. the march block of Paul would have value 1, the May block value 2, and Kim's block in March would be have value 3)
Hopefully this becomes more clear when one looks at the example dataframe (see the desired final column)
Any help much appreciated, code for the test dataframe per below.
Many thanks in advance,
W
Example (note that the last column should be generated by the code, purely included as illustration):
Week Name Activity Hours Desired_Outcome
1 01/01/2013 Paul Free 40 1
2 08/01/2013 Paul Free 10 1
3 08/01/2013 Paul Project A 30 0
4 15/01/2013 Paul Project B 30 0
5 15/01/2013 Paul Project A 10 0
6 22/01/2013 Paul Free 40 2
7 29/01/2013 Paul Project B 40 0
8 05/02/2013 Paul Free 40 3
9 12/02/2013 Paul Free 10 3
10 19/02/2013 Paul Free 30 3
11 01/01/2013 Kim Project E 40 0
12 08/01/2013 Kim Free 40 4
13 15/01/2013 Kim Free 40 4
14 22/01/2013 Kim Project E 40 0
15 29/01/2013 Kim Free 40 5
Code for dataframe:
Name=c(rep("Paul",10),rep("Kim",5))
Week=c("01/01/2013","08/01/2013","08/01/2013","15/01/2013","15/01/2013","22/01/2013","29/01/2013","05/02/2013","12/02/2013","19/02/2013","01/01/2013","08/01/2013","15/01/2013","22/01/2013","29/01/2013")
Activity=c("Free","Free","Project A","Project B","Project A","Free","Project B","Free","Free","Free","Project E","Free","Free","Project E","Free")
Hours=c(40,10,30,30,10,40,40,40,10,30,40,40,40,40,40)
Desired_Outcome=c(1,1,0,0,0,2,0,3,3,3,0,4,4,0,5)
df=as.data.frame(cbind(Week,Name,Activity,Hours,Desired_Outcome))
df

EDIT: This was messy already as the question was edited several times, so I removed old answers.
checkFree<-function(df){
df$Week<-as.Date(df$Week,format="%d/%m/%Y")
df$outcome<-numeric(nrow(df))
if(df$Activity[1]=="Free"){ #check first
counter<-1
df$outcome[1]<-counter
} else counter<-0
for(i in 2:nrow(df)){
if(df$Activity[i]=="Free"){
LastWeek <- (df$Week >= (df$Week[i]-7) &
df$Week < (df$Week[i]))
if(all(df$Activity[LastWeek]!="Free"))
counter<-counter+1
df$outcome[i]<-counter
}
}
df
}
splitdf<-split(df, Name)
df<-unsplit(lapply(splitdf,checkFree),Name)
uniqs<-unique(df2$Name) #for renumbering
for(i in 2:length(uniqs))
df$outcome[df$Name==uniqs[i] & df$outcome>0]<-
max(df$outcome[df$Name==uniqs[i-1]]) +
df$outcome[df$Name==uniqs[i] & df$outcome>0]
df
That should do it, although the above code is probably far from optimal.

Using the comment by user1885116 to Hemmo's answer as a guide to what is desired, here is a somewhat simpler approach:
N <- 1
x <- with(df, df[Activity=='Free',])
y <- with(x, diff(Week)) <= N*7
df$outcome <- 0
df[rownames(x[c(y, FALSE) | c(FALSE, y),]),]$outcome <- 1
df
## Week Activity Hours Desired_Outcome outcome
## 1 2013-01-01 Project A 40 0 0
## 2 2013-01-08 Project A 10 0 0
## 3 2013-01-08 Free 30 1 1
## 4 2013-01-15 Project B 30 0 0
## 5 2013-01-15 Free 10 1 1
## 6 2013-01-22 Project B 40 0 0
## 7 2013-01-29 Free 40 0 0
## 8 2013-02-05 Project C 40 0 0

Related

Updating a table with the rolling average of previous rows in R?

So I have a table where every row represents a given user in a specific event. Each row contains two types of information: the outcomes of such event, as well as data regarding a user specifically. Multiple users can take part in the a same event.
For clarity, here is an simplified example of such table:
EventID Date Revenue Time(s) UserID X Y Z
1 1/1/2017 $10 120 1 3 2 2
1 1/1/2017 $15 150 2 2 1 2
2 2/1/2017 $50 60 1 1 5 1
2 2/1/2017 $45 100 4 3 5 2
3 3/1/2017 $25 75 1 2 3 1
3 3/1/2017 $20 210 2 5 5 1
3 3/1/2017 $25 120 3 1 0 4
3 3/1/2017 $15 100 4 3 1 1
4 4/1/2017 $75 25 4 0 2 1
My goal is to build a model that can, given a specific user's performance history (in the example attributes X, Y and Z), predict a given revenue and time for an event.
What I am after now is a way to format my data in order to train and test such model. More specifically, I want to transform the table in a way that each row would keep the event specific information, while presenting the moving average of each users attributes up until the previous event. An example of the thought process could be: a user up until an event presents averages of 2, 3.5, and 1.5 in attributes X, Y and Z respectively, and the revenue and time outcomes of such event were $25 and 75, now I will use this as a input for my training.
Once again for clarity, here is an example of the output I would expect applying such logic on the original table:
EventID Date Revenue Time(s) UserID X Y Z
1 1/1/2017 $10 120 1 0 0 0
1 1/1/2017 $15 150 2 0 0 0
2 2/1/2017 $50 60 1 3 2 2
2 2/1/2017 $45 100 4 0 0 0
3 3/1/2017 $25 75 1 2 3.5 1.5
3 3/1/2017 $20 210 2 2 1 2
3 3/1/2017 $25 120 3 0 0 0
3 3/1/2017 $15 100 4 3 5 2
4 4/1/2017 $75 25 4 3 3 1.5
Notice that in each users first appearance all attributes are 0, since we still know nothing about them. Also, in a user's second appearance, all we know is the result of his first appearance. In lines 5 and 9, users 1 and 4 third appearances start to show the rolling mean of their previous performances.
If I were dealing with only a single user, I would tackle this problem by simply calculating the moving average of his attributes, and then shifting only the data in the attribute columns down one row. My questions are:
Is there a way to perform such shift filtered by UserID, when dealing with a table with multiple users?
Or is there a better way in R to calculate the rolling mean directly from the original table by always placing a result in each user's next appearance?
It can assumed that all rows are already sorted by date. Any other tips or references related to this problem are also welcome.
Also, It wasn't obvious how to summarize my question with a one liner title, so I'm open to suggestions from any R experts that might think of an improved way of describing it.
We can achieve your desired output using the dplyr package.
library(dplyr)
tablinka %>%
arrange(UserID, EventID) %>%
group_by(UserID) %>%
mutate_at(c("X", "Y", "Z"), cummean) %>%
mutate_at(c("X", "Y", "Z"), lag) %>%
mutate_at(c("X", "Y", "Z"), funs(ifelse(is.na(.), 0, .))) %>%
arrange(EventID, UserID) %>%
ungroup()
We arrange the data, group it, and then apply the desired transformations (the dplyr functions cummean, lag, and replacing NA with 0 using an ifelse).
Once this is done, we rearrange the data to its original state, and ungroup it.

Search for value within a range of values in two separate vectors

This is my first time posting to Stack Exchange, my apologies as I'm certain I will make a few mistakes. I am trying to assess false detections in a dataset.
I have one data frame with "true" detections
truth=
ID Start Stop SNR
1 213466 213468 10.08
2 32238 32240 10.28
3 218934 218936 12.02
4 222774 222776 11.4
5 68137 68139 10.99
And another data frame with a list of times, that represent possible 'real' detections
possible=
ID Times
1 32239.76
2 32241.14
3 68138.72
4 111233.93
5 128395.28
6 146180.31
7 188433.35
8 198714.7
I am trying to see if the values in my 'possible' data frame lies between the start and stop values. If so I'd like to create a third column in possible called "between" and a column in the "truth" data frame called "match. For every value from possible that falls between I'd like a 1, otherwise a 0. For all of the rows in "truth" that find a match I'd like a 1, otherwise a 0.
Neither ID, not SNR are important. I'm not looking to match on ID. Instead I wand to run through the data frame entirely. Output should look something like:
ID Times Between
1 32239.76 0
2 32241.14 1
3 68138.72 0
4 111233.93 0
5 128395.28 0
6 146180.31 1
7 188433.35 0
8 198714.7 0
Alternatively, knowing if any of my 'possible' time values fall within 2 seconds of start or end times would also do the trick (also with 1/0 outputs)
(Thanks for the feedback on the original post)
Thanks in advance for your patience with me as I navigate this system.
I think this can be conceptulised as a rolling join in data.table. Take this simplified example:
truth
# id start stop
#1: 1 1 5
#2: 2 7 10
#3: 3 12 15
#4: 4 17 20
#5: 5 22 26
possible
# id times
#1: 1 3
#2: 2 11
#3: 3 13
#4: 4 28
setDT(truth)
setDT(possible)
melt(truth, measure.vars=c("start","stop"), value.name="times")[
possible, on="times", roll=TRUE
][, .(id=i.id, truthid=id, times, status=factor(variable, labels=c("in","out")))]
# id truthid times status
#1: 1 1 3 in
#2: 2 2 11 out
#3: 3 3 13 in
#4: 4 5 28 out
The source datasets were:
truth <- read.table(text="id start stop
1 1 5
2 7 10
3 12 15
4 17 20
5 22 26", header=TRUE)
possible <- read.table(text="id times
1 3
2 11
3 13
4 28", header=TRUE)
I'll post a solution that I'm pretty sure works like you want it to in order to get you started. Maybe someone else can post a more efficient answer.
Anyway, first I needed to generate some example data - next time please provide this from your own data set in your post using the function dput(head(truth, n = 25)) and dput(head(possible, n = 25)). I used:
#generate random test data
set.seed(7)
truth <- data.frame(c(1:100),
c(sample(5:20, size = 100, replace = T)),
c(sample(21:50, size = 100, replace = T)))
possible <- data.frame(c(sample(1:15, size = 15, replace = F)))
colnames(possible) <- "Times"
After getting sample data to work with; the following solution provides what I believe you are asking for. This should scale directly to your own dataset as it seems to be laid out. Respond below if the comments are unclear.
#need the %between% operator
library(data.table)
#initialize vectors - 0 or false by default
truth.match <- c(rep(0, times = nrow(truth)))
possible.between <- c(rep(0, times = nrow(possible)))
#iterate through 'possible' dataframe
for (i in 1:nrow(possible)){
#get boolean vector to show if any of the 'truth' rows are a 'match'
match.vec <- apply(truth[, 2:3],
MARGIN = 1,
FUN = function(x) {possible$Times[i] %between% x})
#if any are true then update the match and between vectors
if(any(match.vec)){
truth.match[match.vec] <- 1
possible.between[i] <- 1
}
}
#i think this should be called anyMatch for clarity
truth$anyMatch <- truth.match
#similarly; betweenAny
possible$betweenAny <- possible.between

making a table with multiple columns in r

I´m obviously a novice in writing R-code.
I have tried multiple solutions to my problem from stackoverflow but I'm still stuck.
My dataset is carcinoid, patients with a small bowel cancer, with multiple variables.
i would like to know how different variables are distributed
carcinoid$met_any - with metastatic disease 1=yes, 2=no(computed variable)
carcinoid$liver_mets_y_n - liver metastases 1=yes, 2=no
carcinoid$regional_lymph_nodes_y_n - regional lymph nodes 1=yes, 2=no
peritoneal_carcinosis_y_n - peritoneal carcinosis 1=yes, 2=no
i have tried this solution which is close to my wanted result
ddply(carcinoid, .(carcinoid$met_any), summarize,
livermetastases=sum(carcinoid$liver_mets_y_n=="1"),
regionalmets=sum(carcinoid$regional_lymph_nodes_y_n=="1"),
pc=sum(carcinoid$peritoneal_carcinosis_y_n=="1"))
with the result being:
carcinoid$met_any livermetastases regionalmets pc
1 1 21 46 7
2 2 21 46 7
Now, i expected the row with 2(=no metastases), to be empty. i would also like the rows in the column carcinoid$met_any to give the number of patients.
If someone could help me it would be very much appreciated!
John
Edit
My dataset, although the column numbers are: 1, 43,28,31,33
1=yes2=no
case_nr met_any liver_mets_y_n regional_lymph_nodes_y_n pc
1 1 1 1 2
2 1 2 1 2
3 2 2 2 2
4 1 2 1 1
5 1 2 1 1
desired output - I want to count the numbers of 1:s and 2:s, if it works, all 1:s should end up in the met_any=1 row
nr liver_mets regional_lymph_nodes pc
met_any=1 4 1 4 2
met_any=2 1 4 1 3
EDIT
Although i probably was very unclear in my question, with your help i could make the table i needed!
setDT(carcinoid)[,lapply(.SD,table),.SDcols=c(43,28,31,33,17)]
gives
met_any lymph_nod liver_met paraortal extrahep
1: 50 46 21 6 15
2: 111 115 140 151 146
i am very grateful! #mtoto provided the solution
John
Based on your example data, this data.table approach works:
library(data.table)
setDT(df)[,lapply(.SD,table),.SDcols=c(2:5)]
# met_any liver_mets_y_n regional_lymph_nodes_y_n pc
# 1: 4 1 4 2
# 2: 1 4 1 3

select records according to the difference between records R

I hope someone could suggest me something for this "problem", because I really don't know how to proceed...
Well, my data are like this
data<-data.frame(site=c(rep("A",3),rep("B",3),rep("C",3)),time=c(100,180,245,5,55,130,70,120,160))
where time is in minute.
I want to select only the records, for each site, for which the difference is more than 60, so the output should be Like this:
out<-data[c(1:4,6,7,9),]
What I have tried so far. Well,to get the difference I use this:
difference<-stack(tapply(data$time,data$site,diff))
but then, no idea how to pick up those records which satisfied my condition...
If there is already a similar question, although I've searched for a while, I apologize for this.
To make things clear, as probably the definition of difference was not so unambiguous, I need to select all the records (for each site) which are separated at least by 60 minutes, so not only those that are strictly subsequent in time.
Specifically,
> out
site time
1 A 100#included because difference between 2 and 1 is>60
2 A 180#included because difference between 3 and 2 is>60
3 A 245#included because separated by 6o minutes before record#2
4 B 5#included because difference between 6 and 4 is>60
6 B 130#included because separated by 6o minutes before record#4
7 C 70#included because difference between 9 and 7 is>60
9 C 160#included because separated by 60 minutes before record#7
May be to solve the "problem", it could be useful to consider the results of the difference, something like this:
> difference
values ind
1 80 A#include record 1 and 2
2 65 A#include record 2 and 3
3 50 B#include only record 4
4 75 B#include record 6 because there are(50+75)>60 m from r#4
5 50 C#include only record 7
6 40 C#include record 9 because there are (50+40)>60 m from r#7
Thanks for the help.
data[ave(data$time, data$site, FUN = function(x){c(61, diff(x)) > 60}) == 1, ]
# site time
# 1 A 100
# 2 A 180
# 3 A 245
# 4 B 5
# 6 B 130
# 7 C 70
Edit following updated question:
keep <- as.logical(ave(data$time, data$site, FUN = function(x){
c(TRUE, cumsum(diff(x)) > 60)
}))
data[keep, ]
# site time
# 1 A 100
# 2 A 180
# 3 A 245
# 4 B 5
# 6 B 130
# 7 C 70
# 9 C 160
#Calculate the differences
data$diff <- unlist(by(data$time, data$site,function(x)c(NA,diff(x))))
#subset data
data[is.na(data$diff) | data$diff > 60,]
Using plyr:
ddply(dat,.(site),function(x)x[c(TRUE , diff(x$time) >60),])

Update dataframe column efficiently using some hashmap method in R

I am new to R and can't figure out what I might be doing wrong in the code below and how I could speed it up.
I have a dataset and would like to add a column containing average value calculated from two column of data. Please take a look at the code below (WARNING: it could take some time to read my question but the code runs fine in R):
first let me define a dataset df (again I apologize for the long description of the code)
> df<-data.frame(prediction=sample(c(0,1),10,TRUE),subject=sample(c("car","dog","man","tree","book"),10,TRUE))
> df
prediction subject
1 0 man
2 1 dog
3 0 man
4 1 tree
5 1 car
6 1 tree
7 1 dog
8 0 tree
9 1 tree
10 1 tree
Next I add a the new column called subjectRate to df
df$subjectRate <- with(df,ave(prediction,subject))
> df
prediction subject subjectRate
1 0 man 0.0
2 1 dog 1.0
3 0 man 0.0
4 1 tree 0.8
5 1 car 1.0
6 1 tree 0.8
7 1 dog 1.0
8 0 tree 0.8
9 1 tree 0.8
10 1 tree 0.8
from the new table definition I generate a rateMap so as to automatically fill in new data with the subjectRate column initialized with the previously obtained average.
rateMap <- df[!duplicated(df[, c("subjectRate")]), c("subject","subjectRate")]
> rateMap
subject subjectRate
1 man 0.0
2 dog 1.0
4 tree 0.8
Now I am defining a new dataset with a combination of the old subject in df and new subjects
> dfNew<-data.frame(prediction=sample(c(0,1),15,TRUE),subject=sample(c("car","dog","man","cat","book","computer"),15,TRUE))
> dfNew
prediction subject
1 1 man
2 0 cat
3 1 computer
4 0 dog
5 0 book
6 1 cat
7 1 car
8 0 book
9 0 computer
10 1 dog
11 0 cat
12 0 book
13 1 dog
14 1 man
15 1 dog
My question: How do I create the third column efficiently? currently I am running the test below where I look up the subject rate in the map and input the value if found, or 0.5 if not.
> all_facts<-levels(factor(rateMap$subject))
> dfNew$subjectRate <- sapply(dfNew$subject,function(t) ifelse(t %in% all_facts,rateMap[as.character(rateMap$subject) == as.character(t),][1,"subjectRate"],0.5))
> dfNew
prediction subject subjectRate
1 1 man 0.0
2 0 cat 0.5
3 1 computer 0.5
4 0 dog 1.0
5 0 book 0.5
6 1 cat 0.5
7 1 car 0.5
8 0 book 0.5
9 0 computer 0.5
10 1 dog 1.0
11 0 cat 0.5
12 0 book 0.5
13 1 dog 1.0
14 1 man 0.0
15 1 dog 1.0
but with a real dataset (more than 200,000 rows) with multiple columns similar to subject to compute the average, the code takes a very long time to run. Can somebody suggest maybe a better way to do what I am trying to achieve? maybe some merge or something, but I am out of ideas.
Thank you.
I suspect (but am not sure, since I haven't tested it) that this will be faster:
dfNew$subjectRate <- rateMap$subjectRate[match(dfNew$subject,rateMap$subject)]
since it mostly uses just indexing and match. It certainly a bit simpler, I think. This will fill in the "new" values with NAs, rather than 0.5, which can then be filled in however you like with,
dfNew$subjectRate[is.na(dfNew$subjectRate)] <- newValue
If the ave piece is particularly slow, the standard recommendation these days is to use the data.table package:
require(data.table)
dft <- as.data.table(df)
setkeyv(dft, "subject")
dft[, subjectRate := mean(prediction), by = subject]
and this will probably attract a few comments suggesting ways to eke a bit more speed out of that data table aggregation in the last line. Indeed, merging or joining using pure data.tables may be even slicker (and fast), so you might want to investigate that option as well. (See the very bottom of ?data.table for a bunch of examples.)

Resources