I have a data set of different patient ID's, clinic visit dates, and attendance (see example data below, separated by patient ID for clarity).
I am interested in sequentially counting treatment episodes, which are defined as attending >= 4 visits for their starting month, followed by >= 1 visit every month afterwards. If a patient attends <1 visit after starting (i.e., after completing their initial >=4 visits in the starting month), that treatment episode is considered ended. A new treatment episode subsequently starts the next time a patient attends >= 4 visits in a given month, and that same episode continues as long as the patient attends >=1 visit/month thereafter. When patients either do not meet or break this pattern, I'd like to input 0.
Example data (note: I've excluded each day's date to prevent the example from being excessively long and re-produced dates to give a clearer picture of the desired data):
Patient ID
Visit Date
Attendance
1
01/01/2023
Yes
1
01/02/2023
Yes
1
01/03/2023
Yes
1
01/04/2023
Yes
1
02/01/2023
Yes
1
03/01/2023
Yes
1
04/01/2023
No
1
05/01/2023
Yes
1
06/01/2023
No
1
07/01/2023
Yes
1
07/02/2023
Yes
1
07/03/2023
Yes
1
07/04/2023
Yes
1
08/01/2023
Yes
----------
----------
----------
Patient ID
Visit Date
Attendance
----------
----------
----------
2
01/01/2023
Yes
2
02/01/2023
Yes
2
03/01/2023
Yes
2
03/02/2023
Yes
2
03/03/2023
Yes
2
03/04/2023
Yes
2
04/01/2023
Yes
2
05/01/2023
Yes
2
07/01/2023
Yes
Desired data:
Patient ID
Visit Date
Attendance
Tx Episode
1
01/01/2023
Yes
1
1
01/02/2023
Yes
1
1
01/03/2023
Yes
1
1
01/04/2023
Yes
1
1
02/01/2023
Yes
1
1
03/01/2023
Yes
1
1
04/01/2023
No
0
1
05/01/2023
Yes
0
1
06/01/2023
No
0
1
07/01/2023
Yes
2
1
07/02/2023
Yes
2
1
07/03/2023
Yes
2
1
07/04/2023
Yes
2
1
08/01/2023
Yes
2
----------
----------
----------
----------
Patient ID
Visit Date
Attendance
Tx Episode
----------
----------
----------
----------
2
01/01/2023
Yes
0
2
02/01/2023
Yes
0
2
03/01/2023
Yes
1
2
03/02/2023
Yes
1
2
03/03/2023
Yes
1
2
03/04/2023
Yes
1
2
04/01/2023
Yes
1
2
05/01/2023
Yes
1
2
07/01/2023
Yes
0
I am somewhat new to programming in R and have initially attempted to use ifelse() but wasn't able to come up with logicals that worked. I've also attempted to write loops, which have failed to run.
Any help would be greatly appreciated and I'm happy to provide more detail if the above isn't clear.
Thanks in advance for your time/effort!
This seems fairly complex, and not sure of entire logic, but thought this may help. This uses the lubridate library, but otherwise base R functions. A helper function elapsed_months was borrowed from here.
First an empty list is created enc_list that will store results for the final data.frame.
We construct two loops - the first to analyze data for each Patient_ID, and the second to evaluate encounters for that given patient.
Note that I subset based on Attendance being "Yes" - if not attended, would not want to include that data in evaluation. This is an assumption on my part.
A table of months for the Visit_Date is made so that we know which months have >= 4 encounters.
The enc_active is a simple flag on whether row-by-row we are dealing with an active encounter. The enc_num is the number treatment encounter that is incremented when new treatment encounters are discovered.
Going row-by-row through encounter data, first check if in an active encounter. If it is, check if the number of elapsed months is 0 (same month) or 1 (consecutive month). If true, then record that encounter. If not true, then the treatment encounter is over.
If not an active encounter, check if has a month with 4+ encounters, and if it does, set to a new active treatment encounter. Note that in cases were not true, it will record 0 for Tx_Encounter and then reset the flag.
The final results are stored back in the list which will be bound together with rbind (row bind) in the end.
The merge will combine the results with the original data.frame, which will be needed since the rows with Attendance or "No" were removed early on. Since the merge will leave Tx_Encounter with missing for those "No"s, we'll replace NA with 0.
Some example data was adapted from your comment. Please let me know of questions - happy to do a StackOverflow chat to go over as well. I do have an interest in this form of data from my own experiences.
library(lubridate)
elapsed_months <- function(end_date, start_date) {
ed <- as.POSIXlt(end_date)
sd <- as.POSIXlt(start_date)
12 * (ed$year - sd$year) + (ed$mon - sd$mon)
}
enc_list <- list()
for (id in unique(df$Patient_ID)) {
enc_data <- df[df$Patient_ID == id & df$Attendance == "Yes", ]
enc_month <- table(cut(enc_data$Visit_Date, 'month'))
enc_active <- F
enc_num <- 0
for (i in 1:nrow(enc_data)) {
if (enc_active) {
if(elapsed_months(enc_data$Visit_Date[i], enc_data$Visit_Date[i - 1]) <= 1) {
enc_data[i, "Tx_Episode"] <- enc_num
} else {
enc_active = F
enc_data[i, "Tx_Episode"] <- 0
}
} else {
if(enc_month[as.character(floor_date(enc_data$Visit_Date[i], unit = "month"))] >= 4) {
enc_active = T
enc_num <- enc_num + 1
enc_data[i, "Tx_Episode"] <- enc_num
} else {
enc_data[i, "Tx_Episode"] <- 0
}
}
}
enc_list[[id]] <- enc_data
}
df_final <- merge(
do.call('rbind', enc_list),
df,
all.y = T
)
df_final$Tx_Episode[is.na(df_final$Tx_Episode)] <- 0
Output
Patient_ID Visit_Date Attendance Tx_Episode
1 1 2023-01-01 Yes 1
2 1 2023-01-02 Yes 1
3 1 2023-01-03 Yes 1
4 1 2023-01-04 Yes 1
5 1 2023-02-01 Yes 1
6 1 2023-03-01 Yes 1
7 1 2023-04-01 No 0
8 1 2023-05-01 Yes 0
9 1 2023-06-01 No 0
10 1 2023-07-01 Yes 2
11 1 2023-07-02 Yes 2
12 1 2023-07-03 Yes 2
13 1 2023-07-04 Yes 2
14 1 2023-08-01 Yes 2
15 2 2023-01-01 Yes 0
16 2 2023-02-01 Yes 0
17 2 2023-03-01 Yes 1
18 2 2023-03-02 Yes 1
19 2 2023-03-03 Yes 1
20 2 2023-03-04 Yes 1
21 2 2023-04-01 Yes 1
22 2 2023-04-02 Yes 1
23 2 2023-04-03 Yes 1
24 2 2023-04-04 Yes 1
25 2 2023-06-12 Yes 0
Related
I am very new to R, so I apologise if this looks simple to someone.
I try to to join two files and then perform a one-sided Fisher's exact test to determine if there is a greater burden of qualifying variants in casefile or controlfile.
casefile:
GENE CASE_COUNT_HET CASE_COUNT_CH CASE_COUNT_HOM CASE_TOTAL_AC
ENSG00000124209 1 0 0 1
ENSG00000064703 1 1 0 9
ENSG00000171408 1 0 0 1
ENSG00000110514 1 1 1 12
ENSG00000247077 1 1 1 7
controlfile:
GENE CASE_COUNT_HET CASE_COUNT_CH CASE_COUNT_HOM CASE_TOTAL_AC
ENSG00000124209 1 0 0 1
ENSG00000064703 1 1 0 9
ENSG00000171408 1 0 0 1
ENSG00000110514 1 1 1 12
ENSG00000247077 1 1 1 7
ENSG00000174776 1 1 0 2
ENSG00000076864 1 0 1 13
ENSG00000086015 1 0 1 25
I have this script:
#!/usr/bin/env Rscript
library("argparse")
suppressPackageStartupMessages(library("argparse"))
parser <- ArgumentParser()
parser$add_argument("--casefile", action="store")
parser$add_argument("--casesize", action="store", type="integer")
parser$add_argument("--controlfile", action="store")
parser$add_argument("--controlsize", action="store", type="integer")
parser$add_argument("--outfile", action="store")
args <- parser$parse_args()
case.dat<-read.delim(args$casefile, header=T, stringsAsFactors=F, sep="\t")
names(case.dat)[1]<-"GENE"
control.dat<-read.delim(args$controlfile, header=T, stringsAsFactors=F, sep="\t")
names(control.dat)[1]<-"GENE"
dat<-merge(case.dat, control.dat, by="GENE", all.x=T, all.y=T)
dat[is.na(dat)]<-0
dat$P_DOM<-0
dat$P_REC<-0
for(i in 1:nrow(dat)){
#Dominant model
case_count<-dat[i,]$CASE_COUNT_HET+dat[i,]$CASE_COUNT_HOM
control_count<-dat[i,]$CONTROL_COUNT_HET+dat[i,]$CONTROL_COUNT_HOM
if(case_count>args$casesize){
case_count<-args$casesize
}else if(case_count<0){
case_count<-0
}
if(control_count>args$controlsize){
control_count<-args$controlsize
}else if(control_count<0){
control_count<-0
}
mat<-cbind(c(case_count, (args$casesize-case_count)), c(control_count, (args$controlsize-control_count)))
dat[i,]$P_DOM<-fisher.test(mat, alternative="greater")$p.value
and problem starts in here:
case_count<-dat[i,]$CASE_COUNT_HET+dat[i,]$CASE_COUNT_HOM
control_count<-dat[i,]$CONTROL_COUNT_HET+dat[i,]$CONTROL_COUNT_HOM
the result of case_count and control_count is NULL values, however corresponding columns in both input files are NOT empty.
I tried to run the script above with assigning absolute numbers (1000 and 2000) to variables case_count and control_count , and the script worked without issues.
The main purpose of the code:
https://github.com/mhguo1/TRAPD
Run burden testing This script will run the actual burden testing. It
performs a one-sided Fisher's exact test to determine if there is a
greater burden of qualifying variants in cases as compared to controls
for each gene. It will perform this burden testing under a dominant
and a recessive model.
It requires R; the script was tested using R v3.1, but any version of
R should work. The script should be run as: Rscript burden.R
--casefile casecounts.txt --casesize 100 --controlfile controlcounts.txt --controlsize 60000 --output burden.out.txt
The script has 5 required options:
--casefile: Path to the counts file for the cases, as generated in Step 2A
--casesize: Number of cases that were tested in Step 2A
--controlfile: Path to the counts file for the controls, as generated in Step 2B
--controlsize: Number of controls that were tested in Step 2B. If using ExAC or gnomAD, please refer to the respective documentation for
total sample size
--output: Output file path/name Output: A tab delimited file with 10 columns:
#GENE: Gene name CASE_COUNT_HET: Number of cases carrying heterozygous qualifying variants in a given gene CASE_COUNT_CH: Number of cases
carrying potentially compound heterozygous qualifying variants in a
given gene CASE_COUNT_HOM: Number of cases carrying homozygous
qualifying variants in a given gene. CASE_TOTAL_AC: Total AC for a
given gene. CONTROL_COUNT_HET: Approximate number of controls carrying
heterozygous qualifying variants in a given gene CONTROL_COUNT_HOM:
Number of controlss carrying homozygous qualifying variants in a given
gene. CONTROL_TOTAL_AC: Total AC for a given gene. P_DOM: p-value
under the dominant model. P_REC: p-value under the recessive model.
I try to run genetic variant burden test with vcf files and external gnomAD controls. I found this repo suitable and trying to fix bugs now in it.
as a newbie in R statistics, I will be happy about any suggestion. Thank you!
If you want all row in two file. You can use full join with by = "GENE" and suffix as you wish
library(dplyr)
z <- outer_join(case_file, control_file, by = "GENE", suffix = c(".CASE", ".CONTROL"))
GENE CASE_COUNT_HET.CASE CASE_COUNT_CH.CASE CASE_COUNT_HOM.CASE CASE_TOTAL_AC.CASE
1 ENSG00000124209 1 0 0 1
2 ENSG00000064703 1 1 0 9
3 ENSG00000171408 1 0 0 1
4 ENSG00000110514 1 1 1 12
5 ENSG00000247077 1 1 1 7
6 ENSG00000174776 NA NA NA NA
7 ENSG00000076864 NA NA NA NA
8 ENSG00000086015 NA NA NA NA
CASE_COUNT_HET.CONTROL CASE_COUNT_CH.CONTROL CASE_COUNT_HOM.CONTROL CASE_TOTAL_AC.CONTROL
1 1 0 0 1
2 1 1 0 9
3 1 0 0 1
4 1 1 1 12
5 1 1 1 7
6 1 1 0 2
7 1 0 1 13
8 1 0 1 25
If you want only GENE that are in both rows, use inner_join
z <- inner_join(case_file, control_file, by = "GENE", suffix = c(".CASE", ".CONTROL"))
GENE CASE_COUNT_HET.CASE CASE_COUNT_CH.CASE CASE_COUNT_HOM.CASE CASE_TOTAL_AC.CASE
1 ENSG00000124209 1 0 0 1
2 ENSG00000064703 1 1 0 9
3 ENSG00000171408 1 0 0 1
4 ENSG00000110514 1 1 1 12
5 ENSG00000247077 1 1 1 7
CASE_COUNT_HET.CONTROL CASE_COUNT_CH.CONTROL CASE_COUNT_HOM.CONTROL CASE_TOTAL_AC.CONTROL
1 1 0 0 1
2 1 1 0 9
3 1 0 0 1
4 1 1 1 12
5 1 1 1 7
So my data looks like this
id first middle last Age
1 Carol Jenny Smith 15
2 Sarah Carol Roberts 20
3 Josh David Richardson 22
I have a function that creates a new column which gives you how many times the name was found for each row in previous columns that I specified (2nd-4th columns or 'first':'last' columns). I have a function that outputs the result below,
funname <- function(df, cols, value, newcolunmn) {
df$newcolumn <- as.integer(rowSums(df[cols] == value) > 0)
}
id first middle last Age Carol
1 Carol Jenny Smith 15 1
2 Sarah Carol Roberts 20 1
3 Josh David Richardson 22 0
But my real data is more complicated and I want to create at least 20 new, different columns (ex: Carol, Robert, Jenny, Anna, Richard, Daniel, Eric...)
So how can I incorporate multiple new columns into the existing function?
I can only think of adding function(df, cols, value, newcolumn1, newcolumn2, newcolumn3,...,) but this would be impossible if I want like hundred columns later,..any help? thank you in advance! :)
EDIT:
function(df, cols, value, newcol) {
df$newcol <- as.integer(rowSums(df[cols] == value) > 0)
df
}
I read the comments below..but let me change my question..
How would I map this function so that I can generate multiple columns with new names that I want to assign?..
I think this is just one giant table operation if you get your data converted to two long vectors, one representing row number and the other the value:
tab <- as.data.frame.matrix(table(row(dat[2:4]), unlist(dat[2:4])))
cbind(dat, tab)
# id first middle last Age Carol David Jenny Josh Richardson Roberts Sarah Smith
#1 1 Carol Jenny Smith 15 1 0 1 0 0 0 0 1
#2 2 Sarah Carol Roberts 20 1 0 0 0 0 1 1 0
#3 3 Josh David Richardson 22 0 1 0 1 1 0 0 0
This method would also allow you to map the new output columns to variations of the names if required:
tab <- as.data.frame.matrix(table(row(dat[2:4]), unlist(dat[2:4])))
dat[paste0(colnames(tab),"_n")] <- tab
dat
# id first middle last Age Carol_n David_n Jenny_n Josh_n Richardson_n Roberts_n Sarah_n Smith_n
#1 1 Carol Jenny Smith 15 1 0 1 0 0 0 0 1
#2 2 Sarah Carol Roberts 20 1 0 0 0 0 1 1 0
#3 3 Josh David Richardson 22 0 1 0 1 1 0 0 0
I am new to R and data analysis. I have a database similar to this one below, just a lot bigger and I was trying to find a general way to count for each country how many actions there are and how many subquestion with value 1, value 2 and so on there are. For each action there are multiple questions, subquestions and subsubquestions but I would love to find a way to count
1:how many actions there are per country, excluding subquestions
2: a way to find out how many subquestions 1 or 2 with value 1 there are for each country, actionn and questionn.
id country questionn subquestion value actionn
06 NIE 1 1 1 1
05 NIG 1 1 1 1
07 TAN 1 1 1 1
08 BEN 1 1 1 1
03 TOG 1 1 2 1
45 MOZ 1 1 2 1
40 ZIM 1 1 1 1
56 COD 1 1 1 1
87 BFA 1 1 1 1
09 IVC 1 1 2 1
08 SOA 1 1 2 1
02 MAL 1 1 2 1
78 MAI 1 1 2 1
35 GUB 1 1 2 1
87 RWA 1 1 2 1
41 ETH 1 1 1 1
06 NIE 1 2 2 1
05 NIG 1 2 1 1
87 BFA 1 2 1 2
I have tried to create subsets of the data frame and count everything for each country once at a time but it is going to take forever and I was wondering if there was a general way to do it.
For the first question I have done this
df1<-df %>% group_by (country) %>% summarise (countries=county)
unique(df1)
count(df1)
For the second question I was thinking of individually select and count each rows which has quesionn=1, subquestion=1, value=1 and actionn=1, then to select and count how many per country with qustionn=1, subquestionn=2, value=1, actionn=1 etc. Value refers to whether the answer to the question is 1=yes or 2=no.
I would be grateful for any help, thank you soo much :)
For the first question you can try to do something like this:
df %>%
filter(subquestion != 2) %>%
group_by(country) %>%
summarise(num_actions = n())
This will return the number of actions per country, removing rows that do not have 2 for the subquestion column. Note that the n() in the summarize function will count the number observations in the groups (in this case countries).
I am not sure I fully understand the second question, but my suggestion would be to make a new label for the particular observation you want to know (how many subquestions 1 or 2 with value 1 there are for each country, actionn and questionn):
df %>%
mutate(country_question_code = paste(country, action, questionn, sep = "_")) %>%
group_by(country_question_code) %>%
summarize(num_subquestion = n())
For question 1 possible solution (assuming country name are not unique and actionn can be 0, 1,2 or more..
For just total count:
df%>%group_by(country)%>%
summarise(
"Count_actions" = sum(actionn)
) #ignores all other columns.
If you want to count how many times a country appears use n() in place of sum(actionn, na.rm=TRUE).# this may not be desired but sometime simple solution is the best
(Just count the frequency of country)
Or df%>%group_by(country, actionn)%>%summarise("count_actions"= n()) will give country wise count for each type ( say 1,2 or more actions).
Data table version dt[, .(.N), by=.(country, actionn )]
For question 2: use grouping for "for each on your question" after putting filter on data as required. Here, filter subquestions 1 or 2 with (and) value 1 for each "country, question and actionn":
df%>%filter(subquestions <=2 & value==1)%>%group_by( country, question, actionn)%>%summarise("counts_desired"= n(), "sums_desired"= sum(actionn, na.rm=TRUE))
Hope this works. I am also learning and applying it on similar data.
Have not tested it and made certain assumptions about your data (numerical and clean). (Also for.mobile while traveling! Cheers!!)
I've got an R code that works and does what I want but It takes a huge time to run. Here is an explanation of what the code does and the code itself.
I've got a vector of 200000 line containing street adresses (String) : data.
Example :
> data[150000,]
address
"15 rue andre lalande residence marguerite yourcenar 91000 evry france"
And I have a matrix of 131x2 string elements which are 5grams (part of word) and the ids of the bags of NGrams (example of a 5Grams bag : ["stack", "tacko", "ackov", "ckover", ",overf", ... ] ) : list_ngrams
Example of list_ngrams :
idSac ngram
1 4 stree
2 4 tree_
3 4 _stre
4 4 treet
5 5 avenu
6 5 _aven
7 5 venue
8 5 enue_
I have also a 200000x31 numerical matrix initialized with 0 : idv_x_bags
In total I have 131 5-grams and 31 bags of 5-grams.
I want to loop the string addresses and check whether it contains one of the n-grams in my list or not. If it does, I put one in the corresponding column which represents the id of the bag that contains the 5-gram.
Example :
In this address : "15 rue andre lalande residence marguerite yourcenar 91000 evry france". The word "residence" exists in the bag ["resid","eside","dence",...] which the id is 5. So I'm gonna put 1 in the column called 5. Therefore the corresponding line "idv_x_bags" matrix will look like the following :
> idv_x_sacs[150000,]
4 5 6 8 10 12 13 15 17 18 22 26 29 34 35 36 42 43 45 46 47 48 52 55 81 82 108 114 119 122 123
0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Here is the code that does :
idv_x_sacs <- matrix(rep(0,nrow(data)*31),nrow=nrow(data),ncol=31)
colnames(idv_x_sacs) <- as.vector(sqldf("select distinct idSac from list_ngrams order by idSac"))$idSac
for(i in 1:nrow(idv_x_bags))
{
for(ngram in list_ngrams$ngram)
{
if(grepl(ngram,data[i,])==TRUE)
{
idSac <- sqldf(sprintf("select idSac from list_ngramswhere ngram='%s'",ngram))[[1]]
idv_x_bags[i,as.character(idSac)] <- 1
}
}
}
The code does perfectly what I aim to do, but it takes about 18 hours which is huge. I tried to recode it with c++ using Rcpp library but I encountered many problems. I'm tried to recode it using apply, but I couldn't do it.
Here is what I did :
apply(cbind(data,1:nrow(data),1,function(x){
apply(list_ngrams,1,function(y){
if(grepl(y[2],x[1])==TRUE){idv_x_bags[x[2],str_trim(as.character(y[1]))]<-1}
})
})
I need some help with coding my loop using apply or some other method that run faster that the current one. Thank you very much.
Check this one and run the simple example step by step to see how it works.
My N-Grams don't make much sense, but it will work with actual N_Grams as well.
library(dplyr)
library(reshape2)
# your example dataset
dt_sen = data.frame(sen = c("this is a good thing", "this is bad"), stringsAsFactors = F)
dt_ngr = data.frame(id_ngr = c(2,2,2,3,3,3),
ngr = c("th","go","tt","drf","ytu","bad"), stringsAsFactors = F)
# sentence dataset
dt_sen
sen
1 this is a good thing
2 this is bad
#ngrams dataset
dt_ngr
id_ngr ngr
1 2 th
2 2 go
3 2 tt
4 3 drf
5 3 ytu
6 3 bad
# create table of matches
expand.grid(unique(dt_sen$sen), unique(dt_ngr$id_ngr)) %>%
data.frame() %>%
rename(sen = Var1,
id_ngr = Var2) %>%
left_join(dt_ngr, by = "id_ngr") %>%
group_by(sen, id_ngr,ngr) %>%
do(data.frame(match = grepl(.$ngr,.$sen))) %>%
group_by(sen,id_ngr) %>%
summarise(sum_success = sum(match)) %>%
mutate(match = ifelse(sum_success > 0,1,0)) -> dt_full
dt_full
Source: local data frame [4 x 4]
Groups: sen
sen id_ngr sum_success match
1 this is a good thing 2 2 1
2 this is a good thing 3 0 0
3 this is bad 2 1 1
4 this is bad 3 1 1
# reshape table
dt_full %>% dcast(., sen~id_ngr, value.var = "match")
sen 2 3
1 this is a good thing 1 0
2 this is bad 1 1
I have updated the question, as a) i articulated the question not clearly on the first attempt, b) my exact need also shifted somewhat.
I want to especially thank Hemmo for great help so far - and apologies for not articulating my question clearly enough to him. His code (that addressed earlier version of problem) is shown in the answer section.
At a high-level - i am looking for code that helps to identify and differentiate the different blocks of consecutive free time of different individuals. More specifically - the code would ideally:
Check whehter an activity is labelled as "Free"
Check whether consecutive weeks (week earlier, week later) of time spent by the same person where also labelled as "Free".
Give the entire block of consecutive weeks of that person that are labelled "Free" an indicator in the desired outcome column. Note that the lenght of time-periods (e.g. 1 consec week, 4 consec weeks, 8 consec weeks) will vary
Finally - due to a need for further analysis on the characteristics of these clusters, different blocks should receive different indicators. (e.g. the march block of Paul would have value 1, the May block value 2, and Kim's block in March would be have value 3)
Hopefully this becomes more clear when one looks at the example dataframe (see the desired final column)
Any help much appreciated, code for the test dataframe per below.
Many thanks in advance,
W
Example (note that the last column should be generated by the code, purely included as illustration):
Week Name Activity Hours Desired_Outcome
1 01/01/2013 Paul Free 40 1
2 08/01/2013 Paul Free 10 1
3 08/01/2013 Paul Project A 30 0
4 15/01/2013 Paul Project B 30 0
5 15/01/2013 Paul Project A 10 0
6 22/01/2013 Paul Free 40 2
7 29/01/2013 Paul Project B 40 0
8 05/02/2013 Paul Free 40 3
9 12/02/2013 Paul Free 10 3
10 19/02/2013 Paul Free 30 3
11 01/01/2013 Kim Project E 40 0
12 08/01/2013 Kim Free 40 4
13 15/01/2013 Kim Free 40 4
14 22/01/2013 Kim Project E 40 0
15 29/01/2013 Kim Free 40 5
Code for dataframe:
Name=c(rep("Paul",10),rep("Kim",5))
Week=c("01/01/2013","08/01/2013","08/01/2013","15/01/2013","15/01/2013","22/01/2013","29/01/2013","05/02/2013","12/02/2013","19/02/2013","01/01/2013","08/01/2013","15/01/2013","22/01/2013","29/01/2013")
Activity=c("Free","Free","Project A","Project B","Project A","Free","Project B","Free","Free","Free","Project E","Free","Free","Project E","Free")
Hours=c(40,10,30,30,10,40,40,40,10,30,40,40,40,40,40)
Desired_Outcome=c(1,1,0,0,0,2,0,3,3,3,0,4,4,0,5)
df=as.data.frame(cbind(Week,Name,Activity,Hours,Desired_Outcome))
df
EDIT: This was messy already as the question was edited several times, so I removed old answers.
checkFree<-function(df){
df$Week<-as.Date(df$Week,format="%d/%m/%Y")
df$outcome<-numeric(nrow(df))
if(df$Activity[1]=="Free"){ #check first
counter<-1
df$outcome[1]<-counter
} else counter<-0
for(i in 2:nrow(df)){
if(df$Activity[i]=="Free"){
LastWeek <- (df$Week >= (df$Week[i]-7) &
df$Week < (df$Week[i]))
if(all(df$Activity[LastWeek]!="Free"))
counter<-counter+1
df$outcome[i]<-counter
}
}
df
}
splitdf<-split(df, Name)
df<-unsplit(lapply(splitdf,checkFree),Name)
uniqs<-unique(df2$Name) #for renumbering
for(i in 2:length(uniqs))
df$outcome[df$Name==uniqs[i] & df$outcome>0]<-
max(df$outcome[df$Name==uniqs[i-1]]) +
df$outcome[df$Name==uniqs[i] & df$outcome>0]
df
That should do it, although the above code is probably far from optimal.
Using the comment by user1885116 to Hemmo's answer as a guide to what is desired, here is a somewhat simpler approach:
N <- 1
x <- with(df, df[Activity=='Free',])
y <- with(x, diff(Week)) <= N*7
df$outcome <- 0
df[rownames(x[c(y, FALSE) | c(FALSE, y),]),]$outcome <- 1
df
## Week Activity Hours Desired_Outcome outcome
## 1 2013-01-01 Project A 40 0 0
## 2 2013-01-08 Project A 10 0 0
## 3 2013-01-08 Free 30 1 1
## 4 2013-01-15 Project B 30 0 0
## 5 2013-01-15 Free 10 1 1
## 6 2013-01-22 Project B 40 0 0
## 7 2013-01-29 Free 40 0 0
## 8 2013-02-05 Project C 40 0 0