subset dataframe - r

I have a dataframe with counts of geese at several different sites. The aim was to make monthly counts of geese in
all 8 months between September-April at each site in consecutive winter periods. A winter period is defined as the 8 months between
September-April.
If the method had been carried out as planned, this is what the data would look like:
df <- data.frame(site=c(rep('site 1', 16), rep('site 2', 16), rep('site 3', 16)),
date=dmy(rep(c('01/09/2007', '02/10/2007', '02/11/2007',
'02/12/2007', '02/01/2008', '02/02/2008', '02/03/2008',
'02/04/2008', '01/09/2008', '02/10/2008', '02/11/2008',
'02/12/2008', '02/01/2009', '02/02/2009', '02/03/2009',
'02/04/2009'),3)),
count=sample(1:100, 48))
Its ended up with a situation where some sites have all 8 counts in some September-April periods, but not in other September-April periods. In addition, some sites, never achieved 8 counts in a September-April period. These toy data look like my actual data:
df <- df[-c(11:16, 36:48),]
I need to remove rows from the dataframe which do not form part of 8 consecutive counts in a September-April period. Using the toy data, this is the dataframe I need:
df <- df[-c(9:10, 27:29), ]
I've tried various commands using ddply() from plyr package but without success. Is there a solution to this problem?

One way I could think of is to subtract four months from your date so that, then you could group by year. To get the corresponding date by subtracting by 4 months, I suggest you use mondate package. See here for an excellent answer as to what problem you'd face when you subtract month and how you can overcome it.
require(mondate)
df$grp <- mondate(df$date) - 4
df$year <- year(df$grp)
df$month <- month(df$date)
ddply(df, .(site, year), function(x) {
if (all(c(1:4, 9:12) %in% x$month)) {
return(x)
} else {
return(NULL)
}
})
# site date count grp year month
# 1 site 1 2007-09-01 87 2007-05-02 2007 9
# 2 site 1 2007-10-02 44 2007-06-02 2007 10
# 3 site 1 2007-11-02 50 2007-07-03 2007 11
# 4 site 1 2007-12-02 65 2007-08-02 2007 12
# 5 site 1 2008-01-02 12 2007-09-02 2007 1
# 6 site 1 2008-02-02 2 2007-10-03 2007 2
# 7 site 1 2008-03-02 100 2007-11-02 2007 3
# 8 site 1 2008-04-02 29 2007-12-03 2007 4
# 9 site 2 2007-09-01 3 2007-05-02 2007 9
# 10 site 2 2007-10-02 22 2007-06-02 2007 10
# 11 site 2 2007-11-02 56 2007-07-03 2007 11
# 12 site 2 2007-12-02 5 2007-08-02 2007 12
# 13 site 2 2008-01-02 40 2007-09-02 2007 1
# 14 site 2 2008-02-02 15 2007-10-03 2007 2
# 15 site 2 2008-03-02 10 2007-11-02 2007 3
# 16 site 2 2008-04-02 20 2007-12-03 2007 4
# 17 site 2 2008-09-01 93 2008-05-02 2008 9
# 18 site 2 2008-10-02 13 2008-06-02 2008 10
# 19 site 2 2008-11-02 58 2008-07-03 2008 11
# 20 site 2 2008-12-02 64 2008-08-02 2008 12
# 21 site 2 2009-01-02 92 2008-09-02 2008 1
# 22 site 2 2009-02-02 69 2008-10-03 2008 2
# 23 site 2 2009-03-02 89 2008-11-02 2008 3
# 24 site 2 2009-04-02 27 2008-12-03 2008 4
An alternative solution using data.table:
require(data.table)
require(mondate)
dt <- data.table(df)
dt[, `:=`(year=year(mondate(date)-4), month=month(date))]
dt.out <- dt[, .SD[rep(all(c(1:4,9:12) %in% month), .N)],
by=list(site,year)][, c("year", "month") := NULL]

Related

Calculating rolling average over time by two conditions [duplicate]

I have a longitudinal follow-up of blood pressure recordings.
The value at a certain point is less predictive than is the moving average (rolling mean), which is why I'd like to calculate it. The data looks like
test <- read.table(header=TRUE, text = "
ID AGE YEAR_VISIT BLOOD_PRESSURE TREATMENT
1 20 2000 NA 3
1 21 2001 129 2
1 22 2002 145 3
1 22 2002 130 2
2 23 2003 NA NA
2 30 2010 150 2
2 31 2011 110 3
4 50 2005 140 3
4 50 2005 130 3
4 50 2005 NA 3
4 51 2006 312 2
5 27 2010 140 4
5 28 2011 170 4
5 29 2012 160 NA
7 40 2007 120 NA
")
I'd like to calculate a new variable, called BLOOD_PRESSURE_UPDATED. This variable should be the moving average for BLOOD_PRESSURE and have the following characteristics:
A moving average is the current value plus the previous value divided by two.
For the first observation, the BLOOD_PRESSURE_UPDATED is just the current BLOOD_PRESSURE. If that is
missing, BLOOD_PRESSURE_UPDATED should be the overall mean.
Missing values should be filled in with nearest previous value.
I've tried the following:
test2 <- test %>%
group_by(ID) %>%
arrange(ID, YEAR_VISIT) %>%
mutate(BLOOD_PRESSURE_UPDATED = rollmean(x=BLOOD_PRESSURE, 2)) %>%
ungroup()
I have also tried rollaply and rollmeanr without succeeding.
How about this?
library(dplyr)
test2<-arrange(test,ID,YEAR_VISIT) %>%
mutate(lag1=lag(BLOOD_PRESSURE),
lag2=lag(BLOOD_PRESSURE,2),
movave=(lag1+lag2)/2)
Another solution using 'rollapply' function in zoo package (I like more)
library(dplyr)
library(zoo)
test2<-arrange(test,ID,YEAR_VISIT) %>%
mutate(ma2=rollapply(BLOOD_PRESSURE,2,mean,align='right',fill=NA))
slider is a 'new-er' alternative that plays nicely with the tidyverse.
Something like this would do the trick
test2 <- test %>%
group_by(ID) %>%
arrange(ID, YEAR_VISIT) %>%
mutate(BLOOD_PRESSURE_UPDATED = slider::slide_dbl(BLOOD_PRESSURE, mean, .before = 1, .after = 0)) %>%
ungroup()
If you are not committed to to dplyr this should work:
get.mav <- function(bp,n=2){
require(zoo)
if(is.na(bp[1])) bp[1] <- mean(bp,na.rm=TRUE)
bp <- na.locf(bp,na.rm=FALSE)
if(length(bp)<n) return(bp)
c(bp[1:(n-1)],rollapply(bp,width=n,mean,align="right"))
}
test <- with(test,test[order(ID,YEAR_VISIT),])
test$BLOOD_PRESSURE_UPDATED <-
unlist(aggregate(BLOOD_PRESSURE~ID,test,get.mav,na.action=NULL,n=2)$BLOOD_PRESSURE)
test
# ID AGE YEAR_VISIT BLOOD_PRESSURE TREATMENT BLOOD_PRESSURE_UPDATED
# 1 1 20 2000 NA 3 134.6667
# 2 1 21 2001 129 2 131.8333
# 3 1 22 2002 145 3 137.0000
# 4 1 22 2002 130 2 137.5000
# 5 2 23 2003 NA NA 130.0000
# 6 2 30 2010 150 2 140.0000
# 7 2 31 2011 110 3 130.0000
# ...
This works for moving averages > 2 as well.
And here's a data.table solution, which is likely to be much faster if your dataset is large.
library(data.table)
setDT(test) # converts test to a data.table in place
setkey(test,ID,YEAR_VISIT)
test[,BLOOD_PRESSURE_UPDATED:=as.numeric(get.mav(BLOOD_PRESSURE,2)),by=ID]
test
# ID AGE YEAR_VISIT BLOOD_PRESSURE TREATMENT BLOOD_PRESSURE_UPDATED
# 1: 1 20 2000 NA 3 134.6667
# 2: 1 21 2001 129 2 131.8333
# 3: 1 22 2002 145 3 137.0000
# 4: 1 22 2002 130 2 137.5000
# 5: 2 23 2003 NA NA 130.0000
# 6: 2 30 2010 150 2 140.0000
# 7: 2 31 2011 110 3 130.0000
# ...
Try this:
library(dplyr)
library(zoo)
test2<-arrange(test,ID,YEAR_VISIT) %>% group_by(subject)%>%
mutate(ma2=rollapply(BLOOD_PRESSURE,2,mean,align='right',fill=NA))

Dynamic mean in r [duplicate]

I have a longitudinal follow-up of blood pressure recordings.
The value at a certain point is less predictive than is the moving average (rolling mean), which is why I'd like to calculate it. The data looks like
test <- read.table(header=TRUE, text = "
ID AGE YEAR_VISIT BLOOD_PRESSURE TREATMENT
1 20 2000 NA 3
1 21 2001 129 2
1 22 2002 145 3
1 22 2002 130 2
2 23 2003 NA NA
2 30 2010 150 2
2 31 2011 110 3
4 50 2005 140 3
4 50 2005 130 3
4 50 2005 NA 3
4 51 2006 312 2
5 27 2010 140 4
5 28 2011 170 4
5 29 2012 160 NA
7 40 2007 120 NA
")
I'd like to calculate a new variable, called BLOOD_PRESSURE_UPDATED. This variable should be the moving average for BLOOD_PRESSURE and have the following characteristics:
A moving average is the current value plus the previous value divided by two.
For the first observation, the BLOOD_PRESSURE_UPDATED is just the current BLOOD_PRESSURE. If that is
missing, BLOOD_PRESSURE_UPDATED should be the overall mean.
Missing values should be filled in with nearest previous value.
I've tried the following:
test2 <- test %>%
group_by(ID) %>%
arrange(ID, YEAR_VISIT) %>%
mutate(BLOOD_PRESSURE_UPDATED = rollmean(x=BLOOD_PRESSURE, 2)) %>%
ungroup()
I have also tried rollaply and rollmeanr without succeeding.
How about this?
library(dplyr)
test2<-arrange(test,ID,YEAR_VISIT) %>%
mutate(lag1=lag(BLOOD_PRESSURE),
lag2=lag(BLOOD_PRESSURE,2),
movave=(lag1+lag2)/2)
Another solution using 'rollapply' function in zoo package (I like more)
library(dplyr)
library(zoo)
test2<-arrange(test,ID,YEAR_VISIT) %>%
mutate(ma2=rollapply(BLOOD_PRESSURE,2,mean,align='right',fill=NA))
slider is a 'new-er' alternative that plays nicely with the tidyverse.
Something like this would do the trick
test2 <- test %>%
group_by(ID) %>%
arrange(ID, YEAR_VISIT) %>%
mutate(BLOOD_PRESSURE_UPDATED = slider::slide_dbl(BLOOD_PRESSURE, mean, .before = 1, .after = 0)) %>%
ungroup()
If you are not committed to to dplyr this should work:
get.mav <- function(bp,n=2){
require(zoo)
if(is.na(bp[1])) bp[1] <- mean(bp,na.rm=TRUE)
bp <- na.locf(bp,na.rm=FALSE)
if(length(bp)<n) return(bp)
c(bp[1:(n-1)],rollapply(bp,width=n,mean,align="right"))
}
test <- with(test,test[order(ID,YEAR_VISIT),])
test$BLOOD_PRESSURE_UPDATED <-
unlist(aggregate(BLOOD_PRESSURE~ID,test,get.mav,na.action=NULL,n=2)$BLOOD_PRESSURE)
test
# ID AGE YEAR_VISIT BLOOD_PRESSURE TREATMENT BLOOD_PRESSURE_UPDATED
# 1 1 20 2000 NA 3 134.6667
# 2 1 21 2001 129 2 131.8333
# 3 1 22 2002 145 3 137.0000
# 4 1 22 2002 130 2 137.5000
# 5 2 23 2003 NA NA 130.0000
# 6 2 30 2010 150 2 140.0000
# 7 2 31 2011 110 3 130.0000
# ...
This works for moving averages > 2 as well.
And here's a data.table solution, which is likely to be much faster if your dataset is large.
library(data.table)
setDT(test) # converts test to a data.table in place
setkey(test,ID,YEAR_VISIT)
test[,BLOOD_PRESSURE_UPDATED:=as.numeric(get.mav(BLOOD_PRESSURE,2)),by=ID]
test
# ID AGE YEAR_VISIT BLOOD_PRESSURE TREATMENT BLOOD_PRESSURE_UPDATED
# 1: 1 20 2000 NA 3 134.6667
# 2: 1 21 2001 129 2 131.8333
# 3: 1 22 2002 145 3 137.0000
# 4: 1 22 2002 130 2 137.5000
# 5: 2 23 2003 NA NA 130.0000
# 6: 2 30 2010 150 2 140.0000
# 7: 2 31 2011 110 3 130.0000
# ...
Try this:
library(dplyr)
library(zoo)
test2<-arrange(test,ID,YEAR_VISIT) %>% group_by(subject)%>%
mutate(ma2=rollapply(BLOOD_PRESSURE,2,mean,align='right',fill=NA))

Rolling mean (moving average) by group/id with dplyr

I have a longitudinal follow-up of blood pressure recordings.
The value at a certain point is less predictive than is the moving average (rolling mean), which is why I'd like to calculate it. The data looks like
test <- read.table(header=TRUE, text = "
ID AGE YEAR_VISIT BLOOD_PRESSURE TREATMENT
1 20 2000 NA 3
1 21 2001 129 2
1 22 2002 145 3
1 22 2002 130 2
2 23 2003 NA NA
2 30 2010 150 2
2 31 2011 110 3
4 50 2005 140 3
4 50 2005 130 3
4 50 2005 NA 3
4 51 2006 312 2
5 27 2010 140 4
5 28 2011 170 4
5 29 2012 160 NA
7 40 2007 120 NA
")
I'd like to calculate a new variable, called BLOOD_PRESSURE_UPDATED. This variable should be the moving average for BLOOD_PRESSURE and have the following characteristics:
A moving average is the current value plus the previous value divided by two.
For the first observation, the BLOOD_PRESSURE_UPDATED is just the current BLOOD_PRESSURE. If that is
missing, BLOOD_PRESSURE_UPDATED should be the overall mean.
Missing values should be filled in with nearest previous value.
I've tried the following:
test2 <- test %>%
group_by(ID) %>%
arrange(ID, YEAR_VISIT) %>%
mutate(BLOOD_PRESSURE_UPDATED = rollmean(x=BLOOD_PRESSURE, 2)) %>%
ungroup()
I have also tried rollaply and rollmeanr without succeeding.
How about this?
library(dplyr)
test2<-arrange(test,ID,YEAR_VISIT) %>%
mutate(lag1=lag(BLOOD_PRESSURE),
lag2=lag(BLOOD_PRESSURE,2),
movave=(lag1+lag2)/2)
Another solution using 'rollapply' function in zoo package (I like more)
library(dplyr)
library(zoo)
test2<-arrange(test,ID,YEAR_VISIT) %>%
mutate(ma2=rollapply(BLOOD_PRESSURE,2,mean,align='right',fill=NA))
slider is a 'new-er' alternative that plays nicely with the tidyverse.
Something like this would do the trick
test2 <- test %>%
group_by(ID) %>%
arrange(ID, YEAR_VISIT) %>%
mutate(BLOOD_PRESSURE_UPDATED = slider::slide_dbl(BLOOD_PRESSURE, mean, .before = 1, .after = 0)) %>%
ungroup()
If you are not committed to to dplyr this should work:
get.mav <- function(bp,n=2){
require(zoo)
if(is.na(bp[1])) bp[1] <- mean(bp,na.rm=TRUE)
bp <- na.locf(bp,na.rm=FALSE)
if(length(bp)<n) return(bp)
c(bp[1:(n-1)],rollapply(bp,width=n,mean,align="right"))
}
test <- with(test,test[order(ID,YEAR_VISIT),])
test$BLOOD_PRESSURE_UPDATED <-
unlist(aggregate(BLOOD_PRESSURE~ID,test,get.mav,na.action=NULL,n=2)$BLOOD_PRESSURE)
test
# ID AGE YEAR_VISIT BLOOD_PRESSURE TREATMENT BLOOD_PRESSURE_UPDATED
# 1 1 20 2000 NA 3 134.6667
# 2 1 21 2001 129 2 131.8333
# 3 1 22 2002 145 3 137.0000
# 4 1 22 2002 130 2 137.5000
# 5 2 23 2003 NA NA 130.0000
# 6 2 30 2010 150 2 140.0000
# 7 2 31 2011 110 3 130.0000
# ...
This works for moving averages > 2 as well.
And here's a data.table solution, which is likely to be much faster if your dataset is large.
library(data.table)
setDT(test) # converts test to a data.table in place
setkey(test,ID,YEAR_VISIT)
test[,BLOOD_PRESSURE_UPDATED:=as.numeric(get.mav(BLOOD_PRESSURE,2)),by=ID]
test
# ID AGE YEAR_VISIT BLOOD_PRESSURE TREATMENT BLOOD_PRESSURE_UPDATED
# 1: 1 20 2000 NA 3 134.6667
# 2: 1 21 2001 129 2 131.8333
# 3: 1 22 2002 145 3 137.0000
# 4: 1 22 2002 130 2 137.5000
# 5: 2 23 2003 NA NA 130.0000
# 6: 2 30 2010 150 2 140.0000
# 7: 2 31 2011 110 3 130.0000
# ...
Try this:
library(dplyr)
library(zoo)
test2<-arrange(test,ID,YEAR_VISIT) %>% group_by(subject)%>%
mutate(ma2=rollapply(BLOOD_PRESSURE,2,mean,align='right',fill=NA))

Reshape wide format, to multi-column long format

I want to reshape a wide format dataset that has multiple tests which are measured at 3 time points:
ID Test Year Fall Spring Winter
1 1 2008 15 16 19
1 1 2009 12 13 27
1 2 2008 22 22 24
1 2 2009 10 14 20
2 1 2008 12 13 25
2 1 2009 16 14 21
2 2 2008 13 11 29
2 2 2009 23 20 26
3 1 2008 11 12 22
3 1 2009 13 11 27
3 2 2008 17 12 23
3 2 2009 14 9 31
into a data set that separates the tests by column but converts the measurement time into long format, for each of the new columns like this:
ID Year Time Test1 Test2
1 2008 Fall 15 22
1 2008 Spring 16 22
1 2008 Winter 19 24
1 2009 Fall 12 10
1 2009 Spring 13 14
1 2009 Winter 27 20
2 2008 Fall 12 13
2 2008 Spring 13 11
2 2008 Winter 25 29
2 2009 Fall 16 23
2 2009 Spring 14 20
2 2009 Winter 21 26
3 2008 Fall 11 17
3 2008 Spring 12 12
3 2008 Winter 22 23
3 2009 Fall 13 14
3 2009 Spring 11 9
3 2009 Winter 27 31
I have unsuccessfully tried to use reshape and melt. Existing posts address transforming to single column outcome.
Using reshape2:
# Thanks to Ista for helping with direct naming using "variable.name"
df.m <- melt(df, id.var = c("ID", "Test", "Year"), variable.name = "Time")
df.m <- transform(df.m, Test = paste0("Test", Test))
dcast(df.m, ID + Year + Time ~ Test, value.var = "value")
Update: Using data.table melt/cast from versions >= 1.9.0:
data.table from versions 1.9.0 imports reshape2 package and implements fast melt and dcast methods in C for data.tables. A comparison of speed on bigger data is shown below.
For more info regarding NEWS, go here.
require(data.table) ## ver. >=1.9.0
require(reshape2)
dt <- as.data.table(df, key=c("ID", "Test", "Year"))
dt.m <- melt(dt, id.var = c("ID", "Test", "Year"), variable.name = "Time")
dt.m[, Test := paste0("Test", Test)]
dcast.data.table(dt.m, ID + Year + Time ~ Test, value.var = "value")
At the moment, you'll have to write dcast.data.table explicitly as it's not a S3 generic in reshape2 yet.
Benchmarking on bigger data:
# generate data:
set.seed(45L)
DT <- data.table(ID = sample(1e2, 1e7, TRUE),
Test = sample(1e3, 1e7, TRUE),
Year = sample(2008:2014, 1e7,TRUE),
Fall = sample(50, 1e7, TRUE),
Spring = sample(50, 1e7,TRUE),
Winter = sample(50, 1e7, TRUE))
DF <- as.data.frame(DT)
reshape2 timings:
reshape2_melt <- function(df) {
df.m <- melt(df, id.var = c("ID", "Test", "Year"), variable.name = "Time")
}
# min. of three consecutive runs
system.time(df.m <- reshape2_melt(DF))
# user system elapsed
# 43.319 4.909 48.932
df.m <- transform(df.m, Test = paste0("Test", Test))
reshape2_cast <- function(df) {
dcast(df.m, ID + Year + Time ~ Test, value.var = "value")
}
# min. of three consecutive runs
system.time(reshape2_cast(df.m))
# user system elapsed
# 57.728 9.712 69.573
data.table timings:
DT_melt <- function(dt) {
dt.m <- melt(dt, id.var = c("ID", "Test", "Year"), variable.name = "Time")
}
# min. of three consecutive runs
system.time(dt.m <- reshape2_melt(DT))
# user system elapsed
# 0.276 0.001 0.279
dt.m[, Test := paste0("Test", Test)]
DT_cast <- function(dt) {
dcast.data.table(dt.m, ID + Year + Time ~ Test, value.var = "value")
}
# min. of three consecutive runs
system.time(DT_cast(dt.m))
# user system elapsed
# 12.732 0.825 14.006
melt.data.table is ~175x faster than reshape2:::melt and dcast.data.table is ~5x than reshape2:::dcast.
Sticking with base R, this is another good candidate for the "stack + reshape" routine. Assuming our dataset is called "mydf":
mydf.temp <- data.frame(mydf[1:3], stack(mydf[4:6]))
mydf2 <- reshape(mydf.temp, direction = "wide",
idvar=c("ID", "Year", "ind"),
timevar="Test")
names(mydf2) <- c("ID", "Year", "Time", "Test1", "Test2")
mydf2
# ID Year Time Test1 Test2
# 1 1 2008 Fall 15 22
# 2 1 2009 Fall 12 10
# 5 2 2008 Fall 12 13
# 6 2 2009 Fall 16 23
# 9 3 2008 Fall 11 17
# 10 3 2009 Fall 13 14
# 13 1 2008 Spring 16 22
# 14 1 2009 Spring 13 14
# 17 2 2008 Spring 13 11
# 18 2 2009 Spring 14 20
# 21 3 2008 Spring 12 12
# 22 3 2009 Spring 11 9
# 25 1 2008 Winter 19 24
# 26 1 2009 Winter 27 20
# 29 2 2008 Winter 25 29
# 30 2 2009 Winter 21 26
# 33 3 2008 Winter 22 23
# 34 3 2009 Winter 27 31
Base reshape function alternative method is below. Though this required using reshape twice, there might be a simpler way.
Assuming your dataset is called df1
tmp <- reshape(df1,idvar=c("ID","Year"),timevar="Test",direction="wide")
result <- reshape(
tmp,
idvar=c("ID","Year"),
varying=list(3:5,6:8),
v.names=c("Test1","Test2"),
times=c("Fall","Spring","Winter"),
direction="long"
)
Which gives:
> result
ID Year time Test1 Test2
1.2008.Fall 1 2008 Fall 15 22
1.2009.Fall 1 2009 Fall 12 10
2.2008.Fall 2 2008 Fall 12 13
2.2009.Fall 2 2009 Fall 16 23
3.2008.Fall 3 2008 Fall 11 17
3.2009.Fall 3 2009 Fall 13 14
1.2008.Spring 1 2008 Spring 16 22
1.2009.Spring 1 2009 Spring 13 14
2.2008.Spring 2 2008 Spring 13 11
2.2009.Spring 2 2009 Spring 14 20
3.2008.Spring 3 2008 Spring 12 12
3.2009.Spring 3 2009 Spring 11 9
1.2008.Winter 1 2008 Winter 19 24
1.2009.Winter 1 2009 Winter 27 20
2.2008.Winter 2 2008 Winter 25 29
2.2009.Winter 2 2009 Winter 21 26
3.2008.Winter 3 2008 Winter 22 23
3.2009.Winter 3 2009 Winter 27 31
tidyverse/tidyr solution:
library(dplyr)
library(tidyr)
df %>%
gather("Time", "Value", Fall, Spring, Winter) %>%
spread(Test, Value, sep = "")

An increasing counter for occurence of new values in R

I am trying to make a counter which increases for each new change in another vector. E.g. I have several individuals that are observed over several weeks, and I want to know how many weeks they are observed. So I'll end up with a table like this:
Id year Week Weeks observed
1 2006 10 1
1 2006 10 1
1 2006 11 2
1 2006 11 2
1 2006 12 3
1 2006 13 4
1 2007 1 5
1 2007 2 6
1 2007 3 7
1 2007 4 8
1 2007 5 9
1 2007 6 10
2 2006 10 1
2 2006 10 1
2 2006 11 2
2 2006 11 2
2 2006 12 3
2 2006 13 4
2 2007 1 5
2 2007 2 6
2 2007 3 7
2 2007 4 8
2 2007 5 9
2 2007 6 10
Assuming you have your data in a data.frame called dat, you could use tapply and convert Phase to a factor then strip it of its levels to use the underlying integer values:
dat$newcounter <- unlist(tapply(dat$Phase, dat$Id,
function(x) unclass(as.factor(x))))
Obligatory data.table answer:
library(data.table)
dt<-as.data.table(dat)
dt[, newcounter := unclass(as.factor(Phase)), by = Id]
EDIT
To account for the newly phrased question, here is a possibility using data.table.
dt <- as.data.table(dat[, -4]) # Create data.table
setkeyv(dt, c("Id", "year", "Week")) # Create key for data.table
dt2 <- unique(dt) # Get only unique rows by key
dt3 <- dt2[, Weeks.observed := seq_len(.N), by = "Id"] # Create new variable
dt[dt3] # Merge data.tables back together

Resources