I'm trying to subset the data so it only preserves the first occurrence of a variable. I'm looking at a panel data that traces the career of workers, and I'm trying to subset the data so that it only shows until each person became Boss.
id year name job job2
1 1990 Bon Manager 0
1 1991 Bon Manager 0
1 1992 Bon Manager 0
1 1993 Bon Boss 1
1 1994 Bon Manager 0
2 1990 Jane Manager 0
2 1991 Jane Boss 1
2 1992 Jane Manager 0
2 1993 Jane Boss 1
So I would want the data to look like:
id year name job job2
1 1990 Bon Manager 0
1 1991 Bon Manager 0
1 1992 Bon Manager 0
1 1993 Bon Boss 1
2 1990 Jane Manager 0
2 1991 Jane Boss 1
This seems like basic censoring but for the sake of my analysis this is crucial..! Any help would be appreciated.

Here's a dplyr solution that uses two useful window functions lag() and cumall():
df <- read.table(header = TRUE, text = "
id year name job job2
1 1990 Bon Manager 0
1 1991 Bon Manager 0
1 1992 Bon Manager 0
1 1993 Bon Boss 1
1 1994 Bon Manager 0
2 1990 Jane Manager 0
2 1991 Jane Boss 1
2 1992 Jane Manager 0
2 1993 Jane Boss 1
", stringsAsFactors = FALSE)
# Use mutate to see the values of the new variables
df %>%
group_by(id) %>%
mutate(last_job = lag(job, default = ""), cumall(last_job != "Boss"))
# Use filter to see the results
df %>%
group_by(id) %>%
filter(cumall(lag(job, default = "") != "Boss"))
We use lag() to figure out what job each person had in the previous year, and then use cumall() to keep all rows up to the first instance of "Boss". If the data wasn't already sorted by year, you could use lag(job, order_by = year) to make sure lag() used the value of year, rather than the row order, to determine which was "last" year.

Base solution:
by(dat,dat$name,function(x) {
if ("Boss" %in% x$job) x[1:min(which(x$job=="Boss")),]
# id year name job job2
#Bon.1 1 1990 Bon Manager 0
#Bon.2 1 1991 Bon Manager 0
#Bon.3 1 1992 Bon Manager 0
#Bon.4 1 1993 Bon Boss 1
#Jane.6 2 1990 Jane Manager 0
#Jane.7 2 1991 Jane Boss 1
An alternative base solution:
dat$keep <- with(dat,
ave(job=="Boss",name,FUN=function(x) if(1 %in% x) cumsum(x) else 2)
with(dat, dat[keep==0 | (job=="Boss" & keep==1),] )
# id year name job job2 keep
#1 1 1990 Bon Manager 0 0
#2 1 1991 Bon Manager 0 0
#3 1 1992 Bon Manager 0 0
#4 1 1993 Bon Boss 1 1
#6 2 1990 Jane Manager 0 0
#7 2 1991 Jane Boss 1 1
And a data.table solution:
dat <-
dat[,if("Boss" %in% job) .SD[1:min(which(job=="Boss"))],by=name]
# name id year job job2
#1: Bon 1 1990 Manager 0
#2: Bon 1 1991 Manager 0
#3: Bon 1 1992 Manager 0
#4: Bon 1 1993 Boss 1
#5: Jane 2 1990 Manager 0
#6: Jane 2 1991 Boss 1

The library 'sqldf' could do the work.
miny <- sqldf("select id, min(year) as year from df where job='Boss' group by id")
sqldf("select df.* from df join miny on ( and df.year<=miny.year)")

If your data is stored in a data frame called df:
ddply(.data=df, .variables=c("name"), .fun=function(x) {
i <- which(x$job == "Boss")[1]
if (! x[1:i, ] # omit lifelong managers
# id year name job job2
# 1 1 1990 Bon Manager 0
# 2 1 1991 Bon Manager 0
# 3 1 1992 Bon Manager 0
# 4 1 1993 Bon Boss 1
# 5 2 1990 Jane Manager 0
# 6 2 1991 Jane Boss 1


How to find observations whose dummy variable changes from 1 to 0 (and not viceversa) in a df in r

I have a survey composed of n individuals; each individual is present more than one time in the survey (panel). I have a variable pens, which is a dummy that takes value 1 if the individual invests in a complementary pension form. For example:
df <- data.frame(year=c(2002,2002,2004,2004,2006,2008), id=c(1,2,1,2,3,3), y.b=c(1950,1943,1950,1943,1966,1966), sex=c("F", "M", "F", "M", "M", "M"), income=c(100000,55000,88000,66000,12000,24000), pens=c(0,1,1,0,1,1))
year id y.b sex income pens
2002 1 1950 F 100000 0
2002 2 1943 M 55000 1
2004 1 1950 F 88000 1
2004 2 1943 M 66000 0
2006 3 1966 M 12000 1
2008 3 1966 M 24000 1
where id is the individual, y.b is year of birth, pens is the dummy variable regarding complementary pension.
I want to know if there are individuals that invested in a complementary pension form in year t but didn't hold the complementary pension form in year t+2 (the survey is conducted every two years). In this way I want to know how many person had a complementary pension form but released it before pension or gave up (for example for economic reasons).
I tried with this command:
df$x <- (ave(df$pens, df$id, FUN = function(x)length(unique(x)))==1)*1
and actually I have the individuals whose pens variable had changed during time (the command check if a variable is constant in time). For this reason I find individuals whose pens variable changed from 0 (didn't have complementary pension) in year t to 1 in year t+2 and viceversa; but I am interested in individuals whose pens variable was 1 (had a complementary pensione) in year t and 0 in year t+2.
If I use this command with the df I get that for id 1 and 2 the variable x is 0 (pens variable isn't constant), but I'd need to find a way to get just id 2 (whose pens variable changed from 1 to 0).
df$x <- (ave(df$pens, df$id, FUN = function(x)length(unique(x)))==1)*1
year id pens x
1 2002 1 0 0
2 2002 2 1 0
3 2004 1 1 0
4 2004 2 0 0
5 2006 3 1 1
6 2008 3 1 1
(for the sake of semplicity I omitted other variables)
So the desired output is:
year id pens x
1 2002 1 0 1
2 2002 2 1 0
3 2004 1 1 1
4 2004 2 0 0
5 2006 3 1 1
6 2008 3 1 1
only id 2 has x=0 since the pens variable changed from 1 to 0.
Thanks in advance
This assigns 1 to the id's for which there is a decline in pens and 0 otherwise.
transform(d.d, x = ave(pens, id, FUN = function(x) any(diff(x) < 0)))
year id y.b sex income pens x
1 2002 1 1950 F 100000 0 0
2 2002 2 1943 M 55000 1 1
3 2004 1 1950 F 88000 1 0
4 2004 2 1943 M 66000 0 1
5 2006 3 1966 M 12000 1 0
6 2008 3 1966 M 24000 1 0
This should work even even if there are more than 2 rows per id but if we knew there were always 2 rows then we could omit the any simplifying it to:
transform(d.d, x = ave(pens, id, FUN = diff) < 0)
Note: The input in reproducible form is:
Lines <- "year id y.b sex income pens
2002 1 1950 F 100000 0
2002 2 1943 M 55000 1
2004 1 1950 F 88000 1
2004 2 1943 M 66000 0
2006 3 1966 M 12000 1
2008 3 1966 M 24000 1"
d.d <- read.table(text = Lines, header = TRUE, check.names = FALSE)

delete rows for duplicate variable in R

I have panel data with duplicate years, but I want to delete the row where job value is smaller:
id name year job
1 Jane 1990 100
1 Jane 1992 200
1 Jane 1993 300
1 Jane 1993 1
1 Jane 1997 400
1 Jane 1997 2
2 Tom 1990 400
2 Tom 1992 500
2 Tom 1993 700
2 Tom 1993 1
2 Tom 1997 900
2 Tom 1997 3
I would want the following:
id name year job
1 Jane 1990 100
1 Jane 1992 200
1 Jane 1993 1
1 Jane 1997 2
2 Tom 1990 400
2 Tom 1992 500
2 Tom 1993 1
2 Tom 1997 3
Would there be a way to do this?
you have different possibilities for instance with plyr and dplyr :
# plyr
ddply(tab, .(id, name, year), summarise, job=min(job))
# dplyr
tabg <- group_by(tab, id, name, year)
summarise(tabg, job=min(job))
# basic fonction
aggregate(tab[,"job", drop=FALSE], tab[,3:1], min)
You can use ddply for this:
x <- read.table(textConnection("id name year job
1 Jane 1990 100
1 Jane 1992 200
1 Jane 1993 300
1 Jane 1993 1
1 Jane 1997 400
1 Jane 1997 2
2 Tom 1990 400
2 Tom 1992 500
2 Tom 1993 700
2 Tom 1993 1
2 Tom 1997 900
2 Tom 1997 3"),header=T)
ddply(x,c("id","name","year"),summarise, job=max(job))
id name year job
1 1 Jane 1990 100
2 1 Jane 1992 200
3 1 Jane 1993 300
4 1 Jane 1997 400
5 2 Tom 1990 400
6 2 Tom 1992 500
7 2 Tom 1993 700
8 2 Tom 1997 900
Note that I have obtained what you asked for in the description. Your example output contradicts this. If you do want your example output, use min instead of max.
If your data is data frame df
dt <-
dt[, .SD[which.min(job)], by = list(id, name, year)]
You could use base R with the function order, as suggested by James:
> tab[order(tab$job),][! duplicated(tab[order(tab$job), c('id', 'year')], fromLast=T), ]
id name year job
1 1 Jane 1990 100
2 1 Jane 1992 200
3 1 Jane 1993 300
5 1 Jane 1997 400
7 2 Tom 1990 400
8 2 Tom 1992 500
9 2 Tom 1993 700
11 2 Tom 1997 900

Create Indicator Data Frame Based on Interval Ranges

I am trying to create a "long" data frame of indicator ("dummy") variables out of a very peculiar type of "wide" data frame in R that has interval ranges of years defining my data.
What I have looks like this:
name year.start year.end
1 A 1990 1994
2 B 1994 1995
3 C 1993 1993
Update: I have changed the value of year.start for A to 1990 from the initial example of 1993 to address some of the answers below which rely on unique values instead of intervals.
What I would like is a long data frame that would look like this, with an entry for each of the possible years in the original data frame, eg, 1990 through 1995 where 1 = present and 0 = absent.
name year indicator
A 1990 1
A 1991 1
A 1992 1
A 1993 1
A 1994 1
A 1995 0
B 1990 0
B 1991 0
B 1992 0
B 1993 0
B 1994 1
B 1995 1
C 1990 0
C 1991 0
C 1992 0
C 1993 1
C 1994 0
C 1995 0
Try as I might, I don't see how I can do this with Hadley Wickham's reshape2 package.
Someone else might have suggestion for reshape2, but here is a base R solution:
years <- factor(unlist(f[-1]), levels=seq(min(f[-1]), max(f[-1]), by=1))
result <- data.frame(table(years, rep(f[[1]], length.out=length(years))))
# years Var2 Freq
# 1 1990 A 1
# 2 1991 A 0
# 3 1992 A 0
# 4 1993 A 0
# 5 1994 A 1
# 6 1995 A 0
# 7 1990 B 0
# 8 1991 B 0
# 9 1992 B 0
# 10 1993 B 0
# 11 1994 B 1
# 12 1995 B 1
# 13 1990 C 0
# 14 1991 C 0
# 15 1992 C 0
# 16 1993 C 2
# 17 1994 C 0
# 18 1995 C 0
here is a step-by-step breakdown, using data.table
f <-
ALL <- f[, CJ(name=name, year=seq(min(year.start), max(year.end)))]
PRESENT <- f[, list(year = seq(year.start, year.end)), by=name]
setkey(ALL, name, year)
setkey(PRESENT, name, year)
ALL[, indicator := 0]
ALL[PRESENT, indicator := 1]
name year indicator
1: A 1993 1
2: A 1994 1
3: A 1995 0
4: B 1993 0
5: B 1994 1
6: B 1995 1
7: C 1993 1
8: C 1994 0
9: C 1995 0
Here's another solution, similar to the ones above, which aims to be straightforward:
zz <- cbind(name=f[1],year=rep(min(f[-1]):max(f[-1]),each=nrow(f)))
zz$indicator <- as.numeric((f$name==zz$name &
f$year.start<=zz$year &
f$year.end >=zz$year))
result <- zz[order(zz$name,zz$year),]
The first line builds a template with all the names and all the years. The second line sets indicator based on whether it is present in the range. The third line just reorders the result.
Another base R solution
year.start=c(1993,1994,1993),year.end=c(1994,1995,1993), stringsAsFactors=F)
x <- expand.grid(unique(f$name),min(f1$year):max(f1$year))
names(x) <- c("name", "year")
x$indicator <- sapply(1:nrow(x), function(i) sum(x$name[i]==f$name & x$year[i] >= f$year.start & x$year[i] <= f$year.end))
