delete rows for duplicate variable in R - r

I have panel data with duplicate years, but I want to delete the row where job value is smaller:
id name year job
1 Jane 1990 100
1 Jane 1992 200
1 Jane 1993 300
1 Jane 1993 1
1 Jane 1997 400
1 Jane 1997 2
2 Tom 1990 400
2 Tom 1992 500
2 Tom 1993 700
2 Tom 1993 1
2 Tom 1997 900
2 Tom 1997 3
I would want the following:
id name year job
1 Jane 1990 100
1 Jane 1992 200
1 Jane 1993 1
1 Jane 1997 2
2 Tom 1990 400
2 Tom 1992 500
2 Tom 1993 1
2 Tom 1997 3
Would there be a way to do this?

you have different possibilities for instance with plyr and dplyr :
# plyr
ddply(tab, .(id, name, year), summarise, job=min(job))
# dplyr
tabg <- group_by(tab, id, name, year)
summarise(tabg, job=min(job))
# basic fonction
aggregate(tab[,"job", drop=FALSE], tab[,3:1], min)

You can use ddply for this:
x <- read.table(textConnection("id name year job
1 Jane 1990 100
1 Jane 1992 200
1 Jane 1993 300
1 Jane 1993 1
1 Jane 1997 400
1 Jane 1997 2
2 Tom 1990 400
2 Tom 1992 500
2 Tom 1993 700
2 Tom 1993 1
2 Tom 1997 900
2 Tom 1997 3"),header=T)
library(plyr)
ddply(x,c("id","name","year"),summarise, job=max(job))
id name year job
1 1 Jane 1990 100
2 1 Jane 1992 200
3 1 Jane 1993 300
4 1 Jane 1997 400
5 2 Tom 1990 400
6 2 Tom 1992 500
7 2 Tom 1993 700
8 2 Tom 1997 900
Note that I have obtained what you asked for in the description. Your example output contradicts this. If you do want your example output, use min instead of max.

If your data is data frame df
library(data.table)
dt <- as.data.table(df)
dt[, .SD[which.min(job)], by = list(id, name, year)]

You could use base R with the function order, as suggested by James:
> tab[order(tab$job),][! duplicated(tab[order(tab$job), c('id', 'year')], fromLast=T), ]
id name year job
1 1 Jane 1990 100
2 1 Jane 1992 200
3 1 Jane 1993 300
5 1 Jane 1997 400
7 2 Tom 1990 400
8 2 Tom 1992 500
9 2 Tom 1993 700
11 2 Tom 1997 900

Related

How to join/ merge two tables using character values?

I would like to combine two tables based on first name, last name, and year, and create a new binary variable indicating whether the row from table 1 was present in the 2nd table.
First table is a panel data set of some attributes of NBA players during a season:
firstname<-c("Michael","Michael","Michael","Magic","Magic","Magic","Larry","Larry")
lastname<-c("Jordan","Jordan","Jordan","Johnson","Johnson","Johnson","Bird","Bird")
year<-c("1991","1992","1993","1991","1992","1993","1992","1992")
season<-data.frame(firstname,lastname,year)
firstname lastname year
1 Michael Jordan 1991
2 Michael Jordan 1992
3 Michael Jordan 1993
4 Magic Johnson 1991
5 Magic Johnson 1992
6 Magic Johnson 1993
7 Larry Bird 1992
8 Larry Bird 1992
The second data.frame is a panel data set of some attributes of NBA players selected to the All-Star game:
firstname<-c("Michael","Michael","Michael","Magic","Magic","Magic")
lastname<-c("Jordan","Jordan","Jordan","Johnson","Johnson","Johnson")
year<-c("1991","1992","1993","1991","1992","1993")
ALLSTARS<-data.frame(firstname,lastname,year)
firstname lastname year
1 Michael Jordan 1991
2 Michael Jordan 1992
3 Michael Jordan 1993
4 Magic Johnson 1991
5 Magic Johnson 1992
6 Magic Johnson 1993
My desired result looks like:
firstname lastname year allstars
1 Michael Jordan 1991 1
2 Michael Jordan 1992 1
3 Michael Jordan 1993 1
4 Magic Johnson 1991 1
5 Magic Johnson 1992 1
6 Magic Johnson 1993 1
7 Larry Bird 1992 0
8 Larry Bird 1992 0
I tried to use a left join. But not sure whether that makes sense:
test<-join(season, ALLSTARS, by =c("lastname","firstname","year") , type = "left", match = "all")
Here's a simple solution using data.table binary join which allows you to update a column by reference while joing
library(data.table)
setkey(setDT(season), firstname, lastname, year)[ALLSTARS, allstars := 1L]
season
# firstname lastname year allstars
# 1: Larry Bird 1992 NA
# 2: Larry Bird 1992 NA
# 3: Magic Johnson 1991 1
# 4: Magic Johnson 1992 1
# 5: Magic Johnson 1993 1
# 6: Michael Jordan 1991 1
# 7: Michael Jordan 1992 1
# 8: Michael Jordan 1993 1
Or using dplyr
library(dplyr)
ALLSTARS %>%
mutate(allstars = 1L) %>%
right_join(., season)
# firstname lastname year allstars
# 1 Michael Jordan 1991 1
# 2 Michael Jordan 1992 1
# 3 Michael Jordan 1993 1
# 4 Magic Johnson 1991 1
# 5 Magic Johnson 1992 1
# 6 Magic Johnson 1993 1
# 7 Larry Bird 1992 NA
# 8 Larry Bird 1992 NA
In base R:
ALLSTARS$allstars <- 1L
newdf <- merge(season, ALLSTARS, by=c('firstname', 'lastname', 'year'), all.x=TRUE)
newdf$allstars[is.na(newdf$allstars)] <- 0L
newdf
Or one I like for a different approach:
season$allstars <- (apply(season, 1, function(x) paste(x, collapse='')) %in%
apply(ALLSTARS, 1, function(x) paste(x, collapse='')))+0L
#
# firstname lastname year allstars
# 1 Michael Jordan 1991 1
# 2 Michael Jordan 1992 1
# 3 Michael Jordan 1993 1
# 4 Magic Johnson 1991 1
# 5 Magic Johnson 1992 1
# 6 Magic Johnson 1993 1
# 7 Larry Bird 1992 0
# 8 Larry Bird 1992 0
It looks like you are using join() from the plyr package. You were almost there: just preface your command with ALLSTARS$allstars <- 1. Then do your join as it is written and finally convert the NA values to 0. So:
ALLSTARS$allstars <- 1
test <- join(season, ALLSTARS, by =c("lastname","firstname","year") , type = "left", match = "all")
test$allstars[is.na(test$allstars)] <- 0
Result:
firstname lastname year allstars
1 Michael Jordan 1991 1
2 Michael Jordan 1992 1
3 Michael Jordan 1993 1
4 Magic Johnson 1991 1
5 Magic Johnson 1992 1
6 Magic Johnson 1993 1
7 Larry Bird 1992 0
8 Larry Bird 1992 0
Though I personally would use left_join or right_join from the dplyr package, as in David's answer, instead of plyr's join(). Also note that you don't actually need the by argument of join() in this case as by default the function will try to join on all fields with common names, which is what you want here.

Cumulative sums with data.table and multiple by conditions [duplicate]

I have a data set that looks like this
id name year job job2
1 Jane 1980 Worker 0
1 Jane 1981 Manager 1
1 Jane 1982 Manager 1
1 Jane 1983 Manager 1
1 Jane 1984 Manager 1
1 Jane 1985 Manager 1
1 Jane 1986 Boss 0
1 Jane 1987 Boss 0
2 Bob 1985 Worker 0
2 Bob 1986 Worker 0
2 Bob 1987 Manager 1
2 Bob 1988 Boss 0
2 Bob 1989 Boss 0
2 Bob 1990 Boss 0
2 Bob 1991 Boss 0
2 Bob 1992 Boss 0
Here, job2 denotes a dummy variable indicating whether a person was a Manager during that year or not. I want to do two things to this data set: first, I only want to preserve the row when the person became Boss for the first time. Second, I would like to see cumulative years a person worked as a Manager and store this information in the variable cumu_job2. Thus I would like to have:
id name year job job2 cumu_job2
1 Jane 1980 Worker 0 0
1 Jane 1981 Manager 1 1
1 Jane 1982 Manager 1 2
1 Jane 1983 Manager 1 3
1 Jane 1984 Manager 1 4
1 Jane 1985 Manager 1 5
1 Jane 1986 Boss 0 0
2 Bob 1985 Worker 0 0
2 Bob 1986 Worker 0 0
2 Bob 1987 Manager 1 1
2 Bob 1988 Boss 0 0
I have changed my examples and included the Worker position because this reflects more what I want to do with the original data set. The answers in this thread only works when there are only Managers and Boss in the data set - so any suggestions for making this work would be great. I'll be very much grateful!!
Here is the succinct dplyr solution for the same problem.
NOTE: Make sure that stringsAsFactors = FALSE while reading in the data.
library(dplyr)
dat %>%
group_by(name, job) %>%
filter(job != "Boss" | year == min(year)) %>%
mutate(cumu_job2 = cumsum(job2))
Output:
id name year job job2 cumu_job2
1 1 Jane 1980 Worker 0 0
2 1 Jane 1981 Manager 1 1
3 1 Jane 1982 Manager 1 2
4 1 Jane 1983 Manager 1 3
5 1 Jane 1984 Manager 1 4
6 1 Jane 1985 Manager 1 5
7 1 Jane 1986 Boss 0 0
8 2 Bob 1985 Worker 0 0
9 2 Bob 1986 Worker 0 0
10 2 Bob 1987 Manager 1 1
11 2 Bob 1988 Boss 0 0
Explanation
Take the dataset
Group by name and job
Filter each group based on condition
Add cumu_job2 column.
Contributed by Matthew Dowle:
dt[, .SD[job != "Boss" | year == min(year)][, cumjob := cumsum(job2)],
by = list(name, job)]
Explanation
Take the dataset
Run a filter and add a column within each Subset of Data (.SD)
Grouped by name and job
Older versions:
You have two different split apply combines here. One to get the cumulative jobs, and the other to get the first row of boss status. Here is an implementation in data.table where we basically do each analysis separately (well, kind of), and then collect everything in one place with rbind. The main thing to note is the by=id piece, which basically means the other expressions are evaluated for each id grouping in the data, which was what you correctly noted was missing from your attempt.
library(data.table)
dt <- as.data.table(df)
dt[, cumujob:=0L] # add column, set to zero
dt[job2==1, cumujob:=cumsum(job2), by=id] # cumsum for manager time by person
rbind(
dt[job2==1], # this is just the manager portion of the data
dt[job2==0, head(.SD, 1), by=id] # get first bossdom row
)[order(id, year)] # order by id, year
# id name year job job2 cumujob
# 1: 1 Jane 1980 Manager 1 1
# 2: 1 Jane 1981 Manager 1 2
# 3: 1 Jane 1982 Manager 1 3
# 4: 1 Jane 1983 Manager 1 4
# 5: 1 Jane 1984 Manager 1 5
# 6: 1 Jane 1985 Manager 1 6
# 7: 1 Jane 1986 Boss 0 0
# 8: 2 Bob 1985 Manager 1 1
# 9: 2 Bob 1986 Manager 1 2
# 10: 2 Bob 1987 Manager 1 3
# 11: 2 Bob 1988 Boss 0 0
Note this assumes table is sorted by year within each id, but if it isn't that's easy enough to fix.
Alternatively you could also achieve the same with:
ans <- dt[, .I[job != "Boss" | year == min(year)], by=list(name, job)]
ans <- dt[ans$V1]
ans[, cumujob := cumsum(job2), by=list(name,job)]
The idea is to basically get the row numbers where the condition matches (with .I - internal variable) and then subset dt on those row numbers (the $v1 part), then just perform the cumulative sum.
Here is a base solution using within and ave. We assume that the input is DF and that the data is sorted as in the question.
DF2 <- within(DF, {
seq = ave(id, id, job, FUN = seq_along)
job2 = (job == "Manager") + 0
cumu_job2 = ave(job2, id, job, FUN = cumsum)
})
subset(DF2, job != 'Boss' | seq == 1, select = - seq)
REVISION: Now uses within.
I think this does what you want, although the data must be sorted as you have presented it.
my.df <- read.table(text = '
id name year job job2
1 Jane 1980 Worker 0
1 Jane 1981 Manager 1
1 Jane 1982 Manager 1
1 Jane 1983 Manager 1
1 Jane 1984 Manager 1
1 Jane 1985 Manager 1
1 Jane 1986 Boss 0
1 Jane 1987 Boss 0
2 Bob 1985 Worker 0
2 Bob 1986 Worker 0
2 Bob 1987 Manager 1
2 Bob 1988 Boss 0
2 Bob 1989 Boss 0
2 Bob 1990 Boss 0
2 Bob 1991 Boss 0
2 Bob 1992 Boss 0
', header = TRUE, stringsAsFactors = FALSE)
my.seq <- data.frame(rle(my.df$job)$lengths)
my.df$cumu_job2 <- as.vector(unlist(apply(my.seq, 1, function(x) seq(1,x))))
my.df2 <- my.df[!(my.df$job=='Boss' & my.df$cumu_job2 != 1),]
my.df2$cumu_job2[my.df2$job != 'Manager'] <- 0
id name year job job2 cumu_job2
1 1 Jane 1980 Worker 0 0
2 1 Jane 1981 Manager 1 1
3 1 Jane 1982 Manager 1 2
4 1 Jane 1983 Manager 1 3
5 1 Jane 1984 Manager 1 4
6 1 Jane 1985 Manager 1 5
7 1 Jane 1986 Boss 0 0
9 2 Bob 1985 Worker 0 0
10 2 Bob 1986 Worker 0 0
11 2 Bob 1987 Manager 1 1
12 2 Bob 1988 Boss 0 0
#BrodieG's is way better:
The Data
dat <- read.table(text="id name year job job2
1 Jane 1980 Manager 1
1 Jane 1981 Manager 1
1 Jane 1982 Manager 1
1 Jane 1983 Manager 1
1 Jane 1984 Manager 1
1 Jane 1985 Manager 1
1 Jane 1986 Boss 0
1 Jane 1987 Boss 0
2 Bob 1985 Manager 1
2 Bob 1986 Manager 1
2 Bob 1987 Manager 1
2 Bob 1988 Boss 0
2 Bob 1989 Boss 0
2 Bob 1990 Boss 0
2 Bob 1991 Boss 0
2 Bob 1992 Boss 0", header=TRUE)
#The code:
inds1 <- rle(dat$job2)
inds2 <- cumsum(inds1[[1]])[inds1[[2]] == 1] + 1
ends <- cumsum(inds1[[1]])
starts <- c(1, head(ends + 1, -1))
inds3 <- mapply(":", starts, ends)
dat$id <- rep(1:length(inds3), sapply(inds3, length))
dat <- do.call(rbind, lapply(split(dat[, 1:5], dat$id ), function(x) {
if(x$job2[1] == 0){
x$cumu_job2 <- rep(0, nrow(x))
} else {
x$cumu_job2 <- 1:nrow(x)
}
x
}))
keeps <- dat$job2 > 0
keeps[inds2] <- TRUE
dat2 <- data.frame(dat[keeps, ], row.names = NULL)
dat2
## id name year job job2 cumu_job2
## 1 1 Jane 1980 Manager 1 1
## 2 1 Jane 1981 Manager 1 2
## 3 1 Jane 1982 Manager 1 3
## 4 1 Jane 1983 Manager 1 4
## 5 1 Jane 1984 Manager 1 5
## 6 1 Jane 1985 Manager 1 6
## 7 2 Jane 1986 Boss 0 0
## 8 3 Bob 1985 Manager 1 1
## 9 3 Bob 1986 Manager 1 2
## 10 3 Bob 1987 Manager 1 3
## 11 4 Bob 1988 Boss 0 0

select maximum row value by group

I've been trying to do this with my data by looking at other posts, but I keep getting an error. My data new looks like this:
id year name gdp
1 1980 Jamie 45
1 1981 Jamie 60
1 1982 Jamie 70
2 1990 Kate 40
2 1991 Kate 25
2 1992 Kate 67
3 1994 Joe 35
3 1995 Joe 78
3 1996 Joe 90
I want to select the row with the highest year value by id. So the wanted output is:
id year name gdp
1 1982 Jamie 70
2 1992 Kate 67
3 1996 Joe 90
From Selecting Rows which contain daily max value in R I tried the following but did not work
ddply(new,~id,function(x){x[which.max(new$year),]})
I've also tried
tapply(new$year, new$id, max)
But this didn't give me the wanted output.
Any suggestions would really help!
Another option that scales well for large tables is using data.table.
DT <- read.table(text = "id year name gdp
1 1980 Jamie 45
1 1981 Jamie 60
1 1982 Jamie 70
2 1990 Kate 40
2 1991 Kate 25
2 1992 Kate 67
3 1994 Joe 35
3 1995 Joe 78
3 1996 Joe 90",
header = TRUE)
require("data.table")
DT <- as.data.table(DT)
setkey(DT,id,year)
res = DT[,j=list(year=year[which.max(gdp)]),by=id]
res
setkey(res,id,year)
DT[res]
# id year name gdp
# 1: 1 1982 Jamie 70
# 2: 2 1992 Kate 67
# 3: 3 1996 Joe 90
Just use split:
df <- do.call(rbind, lapply(split(df, df$id),
function(subdf) subdf[which.max(subdf$year)[1], ]))
For example,
df <- data.frame(id = rep(1:10, each = 3), year = round(runif(30,0,10)) + 1980, gdp = round(runif(30, 40, 70)))
print(head(df))
# id year gdp
# 1 1 1990 49
# 2 1 1981 47
# 3 1 1987 69
# 4 2 1985 57
# 5 2 1989 41
# 6 2 1988 54
df <- do.call(rbind, lapply(split(df, df$id), function(subdf) subdf[which.max(subdf$year)[1], ]))
print(head(df))
# id year gdp
# 1 1 1990 49
# 2 2 1989 41
# 3 3 1989 55
# 4 4 1988 62
# 5 5 1989 48
# 6 6 1990 41
You can do this with duplicated
# your data
df <- read.table(text="id year name gdp
1 1980 Jamie 45
1 1981 Jamie 60
1 1982 Jamie 70
2 1990 Kate 40
2 1991 Kate 25
2 1992 Kate 67
3 1994 Joe 35
3 1995 Joe 78
3 1996 Joe 90" , header=TRUE)
# Sort by id and year (latest year is last for each id)
df <- df[order(df$id , df$year), ]
# Select the last row by id
df <- df[!duplicated(df$id, fromLast=TRUE), ]
ave works here yet again, and will account for a circumstance with multiple rows for the maximum year.
new[with(new, year == ave(year,id,FUN=max) ),]
# id year name gdp
#3 1 1982 Jamie 70
#6 2 1992 Kate 67
#9 3 1996 Joe 90
Your ddply effort looks good to me, but you referenced the original dataset in the callback function.
ddply(new,~id,function(x){x[which.max(new$year),]})
# should be
ddply(new,.(id),function(x){x[which.max(x$year),]})

how to cumulatively add values in one vector in R

I have a data set that looks like this
id name year job job2
1 Jane 1980 Worker 0
1 Jane 1981 Manager 1
1 Jane 1982 Manager 1
1 Jane 1983 Manager 1
1 Jane 1984 Manager 1
1 Jane 1985 Manager 1
1 Jane 1986 Boss 0
1 Jane 1987 Boss 0
2 Bob 1985 Worker 0
2 Bob 1986 Worker 0
2 Bob 1987 Manager 1
2 Bob 1988 Boss 0
2 Bob 1989 Boss 0
2 Bob 1990 Boss 0
2 Bob 1991 Boss 0
2 Bob 1992 Boss 0
Here, job2 denotes a dummy variable indicating whether a person was a Manager during that year or not. I want to do two things to this data set: first, I only want to preserve the row when the person became Boss for the first time. Second, I would like to see cumulative years a person worked as a Manager and store this information in the variable cumu_job2. Thus I would like to have:
id name year job job2 cumu_job2
1 Jane 1980 Worker 0 0
1 Jane 1981 Manager 1 1
1 Jane 1982 Manager 1 2
1 Jane 1983 Manager 1 3
1 Jane 1984 Manager 1 4
1 Jane 1985 Manager 1 5
1 Jane 1986 Boss 0 0
2 Bob 1985 Worker 0 0
2 Bob 1986 Worker 0 0
2 Bob 1987 Manager 1 1
2 Bob 1988 Boss 0 0
I have changed my examples and included the Worker position because this reflects more what I want to do with the original data set. The answers in this thread only works when there are only Managers and Boss in the data set - so any suggestions for making this work would be great. I'll be very much grateful!!
Here is the succinct dplyr solution for the same problem.
NOTE: Make sure that stringsAsFactors = FALSE while reading in the data.
library(dplyr)
dat %>%
group_by(name, job) %>%
filter(job != "Boss" | year == min(year)) %>%
mutate(cumu_job2 = cumsum(job2))
Output:
id name year job job2 cumu_job2
1 1 Jane 1980 Worker 0 0
2 1 Jane 1981 Manager 1 1
3 1 Jane 1982 Manager 1 2
4 1 Jane 1983 Manager 1 3
5 1 Jane 1984 Manager 1 4
6 1 Jane 1985 Manager 1 5
7 1 Jane 1986 Boss 0 0
8 2 Bob 1985 Worker 0 0
9 2 Bob 1986 Worker 0 0
10 2 Bob 1987 Manager 1 1
11 2 Bob 1988 Boss 0 0
Explanation
Take the dataset
Group by name and job
Filter each group based on condition
Add cumu_job2 column.
Contributed by Matthew Dowle:
dt[, .SD[job != "Boss" | year == min(year)][, cumjob := cumsum(job2)],
by = list(name, job)]
Explanation
Take the dataset
Run a filter and add a column within each Subset of Data (.SD)
Grouped by name and job
Older versions:
You have two different split apply combines here. One to get the cumulative jobs, and the other to get the first row of boss status. Here is an implementation in data.table where we basically do each analysis separately (well, kind of), and then collect everything in one place with rbind. The main thing to note is the by=id piece, which basically means the other expressions are evaluated for each id grouping in the data, which was what you correctly noted was missing from your attempt.
library(data.table)
dt <- as.data.table(df)
dt[, cumujob:=0L] # add column, set to zero
dt[job2==1, cumujob:=cumsum(job2), by=id] # cumsum for manager time by person
rbind(
dt[job2==1], # this is just the manager portion of the data
dt[job2==0, head(.SD, 1), by=id] # get first bossdom row
)[order(id, year)] # order by id, year
# id name year job job2 cumujob
# 1: 1 Jane 1980 Manager 1 1
# 2: 1 Jane 1981 Manager 1 2
# 3: 1 Jane 1982 Manager 1 3
# 4: 1 Jane 1983 Manager 1 4
# 5: 1 Jane 1984 Manager 1 5
# 6: 1 Jane 1985 Manager 1 6
# 7: 1 Jane 1986 Boss 0 0
# 8: 2 Bob 1985 Manager 1 1
# 9: 2 Bob 1986 Manager 1 2
# 10: 2 Bob 1987 Manager 1 3
# 11: 2 Bob 1988 Boss 0 0
Note this assumes table is sorted by year within each id, but if it isn't that's easy enough to fix.
Alternatively you could also achieve the same with:
ans <- dt[, .I[job != "Boss" | year == min(year)], by=list(name, job)]
ans <- dt[ans$V1]
ans[, cumujob := cumsum(job2), by=list(name,job)]
The idea is to basically get the row numbers where the condition matches (with .I - internal variable) and then subset dt on those row numbers (the $v1 part), then just perform the cumulative sum.
Here is a base solution using within and ave. We assume that the input is DF and that the data is sorted as in the question.
DF2 <- within(DF, {
seq = ave(id, id, job, FUN = seq_along)
job2 = (job == "Manager") + 0
cumu_job2 = ave(job2, id, job, FUN = cumsum)
})
subset(DF2, job != 'Boss' | seq == 1, select = - seq)
REVISION: Now uses within.
I think this does what you want, although the data must be sorted as you have presented it.
my.df <- read.table(text = '
id name year job job2
1 Jane 1980 Worker 0
1 Jane 1981 Manager 1
1 Jane 1982 Manager 1
1 Jane 1983 Manager 1
1 Jane 1984 Manager 1
1 Jane 1985 Manager 1
1 Jane 1986 Boss 0
1 Jane 1987 Boss 0
2 Bob 1985 Worker 0
2 Bob 1986 Worker 0
2 Bob 1987 Manager 1
2 Bob 1988 Boss 0
2 Bob 1989 Boss 0
2 Bob 1990 Boss 0
2 Bob 1991 Boss 0
2 Bob 1992 Boss 0
', header = TRUE, stringsAsFactors = FALSE)
my.seq <- data.frame(rle(my.df$job)$lengths)
my.df$cumu_job2 <- as.vector(unlist(apply(my.seq, 1, function(x) seq(1,x))))
my.df2 <- my.df[!(my.df$job=='Boss' & my.df$cumu_job2 != 1),]
my.df2$cumu_job2[my.df2$job != 'Manager'] <- 0
id name year job job2 cumu_job2
1 1 Jane 1980 Worker 0 0
2 1 Jane 1981 Manager 1 1
3 1 Jane 1982 Manager 1 2
4 1 Jane 1983 Manager 1 3
5 1 Jane 1984 Manager 1 4
6 1 Jane 1985 Manager 1 5
7 1 Jane 1986 Boss 0 0
9 2 Bob 1985 Worker 0 0
10 2 Bob 1986 Worker 0 0
11 2 Bob 1987 Manager 1 1
12 2 Bob 1988 Boss 0 0
#BrodieG's is way better:
The Data
dat <- read.table(text="id name year job job2
1 Jane 1980 Manager 1
1 Jane 1981 Manager 1
1 Jane 1982 Manager 1
1 Jane 1983 Manager 1
1 Jane 1984 Manager 1
1 Jane 1985 Manager 1
1 Jane 1986 Boss 0
1 Jane 1987 Boss 0
2 Bob 1985 Manager 1
2 Bob 1986 Manager 1
2 Bob 1987 Manager 1
2 Bob 1988 Boss 0
2 Bob 1989 Boss 0
2 Bob 1990 Boss 0
2 Bob 1991 Boss 0
2 Bob 1992 Boss 0", header=TRUE)
#The code:
inds1 <- rle(dat$job2)
inds2 <- cumsum(inds1[[1]])[inds1[[2]] == 1] + 1
ends <- cumsum(inds1[[1]])
starts <- c(1, head(ends + 1, -1))
inds3 <- mapply(":", starts, ends)
dat$id <- rep(1:length(inds3), sapply(inds3, length))
dat <- do.call(rbind, lapply(split(dat[, 1:5], dat$id ), function(x) {
if(x$job2[1] == 0){
x$cumu_job2 <- rep(0, nrow(x))
} else {
x$cumu_job2 <- 1:nrow(x)
}
x
}))
keeps <- dat$job2 > 0
keeps[inds2] <- TRUE
dat2 <- data.frame(dat[keeps, ], row.names = NULL)
dat2
## id name year job job2 cumu_job2
## 1 1 Jane 1980 Manager 1 1
## 2 1 Jane 1981 Manager 1 2
## 3 1 Jane 1982 Manager 1 3
## 4 1 Jane 1983 Manager 1 4
## 5 1 Jane 1984 Manager 1 5
## 6 1 Jane 1985 Manager 1 6
## 7 2 Jane 1986 Boss 0 0
## 8 3 Bob 1985 Manager 1 1
## 9 3 Bob 1986 Manager 1 2
## 10 3 Bob 1987 Manager 1 3
## 11 4 Bob 1988 Boss 0 0

selecting rows of which the value of the variable is equal to certain vector

I have a longitudinal data called df for more than 1000 people that looks like the following:
id year name status
1 1984 James 4
1 1985 James 1
2 1983 John 2
2 1984 John 1
3 1980 Amy 2
3 1981 Amy 2
4 1930 Jane 4
4 1931 Jane 5
I'm trying to subset the data by certain id. For instance, I have a vector dd that consists of ids that I would like to subset:
dd<-c(1,3)
I've tried the following but it did not work, for instance:
subset<-subset(df, subset(df$id==dd))
or
subset<-subset(df, subset(unique(df$id))==dd))
or
subset<-df[which(unique(df$id)==dd),]
or I tried a for-loop
for (i in 1:2){
subset<-subset(df, subset=(unique(df$id)==dd[i]))
}
Would there be a way to only select the rows with the ids that match the numbers in in the vector dd?
Use %in% and logical indexing:
df[df$id %in% dd,]
id year name status
1 1 1984 James 4
2 1 1985 James 1
5 3 1980 Amy 2
6 3 1981 Amy 2
As an alternative you can use 'dplyr' a new package (author: Hadley Wickham ) which provides a blazingly fast set of tools for efficiently manipulating datasets.
require(dplyr)
filter(df,id %in% dd )
id year name status
1 1 1984 James 4
2 1 1985 James 1
3 3 1980 Amy 2
4 3 1981 Amy 2

Resources