R - clean up data based on preceding and following values - r

I have got a table which is later on divided into multiple intervals based on multiple conditions. In some rare cases, I one or multiple rows which do not fall into the defined interval, so I'd like to preform some extra clean-up in the data.
For each group (name, location), if the row value in stop == 0, I need to count how many of those rows are in the interval. If that less then <3, I need to check how many continous rows are market as stop == 1 above and below the interval with zero value. If the count of values with stop == 1 above & below == 1 then I need to change values in the intervals with zero to 1.
I hope the picture will make it more clear:
df <- read.table(text="name location stop
John London 1
John London 1
John London 1
John London 1
John London 1
John London 1
John London 1
John London 0
John London 0
John London 1
John London 1
John London 1
John London 1
John London 1
John London 1
John London 0
John New_York 0
John New_York 0
John New_York 0
John New_York 1
John New_York 0
",header = TRUE, stringsAsFactors = FALSE)

You could iterate over the rows, but it seems that all you want to do is replace all instances of 101 with 111 and 1001 with 1111 in stop. You can do this by turning the stop column to string and then make substitutions using gsub():
stopString = paste0(df$stop, collapse = "")
stopString = gsub("101","111",stopString)
stopString = gsub("1001","1111",stopString)
df$stop = as.numeric(unlist(strsplit(stopString,"")))
> df
name location stop
1 John London 1
2 John London 1
3 John London 1
4 John London 1
5 John London 1
6 John London 1
7 John London 1
8 John London 1
9 John London 1
10 John London 1
11 John London 1
12 John London 1
13 John London 1
14 John London 1
15 John London 1
16 John London 0
17 John New_York 0
18 John New_York 0
19 John New_York 0
20 John New_York 1
21 John New_York 0
Edit: grouping by name and location:
df <- read.table(text="name location stop
John London 1
John London 0
John London 1
John New_York 0
John New_York 1
John New_York 0
John New_York 0
John New_York 0
John New_York 1
John New_York 0
",header = TRUE, stringsAsFactors = TRUE)
f <- function(x)
{
stopString = paste0(x, collapse = "")
stopString = gsub("101","111",stopString)
stopString = gsub("1001","1111",stopString)
as.numeric(unlist(strsplit(stopString,"")))
}
> df %>% dplyr::group_by(name, location) %>%
dplyr::summarise(stop=stop, s=f(stop))
# A tibble: 10 x 4
# Groups: name, location [2]
name location stop s
<fct> <fct> <int> <dbl>
1 John London 1 1
2 John London 0 1
3 John London 1 1
4 John New_York 0 0
5 John New_York 1 1
6 John New_York 0 0
7 John New_York 0 0
8 John New_York 0 0
9 John New_York 1 1
10 John New_York 0 0

Related

R Dataframe Detecting Hidden Repeated Patterns by Group

I have a dataframe looks like below:
person year location rank
Harry 2002 Los Angeles 1
Harry 2006 Boston 1
Harry 2006 Los Angeles 2
Harry 2006 Chicago 3
Peter 2001 New York 1
Peter 2002 New York 1
Lily 2005 Springfield 1
Lily 2007 New York 1
Lily 2008 Boston 1
Lily 2011 Chicago 1
Lily 2011 New York 2
Sam 2005 Springfield 1
Sam 2007 New York 1
Sam 2008 Boston 1
Sam 2008 Springfield 2
Sam 2008 New York 3
Sam 2011 Chicago 1
Sam 2011 Springfield 2
I want to know at person level, who has a location with rank=1 in a certain year and this location reappears in the next available year but rank!=1. For example, the output should look like:
person yes/no
Harry 1
Peter 0
Lily 0
Sam 1
Here's an approach with dplyr, probably could be more concise.
library(dplyr)
df1 %>%
# define year_number as a count of unique years [assumes sorted already]
group_by(person) %>%
mutate(year_num = cumsum(year != lag(year, default = 0))) %>%
# check for successive years with different ranks
group_by(person, location) %>%
mutate(next_yr_switch = year_num == lag(year_num, default = -Inf) + 1 & rank != lag(rank)) %>%
group_by(person) %>%
summarize(`yes/no` = sum(next_yr_switch))
## A tibble: 4 x 2
# person `yes/no`
#* <chr> <int>
#1 Harry 1
#2 Lily 0
#3 Peter 0
#4 Sam 1

How to populate values of one row conditional of another row in R?

I inherited a data set coded in an unusual way. I would like to learn a less verbose way of reshaping it. The data frame looks like this:
# Input.
participant = c(rep("John",6), rep("Mary",6))
day = c(rep(1,3), rep(2,3), rep(1,3), rep(2,3))
likes = c("apples", "apples", "18", "apples", "apples", "7", "bananas", "bananas", "24", "bananas", "bananas", "3")
question = rep(c(1,1,0),4)
number = c(rep(18,3), rep(7,3), rep(24,3), rep(3,3))
df = data.frame(participant, day, question, likes)
participant day question likes
1 John 1 1 apples
2 John 1 1 apples
3 John 1 0 18
4 John 2 1 apples
5 John 2 1 apples
6 John 2 0 7
7 Mary 1 1 bananas
8 Mary 1 1 bananas
9 Mary 1 0 24
10 Mary 2 1 bananas
11 Mary 2 1 bananas
12 Mary 2 0 3
As you can see, the column likes is heterogeneous. When question equals 0, likes conveys a number chosen by the participants, not their preferred fruit. So I would like to re-code it in a new column as follows:
participant day question likes number
1 John 1 1 apples 18
2 John 1 1 apples 18
3 John 1 0 18 18
4 John 2 1 apples 7
5 John 2 1 apples 7
6 John 2 0 7 7
7 Mary 1 1 bananas 24
8 Mary 1 1 bananas 24
9 Mary 1 0 24 24
10 Mary 2 1 bananas 3
11 Mary 2 1 bananas 3
12 Mary 2 0 3 3
My current solution with base R involves subsetting the initial data frame, creating a lookup table, changing the column names and then merging the lookup table with the original data frame. But this involves several steps and I worry that there should be a simpler solution. I think that tidyr might be the answer, but I don't know how to use it to spread values in one column (likes) conditional other columns (day and question).
Do you have any suggestions? Thanks a lot!
Using the data set above, you can try the following. You group your data by participant and day and look for a row with question == 0 for each group.
library(dplyr)
group_by(df, participant, day) %>%
mutate(age = as.numeric(as.character(likes[which(question == 0)])))
Or as alistaire suggested, you can use grep() too.
group_by(df, participant, day) %>%
mutate(age = as.numeric(grep('\\d+', likes, value = TRUE)))
# participant day question likes age
# (fctr) (dbl) (dbl) (fctr) (dbl)
#1 John 1 1 apples 18
#2 John 1 1 apples 18
#3 John 1 0 18 18
#4 John 2 1 apples 7
#5 John 2 1 apples 7
#6 John 2 0 7 7
#7 Mary 1 1 bananas 24
#8 Mary 1 1 bananas 24
#9 Mary 1 0 24 24
#10 Mary 2 1 bananas 3
#11 Mary 2 1 bananas 3
#12 Mary 2 0 3 3
If you want to use data.table, you can do:
library(data.table)
setDT(df)[, age := as.numeric(as.character(likes[which(question == 0)])),
by = list(participant, day)]
NOTE
The present data set is a new one. Jota's answer works for the deleted data set.
Addressing the new example data:
# create a key column, overwrite it later
df$number <- paste0(df$participant, df$day) # use as a key
# create lookup table
lookup <- df[!is.na(as.numeric(as.character(df$likes))), c("number", "likes")]
# use lookup to overwrite df$number with the appropriate number
df$number <- lookup$likes[match(df$number, lookup$number)]
# participant day question likes number
#1 John 1 1 apples 18
#2 John 1 1 apples 18
#3 John 1 0 18 18
#4 John 2 1 apples 7
#5 John 2 1 apples 7
#6 John 2 0 7 7
#7 Mary 1 1 bananas 24
#8 Mary 1 1 bananas 24
#9 Mary 1 0 24 24
#10 Mary 2 1 bananas 3
#11 Mary 2 1 bananas 3
#12 Mary 2 0 3 3
The warning about NAs be introduced by coercion is expected due to converting characters to numeric (as.numeric(as.character(df$likes))),.
If you're data are ordered like in the example, you can use na.locf from the zoo package:
library(zoo)
df$age <- na.locf(as.numeric(as.character(df$likes)), fromLast = TRUE)

Cumulative sums with data.table and multiple by conditions [duplicate]

I have a data set that looks like this
id name year job job2
1 Jane 1980 Worker 0
1 Jane 1981 Manager 1
1 Jane 1982 Manager 1
1 Jane 1983 Manager 1
1 Jane 1984 Manager 1
1 Jane 1985 Manager 1
1 Jane 1986 Boss 0
1 Jane 1987 Boss 0
2 Bob 1985 Worker 0
2 Bob 1986 Worker 0
2 Bob 1987 Manager 1
2 Bob 1988 Boss 0
2 Bob 1989 Boss 0
2 Bob 1990 Boss 0
2 Bob 1991 Boss 0
2 Bob 1992 Boss 0
Here, job2 denotes a dummy variable indicating whether a person was a Manager during that year or not. I want to do two things to this data set: first, I only want to preserve the row when the person became Boss for the first time. Second, I would like to see cumulative years a person worked as a Manager and store this information in the variable cumu_job2. Thus I would like to have:
id name year job job2 cumu_job2
1 Jane 1980 Worker 0 0
1 Jane 1981 Manager 1 1
1 Jane 1982 Manager 1 2
1 Jane 1983 Manager 1 3
1 Jane 1984 Manager 1 4
1 Jane 1985 Manager 1 5
1 Jane 1986 Boss 0 0
2 Bob 1985 Worker 0 0
2 Bob 1986 Worker 0 0
2 Bob 1987 Manager 1 1
2 Bob 1988 Boss 0 0
I have changed my examples and included the Worker position because this reflects more what I want to do with the original data set. The answers in this thread only works when there are only Managers and Boss in the data set - so any suggestions for making this work would be great. I'll be very much grateful!!
Here is the succinct dplyr solution for the same problem.
NOTE: Make sure that stringsAsFactors = FALSE while reading in the data.
library(dplyr)
dat %>%
group_by(name, job) %>%
filter(job != "Boss" | year == min(year)) %>%
mutate(cumu_job2 = cumsum(job2))
Output:
id name year job job2 cumu_job2
1 1 Jane 1980 Worker 0 0
2 1 Jane 1981 Manager 1 1
3 1 Jane 1982 Manager 1 2
4 1 Jane 1983 Manager 1 3
5 1 Jane 1984 Manager 1 4
6 1 Jane 1985 Manager 1 5
7 1 Jane 1986 Boss 0 0
8 2 Bob 1985 Worker 0 0
9 2 Bob 1986 Worker 0 0
10 2 Bob 1987 Manager 1 1
11 2 Bob 1988 Boss 0 0
Explanation
Take the dataset
Group by name and job
Filter each group based on condition
Add cumu_job2 column.
Contributed by Matthew Dowle:
dt[, .SD[job != "Boss" | year == min(year)][, cumjob := cumsum(job2)],
by = list(name, job)]
Explanation
Take the dataset
Run a filter and add a column within each Subset of Data (.SD)
Grouped by name and job
Older versions:
You have two different split apply combines here. One to get the cumulative jobs, and the other to get the first row of boss status. Here is an implementation in data.table where we basically do each analysis separately (well, kind of), and then collect everything in one place with rbind. The main thing to note is the by=id piece, which basically means the other expressions are evaluated for each id grouping in the data, which was what you correctly noted was missing from your attempt.
library(data.table)
dt <- as.data.table(df)
dt[, cumujob:=0L] # add column, set to zero
dt[job2==1, cumujob:=cumsum(job2), by=id] # cumsum for manager time by person
rbind(
dt[job2==1], # this is just the manager portion of the data
dt[job2==0, head(.SD, 1), by=id] # get first bossdom row
)[order(id, year)] # order by id, year
# id name year job job2 cumujob
# 1: 1 Jane 1980 Manager 1 1
# 2: 1 Jane 1981 Manager 1 2
# 3: 1 Jane 1982 Manager 1 3
# 4: 1 Jane 1983 Manager 1 4
# 5: 1 Jane 1984 Manager 1 5
# 6: 1 Jane 1985 Manager 1 6
# 7: 1 Jane 1986 Boss 0 0
# 8: 2 Bob 1985 Manager 1 1
# 9: 2 Bob 1986 Manager 1 2
# 10: 2 Bob 1987 Manager 1 3
# 11: 2 Bob 1988 Boss 0 0
Note this assumes table is sorted by year within each id, but if it isn't that's easy enough to fix.
Alternatively you could also achieve the same with:
ans <- dt[, .I[job != "Boss" | year == min(year)], by=list(name, job)]
ans <- dt[ans$V1]
ans[, cumujob := cumsum(job2), by=list(name,job)]
The idea is to basically get the row numbers where the condition matches (with .I - internal variable) and then subset dt on those row numbers (the $v1 part), then just perform the cumulative sum.
Here is a base solution using within and ave. We assume that the input is DF and that the data is sorted as in the question.
DF2 <- within(DF, {
seq = ave(id, id, job, FUN = seq_along)
job2 = (job == "Manager") + 0
cumu_job2 = ave(job2, id, job, FUN = cumsum)
})
subset(DF2, job != 'Boss' | seq == 1, select = - seq)
REVISION: Now uses within.
I think this does what you want, although the data must be sorted as you have presented it.
my.df <- read.table(text = '
id name year job job2
1 Jane 1980 Worker 0
1 Jane 1981 Manager 1
1 Jane 1982 Manager 1
1 Jane 1983 Manager 1
1 Jane 1984 Manager 1
1 Jane 1985 Manager 1
1 Jane 1986 Boss 0
1 Jane 1987 Boss 0
2 Bob 1985 Worker 0
2 Bob 1986 Worker 0
2 Bob 1987 Manager 1
2 Bob 1988 Boss 0
2 Bob 1989 Boss 0
2 Bob 1990 Boss 0
2 Bob 1991 Boss 0
2 Bob 1992 Boss 0
', header = TRUE, stringsAsFactors = FALSE)
my.seq <- data.frame(rle(my.df$job)$lengths)
my.df$cumu_job2 <- as.vector(unlist(apply(my.seq, 1, function(x) seq(1,x))))
my.df2 <- my.df[!(my.df$job=='Boss' & my.df$cumu_job2 != 1),]
my.df2$cumu_job2[my.df2$job != 'Manager'] <- 0
id name year job job2 cumu_job2
1 1 Jane 1980 Worker 0 0
2 1 Jane 1981 Manager 1 1
3 1 Jane 1982 Manager 1 2
4 1 Jane 1983 Manager 1 3
5 1 Jane 1984 Manager 1 4
6 1 Jane 1985 Manager 1 5
7 1 Jane 1986 Boss 0 0
9 2 Bob 1985 Worker 0 0
10 2 Bob 1986 Worker 0 0
11 2 Bob 1987 Manager 1 1
12 2 Bob 1988 Boss 0 0
#BrodieG's is way better:
The Data
dat <- read.table(text="id name year job job2
1 Jane 1980 Manager 1
1 Jane 1981 Manager 1
1 Jane 1982 Manager 1
1 Jane 1983 Manager 1
1 Jane 1984 Manager 1
1 Jane 1985 Manager 1
1 Jane 1986 Boss 0
1 Jane 1987 Boss 0
2 Bob 1985 Manager 1
2 Bob 1986 Manager 1
2 Bob 1987 Manager 1
2 Bob 1988 Boss 0
2 Bob 1989 Boss 0
2 Bob 1990 Boss 0
2 Bob 1991 Boss 0
2 Bob 1992 Boss 0", header=TRUE)
#The code:
inds1 <- rle(dat$job2)
inds2 <- cumsum(inds1[[1]])[inds1[[2]] == 1] + 1
ends <- cumsum(inds1[[1]])
starts <- c(1, head(ends + 1, -1))
inds3 <- mapply(":", starts, ends)
dat$id <- rep(1:length(inds3), sapply(inds3, length))
dat <- do.call(rbind, lapply(split(dat[, 1:5], dat$id ), function(x) {
if(x$job2[1] == 0){
x$cumu_job2 <- rep(0, nrow(x))
} else {
x$cumu_job2 <- 1:nrow(x)
}
x
}))
keeps <- dat$job2 > 0
keeps[inds2] <- TRUE
dat2 <- data.frame(dat[keeps, ], row.names = NULL)
dat2
## id name year job job2 cumu_job2
## 1 1 Jane 1980 Manager 1 1
## 2 1 Jane 1981 Manager 1 2
## 3 1 Jane 1982 Manager 1 3
## 4 1 Jane 1983 Manager 1 4
## 5 1 Jane 1984 Manager 1 5
## 6 1 Jane 1985 Manager 1 6
## 7 2 Jane 1986 Boss 0 0
## 8 3 Bob 1985 Manager 1 1
## 9 3 Bob 1986 Manager 1 2
## 10 3 Bob 1987 Manager 1 3
## 11 4 Bob 1988 Boss 0 0

delete rows for duplicate variable in R

I have panel data with duplicate years, but I want to delete the row where job value is smaller:
id name year job
1 Jane 1990 100
1 Jane 1992 200
1 Jane 1993 300
1 Jane 1993 1
1 Jane 1997 400
1 Jane 1997 2
2 Tom 1990 400
2 Tom 1992 500
2 Tom 1993 700
2 Tom 1993 1
2 Tom 1997 900
2 Tom 1997 3
I would want the following:
id name year job
1 Jane 1990 100
1 Jane 1992 200
1 Jane 1993 1
1 Jane 1997 2
2 Tom 1990 400
2 Tom 1992 500
2 Tom 1993 1
2 Tom 1997 3
Would there be a way to do this?
you have different possibilities for instance with plyr and dplyr :
# plyr
ddply(tab, .(id, name, year), summarise, job=min(job))
# dplyr
tabg <- group_by(tab, id, name, year)
summarise(tabg, job=min(job))
# basic fonction
aggregate(tab[,"job", drop=FALSE], tab[,3:1], min)
You can use ddply for this:
x <- read.table(textConnection("id name year job
1 Jane 1990 100
1 Jane 1992 200
1 Jane 1993 300
1 Jane 1993 1
1 Jane 1997 400
1 Jane 1997 2
2 Tom 1990 400
2 Tom 1992 500
2 Tom 1993 700
2 Tom 1993 1
2 Tom 1997 900
2 Tom 1997 3"),header=T)
library(plyr)
ddply(x,c("id","name","year"),summarise, job=max(job))
id name year job
1 1 Jane 1990 100
2 1 Jane 1992 200
3 1 Jane 1993 300
4 1 Jane 1997 400
5 2 Tom 1990 400
6 2 Tom 1992 500
7 2 Tom 1993 700
8 2 Tom 1997 900
Note that I have obtained what you asked for in the description. Your example output contradicts this. If you do want your example output, use min instead of max.
If your data is data frame df
library(data.table)
dt <- as.data.table(df)
dt[, .SD[which.min(job)], by = list(id, name, year)]
You could use base R with the function order, as suggested by James:
> tab[order(tab$job),][! duplicated(tab[order(tab$job), c('id', 'year')], fromLast=T), ]
id name year job
1 1 Jane 1990 100
2 1 Jane 1992 200
3 1 Jane 1993 300
5 1 Jane 1997 400
7 2 Tom 1990 400
8 2 Tom 1992 500
9 2 Tom 1993 700
11 2 Tom 1997 900

how to cumulatively add values in one vector in R

I have a data set that looks like this
id name year job job2
1 Jane 1980 Worker 0
1 Jane 1981 Manager 1
1 Jane 1982 Manager 1
1 Jane 1983 Manager 1
1 Jane 1984 Manager 1
1 Jane 1985 Manager 1
1 Jane 1986 Boss 0
1 Jane 1987 Boss 0
2 Bob 1985 Worker 0
2 Bob 1986 Worker 0
2 Bob 1987 Manager 1
2 Bob 1988 Boss 0
2 Bob 1989 Boss 0
2 Bob 1990 Boss 0
2 Bob 1991 Boss 0
2 Bob 1992 Boss 0
Here, job2 denotes a dummy variable indicating whether a person was a Manager during that year or not. I want to do two things to this data set: first, I only want to preserve the row when the person became Boss for the first time. Second, I would like to see cumulative years a person worked as a Manager and store this information in the variable cumu_job2. Thus I would like to have:
id name year job job2 cumu_job2
1 Jane 1980 Worker 0 0
1 Jane 1981 Manager 1 1
1 Jane 1982 Manager 1 2
1 Jane 1983 Manager 1 3
1 Jane 1984 Manager 1 4
1 Jane 1985 Manager 1 5
1 Jane 1986 Boss 0 0
2 Bob 1985 Worker 0 0
2 Bob 1986 Worker 0 0
2 Bob 1987 Manager 1 1
2 Bob 1988 Boss 0 0
I have changed my examples and included the Worker position because this reflects more what I want to do with the original data set. The answers in this thread only works when there are only Managers and Boss in the data set - so any suggestions for making this work would be great. I'll be very much grateful!!
Here is the succinct dplyr solution for the same problem.
NOTE: Make sure that stringsAsFactors = FALSE while reading in the data.
library(dplyr)
dat %>%
group_by(name, job) %>%
filter(job != "Boss" | year == min(year)) %>%
mutate(cumu_job2 = cumsum(job2))
Output:
id name year job job2 cumu_job2
1 1 Jane 1980 Worker 0 0
2 1 Jane 1981 Manager 1 1
3 1 Jane 1982 Manager 1 2
4 1 Jane 1983 Manager 1 3
5 1 Jane 1984 Manager 1 4
6 1 Jane 1985 Manager 1 5
7 1 Jane 1986 Boss 0 0
8 2 Bob 1985 Worker 0 0
9 2 Bob 1986 Worker 0 0
10 2 Bob 1987 Manager 1 1
11 2 Bob 1988 Boss 0 0
Explanation
Take the dataset
Group by name and job
Filter each group based on condition
Add cumu_job2 column.
Contributed by Matthew Dowle:
dt[, .SD[job != "Boss" | year == min(year)][, cumjob := cumsum(job2)],
by = list(name, job)]
Explanation
Take the dataset
Run a filter and add a column within each Subset of Data (.SD)
Grouped by name and job
Older versions:
You have two different split apply combines here. One to get the cumulative jobs, and the other to get the first row of boss status. Here is an implementation in data.table where we basically do each analysis separately (well, kind of), and then collect everything in one place with rbind. The main thing to note is the by=id piece, which basically means the other expressions are evaluated for each id grouping in the data, which was what you correctly noted was missing from your attempt.
library(data.table)
dt <- as.data.table(df)
dt[, cumujob:=0L] # add column, set to zero
dt[job2==1, cumujob:=cumsum(job2), by=id] # cumsum for manager time by person
rbind(
dt[job2==1], # this is just the manager portion of the data
dt[job2==0, head(.SD, 1), by=id] # get first bossdom row
)[order(id, year)] # order by id, year
# id name year job job2 cumujob
# 1: 1 Jane 1980 Manager 1 1
# 2: 1 Jane 1981 Manager 1 2
# 3: 1 Jane 1982 Manager 1 3
# 4: 1 Jane 1983 Manager 1 4
# 5: 1 Jane 1984 Manager 1 5
# 6: 1 Jane 1985 Manager 1 6
# 7: 1 Jane 1986 Boss 0 0
# 8: 2 Bob 1985 Manager 1 1
# 9: 2 Bob 1986 Manager 1 2
# 10: 2 Bob 1987 Manager 1 3
# 11: 2 Bob 1988 Boss 0 0
Note this assumes table is sorted by year within each id, but if it isn't that's easy enough to fix.
Alternatively you could also achieve the same with:
ans <- dt[, .I[job != "Boss" | year == min(year)], by=list(name, job)]
ans <- dt[ans$V1]
ans[, cumujob := cumsum(job2), by=list(name,job)]
The idea is to basically get the row numbers where the condition matches (with .I - internal variable) and then subset dt on those row numbers (the $v1 part), then just perform the cumulative sum.
Here is a base solution using within and ave. We assume that the input is DF and that the data is sorted as in the question.
DF2 <- within(DF, {
seq = ave(id, id, job, FUN = seq_along)
job2 = (job == "Manager") + 0
cumu_job2 = ave(job2, id, job, FUN = cumsum)
})
subset(DF2, job != 'Boss' | seq == 1, select = - seq)
REVISION: Now uses within.
I think this does what you want, although the data must be sorted as you have presented it.
my.df <- read.table(text = '
id name year job job2
1 Jane 1980 Worker 0
1 Jane 1981 Manager 1
1 Jane 1982 Manager 1
1 Jane 1983 Manager 1
1 Jane 1984 Manager 1
1 Jane 1985 Manager 1
1 Jane 1986 Boss 0
1 Jane 1987 Boss 0
2 Bob 1985 Worker 0
2 Bob 1986 Worker 0
2 Bob 1987 Manager 1
2 Bob 1988 Boss 0
2 Bob 1989 Boss 0
2 Bob 1990 Boss 0
2 Bob 1991 Boss 0
2 Bob 1992 Boss 0
', header = TRUE, stringsAsFactors = FALSE)
my.seq <- data.frame(rle(my.df$job)$lengths)
my.df$cumu_job2 <- as.vector(unlist(apply(my.seq, 1, function(x) seq(1,x))))
my.df2 <- my.df[!(my.df$job=='Boss' & my.df$cumu_job2 != 1),]
my.df2$cumu_job2[my.df2$job != 'Manager'] <- 0
id name year job job2 cumu_job2
1 1 Jane 1980 Worker 0 0
2 1 Jane 1981 Manager 1 1
3 1 Jane 1982 Manager 1 2
4 1 Jane 1983 Manager 1 3
5 1 Jane 1984 Manager 1 4
6 1 Jane 1985 Manager 1 5
7 1 Jane 1986 Boss 0 0
9 2 Bob 1985 Worker 0 0
10 2 Bob 1986 Worker 0 0
11 2 Bob 1987 Manager 1 1
12 2 Bob 1988 Boss 0 0
#BrodieG's is way better:
The Data
dat <- read.table(text="id name year job job2
1 Jane 1980 Manager 1
1 Jane 1981 Manager 1
1 Jane 1982 Manager 1
1 Jane 1983 Manager 1
1 Jane 1984 Manager 1
1 Jane 1985 Manager 1
1 Jane 1986 Boss 0
1 Jane 1987 Boss 0
2 Bob 1985 Manager 1
2 Bob 1986 Manager 1
2 Bob 1987 Manager 1
2 Bob 1988 Boss 0
2 Bob 1989 Boss 0
2 Bob 1990 Boss 0
2 Bob 1991 Boss 0
2 Bob 1992 Boss 0", header=TRUE)
#The code:
inds1 <- rle(dat$job2)
inds2 <- cumsum(inds1[[1]])[inds1[[2]] == 1] + 1
ends <- cumsum(inds1[[1]])
starts <- c(1, head(ends + 1, -1))
inds3 <- mapply(":", starts, ends)
dat$id <- rep(1:length(inds3), sapply(inds3, length))
dat <- do.call(rbind, lapply(split(dat[, 1:5], dat$id ), function(x) {
if(x$job2[1] == 0){
x$cumu_job2 <- rep(0, nrow(x))
} else {
x$cumu_job2 <- 1:nrow(x)
}
x
}))
keeps <- dat$job2 > 0
keeps[inds2] <- TRUE
dat2 <- data.frame(dat[keeps, ], row.names = NULL)
dat2
## id name year job job2 cumu_job2
## 1 1 Jane 1980 Manager 1 1
## 2 1 Jane 1981 Manager 1 2
## 3 1 Jane 1982 Manager 1 3
## 4 1 Jane 1983 Manager 1 4
## 5 1 Jane 1984 Manager 1 5
## 6 1 Jane 1985 Manager 1 6
## 7 2 Jane 1986 Boss 0 0
## 8 3 Bob 1985 Manager 1 1
## 9 3 Bob 1986 Manager 1 2
## 10 3 Bob 1987 Manager 1 3
## 11 4 Bob 1988 Boss 0 0

Resources