Removing rows based where data isn't sequential in R, dplyr - r

I have a data frame where I am trying to remove rows where the year is not sequential.
Here is a sample of my data frame:
Name Year Position Year_diff FBv ind1 velo_diff
1 Aaron Heilman 2005 RP 2 90.1 TRUE 0.0
2 Aaron Heilman 2003 SP NA 89.4 NA 0.0
3 Aaron Laffey 2010 RP 1 86.8 TRUE -0.6
4 Aaron Laffey 2009 SP NA 87.4 NA 0.0
5 Alexi Ogando 2015 RP 2 94.5 TRUE 0.0
6 Alexi Ogando 2013 SP NA 93.4 FALSE 0.0
7 Alexi Ogando 2012 RP 1 97.0 TRUE 1.9
8 Alexi Ogando 2011 SP NA 95.1 NA 0.0
The expected output should be:
Name Year Position Year_diff FBv ind1 velo_diff
3 Aaron Laffey 2010 RP 1 86.8 TRUE -0.6
4 Aaron Laffey 2009 SP NA 87.4 NA 0.0
7 Alexi Ogando 2012 RP 1 97.0 TRUE 1.9
8 Alexi Ogando 2011 SP NA 95.1 NA 0.0
The reason Alexi Ogando 2011-2012 is still there is because his sequence of SP to RP is met in line with consecutive years. Ogando's 2013-2015 SP to RP sequence is not met with consecutive years.
An element which might help is that each sequence where the years aren't sequential, the velo_diff will be 0.0
Would anybody know how to do this? All help is appreciated.

You can do a grouped filter, checking if the subsequent or previous year exists and if the Position matches accordingly:
library(dplyr)
df <- read.table(text = 'Name Year Position Year_diff FBv ind1 velo_diff
1 "Aaron Heilman" 2005 RP 2 90.1 TRUE 0.0
2 "Aaron Heilman" 2003 SP NA 89.4 NA 0.0
3 "Aaron Laffey" 2010 RP 1 86.8 TRUE -0.6
4 "Aaron Laffey" 2009 SP NA 87.4 NA 0.0
5 "Alexi Ogando" 2015 RP 2 94.5 TRUE 0.0
6 "Alexi Ogando" 2013 SP NA 93.4 FALSE 0.0
7 "Alexi Ogando" 2012 RP 1 97.0 TRUE 1.9
8 "Alexi Ogando" 2011 SP NA 95.1 NA 0.0', header = TRUE)
df %>% group_by(Name) %>%
filter(((Year - 1) %in% Year & Position == 'RP') |
((Year + 1) %in% Year & Position == 'SP'))
#> Source: local data frame [4 x 7]
#> Groups: Name [2]
#>
#> Name Year Position Year_diff FBv ind1 velo_diff
#> <fctr> <int> <fctr> <int> <dbl> <lgl> <dbl>
#> 1 Aaron Laffey 2010 RP 1 86.8 TRUE -0.6
#> 2 Aaron Laffey 2009 SP NA 87.4 NA 0.0
#> 3 Alexi Ogando 2012 RP 1 97.0 TRUE 1.9
#> 4 Alexi Ogando 2011 SP NA 95.1 NA 0.0

We can use data.table
library(data.table)
setDT(df1)[df1[, .I[abs(diff(Year))==1], .(Name, grp = cumsum(Position == "RP"))]$V1]
# Name Year Position Year_diff FBv ind1 velo_diff
#1: Aaron Laffey 2010 RP 1 86.8 TRUE -0.6
#2: Aaron Laffey 2009 SP NA 87.4 NA 0.0
#3: Alexi Ogando 2012 RP 1 97.0 TRUE 1.9
#4: Alexi Ogando 2011 SP NA 95.1 NA 0.0
Or using the same methodology with dplyr
library(dplyr)
df1 %>%
group_by(Name, grp = cumsum(Position == "RP")) %>%
filter(abs(diff(Year))==1) %>% #below 2 steps may not be needed
ungroup() %>%
select(-grp)
# A tibble: 4 × 7
# Name Year Position Year_diff FBv ind1 velo_diff
# <chr> <int> <chr> <int> <dbl> <lgl> <dbl>
#1 Aaron Laffey 2010 RP 1 86.8 TRUE -0.6
#2 Aaron Laffey 2009 SP NA 87.4 NA 0.0
#3 Alexi Ogando 2012 RP 1 97.0 TRUE 1.9
#4 Alexi Ogando 2011 SP NA 95.1 NA 0.0

Related

Averaging my data into quarterly means (nfrequency error message)

I'm trying to average my data into quarterly means but when i use the following code i get this error
code
quarterly = aggregate(overturning_ts, nfrequency = 4, mean)
error message
Error in aggregate.ts(overturning_ts, nfrequency = 4, mean) :
cannot change frequency from 1 to 4
date snippet
overturning_ts
year month day hour Quarter Days_since_start Overturning_Strength
[1,] 2004 4 2 0 2 1.0 9.689933
[2,] 2004 4 2 12 2 1.5 10.193495
[3,] 2004 4 3 0 2 2.0 10.660849
[4,] 2004 4 3 12 2 2.5 11.077229
[5,] 2004 4 4 0 2 3.0 11.432414
[6,] 2004 4 4 12 2 3.5 11.721769
all data available here, after downloading i just converted it to a time series to get overturned_ts: https://drive.google.com/file/d/1NV3aKsvpPkGatLnuUMbvLpxhcYs_gdM-/view?usp=sharing
outcome i am looking for here;
Qtr1 Qtr2 Qtr3 Qtr4
1960 160.1 129.7 84.8 120.1
1961 160.1 124.9 84.8 116.9
1962 169.7 140.9 89.7 123.3
Like this?
library(tidyverse)
df %>%
group_by(year, Quarter) %>%
summarise(avg_overturning = mean(Overturning_Strength, na.rm = TRUE)) %>%
pivot_wider(names_from = Quarter,
values_from = avg_overturning, names_sort = TRUE)
# A tibble: 11 x 5
# Groups: year [11]
year `1` `2` `3` `4`
<dbl> <dbl> <dbl> <dbl> <dbl>
1 2004 NA 15.3 23.7 17.7
2 2005 14.0 18.7 21.7 22.5
3 2006 17.1 17.7 20.5 20.8
4 2007 18.9 15.5 17.9 20.0
5 2008 18.5 15.5 16.1 20.2
6 2009 16.3 14.9 15.3 12.2
7 2010 8.89 16.2 19.7 15.1
8 2011 15.1 16.0 17.8 18.4
9 2012 15.8 11.9 16.4 16.5
10 2013 11.9 17.1 17.6 18.8
11 2014 15.1 NA NA NA
We can use base R
with(df1, tapply(Overturning_Strength, list(year, Quarter),
FUN = mean, na.rm = TRUE))
1 2 3 4
2004 NA 15.34713 23.74958 17.65220
2005 13.950342 18.66797 21.73983 22.49755
2006 17.116492 17.71430 20.50190 20.84159
2007 18.918347 15.46002 17.87220 20.01701
2008 18.508666 15.53064 16.06696 20.21658
2009 16.255357 14.85671 15.28269 12.16084
2010 8.889602 16.18042 19.74318 15.05649
2011 15.130970 15.96652 17.79070 18.35192
2012 15.793286 11.90334 16.37805 16.45706
2013 11.867353 17.07688 17.60640 18.81432
2014 15.119643 NA NA NA
Or with xtabs from base R
xtabs(Overturning_Strength ~ year + Quarter,
df1)/table(df1[c("year", "Quarter")])
Quarter
year 1 2 3 4
2004 15.347126 23.749583 17.652204
2005 13.950342 18.667970 21.739828 22.497550
2006 17.116492 17.714298 20.501897 20.841587
2007 18.918347 15.460020 17.872199 20.017007
2008 18.508666 15.530639 16.066960 20.216581
2009 16.255357 14.856708 15.282690 12.160845
2010 8.889602 16.180422 19.743183 15.056486
2011 15.130970 15.966518 17.790699 18.351916
2012 15.793286 11.903337 16.378045 16.457062
2013 11.867353 17.076883 17.606403 18.814323
2014 15.119643
As (it seems like) your data is already structured with quarters as a column a possible solution could be to use dplyr directly without making it a timeseries object with ts(). We would group_by every year-quarter pair, summarise the strength-value, and change to a wide format for the desired output with pivot_wider.
library(dplyr)
overturning |>
select(year, Quarter, Overturning_Strength) |>
group_by(year, Quarter) |>
summarise(value = mean(Overturning_Strength)) |>
ungroup() |>
pivot_wider(year, names_from = Quarter, names_prefix = "Qtr", names_sort = TRUE)

Computing lags but grouping by two categories with dplyr

What I want it's create the var3 using a lag (dplyr package), but should be consistent with the year and the ID. I mean, the lag should belong to the corresponding ID. The dataset is like an unbalanced panel.
YEAR ID VARS
2010 1 -
2011 1 -
2012 1 -
2010 2 -
2011 2 -
2012 2 -
2010 3 -
...
My issue is similar to the following question/post, but grouping by two categories:
dplyr: lead() and lag() wrong when used with group_by()
I tried to extend the solution, unsuccessfully (I get NAs).
Attempt #1:
data %>%
group_by(YEAR,ID) %>%
summarise(var1 = ...
var2 = ...
var3 = var1 - dplyr::lag(var2))
)
Attempt #2:
data %>%
group_by(YEAR,ID) %>%
summarise(var1 = ...
var2 = ...
gr = sprintf(YEAR,ID)
var3 = var1 - dplyr::lag(var2, order_by = gr))
)
Minimum example:
MyData <-
data.frame(YEAR = rep(seq(2010,2014),5),
ID = rep(1:5, each=5),
var1 = rnorm(n=25,mean=10,sd=3),
var2 = rnorm(n=25,mean=1,sd=1)
)
MyData %>%
group_by(YEAR,ID) %>%
summarise(var3 = var1 - dplyr::lag(var2)
)
Thanks in advance.
Do you mean group_by(ID) and effectively "order by YEAR"?
MyData %>%
group_by(ID) %>%
mutate(var3 = var1 - dplyr::lag(var2)) %>%
print(n=99)
# # A tibble: 25 x 5
# # Groups: ID [5]
# YEAR ID var1 var2 var3
# <int> <int> <dbl> <dbl> <dbl>
# 1 2010 1 11.1 1.16 NA
# 2 2011 1 13.5 -0.550 12.4
# 3 2012 1 10.2 2.11 10.7
# 4 2013 1 8.57 1.43 6.46
# 5 2014 1 12.6 1.89 11.2
# 6 2010 2 8.87 1.87 NA
# 7 2011 2 5.30 1.70 3.43
# 8 2012 2 6.81 0.956 5.11
# 9 2013 2 13.3 -0.0296 12.4
# 10 2014 2 9.98 -1.27 10.0
# 11 2010 3 8.62 0.258 NA
# 12 2011 3 12.4 2.00 12.2
# 13 2012 3 16.1 2.12 14.1
# 14 2013 3 8.48 2.83 6.37
# 15 2014 3 10.6 0.190 7.80
# 16 2010 4 12.3 0.887 NA
# 17 2011 4 10.9 1.07 10.0
# 18 2012 4 7.99 1.09 6.92
# 19 2013 4 10.1 1.95 9.03
# 20 2014 4 11.1 1.82 9.17
# 21 2010 5 15.1 1.67 NA
# 22 2011 5 10.4 0.492 8.76
# 23 2012 5 10.0 1.66 9.51
# 24 2013 5 10.6 0.567 8.91
# 25 2014 5 5.32 -0.881 4.76
(Disregarding your summarize into a mutate for now.)

Filter a dataframe by keeping row dates of three days in a row preferably with dplyr

I would like to filter a dataframe based on its date column. I would like to keep the rows where I have at least 3 consecutive days. I would like to do this as effeciently and quickly as possible, so if someone has a vectorized approached it would be good.
I tried to inspire myself from the following link, but it didn't really go well, as it is a different problem:
How to filter rows based on difference in dates between rows in R?
I tried to do it with a for loop, I managed to put an indicator on the dates who are not consecutive, but it didn't give me the desired result, because it keeps all dates that are in a row even if they are less than 3 in a row.
tf is my dataframe
for(i in 2:(nrow(tf)-1)){
if(tf$Date[i] != tf$Date[i+1] %m+% days(-1)){
if(tf$Date[i] != tf$Date[i-1] %m+% days(1)){
tf$Date[i] = as.Date(0)
}
}
}
The first 22 rows of my dataframe look something like this:
Date RR.x RR.y Y
1 1984-10-20 1 10.8 1984
2 1984-11-04 1 12.5 1984
3 1984-11-05 1 7.0 1984
4 1984-11-09 1 22.9 1984
5 1984-11-10 1 24.4 1984
6 1984-11-11 1 19.0 1984
7 1984-11-13 1 5.9 1984
8 1986-10-15 1 10.3 1986
9 1986-10-16 1 18.1 1986
10 1986-10-17 1 11.3 1986
11 1986-11-17 1 14.1 1986
12 2003-10-17 1 7.8 2003
13 2003-10-25 1 7.6 2003
14 2003-10-26 1 5.0 2003
15 2003-10-27 1 6.6 2003
16 2003-11-15 1 26.4 2003
17 2003-11-20 1 10.0 2003
18 2011-10-29 1 10.0 2011
19 2011-11-04 1 11.4 2011
20 2011-11-21 1 9.8 2011
21 2011-11-22 1 5.6 2011
22 2011-11-23 1 20.4 2011
The result should be:
Date RR.x RR.y Y
4 1984-11-09 1 22.9 1984
5 1984-11-10 1 24.4 1984
6 1984-11-11 1 19.0 1984
8 1986-10-15 1 10.3 1986
9 1986-10-16 1 18.1 1986
10 1986-10-17 1 11.3 1986
13 2003-10-25 1 7.6 2003
14 2003-10-26 1 5.0 2003
15 2003-10-27 1 6.6 2003
20 2011-11-21 1 9.8 2011
21 2011-11-22 1 5.6 2011
22 2011-11-23 1 20.4 2011
One possibility could be:
df %>%
mutate(Date = as.Date(Date, format = "%Y-%m-%d"),
diff = c(0, diff(Date))) %>%
group_by(grp = cumsum(diff > 1 & lead(diff, default = last(diff)) == 1)) %>%
filter(if_else(diff > 1 & lead(diff, default = last(diff)) == 1, 1, diff) == 1) %>%
filter(n() >= 3) %>%
ungroup() %>%
select(-diff, -grp)
Date RR.x RR.y Y
<date> <int> <dbl> <int>
1 1984-11-09 1 22.9 1984
2 1984-11-10 1 24.4 1984
3 1984-11-11 1 19 1984
4 1986-10-15 1 10.3 1986
5 1986-10-16 1 18.1 1986
6 1986-10-17 1 11.3 1986
7 2003-10-25 1 7.6 2003
8 2003-10-26 1 5 2003
9 2003-10-27 1 6.6 2003
10 2011-11-21 1 9.8 2011
11 2011-11-22 1 5.6 2011
12 2011-11-23 1 20.4 2011
Here's a base solution:
DF$Date <- as.Date(DF$Date)
rles <- rle(cumsum(c(1,diff(DF$Date)!=1)))
rles$values <- rles$lengths >= 3
DF[inverse.rle(rles), ]
Date RR.x RR.y Y
4 1984-11-09 1 22.9 1984
5 1984-11-10 1 24.4 1984
6 1984-11-11 1 19.0 1984
8 1986-10-15 1 10.3 1986
9 1986-10-16 1 18.1 1986
10 1986-10-17 1 11.3 1986
13 2003-10-25 1 7.6 2003
14 2003-10-26 1 5.0 2003
15 2003-10-27 1 6.6 2003
20 2011-11-21 1 9.8 2011
21 2011-11-22 1 5.6 2011
22 2011-11-23 1 20.4 2011
Similar approach in dplyr
DF%>%
mutate(Date = as.Date(Date))%>%
add_count(IDs = cumsum(c(1, diff(Date) !=1)))%>%
filter(n >= 3)
# A tibble: 12 x 6
Date RR.x RR.y Y IDs n
<date> <int> <dbl> <int> <dbl> <int>
1 1984-11-09 1 22.9 1984 3 3
2 1984-11-10 1 24.4 1984 3 3
3 1984-11-11 1 19 1984 3 3
4 1986-10-15 1 10.3 1986 5 3
5 1986-10-16 1 18.1 1986 5 3
6 1986-10-17 1 11.3 1986 5 3
7 2003-10-25 1 7.6 2003 8 3
8 2003-10-26 1 5 2003 8 3
9 2003-10-27 1 6.6 2003 8 3
10 2011-11-21 1 9.8 2011 13 3
11 2011-11-22 1 5.6 2011 13 3
12 2011-11-23 1 20.4 2011 13 3

filtering data based on sequencial values in R

I have a data frame where I want to do some complex filtering. sequencial
Here is a sample of my data frame:
Name Year Difference_IP Position Position_num
1 Aaron Heilman 2011 35.1 RP 1
2 Aaron Heilman 2010 72.0 RP 1
3 Aaron Heilman 2009 72.1 RP 1
4 Aaron Heilman 2008 76.0 RP 1
5 Aaron Heilman 2007 86.0 RP 1
6 Aaron Heilman 2006 87.0 RP 1
7 Aaron Heilman 2005 24.0 RP 1
8 Aaron Heilman 2003 -62.0 SP 2
9 Aaron Laffey 2012 -71.8 SP 2
10 Aaron Laffey 2011 52.4 RP 1
11 Aaron Laffey 2010 5.2 RP 1
12 Aaron Laffey 2009 -97.0 SP 2
13 Aaron Laffey 2008 -93.2 SP 2
14 Aaron Laffey 2007 -49.1 SP 2
Team Start-IP Relief-IP
1 Diamondbacks 0.0 35.1
2 Diamondbacks 0.0 72.0
3 Cubs 0.0 72.1
4 Mets 0.0 76.0
5 Mets 0.0 86.0
6 Mets 0.0 87.0
7 Mets 42.0 66.0
8 Mets 63.2 1.2
9 Blue Jays 86.0 14.2
10 - - - 0.0 52.4
11 Indians 25.0 30.2
12 Indians 109.1 12.1
13 Indians 93.2 0.0
14 Indians 49.1 0.0
What I am trying to do is examine the Year when a player changed from SP to RP or from RP to SP. Here is the expected output:
Name Year Difference_IP Position Position_num
7 Aaron Heilman 2005 24.0 RP 1
8 Aaron Heilman 2003 -62.0 SP 2
9 Aaron Laffey 2012 -71.8 SP 2
10 Aaron Laffey 2011 52.4 RP 1
11 Aaron Laffey 2010 5.2 RP 1
12 Aaron Laffey 2009 -97.0 SP 2
Team Start-IP Relief-IP
7 Mets 42.0 66.0
8 Mets 63.2 1.2
9 Blue Jays 86.0 14.2
10 - - - 0.0 52.4
11 Indians 25.0 30.2
12 Indians 109.1 12.1
The reason Aaron Heilman is filtered out from 2006-2011 is because at that point his RP and SP Position never changes.
I have tried a number of ways of obtaining this output, unfortunately, I am completely stumped. The closest I've been able to get is with this code:
df_1 <- df %>%
group_by(Name, Position) %>%
filter(row_number() == 1 & unique(Position == "RP") | row_number() == n() & unique(Position == "SP")) %>%
as.data.frame()
but that get's me this output, which isn't quite what I'm looking for:
Name Year Difference_IP Position Position_num
1 Aaron Heilman 2005 24.0 RP 1
2 Aaron Heilman 2003 -62.0 SP 2
3 Aaron Laffey 2012 -71.8 SP 2
4 Aaron Laffey 2010 5.2 RP 1
Team Start-IP Relief-IP
1 Mets 42.0 66.0
2 Mets 63.2 1.2
3 Blue Jays 86.0 14.2
4 Indians 25.0 30.2
The way that I've been trying to think about it is every time there is a sequence between RP to SP or SP to RP from one Year to the next, then that is the data that I want to keep.
Would anybody know how to do this? All help is greatly appreciated.
We can use lag and lead to create logical vector for filtering
library(dplyr)
df %>%
group_by(Name) %>%
filter(Position != lag(Position)| Position != lead(Position))
# Name Year Difference_IP Position Position_num Team `Start-IP` `Relief-IP`
# <chr> <int> <dbl> <chr> <int> <chr> <dbl> <dbl>
#1 Aaron Heilman 2005 24.0 RP 1 Mets 42.0 66.0
#2 Aaron Heilman 2003 -62.0 SP 2 Mets 63.2 1.2
#3 Aaron Laffey 2012 -71.8 SP 2 Blue Jays 86.0 14.2
#4 Aaron Laffey 2011 52.4 RP 1 - - - 0.0 52.4
#5 Aaron Laffey 2010 5.2 RP 1 Indians 25.0 30.2
#6 Aaron Laffey 2009 -97.0 SP 2 Indians 109.1 12.1

creating index conditioned on value in other column; differences over time

I am struggling with the following problem:
The dataframe below contains the development of a value over time for various ids. What i try to get is the increase/decrease of these values based on a the value in a year when event occurred. Several events can occur within one id, so a new event becomes the new baseline year for the id.
To make things clearer, I also add the outcome I want below
What i have
id value year event
a 100 1950 NA
a 101 1951 NA
a 102 1952 NA
a 103 1953 NA
a 104 1954 NA
a 105 1955 X
a 106 1956 NA
a 107 1957 NA
a 108 1958 NA
a 107 1959 Y
a 106 1960 NA
a 105 1961 NA
a 104.8 1962 NA
a 104.2 1963 NA
b 70 1970 NA
b 75 1971 NA
b 80 1972 NA
b 85 1973 NA
b 90 1974 NA
b 60 1975 Z
b 59 1976 NA
b 58 1977 NA
b 57 1978 NA
b 56 1979 NA
b 55 1980 W
b 54 1981 NA
b 53 1982 NA
b 52 1983 NA
b 51 1984 NA
What I am looking for
id value year event index growth
a 100 1950 NA 0
a 101 1951 NA 0
a 102 1952 NA 0
a 103 1953 NA 0
a 104 1954 NA 0
a 105 1955 X 1 1
a 106 1956 NA 2 1.00952381
a 107 1957 NA 3 1.019047619
a 108 1958 NA 4 1.028571429
a 107 1959 Y 1 1 #new baseline year
a 106 1960 NA 2 0.990654206
a 105 1961 NA 3 0.981308411
a 104.8 1962 NA 4 0.979439252
a 104.2 1963 NA 5 0.973831776
b 70 1970 NA 6
b 75 1971 NA 7
b 80 1972 NA 8
b 85 1973 NA 9
b 90 1974 NA 10
b 60 1975 Z 1 1
b 59 1976 NA 2 0.983333333
b 58 1977 NA 3 0.966666667
b 57 1978 NA 4 0.95
b 56 1979 NA 5 0.933333333
b 55 1980 W 1 1 #new baseline year
b 54 1981 NA 2 0.981818182
b 53 1982 NA 3 0.963636364
b 52 1983 NA 4 0.945454545
b 51 1984 NA 5 0.927272727
What I tried
This and this post were quite helpful and I managed to create differences between the years, however, I fail to reset the base year (index) when there is a new event. Furthermore, I am doubtful whether my approach is indeed the most efficient/elegant one. Seems a bit clumsy to me...
x <- ddply(x, .(id), transform, year.min=min(year[!is.na(event)])) #identifies first event year
x1 <- ddply(x[x$year>=x$year.min,], .(id), transform, index=seq_along(id)) #creates counter years following first event; prior years are removed
x1 <- x1[order(x1$id, x1$year),] #sort
x1 <- ddply(x1, .(id), transform, growth=100*(value/value[1])) #calculate difference, however, based on first event year; this is wrong.
library(Interact) #i then merge the df with the years prior to first event which have been removed in the begining
x$id.year <- interaction(x$id,x$year)
x1$id.year <- interaction(x1$id,x1$year)
x$index <- x$growth <- NA
y <- rbind(x[x$year<x$year.min,],x1)
y <- y[order(y$id,y$year),]
Many thanks for any advice.
# Create a tag to indicate the start of each new event by id or
# when id changes
dat$tag <- with(dat, ave(as.character(event), as.character(id),
FUN=function(i) cumsum(!is.na(i))))
# Calculate the growth by id and tag
# this will also produce results for each id before an event has happened
dat$growth <- with(dat, ave(value, tag, id, FUN=function(i) i/i[1] ))
# remove growth prior to an event (this will be when tag equals zero as no
# event have occurred)
dat$growth[dat$tag==0] <- NA
Here is a solution with dplyr.
ana <- group_by(mydf, id) %>%
do(na.locf(., na.rm = FALSE)) %>%
mutate(value = as.numeric(value)) %>%
group_by(id, event) %>%
mutate(growth = value/value[1]) %>%
mutate(index = row_number(event))
ana$growth[is.na(ana$event)] <- 0
id value year event growth index
1 a 100.0 1950 NA 0.0000000 1
2 a 101.0 1951 NA 0.0000000 2
3 a 102.0 1952 NA 0.0000000 3
4 a 103.0 1953 NA 0.0000000 4
5 a 104.0 1954 NA 0.0000000 5
6 a 105.0 1955 X 1.0000000 1
7 a 106.0 1956 X 1.0095238 2
8 a 107.0 1957 X 1.0190476 3
9 a 108.0 1958 X 1.0285714 4
10 a 107.0 1959 Y 1.0000000 1
11 a 106.0 1960 Y 0.9906542 2
12 a 105.0 1961 Y 0.9813084 3
13 a 104.8 1962 Y 0.9794393 4
14 a 104.2 1963 Y 0.9738318 5
15 b 70.0 1970 NA 0.0000000 1
16 b 75.0 1971 NA 0.0000000 2
17 b 80.0 1972 NA 0.0000000 3
18 b 85.0 1973 NA 0.0000000 4
19 b 90.0 1974 NA 0.0000000 5
20 b 60.0 1975 Z 1.0000000 1
21 b 59.0 1976 Z 0.9833333 2
22 b 58.0 1977 Z 0.9666667 3
23 b 57.0 1978 Z 0.9500000 4
24 b 56.0 1979 Z 0.9333333 5
25 b 55.0 1980 W 1.0000000 1
26 b 54.0 1981 W 0.9818182 2
27 b 53.0 1982 W 0.9636364 3
28 b 52.0 1983 W 0.9454545 4
Try:
ddf$index=0
ddf$growth=0
baseline =0
r=1; start=FALSE
for(r in 1:nrow(ddf)){
if(is.na(ddf$event[r])){
if(start) {
ddf$index[r] = ddf$index[r-1]+1
ddf$growth[r] = ddf$value[r]/baseline
}
else {ddf$index[r] = 0;
}
}
else{
start=T
ddf$index[r] = 1
ddf$growth[r]=1
baseline = ddf$value[r]
}
}
ddf
id value year event index growth
1 a 100.0 1950 <NA> 0 0.0000000
2 a 101.0 1951 <NA> 0 0.0000000
3 a 102.0 1952 <NA> 0 0.0000000
4 a 103.0 1953 <NA> 0 0.0000000
5 a 104.0 1954 <NA> 0 0.0000000
6 a 105.0 1955 X 1 1.0000000
7 a 106.0 1956 <NA> 2 1.0095238
8 a 107.0 1957 <NA> 3 1.0190476
9 a 108.0 1958 <NA> 4 1.0285714
10 a 107.0 1959 Y 1 1.0000000
11 a 106.0 1960 <NA> 2 0.9906542
12 a 105.0 1961 <NA> 3 0.9813084
13 a 104.8 1962 <NA> 4 0.9794393
14 a 104.2 1963 <NA> 5 0.9738318
15 b 70.0 1970 <NA> 6 0.6542056
16 b 75.0 1971 <NA> 7 0.7009346
17 b 80.0 1972 <NA> 8 0.7476636
18 b 85.0 1973 <NA> 9 0.7943925
19 b 90.0 1974 <NA> 10 0.8411215
20 b 60.0 1975 Z 1 1.0000000
21 b 59.0 1976 <NA> 2 0.9833333
22 b 58.0 1977 <NA> 3 0.9666667
23 b 57.0 1978 <NA> 4 0.9500000
24 b 56.0 1979 <NA> 5 0.9333333
25 b 55.0 1980 W 1 1.0000000
26 b 54.0 1981 <NA> 2 0.9818182
27 b 53.0 1982 <NA> 3 0.9636364
28 b 52.0 1983 <NA> 4 0.9454545
29 b 51.0 1984 <NA> 5 0.9272727

Resources