In R: How can I check that I have consecutive years of data (to later be able to calculate growth)? - r

I have the dataframe (sample) below:
companyID year yearID
1 2010 1
1 2011 2
1 2012 3
1 2013 4
2 2010 1
2 2011 2
2 2016 3
2 2017 4
2 2018 5
3 2010 1
3 2011 2
3 2014 3
3 2017 4
3 2018 5
I have used a for loop in order to try and create a sequence column that starts a new number for each new sequence of numbers. I am new to R so my definitions may be a bit wrong. My for loop looks like this:
size1 <- c(1:3)
s <- 0
for (val1 in size) {
m <- max(sample[sample$companyID == val1, 4])
size2 <- c(1:m)
for (val2 in size2){
row <- sample[which(sample$companyID == val1 & sample$yearID == val2)]
m1 <- sample[sample$companyID == val1 & sample$yearID == val2, 2]
m2 <- sample[sample$CompanyID == val1 & sample$yearID == (val2-1), 2]
if(val2>1 && m1-m2 > 1) {
sample$sequence[row] s = s+1}
else {s = s}
}
}
Where m is the max value of the yearID per companyID, row is to identify that the value should be entered on the row where companyID = val1 and yearID = val2, m1 is from the year variable and is the latter year, whereas m2 is the former year. What I have tried to do is to change the sequence every time m1-m2 > 1 (when val2 > 1 also).
Desired outcome:
companyID year yearID sequence
1 2010 1 1
1 2011 2 1
1 2012 3 1
1 2013 4 1
2 2010 1 2
2 2011 2 2
2 2016 3 3
2 2017 4 3
2 2018 5 3
3 2010 1 4
3 2011 2 4
3 2014 3 5
3 2017 4 6
3 2018 5 6
Super appreciative if anyone can help!!

This is a good question!
First group_by companyID
calculate the difference of each consecutive row in year column with lag to identify if year is consecutive.
group_by companyID, yearID)
mutate helper column sequence1 to apply 1 to each starting consecutive year in group.
ungroup and apply a sequence number eachtime 1
occurs in sequence1
remove column sequence1 and deltalag1
library(tidyverse)
df1 <- df %>%
group_by(companyID) %>%
mutate(deltaLag1 = year - lag(year, 1)) %>%
group_by(companyID, yearID) %>%
mutate(sequence1 = case_when(is.na(deltaLag1) | deltaLag1 > 1 ~ 1,
TRUE ~ 2)) %>%
ungroup() %>%
mutate(sequence = cumsum(sequence1==1)) %>%
select(-deltaLag1, -sequence1)
data
df <- tribble(
~companyID, ~year, ~yearID,
1, 2010, 1,
1, 2011, 2,
1, 2012, 3,
1, 2013, 4,
2, 2010, 1,
2, 2011, 2,
2, 2016, 3,
2, 2017, 4,
2, 2018, 5,
3, 2010, 1,
3, 2011, 2,
3, 2014, 3,
3, 2017, 4,
3, 2018, 5)

It's not clear if you want the exact desired outcome or check that you have consecutive years by companyID.
According to your title message:
sample <- read.table(header = TRUE, text = "
companyID year yearID
1 2010 1
1 2011 2
1 2012 3
1 2013 4
2 2010 1
2 2011 2
2 2016 3
2 2017 4
2 2018 5
3 2010 1
3 2011 2
3 2014 3
3 2017 4
3 2018 5
")
library(data.table)
sample <- setDT(sample)
sample[ , diff_year := year - shift(year), by = companyID]
sample <- setDF(sample)
sample
#> companyID year yearID diff_year
#> 1 1 2010 1 NA
#> 2 1 2011 2 1
#> 3 1 2012 3 1
#> 4 1 2013 4 1
#> 5 2 2010 1 NA
#> 6 2 2011 2 1
#> 7 2 2016 3 5
#> 8 2 2017 4 1
#> 9 2 2018 5 1
#> 10 3 2010 1 NA
#> 11 3 2011 2 1
#> 12 3 2014 3 3
#> 13 3 2017 4 3
#> 14 3 2018 5 1
# Created on 2021-03-13 by the reprex package (v1.0.0.9002)
Related to Calculate difference between values in consecutive rows by group
Regards,

Related

creating a dummy variable with consecutive cases

I have a similar problem like this one:
How can I create a dummy variable over consecutive values by group id?
the difference is: as soon I have the Dummy = 1 I want my dummy for the rest of my group (ID) beeing 1 since year is in descending order. So for example, out of df1:
df1 <-data.frame(ID = rep(seq(1:3), each = 4),
year = rep(c(2014, 2015, 2016, 2017),3),
value = runif(12, min = 0, max = 25),
Dummy = c(0,0,1,0 ,0,1,0,1, 1,0,0,0))
shall be :
df2 <- data.frame(ID = rep(seq(1:4), 3),
year = rep(c(2014, 2015, 2016, 2017),3),
value = runif(12, min = 0, max = 25),
Dummy = c(0,0,1,1 ,0,1,1, 1, 1,1,1,1))
I've tried something like that (and some others) but that failed:
df2<- df1%>% group_by(ID) %>% arrange(ID , year) %>%
mutate(treated = case_when(Dummy == 1 ~ 1,
lag(Dummy, n= unique(n()), default = 0) == 1 ~ 1))
If your input data is as below then we can just use cummax():
library(dplyr)
df1 <-data.frame(ID = rep(seq(1:3), each = 4),
year = rep(c(2014, 2015, 2016, 2017),3),
value = runif(12, min = 0, max = 25),
Dummy = c(0,0,1,0 ,0,1,0,1, 1,0,0,0))
df1
#> ID year value Dummy
#> 1 1 2014 14.144996 0
#> 2 1 2015 20.621603 0
#> 3 1 2016 8.325170 1
#> 4 1 2017 21.725028 0
#> 5 2 2014 11.894383 0
#> 6 2 2015 13.445744 1
#> 7 2 2016 3.332338 0
#> 8 2 2017 2.984941 1
#> 9 3 2014 17.551266 1
#> 10 3 2015 5.250556 0
#> 11 3 2016 11.062577 0
#> 12 3 2017 20.169439 0
df1 %>%
group_by(ID) %>%
mutate(Dummy = cummax(Dummy))
#> # A tibble: 12 x 4
#> # Groups: ID [3]
#> ID year value Dummy
#> <int> <dbl> <dbl> <dbl>
#> 1 1 2014 14.1 0
#> 2 1 2015 20.6 0
#> 3 1 2016 8.33 1
#> 4 1 2017 21.7 1
#> 5 2 2014 11.9 0
#> 6 2 2015 13.4 1
#> 7 2 2016 3.33 1
#> 8 2 2017 2.98 1
#> 9 3 2014 17.6 1
#> 10 3 2015 5.25 1
#> 11 3 2016 11.1 1
#> 12 3 2017 20.2 1
Created on 2022-10-14 by the reprex package (v2.0.1)

Min() ignoring zeros and NA with dplyr

I have a df that looks like this:
group year
1 2020
1 NA
1 0
2 2021
2 2006
3 NA
3 0
3 2010
3 2010
4 2006
4 2005
4 2010
And I want to group by group and then find the minimum year while ignoring NAs and 0 entries:
group year minYr
1 2020 2020
1 NA 2020
1 0 2020
2 2021 2006
2 2006 2006
3 NA 2010
3 0 2010
3 2010 2010
3 2010 2010
4 2006 2005
4 2005 2005
4 2010 2005
My initial approach
df <- df %>% group_by(group) %>% mutate (minYr = min(year, na.rm = TRUE)) caused a runtime error and didn't take care of the zeros.
Does anyone have a better way of doing this?
df1 %>%
group_by(group) %>%
mutate(minYr = min(year[year > 0], na.rm = TRUE)) %>%
# mutate(minYr = min(year[year > 0 & !is.na(year)])) %>% # equivalent
ungroup()
# A tibble: 12 × 3
group year minYr
<dbl> <dbl> <dbl>
1 1 2020 2020
2 1 NA 2020
3 1 0 2020
4 2 2021 2006
5 2 2006 2006
6 3 NA 2010
7 3 0 2010
8 3 2010 2010
9 3 2010 2010
10 4 2006 2005
11 4 2005 2005
12 4 2010 2005
df1 <- structure(list(group = c(1, 1, 1, 2, 2, 3, 3, 3, 3, 4, 4, 4),
year = c(2020, NA, 0, 2021, 2006, NA, 0, 2010, 2010, 2006, 2005, 2010)),
class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -12L))

Remove duplicate year rows by groups [duplicate]

This question already has answers here:
get rows of unique values by group
(4 answers)
Closed 1 year ago.
I have a data.table of the following form:-
data <- data.table(group = rep(1:3, each = 4),
year = c(2011:2014, rep(2011:2012, each = 2),
2012, 2012, 2013, 2014), value = 1:12)
This is only an abstract of my data.
So group 2 has 2 values for 2011 and 2012. And group 3 has 2 values for the year 2012. I want to just keep the first row for all the duplicated years.
So, in effect, my data.table will become the following:-
data <- data.table(group = c(rep(1, 4), rep(2, 2), rep(3, 3)),
year = c(2011:2014, 2011, 2012, 2012, 2013, 2014),
value = c(1:5, 7, 9, 11, 12))
How can I achieve this? Thanks in advance.
Try this data.table option with duplicated
> data[!duplicated(cbind(group, year))]
group year value
1: 1 2011 1
2: 1 2012 2
3: 1 2013 3
4: 1 2014 4
5: 2 2011 5
6: 2 2012 7
7: 3 2012 9
8: 3 2013 11
9: 3 2014 12
For data.tables you can pass by argument to unique -
library(data.table)
unique(data, by = c('group', 'year'))
# group year value
#1: 1 2011 1
#2: 1 2012 2
#3: 1 2013 3
#4: 1 2014 4
#5: 2 2011 5
#6: 2 2012 7
#7: 3 2012 9
#8: 3 2013 11
#9: 3 2014 12
Using base R
subset(data, !duplicated(cbind(group, year)))
One solution would be to use distinct from dplyr like so:
library(dplyr)
data %>%
distinct(group, year, .keep_all = TRUE)
Output:
group year value
1: 1 2011 1
2: 1 2012 2
3: 1 2013 3
4: 1 2014 4
5: 2 2011 5
6: 2 2012 7
7: 3 2012 9
8: 3 2013 11
9: 3 2014 12
This should do the trick:
library(tidyverse)
data %>%
group_by(group, year) %>%
filter(!duplicated(group, year))

How to modify a column based on a condition in a time series?

I have a data on animal territories by month (1 = January etc.) for multiple individuals:
year month terr_size id
2018 1 20 1
2018 2 30 1
2019 1 5 1
2019 2 10 1
2018 3 20 2
2018 5 25 2
2018 6 20 2
2018 7 20 2
2019 1 10 2
2019 2 5 2
2019 3 20 2
2019 4 30 2
I want to add a column that has a 1 if two consecutive months exceed some value e.g. 10. One wrinkle is that my data can run over one year for a single id.
year month terr_size id new_col
2018 1 20 1 1
2018 2 30 1 1
2019 1 5 1 0
2019 2 10 1 0
2018 3 20 2 0
2018 5 25 2 1
2018 6 20 2 1
2018 7 20 2 1
2019 1 10 2 0
2019 2 5 2 0
2019 3 20 2 1
2019 4 30 2 1
This can be expressed compactly using a single left join in a single SQL statement.
Using the input shown in the Note at the end, perform a left self join using the indicated on condition and set new_col to 1 if for any original row both it and any matched rows have terr_size greater than or equal to 10. If there is no matched row then use coalesce to set new_col to 0.
library(sqldf)
sqldf("
select a.*,
coalesce(max(a.terr_size >= 10 and b.terr_size >= 10), 0)
new_col
from DF a
left join DF b on
a.id = b.id and
(12 * b.year + b.month = 12 * a.year + a.month + 1 or
12 * b.year + b.month = 12 * a.year + a.month - 1)
group by a.rowid")
giving:
year month terr_size id new_col
1 2018 1 20 1 1
2 2018 2 30 1 1
3 2019 1 5 1 0
4 2019 2 10 1 0
5 2018 3 20 2 0
6 2018 5 25 2 1
7 2018 6 20 2 1
8 2018 7 20 2 1
9 2019 1 10 2 0
10 2019 2 5 2 0
11 2019 3 20 2 1
12 2019 4 30 2 1
Note
The input and output shown in the question are not consistent so to be clear we assumed this:
Lines <- "year month terr_size id
2018 1 20 1
2018 2 30 1
2019 1 5 1
2019 2 10 1
2018 3 20 2
2018 5 25 2
2018 6 20 2
2018 7 20 2
2019 1 10 2
2019 2 5 2
2019 3 20 2
2019 4 30 2 "
DF <- read.table(text = Lines, header = TRUE)
Your data:
df <- read.table(text = "year month terr_size id
2018 1 20 1
2018 2 30 1
2019 1 5 1
2019 2 10 1
2018 3 20 2
2018 2 25 2
2018 6 20 2
2018 7 20 2
2019 1 10 2
2019 2 5 2
2019 3 20 2
2019 4 30 2 ", header = TRUE)
The idea is to create a date variable first.
Then you create two copies of your data by changing the dates one month ahead and one month back.
R is efficient memory-wise for this kind of operation, so you won't have a problem.
You will just take the space for one additional column. It doesn't actually replicate the whole dataframe.
Then you can join the new columns to the original dataframe.
You then apply the condition you needed.
I created a magic_number variable for that.
At the end, I selected only the original columns plus the one you needed.
library(dplyr)
library(lubridate)
# the threshold number
magic_number <- 10
# creare date variable
df <- df %>% mutate(date = make_date(year, month))
# [p]revious month
dfp <- df %>% transmute(id, date = date - months(1), terr_size_p = terr_size)
# [n]ext month
dfn <- df %>% transmute(id, date = date + months(1), terr_size_n = terr_size)
# join by id and date
df <- df %>%
left_join(dfp, by = c("id", "date")) %>%
left_join(dfn, by = c("id", "date"))
# for new_col to be 1, terr_size must be over the threshold, so must be at least one between previous and next month
df <- df %>%
mutate(new_col = as.numeric(terr_size > magic_number &
any(terr_size_p > magic_number, terr_size_n > magic_number)))
# remove variables if there is no more use for them
df <- df %>% select(-terr_size_p, -terr_size_n, -date)
df
Result:
year month terr_size id new_col
1 2018 1 20 1 1
2 2018 2 30 1 1
3 2019 1 5 1 0
4 2019 2 10 1 0
5 2018 3 20 2 1
6 2018 2 25 2 1
7 2018 6 20 2 1
8 2018 7 20 2 1
9 2019 1 10 2 0
10 2019 2 5 2 0
11 2019 3 20 2 1
12 2019 4 30 2 1
(The result is not exactly the same because your initial data and expected results do not correspond at row 5)
This solution handles the december-january issue we talked about in the comments.
I'm not exactly sure what is the rule because your output isn't following the rule you talk about (eg: line1/5 doesn't have another month for comparison yet you put an 1, line 6 is separated by 2 months, you put a 1 in the line 11 whereas line12 was <10).
I assumed the most complicated scenario, so you can remove the extra conditions you don't need:
You put an 1 if the territory size remained >10 for two consecutive months including this one (or the first recorded month if it's >10) for each individual.
df <- read.table(text = "year month terr_size id
2018 1 20 1
2018 2 30 1
2019 1 5 1
2019 2 10 1
2018 3 20 2
2018 5 25 2
2018 6 20 2
2018 7 20 2
2019 1 10 2
2019 2 5 2
2019 3 20 2
2019 4 30 2", header = TRUE)
Using dplyr and lag:
library(dplyr)
df %>% arrange(id, year,month) %>%
dplyr::mutate(newcol=case_when(is.na(lag(month))==TRUE & terr_size>10~1,
lag(id)!=id & terr_size>10~1,
id==lag(id) & year-lag(year)==0 & month-lag(month)==1 & terr_size>10 & lag(terr_size)>10~1,
id==lag(id) & year-lag(year)==1 & lag(month)-month==11 & terr_size>10 & lag(terr_size)>10~1,
TRUE~0))
output:
year month terr_size id newcol
1 2018 1 20 1 1
2 2018 2 30 1 1
3 2019 1 5 1 0
4 2019 2 10 1 0
5 2018 3 20 2 1
6 2018 5 25 2 0
7 2018 6 20 2 1
8 2018 7 20 2 1
9 2019 1 10 2 0
10 2019 2 5 2 0
11 2019 3 20 2 0
12 2019 4 30 2 1

how do I identify rows where an element appears for the first time?

I have the following data frame of student records. what I want is to identify students who joined a certain program in 2014 for the first time when they were in 9th grade.
names.first<-c('a','a','b','b','c','d')
names.last<-c('c','c','z','z','f','h')
year<-c(2014,2013,2014,2015,2015,2014)
grade<-c(9,8,9,10,10,10)
df<-data.frame(names.first,names.last,year,grade)
df
To do this, I have used the following statement to say that I want students where the program year==2014 and their grade ==9.
df$first.cohort<-ifelse(df$year==2014 & df$grade==9,1,0)
df
names.first names.last year grade first.cohort
1 a c 2014 9 1
2 a c 2013 8 0
3 b z 2014 9 1
4 b z 2015 10 0
5 c f 2015 10 0
6 d h 2014 10 0
However, as you can notice this would include students who didn't enter the program in year 2014 such as student awho started in 2013. How do I create a ifelse statement where I only capture students who are in 9th grade and started the program in 2014 for the first time so that the df looks like
names.first names.last year grade first.cohort
1 a c 2014 9 0
2 a c 2013 8 0
3 b z 2014 9 1
4 b z 2015 10 0
5 c f 2015 10 0
6 d h 2014 10 0
We can use first after arrangeing by 'name' and 'year' to create the logical expression
library(dplyr)
df %>%
arrange(names, year) %>%
group_by(names) %>%
mutate(first.cohort = as.integer(grade == 9 & first(year) == 2014))
# A tibble: 6 x 4
# Groups: names [4]
# names year grade first.cohort
# <fct> <dbl> <dbl> <int>
#1 a 2013 8 0
#2 a 2014 9 0
#3 b 2014 9 1
#4 b 2015 10 0
#5 c 2015 10 0
#6 d 2014 10 0
For keeping the same order as in the input dataset, we can create a sequence column first and then do the arrange on the column after the mutate
df %>%
mutate(rn = row_number()) %>%
arrange(names, year) %>%
group_by(names) %>%
mutate(first.cohort = as.integer(grade == 9 & first(year) == 2014)) %>%
ungroup %>%
arrange(rn) %>%
select(-rn)
Or using the same logic with data.table that have the additional advantage of keeping the same order as in the input dataset
library(data.table)
setDT(df)[order(names, year), first.cohort := as.integer(grade == 9 &
first(year) == 2014), names]
Update
With the new example in the OP's post, we do the grouping by both the 'names' column
df %>%
arrange(names.first, names.last, year) %>%
group_by(names.first, names.last) %>%
mutate(first.cohort = as.integer(grade == 9 & first(year) == 2014))
# A tibble: 6 x 5
# Groups: names.first, names.last [4]
# names.first names.last year grade first.cohort
# <fct> <fct> <dbl> <dbl> <int>
#1 a c 2013 8 0
#2 a c 2014 9 0
#3 b z 2014 9 1
#4 b z 2015 10 0
#5 c f 2015 10 0
#6 d h 2014 10 0
Using dplyr
library(dplyr)
df%>%group_by(names)%>%dplyr::mutate(Fc=as.numeric((year==2014&grade==9)&(min(year)==2014)))
# A tibble: 6 x 4
# Groups: names [4]
names year grade Fc
<fctr> <dbl> <dbl> <dbl>
1 a 2014 9 0
2 a 2013 8 0
3 b 2014 9 1
4 b 2015 10 0
5 c 2015 10 0
6 d 2014 10 0

Resources