how do I identify rows where an element appears for the first time? - r

I have the following data frame of student records. what I want is to identify students who joined a certain program in 2014 for the first time when they were in 9th grade.
To do this, I have used the following statement to say that I want students where the program year==2014 and their grade ==9.
df$first.cohort<-ifelse(df$year==2014 & df$grade==9,1,0)
names.first names.last year grade first.cohort
1 a c 2014 9 1
2 a c 2013 8 0
3 b z 2014 9 1
4 b z 2015 10 0
5 c f 2015 10 0
6 d h 2014 10 0
However, as you can notice this would include students who didn't enter the program in year 2014 such as student awho started in 2013. How do I create a ifelse statement where I only capture students who are in 9th grade and started the program in 2014 for the first time so that the df looks like
names.first names.last year grade first.cohort
1 a c 2014 9 0
2 a c 2013 8 0
3 b z 2014 9 1
4 b z 2015 10 0
5 c f 2015 10 0
6 d h 2014 10 0

We can use first after arrangeing by 'name' and 'year' to create the logical expression
df %>%
arrange(names, year) %>%
group_by(names) %>%
mutate(first.cohort = as.integer(grade == 9 & first(year) == 2014))
# A tibble: 6 x 4
# Groups: names [4]
# names year grade first.cohort
# <fct> <dbl> <dbl> <int>
#1 a 2013 8 0
#2 a 2014 9 0
#3 b 2014 9 1
#4 b 2015 10 0
#5 c 2015 10 0
#6 d 2014 10 0
For keeping the same order as in the input dataset, we can create a sequence column first and then do the arrange on the column after the mutate
df %>%
mutate(rn = row_number()) %>%
arrange(names, year) %>%
group_by(names) %>%
mutate(first.cohort = as.integer(grade == 9 & first(year) == 2014)) %>%
ungroup %>%
arrange(rn) %>%
Or using the same logic with data.table that have the additional advantage of keeping the same order as in the input dataset
setDT(df)[order(names, year), first.cohort := as.integer(grade == 9 &
first(year) == 2014), names]
With the new example in the OP's post, we do the grouping by both the 'names' column
df %>%
arrange(names.first, names.last, year) %>%
group_by(names.first, names.last) %>%
mutate(first.cohort = as.integer(grade == 9 & first(year) == 2014))
# A tibble: 6 x 5
# Groups: names.first, names.last [4]
# names.first names.last year grade first.cohort
# <fct> <fct> <dbl> <dbl> <int>
#1 a c 2013 8 0
#2 a c 2014 9 0
#3 b z 2014 9 1
#4 b z 2015 10 0
#5 c f 2015 10 0
#6 d h 2014 10 0

Using dplyr
# A tibble: 6 x 4
# Groups: names [4]
names year grade Fc
<fctr> <dbl> <dbl> <dbl>
1 a 2014 9 0
2 a 2013 8 0
3 b 2014 9 1
4 b 2015 10 0
5 c 2015 10 0
6 d 2014 10 0


Count the occurences of accidents until the next accidents

I have the following data frame and I would like to create the "OUTPUT_COLUMN".
Explanation of columns:
ID is the identification number of the policy
ID_REG_YEAR is the identification number per Registration Year
CALENDAR_YEAR is the year that the policy have exposure
NUMBER_OF_RENEWALS is the count of numbers that the policy has renewed
ACCIDENT is accident occurred
Basically, if column NUMBER_OF_RENEWALS = 0 then OUTPUT_COLUMN = 100. Any rows that an accident did not occurred before should contain 100 (e.g rows 13,16,17). If an Accident occured I would like to count the number of renewals until the next accident.
1 A A_2015 2015 0 YES 100
2 A A_2015 2016 0 YES 100
3 A A_2016 2016 1 YES 0
4 A A_2016 2017 1 YES 0
5 A A_2017 2017 2 NO 1
6 A A_2017 2018 2 NO 1
7 A A_2018 2018 3 NO 2
8 A A_2018 2019 3 NO 2
9 A A_2019 2019 4 YES 0
10 A A_2019 2020 4 YES 0
11 B B_2015 2015 0 NO 100
12 B B_2015 2016 0 NO 100
13 B B_2016 2016 1 NO 100
14 C C_2013 2013 0 NO 100
15 C C_2013 2014 0 NO 100
16 C C_2014 2014 1 NO 100
17 C C_2014 2015 1 NO 100
18 C C_2015 2015 2 YES 0
19 C C_2015 2016 2 YES 0
20 C C_2016 2016 3 NO 1
21 C C_2016 2017 3 NO 1
22 C C_2017 2017 4 NO 2
23 C C_2017 2018 4 NO 2
24 C C_2018 2018 5 YES 0
25 C C_2018 2019 5 YES 0
26 C C_2019 2019 6 NO 1
27 C C_2019 2020 6 NO 1
28 C C_2020 2020 7 NO 2
Here is a dplyr solution. First, obtain a separate column for the registration year, which will be used to calculate renewals since prior accident (assumes this is years since last accident). Then, create a column to contain the year of the last accident after grouping by ID. Using fill this value will be propagated. The final outcome column will be set as either 100 (if no prior accident, or NUMBER_OF_RENEWALS is zero) vs. the registration year - last accident year.
df %>%
separate(ID_REG_YEAR, into = c("ID_REG", "REG_YEAR"), convert = T) %>%
group_by(ID) %>%
mutate(LAST_ACCIDENT = ifelse(ACCIDENT == "YES", REG_YEAR, NA_integer_)) %>%
fill(LAST_ACCIDENT, .direction = "down") %>%
<chr> <chr> <int> <int> <int> <chr> <int> <int> <dbl>
1 A A 2015 2015 0 YES 100 2015 100
2 A A 2015 2016 0 YES 100 2015 100
3 A A 2016 2016 1 YES 0 2016 0
4 A A 2016 2017 1 YES 0 2016 0
5 A A 2017 2017 2 NO 1 2016 1
6 A A 2017 2018 2 NO 1 2016 1
7 A A 2018 2018 3 NO 2 2016 2
8 A A 2018 2019 3 NO 2 2016 2
9 A A 2019 2019 4 YES 0 2019 0
10 A A 2019 2020 4 YES 0 2019 0
# … with 18 more rows
Note: If you want to use your policy number (NUMBER_OF_RENEWALS) and not go by the year, you can do something similar. Instead of adding a column with the last accident year, you can include the last accident policy. Then, your output column could reflect the policy number instead of year (to consider the possibility that one or more years could be skipped).
df %>%
separate(ID_REG_YEAR, into = c("ID_REG", "REG_YEAR"), convert = T) %>%
group_by(ID) %>%
fill(LAST_ACCIDENT_POLICY, .direction = "down") %>%

Calculating cumulative sum for multiple columns in R

R newb, I'm trying to calculate the cumulative sum grouped by year, month, group and subgroup, also having multiple columns to calculate.
Sample of the data:
df <- data.frame("Year"=2020,
Year Month Group SubGroup V1 V2
1 2020 Jan A a 10 0
2 2020 Jan A a 10 1
3 2020 Jan A b 20 2
4 2020 Jan B b 20 2
5 2020 Feb A a 50 0
6 2020 Feb B b 50 5
7 2020 Feb B a 10 1
8 2020 Feb B b 10 1
Resulting Table wanted:
Year Month Group SubGroup V1 V2
1 2020 Jan A a 20 1
2 2020 Feb A a 70 1
3 2020 Jan A b 20 2
4 2020 Feb A b 20 2
5 2020 Jan B a 0 0
6 2020 Feb B a 10 1
7 2020 Jan B b 20 2
8 2020 Feb B b 80 8
From Sample Table, on Jan 2020, the sum of Group 'A' Subgroup 'a' was 10+10 = 20... On Feb 2020, the value was 50, therefore 20 from Jan + 50 = 70, and so on...
If there is no value, it should consider 0.
I've tried few codes but none didn't get even close to the output I need. Would really appreciate if someone could help me with some tips for this problem.
This is a simple group_by/mutate problem. The columns V1, V2 are chosen with across and cumsum applied to them.
df$Month <- factor(df$Month, levels = c("Jan", "Feb"))
df %>%
group_by(Year, Group, SubGroup) %>%
mutate(across(V1:V2, ~cumsum(.x))) %>%
ungroup() %>%
arrange(Year, Group, SubGroup, Month)
## A tibble: 8 x 6
# Year Month Group SubGroup V1 V2
# <chr> <fct> <chr> <chr> <dbl> <dbl>
#1 2020 Jan A a 10 0
#2 2020 Jan A a 20 1
#3 2020 Feb A a 70 1
#4 2020 Jan A b 20 2
#5 2020 Feb B a 10 1
#6 2020 Jan B b 20 2
#7 2020 Feb B b 70 7
#8 2020 Feb B b 80 8
If I understand what you are doing, you're taking the sum for each month, then doing the cumulative sums for the months. This is usuaully pretty easy in dplyr.
df %>%
group_by(Year, Month, Group, SubGroup) %>%
V1_sum = sum(V1),
V2_sum = sum(V2)
) %>%
group_by(Year, Group, SubGroup) %>%
V1_cumsum = cumsum(V1_sum),
V2_cumsum = cumsum(V2_sum)
# A tibble: 6 x 8
# Groups: Year, Group, SubGroup [4]
# Year Month Group SubGroup V1_sum V2_sum V1_cumsum V2_cumsum
# <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 2020 Feb A a 50 0 50 0
# 2 2020 Feb B a 10 1 10 1
# 3 2020 Feb B b 60 6 60 6
# 4 2020 Jan A a 20 1 70 1
# 5 2020 Jan A b 20 2 20 2
# 6 2020 Jan B b 20 2 80 8
But you'll notice that the monthly cumulative sums are backwards (i.e. January comes after February), because by default group_by groups alphabetically. Also, you don't see the empty values because dplyr doesn't fill them in.
To fix the order of the months, you can either make your months numeric (convert to dates) or turn them into factors. You can add back 'missing' combinations of the grouping variables by using aggregate in base R instead of dplyr::summarize. aggregate includes all combinations of the grouping factors. aggregate converts the missing values to NA, but you can replace the NA with 0 with tidyr::replace_na, for example.
df <- data.frame("Year"=2020,
df$Month <- factor(df$Month, levels = c("Jan", "Feb"), ordered = TRUE)
# Get monthly sums
df1 <- with(df, aggregate(
list(V1_sum = V1, V2_sum = V2),
list(Year = Year, Month = Month, Group = Group, SubGroup = SubGroup),
FUN = sum, drop = FALSE
df1 <- df1 %>%
# Replace NA with 0
V1_sum = replace_na(V1_sum, 0),
V2_sum = replace_na(V2_sum, 0)
) %>%
# Get cumulative sum across months
group_by(Year, Group, SubGroup) %>%
mutate(V1cumsum = cumsum(V1_sum),
V2cumsum = cumsum(V2_sum)) %>%
ungroup() %>%
select(Year, Month, Group, SubGroup, V1 = V1cumsum, V2 = V2cumsum)
This gives the same result as your example:
# # A tibble: 8 x 6
# Year Month Group SubGroup V1 V2
# <dbl> <ord> <chr> <chr> <dbl> <dbl>
# 1 2020 Jan A a 20 1
# 2 2020 Feb A a 70 1
# 3 2020 Jan B a 0 0
# 4 2020 Feb B a 10 1
# 5 2020 Jan A b 20 2
# 6 2020 Feb A b 20 2
# 7 2020 Jan B b 20 2
# 8 2020 Feb B b 80 8
df %>%
arrange(as.yearmon(paste0(Year, '-', Month), '%Y-%b'), Group, SubGroup) %>%
group_by(Year, Group, SubGroup) %>%
V1 = cumsum(V1),
V2 = cumsum(V2)
) %>%
arrange(Year, Group, SubGroup, as.yearmon(paste0(Year, '-', Month), '%Y-%b')) #for desired output ordering
# A tibble: 8 x 6
# Groups: Year, Group, SubGroup [4]
# Year Month Group SubGroup V1 V2
# <chr> <chr> <chr> <chr> <dbl> <dbl>
# 1 2020 Jan A a 10 0
# 2 2020 Jan A a 20 1
# 3 2020 Feb A a 70 1
# 4 2020 Jan A b 20 2
# 5 2020 Feb B a 10 1
# 6 2020 Jan B b 20 2
# 7 2020 Feb B b 70 7
# 8 2020 Feb B b 80 8

r conditional subtract number

I am trying to do the following logic to create 'subtract' column.
I have years from 1986-2014 and around 100 firms.
year firm count sum_of_year subtract
1986 A 1 2 2
1986 B 1 2 4
1987 A 2 4 5
1987 C 1 4 2
1987 D 1 4 5
1988 C 3 5
1988 E 2 5
That is, if a firm i at t appears in t+1, then subtract its count at t+1 from the sum_of_year at t+1,
if a firm i does not appear in t+1, then just put sum_of_year at t+1 as shown in the sample.
I am having difficulties in creating this conditional code.
How can I do this in a generalized version?
Thank you for your help.
One way using dplyr with the help of tidyr::complete. We complete the missing combinations of rows for year and firm and fill count with 0. For each year, we subtract the count by sum of count for that entire year and finally for each firm, we take the value from the next year using lead.
df %>%
tidyr::complete(year, firm, fill = list(count = 0)) %>%
group_by(year) %>%
mutate(n = sum(count) - count) %>%
group_by(firm) %>%
mutate(subtract = lead(n)) %>%
filter(count != 0) %>%
# year firm count sum_of_year subtract
# <int> <fct> <dbl> <int> <dbl>
#1 1986 A 1 2 2
#2 1986 B 1 2 4
#3 1987 A 2 4 5
#4 1987 C 1 4 2
#5 1987 D 1 4 5
#6 1988 C 3 5 NA
#7 1988 E 2 5 NA

Subtract rows varying one column but keeping others fixed

I have an experiment where I need to subtract values of two different treatments from the Control (baseline), but these subtractions must correspond to other columns, named block and year sampled.
Dummy data frame:
df <- data.frame("Treatment" = c("Control","Treat1", "Treat2"),
"Block" = rep(1:3, each=3), "Year" = rep(2011:2013, each=3),
"Value" = c(6,12,4,3,9,5,6,3,1));df
Treatment Block Year Value
1 Control 1 2011 6
2 Treat1 1 2011 12
3 Treat2 1 2011 4
4 Control 2 2012 3
5 Treat1 2 2012 9
6 Treat2 2 2012 5
7 Control 3 2013 6
8 Treat1 3 2013 3
9 Treat2 3 2013 1
Desired output:
Treatment Block Year Value
1 Control-Treat1 1 2011 -6
2 Control-Treat2 1 2011 2
3 Control-Treat1 2 2012 -6
4 Control-Treat2 2 2012 -2
5 Control-Treat1 3 2013 3
6 Control-Treat2 3 2013 5
Any suggestion, preferably using dplyr?
I have found similar questions but none addressing this specific issue.
We can use dplyr, group_by Block and subtract Value where Treatment == "Control" from each Value and remove the "Control" rows.
df %>%
group_by(Block) %>%
mutate(Value = Value[which.max(Treatment == "Control")] - Value) %>%
filter(Treatment != "Control")
# Treatment Block Year Value
# <fct> <int> <int> <dbl>
#1 Treat1 1 2011 -6
#2 Treat2 1 2011 2
#3 Treat1 2 2012 -6
#4 Treat2 2 2012 -2
#5 Treat1 3 2013 3
#6 Treat2 3 2013 5
Not sure, if the values in Treatment column in expected output (Control-Treat1, Control-Treat2) are shown only for demonstration purpose of the calculation or OP really wants that as output. In case if that is needed as output we can use
df %>%
group_by(Block) %>%
mutate(Value = Value[which.max(Treatment == "Control")] - Value,
Treatment = paste0("Control-", Treatment)) %>%
filter(Treatment != "Control-Control")
# Treatment Block Year Value
# <chr> <int> <int> <dbl>
#1 Control-Treat1 1 2011 -6
#2 Control-Treat2 1 2011 2
#3 Control-Treat1 2 2012 -6
#4 Control-Treat2 2 2012 -2
#5 Control-Treat1 3 2013 3
#6 Control-Treat2 3 2013 5
A somehow different tidyverse possibility could be:
df %>%
spread(Treatment, Value) %>%
gather(var, val, -c(Block, Year, Control)) %>%
mutate(Value = Control - val,
Treatment = paste("Control", var, sep = " - ")) %>%
select(Treatment, Block, Year, Value) %>%
Treatment Block Year Value
1 Control - Treat1 1 2011 -6
2 Control - Treat2 1 2011 2
3 Control - Treat1 2 2012 -6
4 Control - Treat2 2 2012 -2
5 Control - Treat1 3 2013 3
6 Control - Treat2 3 2013 5
This can be done with an SQL self join like this:
sqldf("select a.Treatment || '-' || b.Treatment as Treatment,
a.Value - b.Value as Value
from df a
join df b on a.block = b.block and
a.Treatment = 'Control' and
b.Treatment != 'Control'")
Treatment Block Year Value
1 Control-Treat1 1 2011 -6
2 Control-Treat2 1 2011 2
3 Control-Treat1 2 2012 -6
4 Control-Treat2 2 2012 -2
5 Control-Treat1 3 2013 3
6 Control-Treat2 3 2013 5
Another dplyr-tidyr approach: You can remove unwanted columns with select:
dummy_df %>%
spread(Treatment,Value) %>%
gather(key,value,Treat1:Treat2) %>%
group_by(Block,Year,key) %>%
# A tibble: 6 x 6
# Groups: Block, Year, key [6]
Block Year Control key value Val
<int> <int> <dbl> <chr> <dbl> <dbl>
1 1 2011 6 Treat1 12 -6
2 2 2012 3 Treat1 9 -6
3 3 2013 6 Treat1 3 3
4 1 2011 6 Treat2 4 2
5 2 2012 3 Treat2 5 -2
6 3 2013 6 Treat2 1 5
Just the exact output:
dummy_df %>%
spread(Treatment,Value) %>%
gather(key,value,Treat1:Treat2) %>%
mutate(Treatment=paste0("Control-",key)) %>%
group_by(Block,Year,Treatment) %>%
mutate(Val=Control-value) %>%
# A tibble: 6 x 5
# Groups: Block, Year, Treatment [6]
Treatment Block Year Control Val
<chr> <int> <int> <dbl> <dbl>
1 Control-Treat1 1 2011 6 -6
2 Control-Treat2 1 2011 6 2
3 Control-Treat1 2 2012 3 -6
4 Control-Treat2 2 2012 3 -2
5 Control-Treat1 3 2013 6 3
6 Control-Treat2 3 2013 6 5
Another tidyverse solution. We can use filter to separate "Control" and "Treatment" to different data frames, use left_join to combine them by Block and Year, and then process the data frame.
df2 <- df %>%
filter(!Treatment %in% "Control") %>%
left_join(df %>% filter(Treatment %in% "Control"),
by = c("Block", "Year")) %>%
mutate(Value = Value.x - Value.y) %>%
unite(Treatment, Treatment.x, Treatment.y, sep = "-") %>%
# Treatment Block Year Value
# 1 Control-Treat1 1 2011 -6
# 2 Control-Treat2 1 2011 2
# 3 Control-Treat1 2 2012 -6
# 4 Control-Treat2 2 2012 -2
# 5 Control-Treat1 3 2013 3
# 6 Control-Treat2 3 2013 5

Search in row for a certain value and report the date

Here is my replicating example.
l=c(11/1/2012,"7/1/2012","11/1/2010",0 ,"8/1/2012","6/1/2012","3/1/2012","NA")
mydata = data.frame(a,b,c,d,e,f,g,h,i,j,k,l)
names(mydata) = c("id","test1","month1","year1","test2","month2","year2","test3","month3","year3","anytest","date")
I am aiming to search through each row and find the first test column that is equal to 1. The new column I am aiming to create is "anytest." This column is 1 if test1 or test2 or test3 equals to 1. If none of them do then it equals to 0. This ignores NA values..if test1 and test2 are NA but test3 equals to 0 then anytest equals to 0. Now I have made progress I think using this code:
anytestTRY = if(rowSums(mydata[,c(test1,test2,test3)] == 1, na.rm=TRUE) > 0],1,0)
But now I am at a crossroads because I am aiming to search through each row to find the first column of test1 test2 or test3 that equals to 1 and then report the month and year for that test. So if test1 equals to 0 and test2 equals to NA and test3 equals to 1 I want the column which I created called date to have the month3 and year3 in analyzable time format. Thanks a million.
mydata = data.frame(a,b,c,d,e,f,g,h,i,j)
names(mydata) = c("id","test1","month1","year1","test2","month2","year2","test3","month3","year3")
mydata %>%
mutate_all(~as.numeric(as.character(.))) %>% # update columns to numeric
group_by(id) %>% # for each id
nest() %>% # nest data
mutate(date = map(data, ~case_when(.$test1==1 ~ ymd(paste0(.$year1,"-",.$month1,"-",1)), # get date based on first test that is 1
.$test2==1 ~ ymd(paste0(.$year2,"-",.$month2,"-",1)),
.$test3==1 ~ ymd(paste0(.$year3,"-",.$month3,"-",1)))),
anytest = map(data, ~as.numeric(case_when(sum(c(.$test1, .$test2, .$test3)==1) > 0 ~ "1", # create anytest column
sum($test1, .$test2, .$test3))) == 3 ~ "NA",
TRUE ~ "0")))) %>%
unnest() # unnestdata
which returns:
# # A tibble: 8 x 12
# id date anytest test1 month1 year1 test2 month2 year2 test3 month3 year3
# <dbl> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 2012-11-01 1 1 11 2012 1 10 2014 1 2 2011
# 2 2 2011-07-01 1 1 7 2011 0 4 2012 0 12 2012
# 3 3 2010-11-01 1 0 9 2012 1 11 2010 1 12 2011
# 4 4 NA 0 0 9 2014 0 10 2012 0 6 2012
# 5 5 2012-08-01 1 0 5 2014 0 10 2013 1 8 2012
# 6 6 2011-06-01 0 NA NA NA 1 6 2011 0 11 2014
# 7 7 2012-03-01 0 0 7 2011 NA NA NA 1 3 2012
