creating a dummy variable with consecutive cases - r

I have a similar problem like this one:
How can I create a dummy variable over consecutive values by group id?
the difference is: as soon I have the Dummy = 1 I want my dummy for the rest of my group (ID) beeing 1 since year is in descending order. So for example, out of df1:
df1 <-data.frame(ID = rep(seq(1:3), each = 4),
year = rep(c(2014, 2015, 2016, 2017),3),
value = runif(12, min = 0, max = 25),
Dummy = c(0,0,1,0 ,0,1,0,1, 1,0,0,0))
shall be :
df2 <- data.frame(ID = rep(seq(1:4), 3),
year = rep(c(2014, 2015, 2016, 2017),3),
value = runif(12, min = 0, max = 25),
Dummy = c(0,0,1,1 ,0,1,1, 1, 1,1,1,1))
I've tried something like that (and some others) but that failed:
df2<- df1%>% group_by(ID) %>% arrange(ID , year) %>%
mutate(treated = case_when(Dummy == 1 ~ 1,
lag(Dummy, n= unique(n()), default = 0) == 1 ~ 1))

If your input data is as below then we can just use cummax():
library(dplyr)
df1 <-data.frame(ID = rep(seq(1:3), each = 4),
year = rep(c(2014, 2015, 2016, 2017),3),
value = runif(12, min = 0, max = 25),
Dummy = c(0,0,1,0 ,0,1,0,1, 1,0,0,0))
df1
#> ID year value Dummy
#> 1 1 2014 14.144996 0
#> 2 1 2015 20.621603 0
#> 3 1 2016 8.325170 1
#> 4 1 2017 21.725028 0
#> 5 2 2014 11.894383 0
#> 6 2 2015 13.445744 1
#> 7 2 2016 3.332338 0
#> 8 2 2017 2.984941 1
#> 9 3 2014 17.551266 1
#> 10 3 2015 5.250556 0
#> 11 3 2016 11.062577 0
#> 12 3 2017 20.169439 0
df1 %>%
group_by(ID) %>%
mutate(Dummy = cummax(Dummy))
#> # A tibble: 12 x 4
#> # Groups: ID [3]
#> ID year value Dummy
#> <int> <dbl> <dbl> <dbl>
#> 1 1 2014 14.1 0
#> 2 1 2015 20.6 0
#> 3 1 2016 8.33 1
#> 4 1 2017 21.7 1
#> 5 2 2014 11.9 0
#> 6 2 2015 13.4 1
#> 7 2 2016 3.33 1
#> 8 2 2017 2.98 1
#> 9 3 2014 17.6 1
#> 10 3 2015 5.25 1
#> 11 3 2016 11.1 1
#> 12 3 2017 20.2 1
Created on 2022-10-14 by the reprex package (v2.0.1)

Related

R: Cumulative Mean Excluding Current Value?

I am working with the R programming language.
I have a dataset that looks something like this:
id = c(1,1,1,1,2,2,2)
year = c(2010,2011,2012,2013, 2012, 2013, 2014)
var = rnorm(7,7,7)
my_data = data.frame(id, year,var)
id year var
1 1 2010 12.186300
2 1 2011 19.069836
3 1 2012 7.456078
4 1 2013 14.875019
5 2 2012 20.827933
6 2 2013 5.029625
7 2 2014 -2.260658
For each "group" within the ID column - at each row, I want to take the CUMULATIVE MEAN of the "var" column but EXCLUDE the value of "var" within that row (i.e. most recent).
As an example:
row 1: NA
row 2: 12.186300/1
row 3: (12.186300 + 19.069836)/2
row 4: (12.186300 + 19.069836 + 7.45)/3
row 5: NA
row 6: 20.827933
row 7: (20.827933 + 5.029625)/2
I found this post here (Cumsum excluding current value) which (I think) shows how to do this for the "cumulative sum" - I tried to apply the logic here to my question:
transform(my_data, cmean = ave(var, id, FUN = cummean) - var)
id year var cmean
1 1 2010 12.186300 0.000000
2 1 2011 19.069836 -3.441768
3 1 2012 7.456078 5.447994
4 1 2013 14.875019 -1.478211
5 2 2012 20.827933 0.000000
6 2 2013 5.029625 7.899154
7 2 2014 -2.260658 10.126291
The code appears to have run - but I don't think I have done this correctly (i.e. the numbers produced don't match up with the numbers I had anticipated).
I then tried an answer provided here (Compute mean excluding current value):
my_data %>%
group_by(id) %>%
mutate(avg = (sum(var) - var)/(n() - 1))
# A tibble: 7 x 4
# Groups: id [2]
id year var avg
<dbl> <dbl> <dbl> <dbl>
1 1 2010 12.2 13.8
2 1 2011 19.1 11.5
3 1 2012 7.46 15.4
4 1 2013 14.9 12.9
5 2 2012 20.8 1.38
6 2 2013 5.03 9.28
But it is still not working.
Can someone please show me what I am doing wrong and what I can do this fix this problem?
Thanks!
df %>%
group_by(id)%>%
mutate(avg = lag(cummean(var)))
# A tibble: 7 × 4
# Groups: id [2]
id year var avg
<int> <int> <dbl> <dbl>
1 1 2010 12.2 NA
2 1 2011 19.1 12.2
3 1 2012 7.46 15.6
4 1 2013 14.9 12.9
5 2 2012 20.8 NA
6 2 2013 5.03 20.8
7 2 2014 -2.26 12.9
With the help of some intermediate variables you can do it like so:
library(dplyr)
df <- read.table(text = "
id year var
1 1 2010 12.186300
2 1 2011 19.069836
3 1 2012 7.456078
4 1 2013 14.875019
5 2 2012 20.827933
6 2 2013 5.029625
7 2 2014 -2.260658", header=T)
df |>
group_by(id) |>
#mutate(avg =lag(cummean(var)))
mutate(id_g = row_number()) |>
mutate(ms = cumsum(var)) |>
mutate(cm = ms/id_g,
cm = ifelse(ms == cm, NA, cm)) |>
select(-id_g, -ms)
#> # A tibble: 7 × 4
#> # Groups: id [2]
#> id year var cm
#> <int> <int> <dbl> <dbl>
#> 1 1 2010 12.2 NA
#> 2 1 2011 19.1 15.6
#> 3 1 2012 7.46 12.9
#> 4 1 2013 14.9 13.4
#> 5 2 2012 20.8 NA
#> 6 2 2013 5.03 12.9
#> 7 2 2014 -2.26 7.87

Issue with pivot_longer command

I am trying to convert a long dataset to a wide set set using pivot longer, the column headers are "Program ID" and ‘Participant_Count_22’, ‘Participant_Count_21’, ‘Participant_Count_20’, ‘Participant_Count_19’ for four years 2019-2022.
program_tot %>% pivot_longer(cols=c(‘Participant_Count_22’, ‘Participant_Count_21’, ‘Participant_Count_20’, ‘Participant_Count_19’),
names_to='year',
values_to='Participant_Count')
I am getting an unexpected input error message. Any tips?
You seem to have fancy single quotes around your column names. Note carefully the difference between quotes in ‘Participant_Count_22’, which R doesn't recognise as a string, and 'Participant_Count_22', which it does recognise as a string.
In any case, you will save yourself some time and complexity if you use the selection helpers to select columns. In your case you could use contains("Participant_Count")
library(tidyverse)
program_tot %>%
pivot_longer(cols = contains("Participant_Count"),
names_to = 'year',
values_to = 'Participant_Count') %>%
mutate(year = as.numeric(paste0("20", sub("Participant_Count_", "", year))))
#> # A tibble: 20 x 3
#> Program.ID year Participant_Count
#> <fct> <dbl> <dbl>
#> 1 1 2022 5
#> 2 1 2021 5
#> 3 1 2020 6
#> 4 1 2019 2
#> 5 2 2022 7
#> 6 2 2021 2
#> 7 2 2020 8
#> 8 2 2019 5
#> 9 3 2022 4
#> 10 3 2021 8
#> 11 3 2020 1
#> 12 3 2019 7
#> 13 4 2022 3
#> 14 4 2021 1
#> 15 4 2020 3
#> 16 4 2019 3
#> 17 5 2022 5
#> 18 5 2021 3
#> 19 5 2020 2
#> 20 5 2019 1
Data used
program_tot <- data.frame('Program ID' = factor(1:5),
Participant_Count_22 = c(5, 7, 4, 3, 5),
Participant_Count_21 = c(5, 2, 8, 1, 3),
Participant_Count_20 = c(6, 8, 1, 3, 2),
Participant_Count_19 = c(2, 5, 7, 3, 1))
Created on 2022-09-28 with reprex v2.0.2

replace NA with value of the previous row

I have a dataframe like this one with a start and end month and year.
ID start_month start_year end_month end_year
1 1 2018 5 2019
2 5 1981 NA 1999
2 7 1973 NA 1981
2 7 1963 NA 1973
I have several missing data for the months and would like to be able to replace them with values and have the dates follow each other.
I would like to replace the NA with the start month of the row before - 1, based on the ID.
For the date NA-1999 as it is the most recent date in subject 2 and there is no date after that, I would like to put a 7 for the month.
I would like to get something like this:
ID start_month start_year end_month end_year
1 1 2018 5 2019
2 5 1981 7 1999
2 7 1973 4 1981
2 7 1963 6 1973
I thought of using this:
df<-df %>% group_by(ID) %>% replace(end_month = ifelse(is.na(end_month), length(start_month)-1 , 7)) %>% ungroup()
My " length(start_month)-1" argument and the replace function doesn't work and I don't know what else to do
I'm sorry if this isn't very clear, it's complicated to explain this in writing...
Thank you in advance for your help
If I understand you correctly, you want to replace NAs in end_month within the same ID by the following rules:
start_month - 1 for any period which has a later period
7 for the last period in each ID
Is that correct?
If so, then this should do the trick:
library(dplyr)
df %>%
group_by(ID) %>%
arrange(ID, desc(start_year), desc(start_month)) %>%
mutate(
end_month = ifelse(is.na(end_month), lag(start_month) - 1, end_month),
end_month = ifelse(is.na(end_month), 7, end_month)
) %>%
ungroup()
#> # A tibble: 4 × 5
#> ID start_month start_year end_month end_year
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 2018 5 2019
#> 2 2 5 1981 7 1999
#> 3 2 7 1973 4 1981
#> 4 2 7 1963 6 1973
Created on 2022-03-30 by the reprex package (v2.0.1)
Data
df <- tibble::tribble(
~ID, ~start_month, ~start_year, ~end_month, ~end_year,
1, 1, 2018, 5, 2019,
2, 5, 1981, NA, 1999,
2, 7, 1973, NA, 1981,
2, 7, 1963, NA, 1973
)
df
#> # A tibble: 4 × 5
#> ID start_month start_year end_month end_year
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 2018 5 2019
#> 2 2 5 1981 NA 1999
#> 3 2 7 1973 NA 1981
#> 4 2 7 1963 NA 1973

In R: How can I check that I have consecutive years of data (to later be able to calculate growth)?

I have the dataframe (sample) below:
companyID year yearID
1 2010 1
1 2011 2
1 2012 3
1 2013 4
2 2010 1
2 2011 2
2 2016 3
2 2017 4
2 2018 5
3 2010 1
3 2011 2
3 2014 3
3 2017 4
3 2018 5
I have used a for loop in order to try and create a sequence column that starts a new number for each new sequence of numbers. I am new to R so my definitions may be a bit wrong. My for loop looks like this:
size1 <- c(1:3)
s <- 0
for (val1 in size) {
m <- max(sample[sample$companyID == val1, 4])
size2 <- c(1:m)
for (val2 in size2){
row <- sample[which(sample$companyID == val1 & sample$yearID == val2)]
m1 <- sample[sample$companyID == val1 & sample$yearID == val2, 2]
m2 <- sample[sample$CompanyID == val1 & sample$yearID == (val2-1), 2]
if(val2>1 && m1-m2 > 1) {
sample$sequence[row] s = s+1}
else {s = s}
}
}
Where m is the max value of the yearID per companyID, row is to identify that the value should be entered on the row where companyID = val1 and yearID = val2, m1 is from the year variable and is the latter year, whereas m2 is the former year. What I have tried to do is to change the sequence every time m1-m2 > 1 (when val2 > 1 also).
Desired outcome:
companyID year yearID sequence
1 2010 1 1
1 2011 2 1
1 2012 3 1
1 2013 4 1
2 2010 1 2
2 2011 2 2
2 2016 3 3
2 2017 4 3
2 2018 5 3
3 2010 1 4
3 2011 2 4
3 2014 3 5
3 2017 4 6
3 2018 5 6
Super appreciative if anyone can help!!
This is a good question!
First group_by companyID
calculate the difference of each consecutive row in year column with lag to identify if year is consecutive.
group_by companyID, yearID)
mutate helper column sequence1 to apply 1 to each starting consecutive year in group.
ungroup and apply a sequence number eachtime 1
occurs in sequence1
remove column sequence1 and deltalag1
library(tidyverse)
df1 <- df %>%
group_by(companyID) %>%
mutate(deltaLag1 = year - lag(year, 1)) %>%
group_by(companyID, yearID) %>%
mutate(sequence1 = case_when(is.na(deltaLag1) | deltaLag1 > 1 ~ 1,
TRUE ~ 2)) %>%
ungroup() %>%
mutate(sequence = cumsum(sequence1==1)) %>%
select(-deltaLag1, -sequence1)
data
df <- tribble(
~companyID, ~year, ~yearID,
1, 2010, 1,
1, 2011, 2,
1, 2012, 3,
1, 2013, 4,
2, 2010, 1,
2, 2011, 2,
2, 2016, 3,
2, 2017, 4,
2, 2018, 5,
3, 2010, 1,
3, 2011, 2,
3, 2014, 3,
3, 2017, 4,
3, 2018, 5)
It's not clear if you want the exact desired outcome or check that you have consecutive years by companyID.
According to your title message:
sample <- read.table(header = TRUE, text = "
companyID year yearID
1 2010 1
1 2011 2
1 2012 3
1 2013 4
2 2010 1
2 2011 2
2 2016 3
2 2017 4
2 2018 5
3 2010 1
3 2011 2
3 2014 3
3 2017 4
3 2018 5
")
library(data.table)
sample <- setDT(sample)
sample[ , diff_year := year - shift(year), by = companyID]
sample <- setDF(sample)
sample
#> companyID year yearID diff_year
#> 1 1 2010 1 NA
#> 2 1 2011 2 1
#> 3 1 2012 3 1
#> 4 1 2013 4 1
#> 5 2 2010 1 NA
#> 6 2 2011 2 1
#> 7 2 2016 3 5
#> 8 2 2017 4 1
#> 9 2 2018 5 1
#> 10 3 2010 1 NA
#> 11 3 2011 2 1
#> 12 3 2014 3 3
#> 13 3 2017 4 3
#> 14 3 2018 5 1
# Created on 2021-03-13 by the reprex package (v1.0.0.9002)
Related to Calculate difference between values in consecutive rows by group
Regards,

Interpolate df column within each group

I have a data frame df and a sample vector years of the following kind:
> df <- data.frame(year = rep(c(2000, 2025, 2030, 2050), 2),
type = rep(c('a', 'b'), each = 4),
value = c(3, 9, 8, 6, 7, 5, 2, 10))
> years = seq(2010, 2050, 10)
> df
year type value
1 2000 a 3
2 2025 a 9
3 2030 a 8
4 2050 a 6
5 2000 b 7
6 2025 b 5
7 2030 b 2
8 2050 b 10
> years
[1] 2010 2020 2030 2040 2050
Now I would like to interpolate value within each group of type to get the values for years. My expected result looks like this (where values for 2010, 2020 and 2040 are interpolated):
> result
year type value
1 2010 a 5.4
2 2020 a 7.8
3 2030 a 8
4 2040 a 7
5 2050 a 6
6 2010 b 6.2
7 2020 b 5.4
8 2030 b 2
9 2040 b 6
10 2050 b 10
I have tried something like this but did not succeed as I am not allowed to change the length of the group. Any help is very much appreciated!
> result <- df %>%
group_by(type) %>%
mutate(year = years,
value = approx(year, value, years)$y)
Error: Problem with `mutate()` input `year`.
x Input `year` can't be recycled to size 4.
i Input `year` is `years`.
i Input `year` must be size 4 or 1, not 5.
i The error occurred in group 1: type = "a".
We can use complete to get all the sequence per 'type' and then apply approx
library(dplyr)
library(tidyr)
df %>%
complete(year = years, type) %>%
group_by(type) %>%
mutate(value = approx(year, value, year)$y) %>%
ungroup %>%
arrange(type, year)
-output
# A tibble: 14 x 3
# year type value
# <dbl> <chr> <dbl>
# 1 2000 a 3
# 2 2010 a 5.4
# 3 2020 a 7.8
# 4 2025 a 9
# 5 2030 a 8
# 6 2040 a 7
# 7 2050 a 6
# 8 2000 b 7
# 9 2010 b 6.2
#10 2020 b 5.4
#11 2025 b 5
#12 2030 b 2
#13 2040 b 6
#14 2050 b 10

Resources