Creating a binary variable if individual was observed in the previous year - r

Let's say I have an example dataframe in the following format:
df <- data.frame( c(1,2,3,1,2,3,1,2,3),
c(3,3,3,2,2,2,1,1,1),
c(23,23,34,134,134,NA,45,NA,NA)
)
colnames(df) <- c("id", "year", "fte_wage")
df <- df[is.na(df$fte_wage) == FALSE,]
I want to create a binary variable (let's say, a column named "obs") if the individual was observed in the previous or not. I have tried the following:
library(dplyr)
df2 <-
df %>%
arrange(id, year) %>%
group_by(id) %>%
rowwise() %>%
mutate(obs = ifelse((lag(year) %in% df[df$id == id,]$year & year > lag(year)), 1, 0))
Which generates a column of only 0 values. If I remove the second condition the code works, but then it misinterprets the lag(year) command, as it takes values from different individuals as well.
My desired output would be a dataframe in the following format:
id
year
fte_wage
ob
1
1
23
0
1
2
23
1
1
3
43
1
2
1
54
0
2
2
32
1
3
1
56
0

You can just group_by(id) and then check if row_number() is > 1 to see if it falls in repeating run or is alone.
library(tidyverse)
df <- data.frame("id" = c(1,2,3,1,2,3,1,2,3),
"year" = c(3,3,3,2,2,2,1,1,1),
"fte_wage" = c(23,23,34,134,134,NA,45,NA,NA))
df %>%
drop_na(fte_wage) %>%
arrange(id, year) %>%
group_by(id) %>%
mutate(obs = as.numeric(row_number() > 1))
#> # A tibble: 6 × 4
#> # Groups: id [3]
#> id year fte_wage obs
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 45 0
#> 2 1 2 134 1
#> 3 1 3 23 1
#> 4 2 2 134 0
#> 5 2 3 23 1
#> 6 3 3 34 0
Created on 2022-11-21 with reprex v2.0.2

This is one approach using dplyr without grouping.
library(dplyr)
df %>%
na.omit() %>%
arrange(id, year) %>%
mutate(obs = (lag(id, default=F) == id) * 1)
id year fte_wage obs
1 1 1 45 0
2 1 2 134 1
3 1 3 23 1
4 2 2 134 0
5 2 3 23 1
6 3 3 34 0

You could use diff in the following way:
library(dplyr)
df %>%
group_by(id) %>%
arrange(id, year) %>%
mutate(obs = +(c(0, diff(year)) == 1L))
Output:
# A tibble: 6 x 4
# Groups: id [3]
id year fte_wage obs
<dbl> <dbl> <dbl> <dbl>
1 1 1 45 0
2 1 2 134 1
3 1 3 23 1
4 2 2 134 0
5 2 3 23 1
6 3 3 34 0

Related

how to replace age with previous value + 1

I'm trying to replace missing age values in one wave by adding 1 to the value from the previous wave. So, for instance:
ID
Age
Wave
1
20
1
1
NA
2
2
61
1
2
NA
2
would become
ID
Age
Wave
1
20
1
1
21
2
2
61
1
2
62
2
library(tidyverse)
df %>%
mutate(Age = case_when(is.na(Age) ~ lag(Age) + 1,
TRUE ~ Age))
# A tibble: 4 x 3
ID Age Wave
<dbl> <dbl> <dbl>
1 1 20 1
2 1 21 2
3 2 61 1
4 2 62 2
Base R
> ave(df$Age,df$ID,FUN=function(x){x[1]+seq_along(x)-1})
[1] 20 21 61 62
With tidyverse, assuming your data is in df dataframe:
library(tidyverse)
df %>%
group_by(ID) %>% arrange(ID, Wave) %>%
mutate(missing_grp = cumsum( (is.na(Age)!=is.na(lag(Age))) | !is.na(Age) )) %>%
group_by(ID, missing_grp) %>%
mutate(age_offset=cumsum(is.na(Age))) %>%
group_by(ID) %>%
fill(Age, .direction='down') %>%
mutate(Age = Age + age_offset) %>%
ungroup() %>% select(-missing_grp, -age_offset)
It works also with multiple successive missing ages.
For the following input:
df <- tribble(
~ID, ~Age, ~Wave,
1, 21, 1,
1, NA, 2,
2, 61, 1,
2, NA, 2,
2, NA, 3,
2, 70, 4,
2, NA, 5,
)
it returns:
# A tibble: 7 × 3
ID Age Wave
<dbl> <dbl> <dbl>
1 1 21 1
2 1 22 2
3 2 61 1
4 2 62 2
5 2 63 3
6 2 70 4
7 2 71 5
In base R
within(df, Age[is.na(Age)] <- Age[which(is.na(Age)) - 1] + 1)
#> ID Age Wave
#> 1 1 20 1
#> 2 1 21 2
#> 3 2 61 1
#> 4 2 62 2
If you have more than two waves, we could use the row number:
library(dplyr)
library(tidyverse)
df |>
group_by(ID) |>
fill(Age) |>
mutate(Age = Age + row_number() - 1) |>
ungroup()
Output:
# A tibble: 5 × 3
ID Age Wave
<dbl> <dbl> <dbl>
1 1 21 1
2 1 22 2
3 2 61 1
4 2 62 2
5 2 63 3

How to count data frame elements grouped by multiple conditions in dplyr?

I am trying to use dplyr to count elements grouped by multiple conditions (columns) in a data frame. In the below example (dataframe output is at the top (except that I manually inserted the 2 right-most columns to explain what I am trying to do), and R code is underneath), I am trying to count the joint groupings of the Element and Group columns. My multiple condition grouping attempt is eleGrpCnt. Any recommendations for the correct way to do this in dplyr? I thought that group_by a combined (Element, Group) would work.
desired
Element Group origOrder eleCnt eleGrpCnt eleGrpCnt explanation
<chr> <dbl> <int> <int> <int> <comment> <comment>
1 B 0 1 1 1 1 1st grouping of B where Group = 0
2 R 0 2 1 1 1 1st grouping of R where Group = 0
3 R 1 3 2 1 2 2nd grouping of R where Group = 1
4 R 1 4 3 2 2 2nd grouping of R where Group = 1
5 B 0 5 2 2 1 1st grouping of B where Group = 0
6 X 2 6 1 1 1 1st grouping of X where Group = 2
7 X 2 7 2 2 1 1st grouping of X where Group = 2
8 X 0 8 3 1 2 2nd grouping of X where Group = 0
9 X 0 9 4 2 2 2nd grouping of X where Group = 0
10 X -1 10 5 1 3 3rd grouping of X where Group = -1
library(dplyr)
myData6 <-
data.frame(
Element = c("B","R","R","R","B","X","X","X","X","X"),
Group = c(0,0,1,1,0,2,2,0,0,-1)
)
myData6 %>%
mutate(origOrder = row_number()) %>%
group_by(Element) %>%
mutate(eleCnt = row_number()) %>%
ungroup() %>%
group_by(Element, Group) %>%
mutate(eleGrpCnt = row_number())%>%
ungroup()
If you group by element then the numbers you are looking for are simply the matches of Group against the unique values of Group:
library(dplyr)
myData6 %>%
mutate(origOrder = row_number()) %>%
group_by(Element) %>%
mutate(eleCnt = row_number()) %>%
ungroup() %>%
group_by(Element) %>%
mutate(eleGrpCnt = match(Group, unique(Group)))
#> # A tibble: 10 x 5
#> # Groups: Element [3]
#> Element Group origOrder eleCnt eleGrpCnt
#> <chr> <dbl> <int> <int> <dbl>
#> 1 B 0 1 1 1
#> 2 R 0 2 1 1
#> 3 R 1 3 2 2
#> 4 R 1 4 3 2
#> 5 B 0 5 2 1
#> 6 X 2 6 1 1
#> 7 X 2 7 2 1
#> 8 X 0 8 3 2
#> 9 X 0 9 4 2
#> 10 X -1 10 5 3
Created on 2022-09-11 with reprex v2.0.2
Here's one approach; I'm sorting by Group value but if you want to change the order to match original appearance order we could add a step.
myData6 %>%
mutate(origOrder = row_number()) %>%
group_by(Element) %>%
mutate(eleCnt = row_number()) %>%
ungroup() %>%
arrange(Element, Group) %>%
group_by(Element) %>%
mutate(eleGrpCnt = cumsum(Group != lag(Group, default = -999))) %>%
ungroup() %>%
arrange(origOrder)
# A tibble: 10 × 5
Element Group origOrder eleCnt eleGrpCnt
<chr> <dbl> <int> <int> <int>
1 B 0 1 1 1
2 R 0 2 1 1
3 R 1 3 2 2
4 R 1 4 3 2
5 B 0 5 2 1
6 X 2 6 1 3
7 X 2 7 2 3
8 X 0 8 3 2
9 X 0 9 4 2
10 X -1 10 5 1

How to flag the last row of a data frame group?

Suppose we start with the below dataframe df:
ID <- c(1, 1, 1, 5, 5)
Period <- c(1,2,3,1,2)
Value <- c(10,12,11,4,6)
df <- data.frame(ID, Period, Value)
ID Period Value
1 1 1 10
2 1 2 12
3 1 3 11
4 5 1 4
5 5 2 6
Now using dplyr I add a "Calculate" column that multiplies Period and Value of each row, giving me the following:
> df %>% mutate(Calculate = Period * Value)
ID Period Value Calculate
1 1 1 10 10
2 1 2 12 24
3 1 3 11 33
4 5 1 4 4
5 5 2 6 12
I'd like to modify the above "Calculate" to give me a value of 0, when reaching the last row for a given ID, so that the data frame output looks like:
ID Period Value Calculate
1 1 1 10 10
2 1 2 12 24
3 1 3 11 0
4 5 1 4 4
5 5 2 6 0
I was going to use the lead() function to peer at the next row to see if the ID changes but wasn't sure that happens when reaching the end of the data frame.
How could this be accomplished using dplyr?
You can group_by ID and replace the last row for each ID with 0.
library(dplyr)
df %>%
mutate(Calculate = Period * Value) %>%
group_by(ID) %>%
mutate(Calculate = replace(Calculate, n(), 0)) %>%
ungroup
# ID Period Value Calculate
# <dbl> <dbl> <dbl> <dbl>
#1 1 1 10 10
#2 1 2 12 24
#3 1 3 11 0
#4 5 1 4 4
#5 5 2 6 0
Yet another possibility:
library(tidyverse)
ID <- c(1, 1, 1, 5, 5)
Period <- c(1,2,3,1,2)
Value <- c(10,12,11,4,6)
df <- data.frame(ID, Period, Value)
df %>%
mutate(Calculate = Period * Value) %>%
group_by(ID) %>%
mutate(Calculate = if_else(row_number() == n(), 0, Calculate)) %>%
ungroup
#> # A tibble: 5 × 4
#> ID Period Value Calculate
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 10 10
#> 2 1 2 12 24
#> 3 1 3 11 0
#> 4 5 1 4 4
#> 5 5 2 6 0
ID <- c(1, 1, 1, 5, 5)
Period <- c(1,2,3,1,2)
Value <- c(10,12,11,4,6)
df <- data.frame(ID, Period, Value)
library(tidyverse)
df %>%
mutate(Calculate = Period * Value * duplicated(ID, fromLast = TRUE))
#> ID Period Value Calculate
#> 1 1 1 10 10
#> 2 1 2 12 24
#> 3 1 3 11 0
#> 4 5 1 4 4
#> 5 5 2 6 0
Created on 2022-01-09 by the reprex package (v2.0.1)
This should work. You can also replace rownum with Period (most likely)
ID <- c(1, 1, 1, 5, 5)
Period <- c(1,2,3,1,2)
Value <- c(10,12,11,4,6)
df <- data.frame(ID, Period, Value)
df = df %>% mutate(Calculate = Period * Value)
df$rownum = rownames(df)
df = df %>%
group_by(ID) %>%
mutate(Calculate = ifelse(rownum == max(rownum), 0, Calculate)) %>%
ungroup()
A tibble: 5 × 5
ID Period Value Calculate rownum
<dbl> <dbl> <dbl> <dbl> <chr>
1 1 1 10 10 1
2 1 2 12 24 2
3 1 3 11 0 3
4 5 1 4 4 4
5 5 2 6 0 5

Aggregate rows with specific shared value

I want to aggregate my data as follows:
Aggregate only for successive rows where status = 0
Keep age and sum up points
Example data:
da <- data.frame(userid = c(1,1,1,1,2,2,2,2), status = c(0,0,0,1,1,1,0,0), age = c(10,10,10,11,15,16,16,16), points = c(2,2,2,6,3,5,5,5))
da
userid status age points
1 1 0 10 2
2 1 0 10 2
3 1 0 10 2
4 1 1 11 6
5 2 1 15 3
6 2 1 16 5
7 2 0 16 5
8 2 0 16 5
I would like to have:
da2
userid status age points
1 1 0 10 6
2 1 1 11 6
3 2 1 15 3
4 2 1 16 5
5 2 0 16 10
da %>%
mutate(grp = with(rle(status),
rep(seq_along(values), lengths)) + cumsum(status != 0)) %>%
group_by_at(vars(-points)) %>%
summarise(points = sum(points)) %>%
ungroup() %>%
select(-grp)
## A tibble: 5 x 4
# userid status age points
# <dbl> <dbl> <dbl> <dbl>
#1 1 0 10 6
#2 1 1 11 6
#3 2 0 16 10
#4 2 1 15 3
#5 2 1 16 5
You can use group_by from dplyr:
da %>% group_by(da$userid, cumsum(da$status), da$status)
%>% summarise(age=max(age), points=sum(points))
Output:
`da$userid` `cumsum(da$status)` `da$status` age points
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0 0 10 6
2 1 1 1 11 6
3 2 2 1 15 3
4 2 3 0 16 10
5 2 3 1 16 5
Exactly the same idea as above :
library(dplyr)
data1 <- data %>% group_by(userid, age, status) %>%
filter(status == 0) %>%
summarise(points = sum(points))
data2 <- data %>%
group_by(userid, age, status) %>%
filter(status != 0) %>%
summarise(points = sum(points))
data <- rbind(data1,
data2)
We need to be more carreful with your specification of status equal to 0. I think the code of Quang Hoang works only for your specific example.
I hope it will help.

Frequency table from two filters and two summarise in dplyr

How can I combine the following codes to into one:
df %>% group_by(year) %>% filter(MIAPRFCD_J8==1 | MIAPRFCD_55==1) %>% summarise (Freq = n())
df %>% group_by(year) %>% filter(sum==1 | (MIAPRFCD_J8==1 & MIAPRFCD_55==1)) %>% summarise (reason_lv = n())
So output will be one table (or df) which is grouped by year and two columns of frequencies based on the above filters.
Here is the sample data:
df<- read.table(header=T, text='Act year MIAPRFCD_J8 MIAPRFCD_55 sum
1 2015 1 0 1
2 2016 1 0 1
3 2016 0 1 2
6 2016 1 1 3
7 2016 0 0 2
9 2015 1 0 1
11 2015 1 0 1
12 2015 0 1 2
15 2014 0 1 1
20 2014 1 0 1
60 2013 1 0 1')
Output after combing the codes would be:
year Freq reason_lv
2013 1 1
2014 2 2
2015 4 3
2016 3 2
Now that you've included your data, this is easy enough to solve. Here are two possible options. Both options get you the output you want, it's mostly just a matter of style.
Option 1, make 2 filtered data frames, then use an inner_join to join them together by year. (You could also just build those data frames inline in the arguments to inner_join, but that's a little less clear.)
library(tidyverse)
df<- read.table(header=T,
text='Act year MIAPRFCD_J8 MIAPRFCD_55 sum
1 2015 1 0 1
2 2016 1 0 1
3 2016 0 1 2
6 2016 1 1 3
7 2016 0 0 2
9 2015 1 0 1
11 2015 1 0 1
12 2015 0 1 2
15 2014 0 1 1
20 2014 1 0 1
60 2013 1 0 1')
# option 1: two dataframes, then join
freq_df <- df %>%
group_by(year) %>%
filter(MIAPRFCD_J8 == 1 | MIAPRFCD_55 == 1) %>%
summarise (Freq = n())
reason_df <- df %>%
group_by(year) %>%
filter(sum == 1 | (MIAPRFCD_J8 == 1 & MIAPRFCD_55 == 1)) %>%
summarise (reason_lv = n())
inner_join(freq_df, reason_df, by = "year")
#> # A tibble: 4 x 3
#> year Freq reason_lv
#> <int> <int> <int>
#> 1 2013 1 1
#> 2 2014 2 2
#> 3 2015 4 3
#> 4 2016 3 2
Option 2, add boolean variables for whether the observation needs to go into the Freq calculation, and whether it needs to go into the response calculation--dummy variables help with this since those two things aren't mutually exclusive.
# option 2: binary variables
df %>%
mutate(getFreq = (MIAPRFCD_J8 == 1 | MIAPRFCD_55 == 1)) %>%
mutate(getReason = (sum == 1 | (MIAPRFCD_J8 == 1 & MIAPRFCD_55 == 1))) %>%
group_by(year) %>%
summarise(Freq = sum(getFreq), reason_lv = sum(getReason))
#> # A tibble: 4 x 3
#> year Freq reason_lv
#> <int> <int> <int>
#> 1 2013 1 1
#> 2 2014 2 2
#> 3 2015 4 3
#> 4 2016 3 2
Created on 2018-04-23 by the reprex package (v0.2.0).

Resources