insert NA if previous variable is 1 grouped by unique id - r

I have clinical data that records a patient at three time points with a disease outcome indicated by a binary variable. It looks something like this
patientid <- c(100,100,100,101,101,101,102,102,102)
time <- c(1,2,3,1,2,3,1,2,3)
outcome <- c(0,0,1,NA,0,0,0,1,0)
Data<- data.frame(patientid=patientid,time=time,outcome=outcome)
Data
I want to create an outcome incidence variable. There are two conditions:
When a patient is coded a 1, I would like there to be a NA for any time period after for that patient.
If a patient has NA at time point 1 then time point 2 and 3 should also be NA.
For the example data it should now look like this:
patientid <- c(100,100,100,101,101,101,102,102,102)
time <- c(1,2,3,1,2,3,1,2,3)
outcome <- c(0,0,1,NA,NA,NA,0,1,NA)
Data<- data.frame(patientid=patientid,time=time,outcome=outcome)
Data

Not the smartest way, but you could use dplyr:
library(dplyr)
Data %>%
group_by(patientid) %>%
mutate(outcome = outcome * (1 - na_if(lag(cumsum(outcome) > 0, default = 0), 1))) %>%
ungroup()
which returns
# A tibble: 9 x 3
patientid time outcome
<dbl> <dbl> <dbl>
1 100 1 0
2 100 2 0
3 100 3 1
4 101 1 NA
5 101 2 NA
6 101 3 NA
7 102 1 0
8 102 2 1
9 102 3 NA

We can group_by() patientid, then use purrr:accumulate() and ifelse() here:
library(purrr)
library(dplyr)
Data%>%
group_by(patientid)%>%
mutate(outcome=accumulate(outcome, ~ifelse(.x==1 | is.na(.x), NA, .y)))
# A tibble: 9 x 3
# Groups: patientid [3]
patientid time outcome
<dbl> <dbl> <dbl>
1 100 1 0
2 100 2 0
3 100 3 1
4 101 1 NA
5 101 2 NA
6 101 3 NA
7 102 1 0
8 102 2 1
9 102 3 NA

You can create a function and apply it for each patientid.
library(dplyr)
return_outcome <- function(x) {
if(is.na(first(x))) return(NA)
else {
val <- max(which(x == 1))
if(length(val) && val < length(x))
x[(val + 1):length(x)] <- NA
}
x
}
Data %>%
group_by(patientid) %>%
mutate(result = return_outcome(outcome)) %>%
ungroup
# patientid time outcome result
# <dbl> <dbl> <dbl> <dbl>
#1 100 1 0 0
#2 100 2 0 0
#3 100 3 1 1
#4 101 1 NA NA
#5 101 2 0 NA
#6 101 3 0 NA
#7 102 1 0 0
#8 102 2 1 1
#9 102 3 0 NA

Related

How to count the cumulative number of subgroupings using dplyr?

I'm trying to run the number of cumulative subgroupings using dplyr, as illustrated and explanation in the image below. I am trying to solve for Flag2 in the image. Any recommendations for how to do this?
Beneath the image I also have the reproducible code that runs all columns up through Flag1 which works fine.
Reproducible code:
library(dplyr)
myData <-
data.frame(
Element = c("A","B","B","B","B","B","A","C","C","C","C","C"),
Group = c(0,0,1,1,2,2,0,3,3,0,0,0)
)
excelCopy <- myData %>%
group_by(Element) %>%
mutate(Element_Count = row_number()) %>%
mutate(Flag1 = case_when(Group > 0 ~ match(Group, unique(Group)),TRUE ~ Element_Count)) %>%
ungroup()
print.data.frame(excelCopy)
Using row_number and setting 0 values to NA
library(dplyr)
excelCopy |>
group_by(Element, Group) |>
mutate(Flag2 = ifelse(Group == 0, NA, row_number()))
Element Group Element_Count Flag1 Flag2
<chr> <dbl> <int> <int> <int>
1 A 0 1 1 NA
2 B 0 1 1 NA
3 B 1 2 2 1
4 B 1 3 2 2
5 B 2 4 3 1
6 B 2 5 3 2
7 A 0 2 2 NA
8 C 3 1 1 1
9 C 3 2 1 2
10 C 0 3 3 NA
11 C 0 4 4 NA
12 C 0 5 5 NA

R total multiple columns at once with n()

I don't know if I am missing something very obvious here or not, but I am having trouble getting the desired results format for a count. These are all yes, no or NA answers to a question.
My data looks a bit like:
df <- read.table(text = " A B C
0 NA 1
1 0 NA
0 1 0
NA NA 1
0 0 1
1 0 NA
0 1 NA ", header = TRUE)
df %>%
group_by(A, B, C)%>%
summarise(count = n())
I have also tried
count(A, B, C)
with exactly the same results.
I want to count the total number of 0, 1 and NA responses for each column: (rows and columns are interchangeable here, it's the count of response v column format of the table that I'm after.)
Response 0 1 NA
Column A 4 2 1
Column B 3 2 2
Column C 1 3 3
What I am getting instead is
A B C n
0 0 1 1
0 1 0 1
0 1 NA 1
0 NA 1 1
1 0 NA 2
NA NA 1 1
In other words, it's counting the number of times each unique combination of ABC appears. How do I get it to focus on counting the columns and not the rows?
You can apply the table() function across the columns:
df <- read.table(text = " A B C
0 NA 1
1 0 NA
0 1 0
NA NA 1
0 0 1
1 0 NA
0 1 NA ", header = TRUE)
t(apply(df, 2,table, useNA = "always"))
#> 0 1 <NA>
#> A 4 2 1
#> B 3 2 2
#> C 1 3 3
Created on 2022-08-05 by the reprex package (v2.0.1)
One alternate tidyverse solution would be the following:
library(tidyverse)
df <- read.table(text = " A B C
0 NA 1
1 0 NA
0 1 0
NA NA 1
0 0 1
1 0 NA
0 1 NA ", header = TRUE)
x <- df %>%
mutate(across(everything(), ~fct_explicit_na(as.factor(.x),"NA"))) %>%
map(., ~c(table(.x))) %>%
bind_rows(.id = 'Response')
x
#> # A tibble: 3 × 4
#> Response `0` `1` `NA`
#> <chr> <int> <int> <int>
#> 1 A 4 2 1
#> 2 B 3 2 2
#> 3 C 1 3 3
Created on 2022-08-05 by the reprex package (v2.0.1)
I guess it might need some data re-shaping if you want to use dplyr::n().
First transform df into a "long" format, you'll get a two-column dataframe, from which we can group by everything (group_by_all()) and do your summarize(n()). Finally, transform it back to a "wide" format.
library(tidyverse)
df %>% pivot_longer(everything(), names_to = "Response") %>%
group_by_all() %>%
summarize(n = n()) %>%
pivot_wider(names_from = "value", values_from = "n")
# A tibble: 3 × 4
# Groups: Response [3]
Response `0` `1` `NA`
<chr> <int> <int> <int>
1 A 4 2 1
2 B 3 2 2
3 C 1 3 3
Using table and stack you can try:
t(table(stack(df), useNA = "ifany"))
Output
values
ind 0 1 <NA>
A 4 2 1
B 3 2 2
C 1 3 3
If you find yourself wanting to apply the same operation to multiple columns in your data it could be a hint that you should reshape your data to a "longer" format, such that each row represents a single observation. Once your data is in this format you can use table() to get the summary you're after:
df_tidy <-
df %>%
pivot_longer(cols = everything(), names_to = "group", values_to = "response")
print(df_tidy)
#> # A tibble: 21 x 2
#> group response
#> <chr> <int>
#> 1 A 0
#> 2 B NA
#> 3 C 1
#> 4 A 1
#> 5 B 0
#> 6 C NA
#> 7 A 0
#> 8 B 1
#> 9 C 0
#> 10 A NA
#> # … with 11 more rows
table(df_tidy, useNA = "ifany")
#> response
#> group 0 1 <NA>
#> A 4 2 1
#> B 3 2 2
#> C 1 3 3

Ignore zeros and NAs in cumsum

I need to assign numbers to sets of consecutive values in every column and create new columns. Eventually I want to find a sum of values in z column that correspond to the first consecutive numbers in each column.
My data looks something like this:
library(dplyr)
y1 = c(1,2,3,8,9,0)
y2 = c(0,0,0,4,5,6)
z = c(200,250,200,100,90,80)
yabc <- tibble(y1, y2, z)
# A tibble: 6 × 3
y1 y2 z
<dbl> <dbl> <dbl>
1 1 0 200
2 2 0 250
3 3 0 200
4 8 4 100
5 9 5 90
6 0 6 80
I tried the following formula:
yabc %>%
mutate_at(vars(starts_with("y")),
list(mod = ~ cumsum(c(FALSE, diff(.x)!=1))+1))
that gave me the following result:
# A tibble: 6 × 5
y1 y2 z y1_mod y2_mod
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0 200 1 1
2 2 0 250 1 2
3 3 0 200 1 3
4 8 4 100 2 4
5 9 5 90 2 4
6 0 6 80 3 4
I am only interested in numbers greater than zero. I tried replacing zeros with NA, but it did not work either.
# A tibble: 6 × 5
y1 y2 z y1_mod y2_mod
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 NA 200 1 1
2 2 NA 250 1 NA
3 3 NA 200 1 NA
4 8 4 100 2 NA
5 9 5 90 2 NA
6 NA 6 80 NA NA
What I would like the data to look like is:
# A tibble: 6 × 5
y1 y2 z y1_mod y2_mod
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0 200 1 NA
2 2 0 250 1 NA
3 3 0 200 1 NA
4 8 4 100 2 1
5 9 5 90 2 1
6 0 6 80 NA 1
Is there any way to exclude zeros and start applying the formula only when .x is greater than 0? Or any other way to make the formula work the way I need? Thank you!
FYI: mutate_at has been superseded by across, I'll demonstrate the new method in my code.
yabc %>%
mutate(
across(starts_with("y"),
list(mod = ~ if_else(.x > 0,
cumsum(.x > 0 & c(FALSE, diff(.x) != 1)) + 1L,
NA_integer_) )
)
)
# # A tibble: 6 x 5
# y1 y2 z y1_mod y2_mod
# <dbl> <dbl> <dbl> <int> <int>
# 1 1 0 200 1 NA
# 2 2 0 250 1 NA
# 3 3 0 200 1 NA
# 4 8 4 100 2 2
# 5 9 5 90 2 2
# 6 0 6 80 NA 2
If this is sufficient (you don't care if it's 1 or 2 for the first effective group in y2_mod), then you're good. If you want to reduce them all to be 1-based, then
yabc %>%
mutate(
across(starts_with("y"),
list(mod = ~ if_else(.x > 0,
cumsum(.x > 0 & c(FALSE, diff(.x) != 1)),
NA_integer_))),
across(ends_with("_mod"),
~ if_else(is.na(.x), .x, match(.x, na.omit(unique(.x))))
)
)
# # A tibble: 6 x 5
# y1 y2 z y1_mod y2_mod
# <dbl> <dbl> <dbl> <int> <int>
# 1 1 0 200 1 NA
# 2 2 0 250 1 NA
# 3 3 0 200 1 NA
# 4 8 4 100 2 1
# 5 9 5 90 2 1
# 6 0 6 80 NA 1
Notes:
if_else is helpful to handle the NA-including rows specially; it requires the same class, which can be annoying/confusing. Because of this, we need to pass the specific "class" of NA as the false= (third) argument to if_else. For example, cumsum(.)+1 produces a numeric, so the third arg would need to be NA_real_ (since the default NA is actually logical). Another way to deal with it is to either use cumsum(.)+1L (produces an integer) and NA_integer_ or (as I show in my second example) use cumsum(.) by itself (and NA_integer_) since we match things later (and match(.) returns integer)
I demo the shift from your mutate_at to mutate(across(..)). An important change here from mutate is that we run across without assigning its return to anything. In essence, it returns a named-list where each element of the list is an updated column or a new one, depending on the presence of .names; that takes a glue-like string to allow for renaming the calculated columns, thereby adding new columns instead of the default action (no .names) of overwriting the columns in-place. The alternate way of producing new (not in-place) columns is the way you used, with a named list of functions, still a common/supported way to use a list of functions within across(..).
library(data.table)
library(tidyverse)
yabc %>%
mutate(across(starts_with('y'),
~ as.integer(factor(`is.na<-`(rleid(.x - row_number()), !.x))),
.names = '{col}_mod'))
# A tibble: 6 x 5
y1 y2 z y1_mod y2_mod
<dbl> <dbl> <dbl> <int> <int>
1 1 0 200 1 NA
2 2 0 250 1 NA
3 3 0 200 1 NA
4 8 4 100 2 1
5 9 5 90 2 1
6 0 6 80 NA 1
The trick lies in knowing that for consecutive numbers, the difference between the number and their row_number() is the same:
ie consider:
x <- c(1,2,3,6,7,8,10,11,12)
The consecutive numbers can be grouped as:
x - seq_along(x)
[1] 0 0 0 2 2 2 3 3 3
As you can see, the consecutive numbers are grouped together. To get the desired groups, we should use rle
rleid(x-seq_along(x))
[1] 1 1 1 2 2 2 3 3 3
Another possible solution:
library(tidyverse)
y1=c(1,2,3,8,9,0)
y2=c(0,0,0,4,5,6)
z=c(200,250,200,100,90,80)
yabc<-tibble(y1,y2,z)
yabc %>%
mutate(across(starts_with("y"),
~if_else(.x==0, NA_real_, 1+cumsum(c(1,diff(.x)) != 1)), .names="{.col}_mod"))%>%
mutate(across(ends_with("mod"), ~ factor(.x) %>% as.numeric(.)))
#> # A tibble: 6 × 5
#> y1 y2 z y1_mod y2_mod
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 0 200 1 NA
#> 2 2 0 250 1 NA
#> 3 3 0 200 1 NA
#> 4 8 4 100 2 1
#> 5 9 5 90 2 1
#> 6 0 6 80 NA 1

Aggregate rows with specific shared value

I want to aggregate my data as follows:
Aggregate only for successive rows where status = 0
Keep age and sum up points
Example data:
da <- data.frame(userid = c(1,1,1,1,2,2,2,2), status = c(0,0,0,1,1,1,0,0), age = c(10,10,10,11,15,16,16,16), points = c(2,2,2,6,3,5,5,5))
da
userid status age points
1 1 0 10 2
2 1 0 10 2
3 1 0 10 2
4 1 1 11 6
5 2 1 15 3
6 2 1 16 5
7 2 0 16 5
8 2 0 16 5
I would like to have:
da2
userid status age points
1 1 0 10 6
2 1 1 11 6
3 2 1 15 3
4 2 1 16 5
5 2 0 16 10
da %>%
mutate(grp = with(rle(status),
rep(seq_along(values), lengths)) + cumsum(status != 0)) %>%
group_by_at(vars(-points)) %>%
summarise(points = sum(points)) %>%
ungroup() %>%
select(-grp)
## A tibble: 5 x 4
# userid status age points
# <dbl> <dbl> <dbl> <dbl>
#1 1 0 10 6
#2 1 1 11 6
#3 2 0 16 10
#4 2 1 15 3
#5 2 1 16 5
You can use group_by from dplyr:
da %>% group_by(da$userid, cumsum(da$status), da$status)
%>% summarise(age=max(age), points=sum(points))
Output:
`da$userid` `cumsum(da$status)` `da$status` age points
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0 0 10 6
2 1 1 1 11 6
3 2 2 1 15 3
4 2 3 0 16 10
5 2 3 1 16 5
Exactly the same idea as above :
library(dplyr)
data1 <- data %>% group_by(userid, age, status) %>%
filter(status == 0) %>%
summarise(points = sum(points))
data2 <- data %>%
group_by(userid, age, status) %>%
filter(status != 0) %>%
summarise(points = sum(points))
data <- rbind(data1,
data2)
We need to be more carreful with your specification of status equal to 0. I think the code of Quang Hoang works only for your specific example.
I hope it will help.

New variable that indicates the first occurrence of a specific value

I want to create a new variable that indicates the first specific observation of a value for a variable.
In the following example dataset I want to have a new variable "firstna" that is "1" for the first observation of "NA" for this player.
game_data <- data.frame(player = c(1,1,1,1,2,2,2,2), level = c(1,2,3,4,1,2,3,4), points = c(20,NA,NA,NA,20,40,NA,NA))
game_data
player level points
1 1 1 20
2 1 2 NA
3 1 3 NA
4 1 4 NA
5 2 1 20
6 2 2 40
7 2 3 NA
8 2 4 NA
The resulting dataframe should look like this:
game_data_new <- data.frame(player = c(1,1,1,1,2,2,2,2), level = c(1,2,3,4,1,2,3,4), points = c(20,NA,NA,NA,20,40,NA,NA), firstna = c(0,1,0,0,0,0,1,0))
game_data_new
player level points firstna
1 1 1 20 0
2 1 2 NA 1
3 1 3 NA 0
4 1 4 NA 0
5 2 1 20 0
6 2 2 40 0
7 2 3 NA 1
8 2 4 NA 0
To be honest i don't know how to do this. It would be perfect if there is a dplyr option to do so.
A base R solution:
ave(game_data$points, game_data$player,
FUN = function(x) seq_along(x) == match(NA, x, nomatch = 0))
Another ave option to find out first NA by group (player).
game_data$firstna <- ave(game_data$points, game_data$player,
FUN = function(x) cumsum(is.na(x)) == 1)
game_data
# player level points firstna
#1 1 1 20 0
#2 1 2 NA 1
#3 1 3 NA 0
#4 1 4 NA 0
#5 2 1 20 0
#6 2 2 40 0
#7 2 3 NA 1
#8 2 4 NA 0
Here is a solution with data.table:
library("data.table")
game_data <- data.table(player = c(1,1,1,1,2,2,2,2), level = c(1,2,3,4,1,2,3,4), points = c(20,NA,NA,NA,20,40,NA,NA))
game_data[, firstna:=is.na(points) & !is.na(shift(points)), player][]
# > game_data[, firstna:=is.na(points) & !is.na(shift(points)), player][]
# player level points firstna
# 1: 1 1 20 FALSE
# 2: 1 2 NA TRUE
# 3: 1 3 NA FALSE
# 4: 1 4 NA FALSE
# 5: 2 1 20 FALSE
# 6: 2 2 40 FALSE
# 7: 2 3 NA TRUE
# 8: 2 4 NA FALSE
You can do this by grouping by player and then mutating to check if a row has an NA value and the previous row doesn't
game_data %>%
group_by(player) %>%
mutate(firstna = ifelse(is.na(points) & lag(!is.na(points)),1,0)) %>%
ungroup()
Result:
# A tibble: 8 x 4
# Groups: player [2]
player level points firstna
<dbl> <dbl> <dbl> <dbl>
1 1 1 20 0
2 1 2 NA 1
3 1 3 NA 0
4 1 4 NA 0
5 2 1 20 0
6 2 2 40 0
7 2 3 NA 1
8 2 4 NA 0
library(tidyverse)
library(data.table)
data.frame(
player = c(1,1,1,1,2,2,2,2),
level = c(1,2,3,4,1,2,3,4),
points = c(20,NA,NA,NA,20,40,NA,NA)
) -> game_data
game_data_base1 <- game_data
game_data_dt <- data.table(game_data)
microbenchmark::microbenchmark(
better_base = game_data$first_na <- ave(
game_data$points,
game_data$player,
FUN=function(x) seq_along(x)==match(NA,x,nomatch=0)
),
brute_base = do.call(
rbind.data.frame,
lapply(
split(game_data, game_data$player),
function(x) {
x$firstna <- 0
na_loc <- which(is.na(x$points))
if (length(na_loc) > 0) x$firstna[na_loc[1]] <- 1
x
}
)
),
tidy = game_data %>%
group_by(player) %>%
mutate(firstna=as.numeric(is.na(points) & !duplicated(points))) %>%
ungroup(),
dt = game_data_dt[, firstna:=as.integer(is.na(points) & !is.na(shift(points))), player]
)
## Unit: microseconds
## expr min lq mean median uq max neval
## better_base 125.188 156.861 362.9829 191.6385 355.6675 3095.958 100
## brute_base 366.642 450.002 2782.6621 658.0380 1072.6475 174373.974 100
## tidy 998.924 1119.022 2528.3687 1509.0705 2516.9350 42406.778 100
## dt 330.428 421.211 1031.9978 535.8415 1042.1240 9671.991 100
game_data %>%
group_by(player) %>%
mutate(firstna=as.numeric(is.na(points) & !duplicated(points)))
Group by player, then create a boolean vector for cases that are both NA and not duplicates for previous rows.
# A tibble: 8 x 4
# Groups: player [2]
player level points firstna
<dbl> <dbl> <dbl> <dbl>
1 1 1 20 0
2 1 2 NA 1
3 1 3 NA 0
4 1 4 NA 0
5 2 1 20 0
6 2 2 40 0
7 2 3 NA 1
8 2 4 NA 0
If you want the 1s on the last non-NA line before an NA, replace the mutate line with this:
mutate(lastnonNA=as.numeric(!is.na(points) & is.na(lead(points))))
First row of a block of NAs that runs all the way to the end of the player's group:
game_data %>%
group_by(player) %>%
mutate(firstna=as.numeric(is.na(points) & !duplicated(cbind(points,cumsum(!is.na(points))))))
Another way using base:
game_data$firstna <-
unlist(
tapply(game_data$points, game_data$player, function(x) {i<-which(is.na(x))[1];x[]<-0;x[i]<-1;x})
)
or as another ?ave clone:
ave(game_data$points, game_data$player, FUN = function(x) {
i<-which(is.na(x))[1];x[]<-0;x[i]<-1;x
})
An option using diff
transform(game_data, firstna = ave(is.na(points), player, FUN = function(x) c(0,diff(x))))
# player level points firstna
# 1 1 1 20 0
# 2 1 2 NA 1
# 3 1 3 NA 0
# 4 1 4 NA 0
# 5 2 1 20 0
# 6 2 2 40 0
# 7 2 3 NA 1
# 8 2 4 NA 0
And its dplyr equivalent:
library(dplyr)
game_data %>% group_by(player) %>% mutate(firstna = c(0,diff(is.na(points))))
# # A tibble: 8 x 4
# # Groups: player [2]
# player level points firstna
# <dbl> <dbl> <dbl> <dbl>
# 1 1 1 20 0
# 2 1 2 NA 1
# 3 1 3 NA 0
# 4 1 4 NA 0
# 5 2 1 20 0
# 6 2 2 40 0
# 7 2 3 NA 1
# 8 2 4 NA 0

Resources