Deleting unnecessary rows after column shuffling in a data frame in R - r

I have a data frame as below. The Status of each ID recorded in different time points. 0 means the person is alive and 1 means dead.
ID Status
1 0
1 0
1 1
2 0
2 0
2 0
3 0
3 0
3 0
3 1
I want to shuffle the column Status and each ID can have a status of 1, just one time. After that, I want to have NA for other rows. For instance, I want my data frame to look like below after shuffling:
ID Status
1 0
1 0
1 0
2 0
2 1
2 NA
3 0
3 1
3 NA
3 NA

From the data you posted and your example output, it looks like you want to randomly sample df$Status and then do the replacement. To get what you want in one step you could do:
set.seed(3)
df$Status <- ave(sample(df$Status), df$ID, FUN = function(x) replace(x, which(cumsum(x)>=1)[-1], NA))
df
# ID Status
#1 1 0
#2 1 0
#3 1 0
#4 2 1
#5 2 NA
#6 2 NA
#7 3 0
#8 3 0
#9 3 1
#10 3 NA

One option to use cumsum of cumsum to decide first 1 appearing for an ID.
Note that I have modified OP's sample dataframe to represent logic of reshuffling.
library(dplyr)
df %>% group_by(ID) %>%
mutate(Sum = cumsum(cumsum(Status))) %>%
mutate(Status = ifelse(Sum > 1, NA, Status)) %>%
select(-Sum)
# # A tibble: 10 x 2
# # Groups: ID [3]
# ID Status
# <int> <int>
# 1 1 0
# 2 1 0
# 3 1 1
# 4 2 0
# 5 2 1
# 6 2 NA
# 7 3 0
# 8 3 1
# 9 3 NA
# 10 3 NA
Data
df <- read.table(text =
"ID Status
1 0
1 0
1 1
2 0
2 1
2 0
3 0
3 1
3 0
3 0", header = TRUE)

Related

Counting Frequencies of Sequences

Suppose there are two students - each student takes an exam multiple times (e.g.result_id = 1 is the first exam, result_id = 2 is the second exam, etc.). The student can either "pass" (1) or "fail" (0).
The data looks something like this:
library(data.table)
my_data = data.frame(id = c(1,1,1,1,1,1,2,2,2,2,2,2,2,2,2), results = c(0,1,0,1,0,0,1,1,1,0,1,1,0,1,0), result_id = c(1,2,3,4,5,6,1,2,3,4,5,6,7,8,9))
my_data = setDT(my_data)
id results result_id
1: 1 0 1
2: 1 1 2
3: 1 0 3
4: 1 1 4
5: 1 0 5
6: 1 0 6
7: 2 1 1
8: 2 1 2
9: 2 1 3
10: 2 0 4
11: 2 1 5
12: 2 1 6
13: 2 0 7
14: 2 1 8
15: 2 0 9
I am interested in counting the number of times that a student passes an exam, given that the student passed the previous two exams.
I tried to do this with the following code:
my_data$current_exam = shift(my_data$results, 0)
my_data$prev_exam = shift(my_data$results, 1)
my_data$prev_2_exam = shift(my_data$results, 2)
# Count the number of exam results for each record
out <- my_data[!is.na(prev_exam), .(tally = .N), by = .(id, current_exam, prev_exam, prev_2_exam)]
out = na.omit(out)
My code produces the following results:
> out
id current_exam prev_exam prev_2_exam tally
1: 1 0 1 0 2
2: 1 1 0 1 1
3: 1 0 0 1 1
4: 2 1 0 0 1
5: 2 1 1 0 2
6: 2 1 1 1 1
7: 2 0 1 1 2
8: 2 1 0 1 2
9: 2 0 1 0 1
However, I do not think that my code is correct.
For example, with Student_ID = 2 :
My code says that "Current_Exam = 1, Prev_Exam = 1, Prev_2_Exam = 0" happens 1 time, but looking at the actual data - this does not happen at all
Can someone please show me what I am doing wrong and how I can correct this?
Note: I think that this should be the expected output:
> expected_output
id current_exam prev_exam prev_2_exam tally
1: 1 0 1 0 2
2: 1 1 0 1 1
3: 1 0 0 1 1
4: 2 1 0 0 1
5: 2 1 1 0 1
6: 2 1 1 1 1
7: 2 0 1 1 2
8: 2 1 0 1 2
9: 2 0 1 0 0
You did not consider that you can not shift the results over id without placing NA.
. <- my_data[order(my_data$id, my_data$result_id),] #sort if needed
.$p1 <- ave(.$results, .$id, FUN = \(x) c(NA, x[-length(x)]))
.$p2 <- ave(.$p1, .$id, FUN = \(x) c(NA, x[-length(x)]))
aggregate(list(tally=.$p1), .[c("id","results", "p1", "p2")], length)
# id results p1 p2 tally
#1 1 0 1 0 2
#2 2 0 1 0 1
#3 2 1 1 0 1
#4 1 0 0 1 1
#5 1 1 0 1 1
#6 2 1 0 1 2
#7 2 0 1 1 2
#8 2 1 1 1 1
.
# id results result_id p1 p2
#1 1 0 1 NA NA
#2 1 1 2 0 NA
#3 1 0 3 1 0
#4 1 1 4 0 1
#5 1 0 5 1 0
#6 1 0 6 0 1
#7 2 1 1 NA NA
#8 2 1 2 1 NA
#9 2 1 3 1 1
#10 2 0 4 1 1
#11 2 1 5 0 1
#12 2 1 6 1 0
#13 2 0 7 1 1
#14 2 1 8 0 1
#15 2 0 9 1 0
An option would be to use filter to indicate those which had passed 3 times in a row.
cbind(., n=ave(.$results, .$id, FUN = \(x) filter(x, c(1,1,1), sides=1)))
# id results result_id n
#1 1 0 1 NA
#2 1 1 2 NA
#3 1 0 3 1
#4 1 1 4 2
#5 1 0 5 1
#6 1 0 6 1
#7 2 1 1 NA
#8 2 1 2 NA
#9 2 1 3 3
#10 2 0 4 2
#11 2 1 5 2
#12 2 1 6 2
#13 2 0 7 2
#14 2 1 8 2
#15 2 0 9 1
If olny the number of times that a student passes an exam, given that the student passed the previous two exams:
sum(ave(.$results, .$id, FUN = \(x) filter(x, c(1,1,1))==3), na.rm=TRUE)
#[1] 1
sum(ave(.$results, .$id, FUN = \(x)
x==1 & c(x[-1], 0) == 1 & c(x[-1:-2], 0, 0) == 1))
#[1] 1
When trying to count events that happen in series, cumsum() comes in quite handy. As opposed to creating multiple lagged variables, this scales well to counts across a larger number of events:
library(tidyverse)
d <- my_data |>
group_by(id) |> # group to cumulate within student only
mutate(
csum = cumsum(results), # cumulative sum of results
i = csum - lag(csum, 3, 0) # substract the cumulative sum from 3 observation before. This gives the number of exams passed in the current and previous 2 observations.
)
# Ungroup to get global count
d |>
ungroup() |>
count(i == 3) # Count the number of cases where the number of exams passes within 3 observations equals 3
#> # A tibble: 2 × 2
#> `i == 3` n
#> <lgl> <int>
#> 1 FALSE 14
#> 2 TRUE 1
# Retaining the group gives counts by student
d |>
count(i == 3) # Count the number of cases where the number of exams passes within 3 observations equals 3
#> # A tibble: 3 × 3
#> # Groups: id [2]
#> id `i == 3` n
#> <dbl> <lgl> <int>
#> 1 1 FALSE 6
#> 2 2 FALSE 8
#> 3 2 TRUE 1
Since you provided the data as data.table, here is how to do the same in that ecosystem:
my_data[ , csum := cumsum(results), .(id)]
my_data[ , i := csum - lag(csum, 3, 0), .(id)]
my_data[ , .(n_cases = sum(i ==3)), id]
#> id n_cases
#> 1: 1 0
#> 2: 2 1
Here's an approach using dplyr. It uses the lag function to look back 1 and 2 results. If the sum together with the current result is 3, then the condition is met. In the example you provided, the condition is only met once
my_data %>%
group_by(id) %>%
mutate(threex = ifelse(results + lag(results,1) + lag(results, 2) == 3, 1, 0)) %>%
filter(!is.na(threex))
id results result_id threex
<dbl> <dbl> <dbl> <dbl>
1 1 0 3 0
2 1 1 4 0
3 1 0 5 0
4 1 0 6 0
5 2 1 3 1
6 2 0 4 0
7 2 1 5 0
8 2 1 6 0
9 2 0 7 0
10 2 1 8 0
11 2 0 9 0
If you then just want to capture the cases when the condition is met, add a filter.
my_data %>%
group_by(id) %>%
mutate(threex = ifelse(results + lag(results,1) + lag(results, 2) == 3, 1, 0)) %>%
filter(threex == 1)
id results result_id threex
<dbl> <dbl> <dbl> <dbl>
1 2 1 3 1
If you are looking to understand how many times the condition is met per id, you can do this.
my_data %>%
group_by(id) %>%
mutate(threex = ifelse(results + lag(results,1) + lag(results, 2) == 3, 1, 0)) %>%
filter(threex == 1) %>%
select(id) %>%
summarize(count = n())
id count
<dbl> <int>
1 2 1

Filling in NA values with a sequence by group

I have a data set that looks like the following:
ID Count
1 0
1 1
1 NA
1 2
1 NA
1 NA
1 NA
1 NA
1 NA
2 0
2 NA
2 NA
2 3
The first row of each ID starts with 0. I want to fill the NA values with sequential values by group. If there are values before and after the NA values, I need to fill the NA values with a sequence counting up to the first value after the NA values. If there are no values after the NA values, I need to fill the NA values with a sequence counting up from the last value before the NA value. The output should look like following:
ID Count
1 0
1 1
1 1
1 2
1 3
1 4
1 5
1 6
1 7
2 0
2 1
2 2
2 3
This is a little complicated, but I think this does what you want. I left all my helper columns in so you can see what's happening, but the non-needed columns can all be dropped at the end.
library(dplyr)
library(vctrs)
df %>%
group_by(ID, na_group = cumsum(!is.na(Count))) %>%
mutate(n_til_non_na = ifelse(is.na(Count), rev(row_number()), 0L)) %>%
group_by(ID) %>%
mutate(
fill_down = vec_fill_missing(Count, direction = "down"),
fill_up = vec_fill_missing(Count, direction = "up"),
result = case_when(
is.na(fill_up) ~ fill_down + cumsum(is.na(fill_up)),
is.na(Count) ~ fill_up - n_til_non_na,
TRUE ~ Count
)
) %>%
ungroup()
# # A tibble: 13 × 7
# ID Count na_group n_til_non_na fill_down fill_up result
# <int> <int> <int> <int> <int> <int> <int>
# 1 1 0 1 0 0 0 0
# 2 1 1 2 0 1 1 1
# 3 1 NA 2 1 1 2 1
# 4 1 2 3 0 2 2 2
# 5 1 NA 3 5 2 NA 3
# 6 1 NA 3 4 2 NA 4
# 7 1 NA 3 3 2 NA 5
# 8 1 NA 3 2 2 NA 6
# 9 1 NA 3 1 2 NA 7
# 10 2 0 4 0 0 0 0
# 11 2 NA 4 2 0 3 1
# 12 2 NA 4 1 0 3 2
# 13 2 3 5 0 3 3 3
Using this sample data:
df = read.table(text = 'ID Count
1 0
1 1
1 NA
1 2
1 NA
1 NA
1 NA
1 NA
1 NA
2 0
2 NA
2 NA
2 3', header = T)
You can use purrr::accumulate(), first backwards, then forward. While going backwards, replace each missing value with the previous value - 1 to count down; then while moving forwards, replace remaining missing values with the previous value + 1 to count up.
library(dplyr)
library(purrr)
dat %>%
group_by(ID) %>%
mutate(
Count = accumulate(
Count,
\(x, y) ifelse(is.na(x), y - 1, x),
.dir = "backward"
),
Count = accumulate(
Count,
\(x, y) ifelse(is.na(y), x + 1, y)
)
) %>%
ungroup()
# A tibble: 13 × 2
ID Count
<dbl> <dbl>
1 1 0
2 1 1
3 1 1
4 1 2
5 1 3
6 1 4
7 1 5
8 1 6
9 1 7
10 2 0
11 2 1
12 2 2
13 2 3

Create variable that flags an ID if it has existed in any previous month

I am unsure of how to create a variable that flags an ID in the current month if the ID has existed in any previous month.
Example data:
ID<-c(1,2,3,2,3,4,1,5)
Month<-c(1,1,1,2,2,2,3,3)
Flag<-c(0,0,0,1,1,0,1,0)
have<-cbind(ID,Month)
> have
ID Month
1 1
2 1
3 1
2 2
3 2
4 2
1 3
5 3
want:
> want
ID Month Flag
1 1 0
2 1 0
3 1 0
2 2 1
3 2 1
4 2 0
1 3 1
5 3 0
a data.table approach
library(data.table)
# set to data.table format
DT <- as.data.table(have)
# initialise Signal column
DT[, Signal := 0]
# flag duplicates with a 1
DT[duplicated(ID), Signal := 1, by = Month][]
ID Month Signal
1: 1 1 0
2: 2 1 0
3: 3 1 0
4: 2 2 1
5: 3 2 1
6: 4 2 0
7: 1 3 1
8: 5 3 0
The idea is presented from akrun in the comments. Here is the dplyr application:
First use as_tibble to bring matrix in tibble format
then use an ifelse statement with duplicated as #akrun already suggests.
library(tibble)
library(dplyr)
have %>%
as_tibble() %>%
mutate(flag = ifelse(duplicated(ID),1,0))
ID Month flag
<dbl> <dbl> <dbl>
1 1 1 0
2 2 1 0
3 3 1 0
4 2 2 1
5 3 2 1
6 4 2 0
7 1 3 1
8 5 3 0

Applying a function for calculating AUC for each subject

I want to calculate the area under the curve(AUC) of concentration-TIME profiles for many subjects (~200 subjects). I am using the package MESS where:
AUC = auc(data$TIME,data$CONC, type = "spline")
How can I apply it to each unique ID in the data set? and retain the results in R by adding a new "AUC" column in the original data set?
The data has the following columns:
ID TIME CONC
1 0 0
1 2 4
1 3 7
2 0 0
2 1 NA
2 3 5
2 4 10
One way would be like this. foo is your data.
library(MESS)
library(dplyr)
foo %>%
group_by(ID) %>%
summarize(AUC = auc(TIME,CONC, type = "spline"))
# ID AUC
#1 1 9.12500
#2 2 12.08335
If you want to keep all data, you could do this.
foo %>%
group_by(ID) %>%
mutate(AUC = auc(TIME,CONC, type = "spline"))
# ID TIME CONC AUC
#1 1 0 0 9.12500
#2 1 2 4 9.12500
#3 1 3 7 9.12500
#4 2 0 0 12.08335
#5 2 1 NA 12.08335
#6 2 3 5 12.08335
#7 2 4 10 12.08335
In my opinion, the dplyrsolution provided by #jazzurro is the way to go, but here's a base approach for good measure.
d <- read.table(text='ID TIME CONC
1 0 0
1 2 4
1 3 7
2 0 0
2 1 NA
2 3 5
2 4 10', header=TRUE)
library(MESS)
auc <- t(sapply(split(d, d$ID), function(x) {
data.frame(ID=x$ID[1], auc=auc(x$TIME, x$CONC, type='spline'))
}))
merge(d, auc)
# ID TIME CONC auc
# 1 1 0 0 9.125
# 2 1 2 4 9.125
# 3 1 3 7 9.125
# 4 2 0 0 12.08335
# 5 2 1 NA 12.08335
# 6 2 3 5 12.08335
# 7 2 4 10 12.08335

expand data.frame to long format and increment value

I would like to convert my data from a short format to a long format and I imagine there is a simple way to do it (possibly with reshape2, plyr, dplyr, etc?).
For example, I have:
foo <- data.frame(id = 1:5,
y = c(0, 1, 0, 1, 0),
time = c(2, 3, 4, 2, 3))
id y time
1 0 2
2 1 3
3 0 4
4 1 2
5 0 3
I would like to expand/copy each row n times, where n is that row's value in the "time" column. However, I would also like the variable "time" to be incremented from 1 to n. That is, I would like to produce:
id y time
1 0 1
1 0 2
2 1 1
2 1 2
2 1 3
3 0 1
3 0 2
3 0 3
3 0 4
4 1 1
4 1 2
5 0 1
5 0 2
5 0 3
As a bonus, I would also like to do a sort of incrementing of the variable "y" where, for those ids with y = 1, y is set to 0 until the largest value of "time". That is, I would like to produce:
id y time
1 0 1
1 0 2
2 0 1
2 0 2
2 1 3
3 0 1
3 0 2
3 0 3
3 0 4
4 0 1
4 1 2
5 0 1
5 0 2
5 0 3
This seems like something that dplyr might already do, but I just don't know where to look. Regardless, any solution that avoids loops is helpful.
You can create a new data frame with the proper id and time columns for the long format, then merge that with the original. This leaves NA for the unmatched values, which can then be substituted with 0:
merge(foo,
with(foo,
data.frame(id=rep(id,time), time=sequence(time))
),
all.y=TRUE
)
## id time y
## 1 1 1 NA
## 2 1 2 0
## 3 2 1 NA
## 4 2 2 NA
## 5 2 3 1
## 6 3 1 NA
## 7 3 2 NA
## 8 3 3 NA
## 9 3 4 0
## 10 4 1 NA
## 11 4 2 1
## 12 5 1 NA
## 13 5 2 NA
## 14 5 3 0
A similar merge works for the first expansion. Merge foo without the time column with the same created data frame as above:
merge(foo[c('id','y')],
with(foo,
data.frame(id=rep(id,time), time=sequence(time))
)
)
## id y time
## 1 1 0 1
## 2 1 0 2
## 3 2 1 1
## 4 2 1 2
## 5 2 1 3
## 6 3 0 1
## 7 3 0 2
## 8 3 0 3
## 9 3 0 4
## 10 4 1 1
## 11 4 1 2
## 12 5 0 1
## 13 5 0 2
## 14 5 0 3
It's not necessary to specify all (or all.y) in the latter expression because there are multiple time values for each matching id value, and these are expanded. In the prior case, the time values were matched from both data frames, and without specifying all (or all.y) you would get your original data back.
The initial expansion can be achieved with:
newdat <- transform(
foo[rep(rownames(foo),foo$time),],
time = sequence(foo$time)
)
# id y time
#1 1 0 1
#1.1 1 0 2
#2 2 1 1
#2.1 2 1 2
#2.2 2 1 3
# etc
To get the complete solution, including the bonus part, then do:
newdat$y[-cumsum(foo$time)] <- 0
# id y time
#1 1 0 1
#1.1 1 0 2
#2 2 0 1
#2.1 2 0 2
#2.2 2 1 3
#etc
If you were really excitable, you could do it all in one step using within:
within(
foo[rep(rownames(foo),foo$time),],
{
time <- sequence(foo$time)
y[-cumsum(foo$time)] <- 0
}
)
If you're willing to go with "data.table", you can try:
library(data.table)
fooDT <- as.data.table(foo)
fooDT[, list(time = sequence(time)), by = list(id, y)]
# id y time
# 1: 1 0 1
# 2: 1 0 2
# 3: 2 1 1
# 4: 2 1 2
# 5: 2 1 3
# 6: 3 0 1
# 7: 3 0 2
# 8: 3 0 3
# 9: 3 0 4
# 10: 4 1 1
# 11: 4 1 2
# 12: 5 0 1
# 13: 5 0 2
# 14: 5 0 3
And, for the bonus question:
fooDT[, list(time = sequence(time)),
by = list(id, y)][, y := {y[1:(.N-1)] <- 0; y},
by = id][]
# id y time
# 1: 1 0 1
# 2: 1 0 2
# 3: 2 0 1
# 4: 2 0 2
# 5: 2 1 3
# 6: 3 0 1
# 7: 3 0 2
# 8: 3 0 3
# 9: 3 0 4
# 10: 4 0 1
# 11: 4 1 2
# 12: 5 0 1
# 13: 5 0 2
# 14: 5 0 3
For the bonus question, alternatively:
fooDT[, list(time=seq_len(time)), by=list(id,y)][y == 1,
y := c(rep.int(0, .N-1L), 1), by=id][]
With dplyr (and magritte for nice legibility):
library(magrittr)
library(dplyr)
foo[rep(1:nrow(foo), foo$time), ] %>%
group_by(id) %>%
mutate(y = !duplicated(y, fromLast = TRUE),
time = 1:n())
Hope it helps

Resources