Assign observation level values by grouping variable - r

Thanks in advance for any help.
I have the below data frame
> df <- data.frame(
id = c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3,4,4,4,4,4,4,5,5,5,5,5,5),
time = c(1,2,3,4,5,6,1,2,3,4,5,6,1,2,3,4,5,6,1,2,3,4,5,6,1,2,3,4,5,6),
mortality = c(NA,1,0,0,0,0,NA,0,0,0,0,1,NA,0,0,0,0,0,NA,0,0,1,0,0,NA,0,1,0,0,0)
)
> head(df)
id time mortality
1 1 1 NA
2 1 2 1
3 1 3 0
4 1 4 0
5 1 5 0
6 1 6 0
df$id represents individuals measured at six points in time throughout a survival trail. At the start of the trial all individuals are alive and they subsequently die or remain alive. df$mortality represents within which time period that the individual died, for example individual 1 died in time period 2.
I would like to create a new variable indicating what I have called cumulative survival. This would indicate if the individual had died in the current time period or any of the previous time periods. How would I code this?
I have tried a number of different ways using ifelse() statements and dplyr group_by() without success.
Below is what the new data frame should look like. Thanks
> df
id time mortality cum.survival
1 1 1 NA 0
2 1 2 1 1
3 1 3 0 1
4 1 4 0 1
5 1 5 0 1
6 1 6 0 1
7 2 1 NA 0
8 2 2 0 0
9 2 3 0 0
10 2 4 0 0
11 2 5 0 0
12 2 6 1 1
13 3 1 NA 0
14 3 2 0 0
15 3 3 0 0
16 3 4 0 0
17 3 5 0 0
18 3 6 0 0
19 4 1 NA 0
20 4 2 0 0
21 4 3 0 0
22 4 4 1 1
23 4 5 0 1
24 4 6 0 1
25 5 1 NA 0
26 5 2 0 0
27 5 3 1 1
28 5 4 0 1
29 5 5 0 1
30 5 6 0 1

Assuming the person will die only once, we can also use cumsum.
First replacing NA in mortality to 0 in cum.survival.
df <- transform(df, cum.survival = replace(mortality, is.na(mortality), 0))
We can then use base R :
df$cum.survival <- with(df, ave(cum.survival, id, FUN = cumsum))
dplyr :
library(dplyr)
df %>% group_by(id) %>% mutate(cum.survival = cumsum(cum.survival))
Or data.table :
library(data.table)
setDT(df)[, cum.survival := cumsum(cum.survival), id]
Another option is to match the row index in the group to the index where 1 is present.
We can use which.max :
df %>%
group_by(id) %>%
mutate(cum.survival = +(row_number() >= which.max(mortality)))
OR match :
df %>%
group_by(id) %>%
mutate(cum.survival = +(row_number() >= match(1, mortality)))

An option using by:
df$cum.survival <- unlist(by(df$mortality, df$id, function(x) cummax(replace(x, is.na(x), 0L))))
or ave:
df$cum.survival <- ave(df$mortality, df$id, FUN=function(x) cummax(replace(x, is.na(x), 0L)))
or tapply:
df$cum.survival <- unlist(tapply(df$mortality, df$id, FUN=function(x) cummax(replace(x, is.na(x), 0L))))

Related

Which IDs have only zero-counts in variable across all days? [duplicate]

This question already has answers here:
Select groups where all values are positive
(2 answers)
Select rows where all values are TRUE by group, using data.table
(2 answers)
Closed 7 months ago.
In my dataset there is the variable "cigarettes per day" (CPD) for 21 days and several subjects (ID). I want to know how many and which subjects never smoked (e.g. have only 0 in CPD) across the 21 days.
Here is a example dataset for 3 subjects and 5 days
day <- c(1,2,3,4,5,1,2,3,4,5,1,2,3,4,5)
ID <- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3)
CPD <- c(3,4,0,2,0,0,0,0,0,0,4,0,0,0,1)
df <- data.frame(day, ID, CPD)
what I want would be something like this:
day ID CPD
1 1 2 0
2 2 2 0
3 3 2 0
4 4 2 0
5 5 2 0
We may use a group by all approach
library(dplyr)
df %>%
group_by(ID) %>%
filter(all(CPD %in% 0)) %>%
ungroup
-output
# A tibble: 5 × 3
day ID CPD
<dbl> <dbl> <dbl>
1 1 2 0
2 2 2 0
3 3 2 0
4 4 2 0
5 5 2 0
Or without grouping
df %>%
filter(!ID %in% ID[CPD != 0])
day ID CPD
1 1 2 0
2 2 2 0
3 3 2 0
4 4 2 0
5 5 2 0
Or with base R
subset(df, !ID %in% ID[CPD != 0])
Here is a slighltly modified dplyr (#akrun) approach:
libaray(dplyr)
df %>%
group_by(ID) %>%
filter(all(CPD==0)==TRUE)
# Groups: ID [1]
day ID CPD
<dbl> <dbl> <dbl>
1 1 2 0
2 2 2 0
3 3 2 0
4 4 2 0
5 5 2 0
and here is a data.table approach:
library(data.table)
setDT(df)[,if(all(CPD == 0)) .SD , by = ID]
ID day CPD
1: 2 1 0
2: 2 2 0
3: 2 3 0
4: 2 4 0
5: 2 5 0

Subsetting data based on a value within ids in r

I'm trying to subset a dataset based on two criteria. Here is a snapshot of my data:
ids <- c(1,1,1,1,1,1, 2,2,2,2,2,2, 3,3,3,3,3,3)
seq <- c(1,2,3,4,5,6, 1,2,3,4,5,6, 1,2,3,4,5,6)
type <- c(1,1,5,1,1,1, 1,1,1,8,1,1, 1,1,1,1,1,1)
data <- data.frame(ids, seq, type)
ids seq type
1 1 1 1
2 1 2 1
3 1 3 5
4 1 4 1
5 1 5 1
6 1 6 1
7 2 1 1
8 2 2 1
9 2 3 1
10 2 4 8
11 2 5 1
12 2 6 1
13 3 1 1
14 3 2 1
15 3 3 1
16 3 4 1
17 3 5 1
18 3 6 1
ids is the student id, seq is the sequence of the questions (items) students take. type refers to the type of the question. 1 is simple, 5 or 8 is the complicated items. What I would like to do is to generate 1st variable(complex) as to whether or not student has a complicated item(type=5|8). Then I would like to get:
> data
ids seq type complex
1 1 1 1 1
2 1 2 1 1
3 1 3 5 1
4 1 4 1 1
5 1 5 1 1
6 1 6 1 1
7 2 1 1 1
8 2 2 1 1
9 2 3 1 1
10 2 4 8 1
11 2 5 1 1
12 2 6 1 1
13 3 1 1 0
14 3 2 1 0
15 3 3 1 0
16 3 4 1 0
17 3 5 1 0
18 3 6 1 0
The second step is to split data within students.
(a) For the student who has non-complex items (complex=0), I would like to split the dataset from half point and get this below:
>simple.split.1
ids seq type complex
13 3 1 1 0
14 3 2 1 0
15 3 3 1 0
>simple.split.2
ids seq type complex
16 3 4 1 0
17 3 5 1 0
18 3 6 1 0
(b) for the students who have complex items (complex=1), I would like to set the complex item as a cutting point and split the data from there. So the data should look like this (excluding complex item):
>complex.split.1
ids seq type complex
1 1 1 1 1
2 1 2 1 1
7 2 1 1 1
8 2 2 1 1
9 2 3 1 1
>complex.split.2
ids seq type complex
4 1 4 1 1
5 1 5 1 1
6 1 6 1 1
11 2 5 1 1
12 2 6 1 1
Any thoughts?
Thanks
Here's a way to do it using data.table, zoo packages and split function:
library(data.table)
library(zoo)
setDT(data)[, complex := ifelse(type == 5 | type == 8, 1, NA_integer_), by = ids][, complex := na.locf(na.locf(complex, na.rm=FALSE), na.rm=FALSE, fromLast=TRUE), by = ids][, complex := ifelse(is.na(complex), 0, complex)] ## set data to data.table & add a flag 1 where type is 5 or 8 ## carry forward and backward of complex flag ## replace na values in complex column with 0
data <- data[!(type == 5 | type == 8), ] ## removing rows where type equals 5 or 8
complex <- split(data, data$complex) ## split data based on complex flag
complex_0 <- as.data.frame(complex$`0`) ## saving as data frame based on complex flag
complex_1 <- as.data.frame(complex$`1`)
split(complex_0, cut(complex_0$seq, 2)) ## split into equal parts
split(complex_1, cut(complex_1$seq, 2))
#$`(0.995,3.5]`
# ids seq type complex
#1 3 1 1 0
#2 3 2 1 0
#3 3 3 1 0
#$`(3.5,6]`
# ids seq type complex
#4 3 4 1 0
#5 3 5 1 0
#6 3 6 1 0
#$`(0.995,3.5]`
# ids seq type complex
#1 1 1 1 1
#2 1 2 1 1
#6 2 1 1 1
#7 2 2 1 1
#8 2 3 1 1
#$`(3.5,6]`
# ids seq type complex
#3 1 4 1 1
#4 1 5 1 1
#5 1 6 1 1
#9 2 5 1 1
#10 2 6 1 1
If you prefer using the tidyverse, here's an approach:
ids <- c(1,1,1,1,1,1, 2,2,2,2,2,2, 3,3,3,3,3,3)
seq <- c(1,2,3,4,5,6, 1,2,3,4,5,6, 1,2,3,4,5,6)
type <- c(1,1,5,1,1,1, 1,1,1,8,1,1, 1,1,1,1,1,1)
data <- data.frame(ids, seq, type)
step1.data <- data %>%
group_by(ids) %>%
mutate(complex = ifelse(any(type %in% c(5,8)), 1, 0)) %>%
ungroup()
simple.split.1 <- step1.data %>%
filter(complex == 0) %>%
group_by(ids) %>%
filter(seq <= mean(seq)) %>% #if you happen to have more than 6 questions in seq, this gives the midpoint
ungroup()
simple.split.2 <- step1.data %>%
filter(complex == 0) %>%
group_by(ids) %>%
filter(seq > mean(seq)) %>%
ungroup()
complex.split.1 <- step1.data %>%
filter(complex == 1) %>%
arrange(ids, seq) %>%
group_by(ids) %>%
filter(seq < min(seq[type %in% c(5,8)])) %>%
ungroup()
complex.split.2 <- step1.data %>%
filter(complex == 1) %>%
arrange(ids, seq) %>%
group_by(ids) %>%
filter(seq > min(seq[type %in% c(5,8)])) %>%
ungroup()

Add column to data frame with loop calculation from another data frame

I have two datasets, one at the individual level and one at the school level. I would like to calculate the proportion of fighting in each school using a loop (since i have >100 schools).
Current code:
for (i in levels(df$school_id)) {
school <- subset(df, school_id == i)
number_students <- nrow(school)
prop <- (sum(school$fight_binary, na.rm = TRUE))/number_students
df$proportion_fight[df$school_id == i] <- prop
}
I tried initializing the new column first, but when I run this loop nothing happens at all.
Here's some sample data
INDIVIDUAL LEVEL:
student_id school_id ever_fight
1 2 1
2 3 0
3 1 1
4 1 1
5 2 0
6 2 0
7 2 0
8 2 0
9 3 1
10 1 0
11 3 1
12 3 1
13 3 1
14 3 1
15 1 0
16 2 0
17 1 0
18 1 0
19 1 0
20 1 0
SCHOOL LEVEL (need to fill the second column with data from above):
school_id proportion_fight
1
2
3
We can use a group by mean
library(dplyr)
df1 %>%
group_by(school_id) %>%
summarise(proportion_flight = mean(ever_flight))

Deleting unnecessary rows after column shuffling in a data frame in R

I have a data frame as below. The Status of each ID recorded in different time points. 0 means the person is alive and 1 means dead.
ID Status
1 0
1 0
1 1
2 0
2 0
2 0
3 0
3 0
3 0
3 1
I want to shuffle the column Status and each ID can have a status of 1, just one time. After that, I want to have NA for other rows. For instance, I want my data frame to look like below after shuffling:
ID Status
1 0
1 0
1 0
2 0
2 1
2 NA
3 0
3 1
3 NA
3 NA
From the data you posted and your example output, it looks like you want to randomly sample df$Status and then do the replacement. To get what you want in one step you could do:
set.seed(3)
df$Status <- ave(sample(df$Status), df$ID, FUN = function(x) replace(x, which(cumsum(x)>=1)[-1], NA))
df
# ID Status
#1 1 0
#2 1 0
#3 1 0
#4 2 1
#5 2 NA
#6 2 NA
#7 3 0
#8 3 0
#9 3 1
#10 3 NA
One option to use cumsum of cumsum to decide first 1 appearing for an ID.
Note that I have modified OP's sample dataframe to represent logic of reshuffling.
library(dplyr)
df %>% group_by(ID) %>%
mutate(Sum = cumsum(cumsum(Status))) %>%
mutate(Status = ifelse(Sum > 1, NA, Status)) %>%
select(-Sum)
# # A tibble: 10 x 2
# # Groups: ID [3]
# ID Status
# <int> <int>
# 1 1 0
# 2 1 0
# 3 1 1
# 4 2 0
# 5 2 1
# 6 2 NA
# 7 3 0
# 8 3 1
# 9 3 NA
# 10 3 NA
Data
df <- read.table(text =
"ID Status
1 0
1 0
1 1
2 0
2 1
2 0
3 0
3 1
3 0
3 0", header = TRUE)

Wide to long format with several variables

This question is related to a previous question I asked on converting from wide to long format in R with an additional complication.
previous question is here: Wide to long data conversion
The wide data I start with looks like the following:
d2 <- data.frame('id' = c(1,2),
'Q1' = c(2,3),
'Q2' = c(1,3),
'Q3' = c(3,1),
'Q1_X_Opt_1' = c(0,0),
'Q1_X_Opt_2' = c(75,200),
'Q1_X_Opt_3' = c(150,300),
'Q2_X_Opt_1' = c(0,0),
'Q2_X_Opt_2' = c(150,200),
'Q2_X_Opt_3' = c(75,300),
'Q3_X_Opt_1' = c(0,0),
'Q3_X_Opt_2' = c(100,500),
'Q3_X_Opt_3' = c(150,300))
In this example, there are two individuals who have answered three questions. The answer to each question takes the following values {1,2,3} encoded in Q1, Q2, and Q3. So, in this examples, individual 1 chose option 2 in Q1, chose option 1 in Q2, and chose option 3 in Q3.
For each option there is also a variable X associated with each option that I also need to be converted to wide format. The output I am seeking looks like the following:
id question option choice cost
1 1 1 1 0 0
2 1 1 2 1 75
3 1 1 3 0 150
4 1 2 1 1 0
5 1 2 2 0 150
6 1 2 3 0 75
7 1 3 1 0 0
8 1 3 2 0 100
9 1 3 3 1 150
10 2 1 1 0 0
11 2 1 2 0 200
12 2 1 3 1 300
13 2 2 1 0 0
14 2 2 2 0 200
15 2 2 3 1 300
16 2 3 1 1 0
17 2 3 2 0 500
18 2 3 3 0 300
I have tried to adapting the code from the answer to the prior question, but with no success thus far. Thanks for any suggestions or comments.
It's not exactly elegant, but here's a tidyverse version:
library(tidyverse)
d3 <- d2 %>%
gather(option, cost, -id:-Q3) %>%
gather(question, choice, Q1:Q3) %>%
separate(option, c('question2', 'option'), extra = 'merge') %>%
filter(question == question2) %>%
mutate_at(vars(question, option), parse_number) %>%
mutate(choice = as.integer(option == choice)) %>%
select(1, 5, 3, 6, 4) %>%
arrange(id)
d3
#> id question option choice cost
#> 1 1 1 1 0 0
#> 2 1 1 2 1 75
#> 3 1 1 3 0 150
#> 4 1 2 1 1 0
#> 5 1 2 2 0 150
#> 6 1 2 3 0 75
#> 7 1 3 1 0 0
#> 8 1 3 2 0 100
#> 9 1 3 3 1 150
#> 10 2 1 1 0 0
#> 11 2 1 2 0 200
#> 12 2 1 3 1 300
#> 13 2 2 1 0 0
#> 14 2 2 2 0 200
#> 15 2 2 3 1 300
#> 16 2 3 1 1 0
#> 17 2 3 2 0 500
#> 18 2 3 3 0 300
1) First melt the input transformihg it to long form. Then break apart the variable column on underscore using read.table giving columns named V1, V2, V3, V4 representing the question as a factor, junk, junk and the option parts, respectively. Append that back to m and set the question to the factor level of V1 and option to V4. Sort it by id to give the same ordering as in the question. (If the order does not matter this line could be omiited.)
Now put the parts together noting that choice is 1 if the appropriate column among the Q1/Q2/Q3 columns equals the option and 0 otherwise.
library(reshape2)
m <- melt(d2, id = 1:4)
m <- cbind(m, read.table(text = as.character(m$variable), sep = "_"))
m <- transform(m, question = as.numeric(V1), option = V4)
m <- m[order(m$id), ]
n <- nrow(m)
with(m, data.frame(id,
question,
option,
choice = (m[cbind(1:n, question + 1)] == option) + 0,
value))
The result is:
id question option choice value
1 1 1 1 0 0
2 1 1 2 1 75
3 1 1 3 0 150
4 1 2 1 1 0
5 1 2 2 0 150
6 1 2 3 0 75
7 1 3 1 0 0
8 1 3 2 0 100
9 1 3 3 1 150
10 2 1 1 0 0
11 2 1 2 0 200
12 2 1 3 1 300
13 2 2 1 0 0
14 2 2 2 0 200
15 2 2 3 1 300
16 2 3 1 1 0
17 2 3 2 0 500
18 2 3 3 0 300
2) This could also be expressed using magirttr giving the same answer. Note that the last two pipes use the exposition operator %$% providing an implicit with(., ...) around the subsequent expression:
library(magrittr)
library(reshape2)
d2 %>%
melt(id = 1:4) %>%
cbind(read.table(text = as.character(.$variable), sep = "_")) %>%
transform(question = as.numeric(V1), option = V4) %$%
.[order(id), ] %$%
data.frame(id,
question,
option,
choice = (.[cbind(1:nrow(.), question + 1)] == option) + 0,
value)
3) This can be translated to reshape2/dplyr/tidyr:
library(reshape2)
library(dplyr)
library(tidyr)
d2 %>%
melt(id = 1:4) %>%
separate(variable, c("question", "X", "Opt", "option")) %>%
arrange(id) %>%
mutate(question = as.numeric(factor(question)),
choice = (.[cbind(1:n(), question + 1)] == option) + 0) %>%
select(id, question, option, choice, value)

Resources