I have a dataset in long-format (i.e. multiple observations per ID). Each ID contains multiple visits at which the individual was diagnosed for disease (in the toy example, I show 3 but in my real data I have as many as 30), which are coded in consecutive columns (disease1-disease3). A value of 1 means they were diagnosed with the disease at the time of diagnosis_dt, and 0 means the did not have it. For each ID, I'm interested in summarizing whether or not they had any disease across all visits where diagnosis_dt falls between start_dt and end_dt. Some IDs don't have diagnosis information, and consequently are coded as NAs in the respective columns. I'd still like to keep this information.
A toy example of my dataset is below:
library(dplyr)
library(data.table)
ex_dat <- data.frame(ID = c(rep("a",3),
rep("b",4),
rep("c",5)),
start_dt = as.Date(c(rep("2009-01-01",3),
rep("2009-04-01",4),
rep("2009-02-01",5))),
end_dt = as.Date(c(rep("2010-12-31",3),
rep("2011-03-31",4),
rep("2011-01-31",5))),
diagnosis_dt = c(as.Date(c("2011-01-03","2010-11-01","2009-12-01")),
as.Date(c("2011-04-03","2010-11-01","2009-12-01","2011-12-01")),
rep(NA,5)),
disease1 = c(c(1,0,0),
c(1,1,0,1),
rep(NA,5)),
disease2 = c(c(1,1,0),
c(0,0,0,1),
rep(NA,5)),
disease3 = c(c(0,0,0),
c(0,0,1,0),
rep(NA,5))
)
The desired output is:
ID disease1 disease2 disease3
1 a 0 1 0
2 b 1 0 1
3 c NA NA NA
I've been trying this for hours now and my latest attempt is:
out <- ex_dat %>% group_by(ID) %>%
mutate_at(vars(disease1:disease3),
function(x) ifelse(!is.na(.$diagnosis_dt) &
between(.$diagnosis_dt,.$start_dt,.$end_dt) &
sum(x)>0,
1,0)) %>%
slice(1) %>%
select(ID,disease1:disease3)
Here is a tidyverse solution using filter to eliminate the rows that do not meet the desired condition and then use complete to complete the missing groups with NA.
library(tidyverse)
ex_dat %>%
#Group by ID
group_by(ID) %>%
# Stay with the rows for which diagnosis_dt is between start_dt and end_dt
filter(diagnosis_dt >= start_dt & diagnosis_dt <= end_dt ) %>%
# summarize all variables that start with disease by taking its max value
summarize_at(vars(starts_with("disease")), max) %>%
# Complete the missing IDs, those that only had NA or did not meet the criteria in
# the filter
complete(ID)
# A tibble: 3 x 4
# ID disease1 disease2 disease3
# <fct> <dbl> <dbl> <dbl>
# 1 a 0 1 0
# 2 b 1 0 1
# 3 c NA NA NA
Here's an approach with the dplyr across functionality (version >= 1.0.0):
library(dplyr)
ex_dat %>%
group_by(ID) %>%
summarize(across(-one_of(c("start_dt","end_dt","diagnosis_dt")),
~ if_else(any(diagnosis_dt > start_dt & diagnosis_dt < end_dt & .),
1, 0)))
## A tibble: 3 x 4
# ID disease1 disease2 disease3
# <fct> <dbl> <dbl> <dbl>
#1 a 0 1 0
#2 b 1 0 1
#3 c NA NA NA
Note that using the & operator on the integer column . converts to logical. I'm using the -one_of tidyselect verb because then we don't even need to know how many diseases there are. The columns that are actively being group_by-ed are automatically excluded.
Your version isn't working because 1) you need to summarize, not mutate, and 2) inside the function call . refers to the column that is being worked on, not the data from piping. Instead, you need to access those columns without $ from the calling environment.
Related
I have a dataframe like so
ID <- c('John', 'Bill', 'Alice','Paulina')
Type1 <- c(1,1,0,1)
Type2 <- c(0,1,1,0)
cluster <- c(1,2,3,1)
test <- data.frame(ID, Type1, Type2, cluster)
I want to group by cluster and sum the values in all the other columns apart from ID that should be dropped.
I achieved it through
test.sum <- test %>%
group_by(cluster)%>%
summarise(sum(Type1), sum(Type2))
However, I have thousands of types and I can't write out each column in summarise manually. Can you help me?
This is whereacross() and contains comes in incredibly useful to select the columns you want to summarise across:
test %>%
group_by(cluster) %>%
summarise(across(contains("Type"), sum))
cluster Type1 Type2
<dbl> <dbl> <dbl>
1 1 2 0
2 2 1 1
3 3 0 1
Alternatively, pivoting the dataset into long and then back into wide means you can easily analyse all groups and clusters at once:
library(dplyr)
library(tidyr)
test %>%
pivot_longer(-c(ID, cluster)) %>%
group_by(cluster, name) %>%
summarise(sum_value = sum(value)) %>%
pivot_wider(names_from = "name", values_from = "sum_value")
cluster Type1 Type2
<dbl> <dbl> <dbl>
1 1 2 0
2 2 1 1
3 3 0 1
Base R
You can exploit split which is equivalent to group_by(). This should give you what you are looking for, regardless of how many Types you have.
my_split <- split(subset(test, select = grep('^Ty', names(test))), test[, -1]$cluster)
my_sums <- sapply(my_split, \(x) colSums(x))
my_sums <- data.frame( cluster = as.numeric(gsub("\\D", '', colnames(my_sums))),
t(my_sums) )
Output
> my_sums
cluster Type1 Type2
1 1 2 0
2 2 1 1
3 3 0 1
Note: use function(x) instead of \(x) if you use a version of R <4.1.0
Consider the following dataframe
df<-data.frame(group=c(1,1,1,2,2,2,3,3,3),
status=c(NA,1,1,NA,NA,1,NA,1,NA),
health=c(0,1,1,1,0,1,1,0,0))
For each group (i.e. first column), I'm looking for a way to subset the rows preceding the cells where 1 is first seen in the second column (labelled status). The expected output is
group status health
1 1 NA 0
2 2 NA 0
3 3 NA 1
I've tried resolving this with "filter" and "slice" functions, but have not succeed in subsetting preceding rows. Any help is greatly appreciated.
one solution is a tidyverse
df %>%
group_by(group) %>%
mutate(gr=which(status==1)[1]-1) %>%
slice(unique(gr)) %>%
select(-gr)
# A tibble: 3 x 3
# Groups: group [3]
group status health
<dbl> <dbl> <dbl>
1 1 NA 0
2 2 NA 0
3 3 NA 1
or
df %>%
group_by(group) %>%
filter(row_number() == which(status==1)[1]-1)
or
df %>%
group_by(group) %>%
slice(which(lead(status==1))[1])
I have a longitudinal data set and would like to extract the latest, non-missing complete set of observations for each variable in the data set where id is a unique identifier, yr is year, and x1 and x2 are variables with missing values. The actual data set has 100s of variables over the course of 60 years.
data <- data.frame(id=rep(1:3,3)
yr=rep(1:3,times=1, each=3)
x1=c(1,3,7,NA,NA,NA,9,4,10)
x2=c(NA,NA,NA,3,9,6,NA,NA,NA))
Below are my expected results. For x1, the latest complete set of observations is year 3. For x2, the latest complete set of observations is year 2.
Using base R
subset(data, yr %in% names(tail(which(sapply(split(data[c('x1', 'x2')],
data$yr), function(x) any(colSums(!is.na(x)) == nrow(x)))), 2)))
Here's a tidyverse solution. First, I create the data frame.
# Create data frame
df <- data.frame(id=rep(1:3,3),
yr=rep(1:3,times=1, each=3),
x1=c(1,3,7,NA,NA,NA,9,4,10),
x2=c(NA,NA,NA,3,9,6,NA,NA,NA))
Next, I load the required libraries.
# Load library
library(dplyr)
library(tidyr)
I then go from wide to long format, group by yr and key (i.e., variable name), remove those that have NA values (i.e., keep those that are all not NA), group by key, keep those data that are in the maximum year, switch back to wide format, and arrange to make the printed result look pretty.
df %>%
gather("key", "val", x1, x2) %>%
group_by(yr, key) %>%
filter(all(!is.na(val))) %>%
group_by(key) %>%
filter(yr == max(yr)) %>%
spread(key, val) %>%
arrange(yr)
#> # A tibble: 6 x 4
#> id yr x1 x2
#> <int> <int> <dbl> <dbl>
#> 1 1 2 NA 3
#> 2 2 2 NA 9
#> 3 3 2 NA 6
#> 4 1 3 9 NA
#> 5 2 3 4 NA
#> 6 3 3 10 NA
Created on 2019-05-29 by the reprex package (v0.3.0)
I have two data sets with one common variable - ID (there are duplicate ID numbers in both data sets). I need to link dates to one data set, but I can't use left-join because the first or left file so to say needs to stay as it is (I don't want it to return all combinations and add rows). But I also don't want it to link data like vlookup in Excel which finds the first match and returns it so when I have duplicate ID numbers it only returns the first match. I need it to return the first match, then the second, then third (because the dates are sorted so that the newest date is always first for every ID number) and so on BUT I can't have added rows. Is there any way to do this? Since I don't know how else to show you I have included an example picture of what I need. data joining. Not sure if I made myself clear but thank you in advance!
You can add a second column to create subid's that follow the order of the rownumbers. Then you can use an inner_join to join everything together.
Since you don't have example data sets I created two to show the principle.
df1 <- df1 %>%
group_by(ID) %>%
mutate(follow_id = row_number())
df2 <- df2 %>% group_by(ID) %>%
mutate(follow_id = row_number())
outcome <- df1 %>% inner_join(df2)
# A tibble: 7 x 3
# Groups: ID [?]
ID sub_id var1
<dbl> <int> <fct>
1 1 1 a
2 1 2 b
3 2 1 e
4 3 1 f
5 4 1 h
6 4 2 i
7 4 3 j
data:
df1 <- data.frame(ID = c(1, 1, 2,3,4,4,4))
df2 <- data.frame(ID = c(1,1,1,1,2,3,3,4,4,4,4),
var1 = letters[1:11])
You need a secondary id column. Since you need the first n matches, just group by the id, create an autoincrement id for each group, then join as usual
df1<-data.frame(id=c(1,1,2,3,4,4,4))
d1=sample(seq(as.Date('1999/01/01'), as.Date('2012/01/01'), by="day"),11)
df2<-data.frame(id=c(1,1,1,1,2,3,3,4,4,4,4),d1,d2=d1+sample.int(50,11))
library(dplyr)
df11 <- df1 %>%
group_by(id) %>%
mutate(id2=1:n())%>%
ungroup()
df21 <- df2 %>%
group_by(id) %>%
mutate(id2=1:n())%>%
ungroup()
left_join(df11,df21,by = c("id", "id2"))
# A tibble: 7 x 4
id id2 d1 d2
<dbl> <int> <date> <date>
1 1 1 2009-06-10 2009-06-13
2 1 2 2004-05-28 2004-07-11
3 2 1 2001-08-13 2001-09-06
4 3 1 2005-12-30 2006-01-19
5 4 1 2000-08-06 2000-08-17
6 4 2 2010-09-02 2010-09-10
7 4 3 2007-07-27 2007-09-05
I'm trying to modify a solution posted here Create cohort dropout rate table from raw data
I'd like to create a CUMULATIVE dropout rate table using these data.
DT<-data.table(
id =c (1,2,3,4,5,6,7,8,9,10,
11,12,13,14,15,16,17,18,19,20,
21,22,23,24,25,26,27,28,29,30,31,32,33,34,35),
year =c (2014,2014,2014,2014,2014,2014,2014,2014,2014,2014,
2015,2015,2015,2015,2015,2015,2015,2015,2015,2015,
2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016),
cohort =c(1,1,1,1,1,1,1,1,1,1,
2,2,2,1,1,2,1,2,1,2,
1,1,3,3,3,2,2,2,2,3,3,3,3,3,3))
So far, I've been able to get to this point
library(tidyverse)
DT %>%
group_by(year) %>%
count(cohort) %>%
ungroup() %>%
spread(year, n) %>%
mutate(y2014_2015_dropouts = (`2014` - `2015`),
y2015_2016_dropouts = (`2015` - `2016`)) %>%
mutate(y2014_2015_cumulative =y2014_2015_dropouts/`2014`,
y2015_2016_cumulative =y2015_2016_dropouts/`2014`+y2014_2015_cumulative)%>%
replace_na(list(y2014_2015_dropouts = 0.0,
y2015_2016_dropouts = 0.0)) %>%
select(cohort, y2014_2015_dropouts, y2015_2016_dropouts, y2014_2015_cumulative,y2015_2016_cumulative )
A cumulative dropout rate table reflects the proportion of students within a class who dropped out of school across years.
# A tibble: 3 x 5
cohort y2014_2015_dropouts y2015_2016_dropouts y2014_2015_cumulative y2015_2016_cumulative
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 6 2 0.6 0.8
2 2 0 2 NA NA
3 3 0 0 NA NA
>
The last two columns of the tibble show that by the end of year 2014-2015, 60% of cohort 1 students dropped out; and by the end of year 2015-2016, 80% of cohort 1 students had dropped out.
I'd like to calculate the same for cohorts 2 and 3, but I don't know how to do it.
Here is an alternative data.table solution that keeps your data organized in a way that I find easier to deal with. Using your DT input data:
Organize and order by cohort and year:
DT2 <- DT[, .N, list(cohort, year)][order(cohort, year)]
Assign the year range:
DT2[, year := paste(lag(year), year, sep = "_"),]
Get dropouts per year
DT2[, dropouts := ifelse(!is.na(lag(N)), lag(N) - N, 0), , cohort, ]
Get the cumulative sum of proportion dropped out each year per cohort:
DT2[, cumul := cumsum(dropouts) / max(N), cohort]
Output:
> DT2
cohort year N dropouts cumul
1: 1 NA_2014 10 0 0.0000000
2: 1 2014_2015 4 6 0.6000000
3: 1 2015_2016 2 2 0.8000000
4: 2 2016_2015 6 0 0.0000000
5: 2 2015_2016 4 2 0.3333333
6: 3 2016_2016 9 0 0.0000000
Because you spread your data by year early in your pipe and your 2014 columns have NA values for everything related to cohort 2, you need to coalesce the denominator in your calculation for y2015_2016_cumulative. If you replace the definition for that variable from the current
y2015_2016_cumulative =y2015_2016_dropouts/`2014`+y2014_2015_cumulative
to
y2015_2016_cumulative =y2015_2016_dropouts/coalesce(`2014`, `2015`) +
coalesce(y2014_2015_cumulative, 0)
you should be good to go. The coalesce function tries the first argument, but inputs the second argument if the first is NA. That being said, this current method isn't extremely scalable. You would have to add additional coalesce statements for every year you added. If you keep your data in the tidy format, you can keep a running list at the year-cohort level using
DT %>%
group_by(year) %>%
count(cohort) %>%
ungroup() %>%
group_by(cohort) %>%
mutate(dropouts = lag(n) - n,
dropout_rate = dropouts / max(n)) %>%
replace_na(list(dropouts = 0, n = 0, dropout_rate = 0)) %>%
mutate(cumulative_dropouts = cumsum(dropouts),
cumulative_dropout_rate = cumulative_dropouts / max(n))