R - Collapse Data by Grouped Row Observations - r

I'm working with a large data frame of hospitalization records. Many patients have two or more hospitalizations, and their past medical history may be incomplete at one or more of the hospitalizations. I'd like to collapse all the information from each of their hospitalizations into a single list of medical problems for each patient.
Here's a sample data frame:
id <- c("123","456","789","101","123","587","456","789")
HTN <- c(TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE,
FALSE)
DM2 <- c(FALSE, FALSE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE)
TIA <- c(TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE)
df <- data.frame(id,HTN,DM2,TIA)
df
Which comes out to:
> df
id HTN DM2 TIA
1 123 TRUE FALSE TRUE
2 456 FALSE FALSE TRUE
3 789 FALSE TRUE TRUE
4 101 FALSE TRUE TRUE
5 123 FALSE FALSE FALSE
6 587 TRUE TRUE TRUE
7 456 FALSE FALSE TRUE
8 789 FALSE TRUE TRUE
I'd like my output to look like this:
id <- c("101","123","456","587","789")
HTN <- c(FALSE,TRUE,FALSE,TRUE,FALSE)
DM2 <- c(TRUE,FALSE,FALSE,TRUE,TRUE)
TIA <- c(TRUE,TRUE,TRUE,TRUE,TRUE)
df2 <- data.frame(id,HTN,DM2,TIA)
df2
id HTN DM2 TIA
1 101 FALSE TRUE TRUE
2 123 TRUE FALSE TRUE
3 456 FALSE FALSE TRUE
4 587 TRUE TRUE TRUE
5 789 FALSE TRUE TRUE
So far I've got a pretty good hunch that arranging and grouping my data is the right place to start, and I think I could make it work by creating a new variable for each medical problem. I have about 30 medical problem's I'll need to collapse in this way, though, and that much repetitive code just seems like a recipe for an occult error.
df3 <- df %>%
arrange(id) %>%
group_by(id)
Looking around I haven't been able to find a particularly elegant way to go about this. Is there some slick dplyr function I'm overlooking?

We may use
df %>% group_by(id) %>% summarize_all(any)
# A tibble: 5 x 4
# id HTN DM2 TIA
# <fct> <lgl> <lgl> <lgl>
# 1 101 FALSE TRUE TRUE
# 2 123 TRUE FALSE TRUE
# 3 456 FALSE FALSE TRUE
# 4 587 TRUE TRUE TRUE
# 5 789 FALSE TRUE TRUE
In this way we first indeed group by id, as you suggested. Then we summarize all the variables with a function any: we provide a logical vector (e.g., HTN for patient 101) and return TRUE if in any of the rows we have TRUE and FALSE otherwise.

A base R option would be
aggregate(.~ id, df, any)
# id HTN DM2 TIA
#1 101 FALSE TRUE TRUE
#2 123 TRUE FALSE TRUE
#3 456 FALSE FALSE TRUE
#4 587 TRUE TRUE TRUE
#5 789 FALSE TRUE TRUE
Or with rowsum
rowsum(+(df[-1]), group = df$id) > 0

If we prefer data.table we might use:
setDT(df)[, lapply(.SD, any), keyby = id]
id HTN DM2 TIA
1: 101 FALSE TRUE TRUE
2: 123 TRUE FALSE TRUE
3: 456 FALSE FALSE TRUE
4: 587 TRUE TRUE TRUE
5: 789 FALSE TRUE TRUE

Related

How to drop rows by condition in R?

From this dataframe I need to drop all the rows which have TRUEs in every column. However, since I need to automatize the process I cant drop them with column names or column indexes. I need something else
df1 <- c(TRUE,TRUE,FALSE,TRUE,TRUE)
df2 <- c(TRUE,FALSE,FALSE,TRUE,TRUE)
df3 <- c(FALSE,TRUE,TRUE,TRUE,TRUE)
df <- data.frame(df1,df2,df3)
df1 df2 df3
1 TRUE TRUE FALSE
2 TRUE FALSE TRUE
3 FALSE FALSE TRUE
4 TRUE TRUE TRUE
5 TRUE TRUE TRUE
This should be the fastest solution:
df[!do.call(pmin, df), ]
# df1 df2 df3
# 1 TRUE TRUE FALSE
# 2 TRUE FALSE TRUE
# 3 FALSE FALSE TRUE
base R:
df[!apply(df, 1, all), ]
# df1 df2 df3
#1 TRUE TRUE FALSE
#2 TRUE FALSE TRUE
#3 FALSE FALSE TRUE
tidyverse:
library(dplyr)
filter(df, !if_all())
# df1 df2 df3
#1 TRUE TRUE FALSE
#2 TRUE FALSE TRUE
#3 FALSE FALSE TRUE
We can use rowwise function from dplyr library
library(dplyr)
df |> rowwise() |> filter(!all(c_across() == TRUE))
output
# A tibble: 3 × 3
# Rowwise:
df1 df2 df3
<lgl> <lgl> <lgl>
1 TRUE TRUE FALSE
2 TRUE FALSE TRUE
3 FALSE FALSE TRUE

IF TRUE then Variable name

I have a series of TRUE and FALSE variables representing search findings, ex: Cellphones, Knifes, Money, etc. My goal is to change the values TRUE for the name of the variable. Note that I would like to do this for 15 or more variables.
df <- data.frame(cellphone = c(TRUE, TRUE, FALSE, TRUE, FALSE),
money = c(FALSE, FALSE, FALSE, TRUE, FALSE),
knife = c(TRUE,TRUE,FALSE, FALSE, FALSE),
whatIneed = c("cellphone", "cellphone", "", "cellphone",""))
cellphone money knife whatIneed
1 TRUE FALSE TRUE cellphone
2 TRUE FALSE TRUE cellphone
3 FALSE FALSE FALSE
4 TRUE TRUE FALSE cellphone
5 FALSE FALSE FALSE
In base R, an option is to loop over the subset of columns that are logical rowwise, get the first column name based on the logical vector
df$whatIneed <- apply(df[1:3], 1, function(x) names(x)[x][1])
df$whatIneed
[1] "cellphone" "cellphone" NA "cellphone" NA
If we want to do this individually on each column
library(dplyr)
df %>%
mutate(across(where(is.logical),
~ case_when(.x ~ cur_column()), .names = "{.col}_name"))
-output
cellphone money knife whatIneed cellphone_name money_name knife_name
1 TRUE FALSE TRUE cellphone cellphone <NA> knife
2 TRUE FALSE TRUE cellphone cellphone <NA> knife
3 FALSE FALSE FALSE <NA> <NA> <NA>
4 TRUE TRUE FALSE cellphone cellphone money <NA>
5 FALSE FALSE FALSE <NA> <NA> <NA>

R: Is it possible to combine the boolean data in multiple select columns of partially duplicate rows?

First off, I apologize for so horrendously wording my question. I cannot figure out a better, more succinct way of writing it, so hopefully what follows will help make it clear - any suggestions to improve its clarity are welcome, so as to make it more accessible to people in the future struggling with the same thing.
I am working with a dataframe in R which contains some rows with duplicate ID tags. There are four columns associated with each row which contain boolean values, and per row only one registers as true, in such a way that if an ID tag is repeated, the columns in which the boolean is true will be different. Below is a very short example section of the data I am working with:
dbsid l_e l_d n_e b_c
CCH00090 TRUE FALSE FALSE FALSE
CCH00091 FALSE FALSE TRUE FALSE
CCH00090 FALSE TRUE FALSE FALSE
I am hoping to end up with the following (though on a much larger scale):
dbsid l_e l_d n_e b_c
CCH00090 TRUE TRUE FALSE FALSE
CCH00091 FALSE FALSE TRUE FALSE
but cannot figure out any way to go about producing such an output. Note that the boolean data in the case of the duplicate entry has been combined so that the true values are kept over the false ones. I've been looking at the aggregate function, but have had no luck coercing it into performing the above.
Is it possible to do so? Thanks for taking the time to read through my question.
You can do this with summarize_all from dplyr:
library(dplyr)
df %>%
group_by(dbsid) %>%
summarize_all(sum)
Result:
# A tibble: 2 x 5
dbsid l_e l_d n_e b_c
<fctr> <int> <int> <int> <int>
1 CCH00090 1 1 0 0
2 CCH00091 0 0 1 0
or with any (#Ryan):
df %>%
group_by(dbsid) %>%
summarize_all(any)
Result:
# A tibble: 2 x 5
dbsid l_e l_d n_e b_c
<fctr> <lgl> <lgl> <lgl> <lgl>
1 CCH00090 TRUE TRUE FALSE FALSE
2 CCH00091 FALSE FALSE TRUE FALSE
Data:
df = structure(list(dbsid = structure(c(1L, 2L, 1L), .Label = c("CCH00090",
"CCH00091"), class = "factor"), l_e = c(TRUE, FALSE, FALSE),
l_d = c(FALSE, FALSE, TRUE), n_e = c(FALSE, TRUE, FALSE),
b_c = c(FALSE, FALSE, FALSE)), .Names = c("dbsid", "l_e",
"l_d", "n_e", "b_c"), class = "data.frame", row.names = c(NA,
-3L))
You can apply the any function across all the rows having the same dbsid, for all variables.
library(data.table)
setDT(df)
df[, lapply(.SD, any), by = dbsid]
# dbsid l_e l_d n_e b_c
# 1: CCH00090 TRUE TRUE FALSE FALSE
# 2: CCH00091 FALSE FALSE TRUE FALSE
Data used
df <- fread("dbsid l_e l_d n_e b_c
CCH00090 TRUE FALSE FALSE FALSE
CCH00091 FALSE FALSE TRUE FALSE
CCH00090 FALSE TRUE FALSE FALSE")

More efficient ways to use R than 'for' loops

I'm a relative newcomer to R so I'm sorry if there's an obvious answer to this. I've looked at other questions and I think 'apply' is the answer but I can't work out how to use it in this case.
I've got a longitudinal survey where participants are invited every year. In some years they fail to take part, and sometimes they die. I need to identify which participants have taken part for a consistent 'streak' since from the start of the survey (i.e. if they stop, they stop for good).
I've done this with a 'for' loop, which works fine in the example below. But I have many years and many participants, and the loop is very slow. Is there a faster approach I could use?
In the example, TRUE means they participated in that year. The loop creates two vectors - 'finalyear' for the last year they took part, and 'streak' to show if they completed all years before the finalyear (i.e. cases 1, 3 and 5).
dat <- data.frame(ids = 1:5, "1999" = c(T, T, T, F, T), "2000" = c(T, F, T, F, T), "2001" = c(T, T, T, T, T), "2002" = c(F, T, T, T, T), "2003" = c(F, T, T, T, F))
finalyear <- NULL
streak <- NULL
for (i in 1:nrow(dat)) {
x <- as.numeric(dat[i,2:6])
y <- max(grep(1, x))
finalyear[i] <- y
streak[i] <- sum(x) == y
}
dat$finalyear <- finalyear
dat$streak <- streak
Thanks!
We could use max.col and rowSums as a vectorized approach.
dat$finalyear <- max.col(dat[-1], 'last')
If there are rows without TRUE values, we can make sure to return 0 for that row by multiplying with the double negation of rowSums. The FALSE will be coerced to 0 and multiplying with 0 returns 0 for that row.
dat$finalyear <- max.col(dat[-1], 'last')*!!rowSums(dat[-1])
Then, we create the 'streak' column by comparing the rowSums of columns 2:6 with that of 'finalyear'
dat$streak <- rowSums(dat[,2:6])==dat$finalyear
dat
# ids X1999 X2000 X2001 X2002 X2003 finalyear streak
#1 1 TRUE TRUE TRUE FALSE FALSE 3 TRUE
#2 2 TRUE FALSE TRUE TRUE TRUE 5 FALSE
#3 3 TRUE TRUE TRUE TRUE TRUE 5 TRUE
#4 4 FALSE FALSE TRUE TRUE TRUE 5 FALSE
#5 5 TRUE TRUE TRUE TRUE FALSE 4 TRUE
Or a one-line code (it could fit in one-line, but decided to make it obvious by 2-lines ) suggested by #ColonelBeauvel
library(dplyr)
mutate(dat, finalyear=max.col(dat[-1], 'last'),
streak=rowSums(dat[-1])==finalyear)
For-loops are not inherently bad in R, but they are slow if you grow vectors iteratively (like you are doing). There are often better ways to do things. Example of a solution with only apply-functions:
dat$finalyear <- apply(dat[,2:6],MARGIN=1,function(x){max(which(x))})
dat$streak <- apply(dat[,2:7],MARGIN=1,function(x){sum(x[1:5])==x[6]})
Or option 2, based on comment by #Spacedman:
dat$finalyear <- apply(dat[,2:6],MARGIN=1,function(x){max(which(x))})
dat$streak <- apply(dat[,2:6],MARGIN=1,function(x){max(which(x))==sum(x)})
> dat
ids X1999 X2000 X2001 X2002 X2003 finalyear streak
1 1 TRUE TRUE TRUE FALSE FALSE 3 TRUE
2 2 TRUE FALSE TRUE TRUE TRUE 5 FALSE
3 3 TRUE TRUE TRUE TRUE TRUE 5 TRUE
4 4 FALSE FALSE TRUE TRUE TRUE 5 FALSE
5 5 TRUE TRUE TRUE TRUE FALSE 4 TRUE
Here is a solution with dplyr and tidyr.
gather(data = dat,year,value,-ids) %>%
mutate(year=as.integer(gsub("X","",year))) %>%
group_by(ids) %>%
summarize(finalyear=last(year[value]),
streak=!any(value[first(year):finalyear] == FALSE))
output
ids finalyear streak
1 1 2001 TRUE
2 2 2003 FALSE
3 3 2003 TRUE
4 4 2003 FALSE
5 5 2002 TRUE
Here's a base version using apply to loop over rows and rle to see how often the state changes. Your condition seems to be equivalent to the state starting as TRUE and only ever changing to FALSE at most once, so I test the rle as being shorter than 3 and the first value being TRUE:
> dat$streak = apply(dat[,2:6],1,function(r){r[1] & length(rle(r)$length)<=2})
>
> dat
ids X1999 X2000 X2001 X2002 X2003 streak
1 1 TRUE TRUE TRUE FALSE FALSE TRUE
2 2 TRUE FALSE TRUE TRUE TRUE FALSE
3 3 TRUE TRUE TRUE TRUE TRUE TRUE
4 4 FALSE FALSE TRUE TRUE TRUE FALSE
5 5 TRUE TRUE TRUE TRUE FALSE TRUE
There's probably loads of ways of working out finalyear, this just finds the last element of each row which is TRUE:
> dat$finalyear = apply(dat[,2:6], 1, function(r){max(which(r))})
> dat
ids X1999 X2000 X2001 X2002 X2003 streak finalyear
1 1 TRUE TRUE TRUE FALSE FALSE TRUE 3
2 2 TRUE FALSE TRUE TRUE TRUE FALSE 5
3 3 TRUE TRUE TRUE TRUE TRUE TRUE 5
4 4 FALSE FALSE TRUE TRUE TRUE FALSE 5
5 5 TRUE TRUE TRUE TRUE FALSE TRUE 4

Counting Falses before Trues in R

I'm trying to use R to find the average number of attempts before a success in a dataframe with 300,000+ rows. Data is structured as below.
EventID SubjectID ActionID Success DateUpdated
a b c TRUE 2014-06-21 20:20:08.575032+00
b a c FALSE 2014-06-20 02:58:40.70699+00
I'm still learning my way through R. It looks like I can use ddply to separate the frame out based on Subject and Action (I want to see how many times a given subject tries an action before achieving a success), but I can't figure out how to write the formula I need to apply.
library(data.table)
# example data
dt = data.table(group = c(1,1,1,1,1,2,2), success = c(F,F,T,F,T,F,T))
# group success
#1: 1 FALSE
#2: 1 FALSE
#3: 1 TRUE
#4: 1 FALSE
#5: 1 TRUE
#6: 2 FALSE
#7: 2 TRUE
dt[, which(success)[1] - 1, by = group]
# group V1
#1: 1 2
#2: 2 1
Replace group with list(subject, action) or whatever is appropriate for your data (after converting it to data.table from data.frame).
To follow up on Tarehman's suggestion, since I like rle,
foo <- rle(data$Success)
mean(foo$lengths[foo$values==FALSE])
This might be an answer to a totally different question, but does this get close to what you want?
tfs <- sample(c(FALSE,TRUE),size = 50, replace = TRUE, prob = c(0.8,0.2))
tfs_sums <- cumsum(!tfs)
repsums <- tfs_sums[duplicated(tfs_sums)]
mean(repsums - c(0,repsums[-length(repsums)]))
tfs
[1] FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[20] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE
[39] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
repsums
1 6 8 9 20 20 20 20 24 26 31 36
repsums - c(0,repsums[-length(repsums)])
1 5 2 1 11 0 0 0 4 2 5 5
The last vector shown is the length of each continuous "run" of FALSE values in the vector tfs
you could use data.table work around to get what you need as follows:
library (data.table)
df=data.frame(EventID=c("a","b","c","d"),SubjectID=c("b","a","a","a"),ActionID=c("c","c","c","c"),Success=c(TRUE,FALSE,FALSE,TRUE))
dt=data.table(df)
dt[ , Index := 1:.N , by = c("SubjectID" , "ActionID","Success") ]
Now this Index column will hold the number that you need for each subject/action consecutive experiments. You need to aggregate to get that number (max number)
result=stats:::aggregate.formula(Index~(SubjectID+ActionID),data=dt,FUN= function(x) max(x))
so this will give you the max index and it is the number of the falses before you hit a true. Note that you might need to do further processing to filter out subjects that has never had a true

Resources