I have a series of TRUE and FALSE variables representing search findings, ex: Cellphones, Knifes, Money, etc. My goal is to change the values TRUE for the name of the variable. Note that I would like to do this for 15 or more variables.
df <- data.frame(cellphone = c(TRUE, TRUE, FALSE, TRUE, FALSE),
money = c(FALSE, FALSE, FALSE, TRUE, FALSE),
knife = c(TRUE,TRUE,FALSE, FALSE, FALSE),
whatIneed = c("cellphone", "cellphone", "", "cellphone",""))
cellphone money knife whatIneed
1 TRUE FALSE TRUE cellphone
2 TRUE FALSE TRUE cellphone
3 FALSE FALSE FALSE
4 TRUE TRUE FALSE cellphone
5 FALSE FALSE FALSE
In base R, an option is to loop over the subset of columns that are logical rowwise, get the first column name based on the logical vector
df$whatIneed <- apply(df[1:3], 1, function(x) names(x)[x][1])
df$whatIneed
[1] "cellphone" "cellphone" NA "cellphone" NA
If we want to do this individually on each column
library(dplyr)
df %>%
mutate(across(where(is.logical),
~ case_when(.x ~ cur_column()), .names = "{.col}_name"))
-output
cellphone money knife whatIneed cellphone_name money_name knife_name
1 TRUE FALSE TRUE cellphone cellphone <NA> knife
2 TRUE FALSE TRUE cellphone cellphone <NA> knife
3 FALSE FALSE FALSE <NA> <NA> <NA>
4 TRUE TRUE FALSE cellphone cellphone money <NA>
5 FALSE FALSE FALSE <NA> <NA> <NA>
Related
I want to group a data frame by using a condition which when applied gives the ouptut "True False False False False True False False True False ..." and so on. I want group in a way which gives me the rows from "True" to the last "False" i.e. for the mentioned output I´d get 3 groups (the rows which give) "True False False False False", "True False False " and "True False"
I hope this makes sense lol
We could do this by counting the cumulative number of TRUE's and using that as our group.
a <- data.frame(log = c(TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE))
a %>%
mutate(group = cumsum(log))
or in base R:
a$group = cumsum(a$log)
Result
log group
1 TRUE 1
2 FALSE 1
3 FALSE 1
4 FALSE 1
5 FALSE 1
6 TRUE 2
7 FALSE 2
8 FALSE 2
9 TRUE 3
10 FALSE 3
I'm working with a large data frame of hospitalization records. Many patients have two or more hospitalizations, and their past medical history may be incomplete at one or more of the hospitalizations. I'd like to collapse all the information from each of their hospitalizations into a single list of medical problems for each patient.
Here's a sample data frame:
id <- c("123","456","789","101","123","587","456","789")
HTN <- c(TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE,
FALSE)
DM2 <- c(FALSE, FALSE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE)
TIA <- c(TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE)
df <- data.frame(id,HTN,DM2,TIA)
df
Which comes out to:
> df
id HTN DM2 TIA
1 123 TRUE FALSE TRUE
2 456 FALSE FALSE TRUE
3 789 FALSE TRUE TRUE
4 101 FALSE TRUE TRUE
5 123 FALSE FALSE FALSE
6 587 TRUE TRUE TRUE
7 456 FALSE FALSE TRUE
8 789 FALSE TRUE TRUE
I'd like my output to look like this:
id <- c("101","123","456","587","789")
HTN <- c(FALSE,TRUE,FALSE,TRUE,FALSE)
DM2 <- c(TRUE,FALSE,FALSE,TRUE,TRUE)
TIA <- c(TRUE,TRUE,TRUE,TRUE,TRUE)
df2 <- data.frame(id,HTN,DM2,TIA)
df2
id HTN DM2 TIA
1 101 FALSE TRUE TRUE
2 123 TRUE FALSE TRUE
3 456 FALSE FALSE TRUE
4 587 TRUE TRUE TRUE
5 789 FALSE TRUE TRUE
So far I've got a pretty good hunch that arranging and grouping my data is the right place to start, and I think I could make it work by creating a new variable for each medical problem. I have about 30 medical problem's I'll need to collapse in this way, though, and that much repetitive code just seems like a recipe for an occult error.
df3 <- df %>%
arrange(id) %>%
group_by(id)
Looking around I haven't been able to find a particularly elegant way to go about this. Is there some slick dplyr function I'm overlooking?
We may use
df %>% group_by(id) %>% summarize_all(any)
# A tibble: 5 x 4
# id HTN DM2 TIA
# <fct> <lgl> <lgl> <lgl>
# 1 101 FALSE TRUE TRUE
# 2 123 TRUE FALSE TRUE
# 3 456 FALSE FALSE TRUE
# 4 587 TRUE TRUE TRUE
# 5 789 FALSE TRUE TRUE
In this way we first indeed group by id, as you suggested. Then we summarize all the variables with a function any: we provide a logical vector (e.g., HTN for patient 101) and return TRUE if in any of the rows we have TRUE and FALSE otherwise.
A base R option would be
aggregate(.~ id, df, any)
# id HTN DM2 TIA
#1 101 FALSE TRUE TRUE
#2 123 TRUE FALSE TRUE
#3 456 FALSE FALSE TRUE
#4 587 TRUE TRUE TRUE
#5 789 FALSE TRUE TRUE
Or with rowsum
rowsum(+(df[-1]), group = df$id) > 0
If we prefer data.table we might use:
setDT(df)[, lapply(.SD, any), keyby = id]
id HTN DM2 TIA
1: 101 FALSE TRUE TRUE
2: 123 TRUE FALSE TRUE
3: 456 FALSE FALSE TRUE
4: 587 TRUE TRUE TRUE
5: 789 FALSE TRUE TRUE
I have a logical dataframe like:
> test
apple apple apple kiwi kiwi banana banana banana apple orange
1 FALSE TRUE FALSE FALSE TRUE FALSE TRUE TRUE TRUE FALSE
2 TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE FALSE
3 FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
My aim is to combine the column with same column name. That's to say the output should be a dataframe with 4 column (apple, kiwi, banana, orange).
I tried :
testmerge <- df[, !duplicated(colnames(df))]
But the output is not what I look for. For each row given same column name, the output should be True as long there is at least 1 TRUE. For each row given same column name, the output should be False if there is 0 TRUE.
For intance first row first column is should be TRUE instead of FALSE.
Undesired testmerge output:
apple kiwi banana orange
1 FALSE FALSE FALSE FALSE
2 TRUE TRUE TRUE FALSE
3 FALSE TRUE FALSE FALSE
Desired output:
apple kiwi banana orange
1 TRUE TRUE TRUE FALSE
2 TRUE TRUE TRUE FALSE
3 FALSE TRUE FALSE FALSE
Replicate dataframe:
test <- structure(list(apple = c(FALSE, TRUE, FALSE), apple = c(TRUE, TRUE,
FALSE), apple = c(FALSE, TRUE, FALSE), kiwi = c(FALSE, TRUE, TRUE
), kiwi = c(TRUE, TRUE, TRUE), banana = c(FALSE, TRUE, FALSE), banana = c(TRUE,
FALSE, FALSE), banana = c(TRUE, TRUE, FALSE), apple = c(TRUE, TRUE,
FALSE), orange = c(FALSE, FALSE, FALSE)), .Names = c("apple", "apple",
"apple", "kiwi", "kiwi", "banana", "banana", "banana", "apple", "orange"), row.names = c(NA,
-3L), class = "data.frame")
Using sapply and rowSums:
as.data.frame(
sapply(unique(colnames(test)),
function(i){
rowSums(test[, grepl(i, colnames(test)), drop = FALSE]) > 0})
)
#output
# apple kiwi banana orange
# 1 TRUE TRUE TRUE FALSE
# 2 TRUE TRUE TRUE FALSE
# 3 FALSE TRUE FALSE FALSE
We are subsetting datafame based on fruit names, then computing rowSums. TRUE is 1 and FALSE is 0, so rowSums of more than zero will have at least one TRUE value. I have drop = FALSE, so the subset will stay as a dataframe in cases like orange where there is only one column.
Note:
If the data is long then Reduce() solution by #akrun works better, but if data is wide then rowSums() is more efficient.
There maybe more efficient ways to achieve this, but here's a try
I would suggest to convert the column names to to unique ones using make.unique, then convert to long format, check your condition by a row id and the column names (made unique again) and then convert back to a wide format, something like
library(data.table)
setnames(setDT(test), make.unique(names(test))) # Make column names unique
res <- melt(test[, id := .I], id = "id" # Add a row index and melt by it
)[, sum(value) > 0, # Check condition >>
by = .(id, Names = sub("\\..*", "", variable))] # by row id and unique names
dcast(res, id ~ Names, value.var = "V1") # Convert back to wide format
# id apple banana kiwi orange
# 1: 1 TRUE TRUE TRUE FALSE
# 2: 2 TRUE TRUE TRUE FALSE
# 3: 3 FALSE FALSE TRUE FALSE
Another option would be to split the sequence of columns of the dataset by the names of it into a list, loop through the list, subset based on the numeric index, use Reduce to check whether there are any TRUE in each row.
sapply(split(seq_along(test), names(test)), function(i) Reduce(`|`, test[i]))
# apple banana kiwi orange
#[1,] TRUE TRUE TRUE FALSE
#[2,] TRUE TRUE TRUE FALSE
#[3,] FALSE FALSE TRUE FALSE
I'm a relative newcomer to R so I'm sorry if there's an obvious answer to this. I've looked at other questions and I think 'apply' is the answer but I can't work out how to use it in this case.
I've got a longitudinal survey where participants are invited every year. In some years they fail to take part, and sometimes they die. I need to identify which participants have taken part for a consistent 'streak' since from the start of the survey (i.e. if they stop, they stop for good).
I've done this with a 'for' loop, which works fine in the example below. But I have many years and many participants, and the loop is very slow. Is there a faster approach I could use?
In the example, TRUE means they participated in that year. The loop creates two vectors - 'finalyear' for the last year they took part, and 'streak' to show if they completed all years before the finalyear (i.e. cases 1, 3 and 5).
dat <- data.frame(ids = 1:5, "1999" = c(T, T, T, F, T), "2000" = c(T, F, T, F, T), "2001" = c(T, T, T, T, T), "2002" = c(F, T, T, T, T), "2003" = c(F, T, T, T, F))
finalyear <- NULL
streak <- NULL
for (i in 1:nrow(dat)) {
x <- as.numeric(dat[i,2:6])
y <- max(grep(1, x))
finalyear[i] <- y
streak[i] <- sum(x) == y
}
dat$finalyear <- finalyear
dat$streak <- streak
Thanks!
We could use max.col and rowSums as a vectorized approach.
dat$finalyear <- max.col(dat[-1], 'last')
If there are rows without TRUE values, we can make sure to return 0 for that row by multiplying with the double negation of rowSums. The FALSE will be coerced to 0 and multiplying with 0 returns 0 for that row.
dat$finalyear <- max.col(dat[-1], 'last')*!!rowSums(dat[-1])
Then, we create the 'streak' column by comparing the rowSums of columns 2:6 with that of 'finalyear'
dat$streak <- rowSums(dat[,2:6])==dat$finalyear
dat
# ids X1999 X2000 X2001 X2002 X2003 finalyear streak
#1 1 TRUE TRUE TRUE FALSE FALSE 3 TRUE
#2 2 TRUE FALSE TRUE TRUE TRUE 5 FALSE
#3 3 TRUE TRUE TRUE TRUE TRUE 5 TRUE
#4 4 FALSE FALSE TRUE TRUE TRUE 5 FALSE
#5 5 TRUE TRUE TRUE TRUE FALSE 4 TRUE
Or a one-line code (it could fit in one-line, but decided to make it obvious by 2-lines ) suggested by #ColonelBeauvel
library(dplyr)
mutate(dat, finalyear=max.col(dat[-1], 'last'),
streak=rowSums(dat[-1])==finalyear)
For-loops are not inherently bad in R, but they are slow if you grow vectors iteratively (like you are doing). There are often better ways to do things. Example of a solution with only apply-functions:
dat$finalyear <- apply(dat[,2:6],MARGIN=1,function(x){max(which(x))})
dat$streak <- apply(dat[,2:7],MARGIN=1,function(x){sum(x[1:5])==x[6]})
Or option 2, based on comment by #Spacedman:
dat$finalyear <- apply(dat[,2:6],MARGIN=1,function(x){max(which(x))})
dat$streak <- apply(dat[,2:6],MARGIN=1,function(x){max(which(x))==sum(x)})
> dat
ids X1999 X2000 X2001 X2002 X2003 finalyear streak
1 1 TRUE TRUE TRUE FALSE FALSE 3 TRUE
2 2 TRUE FALSE TRUE TRUE TRUE 5 FALSE
3 3 TRUE TRUE TRUE TRUE TRUE 5 TRUE
4 4 FALSE FALSE TRUE TRUE TRUE 5 FALSE
5 5 TRUE TRUE TRUE TRUE FALSE 4 TRUE
Here is a solution with dplyr and tidyr.
gather(data = dat,year,value,-ids) %>%
mutate(year=as.integer(gsub("X","",year))) %>%
group_by(ids) %>%
summarize(finalyear=last(year[value]),
streak=!any(value[first(year):finalyear] == FALSE))
output
ids finalyear streak
1 1 2001 TRUE
2 2 2003 FALSE
3 3 2003 TRUE
4 4 2003 FALSE
5 5 2002 TRUE
Here's a base version using apply to loop over rows and rle to see how often the state changes. Your condition seems to be equivalent to the state starting as TRUE and only ever changing to FALSE at most once, so I test the rle as being shorter than 3 and the first value being TRUE:
> dat$streak = apply(dat[,2:6],1,function(r){r[1] & length(rle(r)$length)<=2})
>
> dat
ids X1999 X2000 X2001 X2002 X2003 streak
1 1 TRUE TRUE TRUE FALSE FALSE TRUE
2 2 TRUE FALSE TRUE TRUE TRUE FALSE
3 3 TRUE TRUE TRUE TRUE TRUE TRUE
4 4 FALSE FALSE TRUE TRUE TRUE FALSE
5 5 TRUE TRUE TRUE TRUE FALSE TRUE
There's probably loads of ways of working out finalyear, this just finds the last element of each row which is TRUE:
> dat$finalyear = apply(dat[,2:6], 1, function(r){max(which(r))})
> dat
ids X1999 X2000 X2001 X2002 X2003 streak finalyear
1 1 TRUE TRUE TRUE FALSE FALSE TRUE 3
2 2 TRUE FALSE TRUE TRUE TRUE FALSE 5
3 3 TRUE TRUE TRUE TRUE TRUE TRUE 5
4 4 FALSE FALSE TRUE TRUE TRUE FALSE 5
5 5 TRUE TRUE TRUE TRUE FALSE TRUE 4
I have a data.frame with 6,000 obs
SubjectID : int 1,2,3,4...
Arthritis : logi FALSE FALSE TRUE FALSE FALSE
Stroke : logi TRUE FALSE FALSE FALSE FALSE
Diabetes : logi TRUE FALSE FALSE FALSE FALSE
Cancer : logi FALSE FALSE FALSE FALSE TRUE
I am trying to find rows where every disease is absent. I can do it for a single disease with this:
subset(RHV.FINAL, Arthritis=="FALSE")
I have tried this for all diseases, which works, but is cumbersome:
subset(RHV.FINAL, Arthritis=="FALSE" & Stroke=="FALSE" & Diabetes=="FALSE" & Cancer=="FALSE")
Is there a more eloquent solution?
Can you not use rowSums? It's abit hard to tell with the str of your data as you have posted it. A small example to recreate in an R session would be good (dput).
df [rowSums( df ) == 0 , ]
For example...
set.seed(123)
df <- data.frame( id = 1:5,
A = sample( c(T,F) , 5 , repl = T ),
B = sample( c(T,F) , 5 , repl = T ),
C = sample( c(T,F) , 5 , repl = T ))
id A B C
1 1 TRUE TRUE FALSE
2 2 FALSE FALSE TRUE
3 3 TRUE FALSE FALSE
4 4 FALSE FALSE FALSE
5 5 FALSE TRUE TRUE
# df[,-1] to exclude id variable in first column (thanks #DidzisElferts)
df[ rowSums(df[,-1]) == 0 , ]
id A B C
4 4 FALSE FALSE FALSE