Merge same column names of a logical/binary dataframe

Merge same column names of a logical/binary dataframe - r

I have a logical dataframe like:
> test
apple apple apple kiwi kiwi banana banana banana apple orange
1 FALSE TRUE FALSE FALSE TRUE FALSE TRUE TRUE TRUE FALSE
2 TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE FALSE
3 FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
My aim is to combine the column with same column name. That's to say the output should be a dataframe with 4 column (apple, kiwi, banana, orange).
I tried :
testmerge <- df[, !duplicated(colnames(df))]
But the output is not what I look for. For each row given same column name, the output should be True as long there is at least 1 TRUE. For each row given same column name, the output should be False if there is 0 TRUE.
For intance first row first column is should be TRUE instead of FALSE.
Undesired testmerge output:
apple kiwi banana orange
1 FALSE FALSE FALSE FALSE
2 TRUE TRUE TRUE FALSE
3 FALSE TRUE FALSE FALSE
Desired output:
apple kiwi banana orange
1 TRUE TRUE TRUE FALSE
2 TRUE TRUE TRUE FALSE
3 FALSE TRUE FALSE FALSE
Replicate dataframe:
test <- structure(list(apple = c(FALSE, TRUE, FALSE), apple = c(TRUE, TRUE,
FALSE), apple = c(FALSE, TRUE, FALSE), kiwi = c(FALSE, TRUE, TRUE
), kiwi = c(TRUE, TRUE, TRUE), banana = c(FALSE, TRUE, FALSE), banana = c(TRUE,
FALSE, FALSE), banana = c(TRUE, TRUE, FALSE), apple = c(TRUE, TRUE,
FALSE), orange = c(FALSE, FALSE, FALSE)), .Names = c("apple", "apple",
"apple", "kiwi", "kiwi", "banana", "banana", "banana", "apple", "orange"), row.names = c(NA,
-3L), class = "data.frame")

Using sapply and rowSums:
as.data.frame(
sapply(unique(colnames(test)),
function(i){
rowSums(test[, grepl(i, colnames(test)), drop = FALSE]) > 0})
)
#output
# apple kiwi banana orange
# 1 TRUE TRUE TRUE FALSE
# 2 TRUE TRUE TRUE FALSE
# 3 FALSE TRUE FALSE FALSE
We are subsetting datafame based on fruit names, then computing rowSums. TRUE is 1 and FALSE is 0, so rowSums of more than zero will have at least one TRUE value. I have drop = FALSE, so the subset will stay as a dataframe in cases like orange where there is only one column.
Note:
If the data is long then Reduce() solution by #akrun works better, but if data is wide then rowSums() is more efficient.

There maybe more efficient ways to achieve this, but here's a try
I would suggest to convert the column names to to unique ones using make.unique, then convert to long format, check your condition by a row id and the column names (made unique again) and then convert back to a wide format, something like
library(data.table)
setnames(setDT(test), make.unique(names(test))) # Make column names unique
res <- melt(test[, id := .I], id = "id" # Add a row index and melt by it
)[, sum(value) > 0, # Check condition >>
by = .(id, Names = sub("\\..*", "", variable))] # by row id and unique names
dcast(res, id ~ Names, value.var = "V1") # Convert back to wide format
# id apple banana kiwi orange
# 1: 1 TRUE TRUE TRUE FALSE
# 2: 2 TRUE TRUE TRUE FALSE
# 3: 3 FALSE FALSE TRUE FALSE

Another option would be to split the sequence of columns of the dataset by the names of it into a list, loop through the list, subset based on the numeric index, use Reduce to check whether there are any TRUE in each row.
sapply(split(seq_along(test), names(test)), function(i) Reduce(`|`, test[i]))
# apple banana kiwi orange
#[1,] TRUE TRUE TRUE FALSE
#[2,] TRUE TRUE TRUE FALSE
#[3,] FALSE FALSE TRUE FALSE

Related

How to group dataframe by Conditioning in R

I want to group a data frame by using a condition which when applied gives the ouptut "True False False False False True False False True False ..." and so on. I want group in a way which gives me the rows from "True" to the last "False" i.e. for the mentioned output I´d get 3 groups (the rows which give) "True False False False False", "True False False " and "True False"
I hope this makes sense lol

We could do this by counting the cumulative number of TRUE's and using that as our group.
a <- data.frame(log = c(TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE))
a %>%
mutate(group = cumsum(log))
or in base R:
a$group = cumsum(a$log)
Result
log group
1 TRUE 1
2 FALSE 1
3 FALSE 1
4 FALSE 1
5 FALSE 1
6 TRUE 2
7 FALSE 2
8 FALSE 2
9 TRUE 3
10 FALSE 3

IF TRUE then Variable name

I have a series of TRUE and FALSE variables representing search findings, ex: Cellphones, Knifes, Money, etc. My goal is to change the values TRUE for the name of the variable. Note that I would like to do this for 15 or more variables.
df <- data.frame(cellphone = c(TRUE, TRUE, FALSE, TRUE, FALSE),
money = c(FALSE, FALSE, FALSE, TRUE, FALSE),
knife = c(TRUE,TRUE,FALSE, FALSE, FALSE),
whatIneed = c("cellphone", "cellphone", "", "cellphone",""))
cellphone money knife whatIneed
1 TRUE FALSE TRUE cellphone
2 TRUE FALSE TRUE cellphone
3 FALSE FALSE FALSE
4 TRUE TRUE FALSE cellphone
5 FALSE FALSE FALSE

In base R, an option is to loop over the subset of columns that are logical rowwise, get the first column name based on the logical vector
df$whatIneed <- apply(df[1:3], 1, function(x) names(x)[x][1])
df$whatIneed
[1] "cellphone" "cellphone" NA "cellphone" NA
If we want to do this individually on each column
library(dplyr)
df %>%
mutate(across(where(is.logical),
~ case_when(.x ~ cur_column()), .names = "{.col}_name"))
-output
cellphone money knife whatIneed cellphone_name money_name knife_name
1 TRUE FALSE TRUE cellphone cellphone <NA> knife
2 TRUE FALSE TRUE cellphone cellphone <NA> knife
3 FALSE FALSE FALSE <NA> <NA> <NA>
4 TRUE TRUE FALSE cellphone cellphone money <NA>
5 FALSE FALSE FALSE <NA> <NA> <NA>

R - Collapse Data by Grouped Row Observations

I'm working with a large data frame of hospitalization records. Many patients have two or more hospitalizations, and their past medical history may be incomplete at one or more of the hospitalizations. I'd like to collapse all the information from each of their hospitalizations into a single list of medical problems for each patient.
Here's a sample data frame:
id <- c("123","456","789","101","123","587","456","789")
HTN <- c(TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE,
FALSE)
DM2 <- c(FALSE, FALSE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE)
TIA <- c(TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE)
df <- data.frame(id,HTN,DM2,TIA)
df
Which comes out to:
> df
id HTN DM2 TIA
1 123 TRUE FALSE TRUE
2 456 FALSE FALSE TRUE
3 789 FALSE TRUE TRUE
4 101 FALSE TRUE TRUE
5 123 FALSE FALSE FALSE
6 587 TRUE TRUE TRUE
7 456 FALSE FALSE TRUE
8 789 FALSE TRUE TRUE
I'd like my output to look like this:
id <- c("101","123","456","587","789")
HTN <- c(FALSE,TRUE,FALSE,TRUE,FALSE)
DM2 <- c(TRUE,FALSE,FALSE,TRUE,TRUE)
TIA <- c(TRUE,TRUE,TRUE,TRUE,TRUE)
df2 <- data.frame(id,HTN,DM2,TIA)
df2
id HTN DM2 TIA
1 101 FALSE TRUE TRUE
2 123 TRUE FALSE TRUE
3 456 FALSE FALSE TRUE
4 587 TRUE TRUE TRUE
5 789 FALSE TRUE TRUE
So far I've got a pretty good hunch that arranging and grouping my data is the right place to start, and I think I could make it work by creating a new variable for each medical problem. I have about 30 medical problem's I'll need to collapse in this way, though, and that much repetitive code just seems like a recipe for an occult error.
df3 <- df %>%
arrange(id) %>%
group_by(id)
Looking around I haven't been able to find a particularly elegant way to go about this. Is there some slick dplyr function I'm overlooking?

We may use
df %>% group_by(id) %>% summarize_all(any)
# A tibble: 5 x 4
# id HTN DM2 TIA
# <fct> <lgl> <lgl> <lgl>
# 1 101 FALSE TRUE TRUE
# 2 123 TRUE FALSE TRUE
# 3 456 FALSE FALSE TRUE
# 4 587 TRUE TRUE TRUE
# 5 789 FALSE TRUE TRUE
In this way we first indeed group by id, as you suggested. Then we summarize all the variables with a function any: we provide a logical vector (e.g., HTN for patient 101) and return TRUE if in any of the rows we have TRUE and FALSE otherwise.

A base R option would be
aggregate(.~ id, df, any)
# id HTN DM2 TIA
#1 101 FALSE TRUE TRUE
#2 123 TRUE FALSE TRUE
#3 456 FALSE FALSE TRUE
#4 587 TRUE TRUE TRUE
#5 789 FALSE TRUE TRUE
Or with rowsum
rowsum(+(df[-1]), group = df$id) > 0

If we prefer data.table we might use:
setDT(df)[, lapply(.SD, any), keyby = id]
id HTN DM2 TIA
1: 101 FALSE TRUE TRUE
2: 123 TRUE FALSE TRUE
3: 456 FALSE FALSE TRUE
4: 587 TRUE TRUE TRUE
5: 789 FALSE TRUE TRUE

Subset data based on sequence of characters across rows

How can I subset df by a pattern of consecutive rows of characters? In the example below, I'd like to subset the data that have history values of "TRUE", "FALSE", "TRUE" consecutively. The data below is a bit odd but you get the idea!
value <- c(1/1/16,1/2/16, 1/3/16, 1/4/16, 1/5/16, 1/6/16, 1/7/16, 1/8/16, 1/9/16, 1/10/16)
history <- c("TRUE", "FALSE", "TRUE", "TRUE", "FALSE", "TRUE", "TRUE", "TRUE", "FALSE", "TRUE")
df <- data.frame(value, history)
df
value history
1 0.062500000 TRUE
2 0.031250000 FALSE
3 0.020833333 TRUE
4 0.015625000 TRUE
5 0.012500000 FALSE
6 0.010416667 TRUE
7 0.008928571 TRUE
8 0.007812500 TRUE
9 0.006944444 FALSE
10 0.006250000 TRUE
I've tried grepl, but that works for character strings - not sequences of characters consecutively across rows.
The output would be the same df as above, but without row 7, as that doesn't follow the pattern mentioned.

You could do...
s = c("TRUE", "FALSE", "TRUE")
library(data.table)
w = as.data.table(embed(history, length(s)))[as.list(s), on=paste0("V", seq_along(s)), which=TRUE]
df$v <- FALSE
df$v[w + rep(seq_along(s)-1L, each=length(s))] <- TRUE
value history v
1 0.062500000 TRUE TRUE
2 0.031250000 FALSE TRUE
3 0.020833333 TRUE TRUE
4 0.015625000 TRUE TRUE
5 0.012500000 FALSE TRUE
6 0.010416667 TRUE TRUE
7 0.008928571 TRUE FALSE
8 0.007812500 TRUE TRUE
9 0.006944444 FALSE TRUE
10 0.006250000 TRUE TRUE
You can then filter like subset(df, v == TRUE).
This works using data.table joins, x[i, which=TRUE] which looks up i = as.list(s) in x = embed(history, length(s)) and reports which rows of x are matched:
> as.data.table(as.list(s))
V1 V2 V3
1: TRUE FALSE TRUE
> as.data.table(embed(history, length(s)))
V1 V2 V3
1: TRUE FALSE TRUE
2: TRUE TRUE FALSE
3: FALSE TRUE TRUE
4: TRUE FALSE TRUE
5: TRUE TRUE FALSE
6: TRUE TRUE TRUE
7: FALSE TRUE TRUE
8: TRUE FALSE TRUE
The w + rep(...) is the same as #GGrothendieck's outer(...) except here w contains the position of the start of a match, not the end.

The data in the question looks very strange so we used the data in the Note at the end. If you really have a character vector or factor with value "TRUE" and "FALSE" it can readily be translated to logicals using:
df <- transform(df, history = history == "TRUE")
1) rollapply First define the pattern and then search for it using a moving window with rollapplyr. That gives a logical vector which is TRUE if it is the end of such a pattern match. Find the indexes of the TRUEs and include the prior two indexes as well. Finally perform the subset.
library(zoo)
pattern <- c(TRUE, FALSE, TRUE)
ix <- which(rollapplyr(df$history, length(pattern), identical, pattern, fill = FALSE))
ix <- unique(sort(c(outer(ix, seq_along(pattern) - 1L, "-"))))
df[ix, ]
giving:
value history
1 0.062500000 TRUE
2 0.031250000 FALSE
3 0.020833333 TRUE
4 0.015625000 TRUE
5 0.012500000 FALSE
6 0.010416667 TRUE
8 0.007812500 TRUE
9 0.006944444 FALSE
10 0.006250000 TRUE
1a) magrittr This code in (1) could be expressed using magrittr. (Solution (2) could also be expressed using magrittr following similar ideas.)
library(magrittr)
library(zoo)
df %>%
extract(
extract(.,, "history") %>%
rollapplyr(length(pattern), identical, pattern, fill = FALSE) %>%
which %>%
outer(seq_along(pattern) - 1L, "-") %>%
sort %>%
unique, )
2) gregexpr Using pattern defined above we convert it to a character string of 0s and 1s and also convert df$history to such a string. We can then use gregexpr to find the indexes of the first element of each match and then expand that to all indexes and subset. We get the same answer as before. This alternative uses no packages.
collapse <- function(x) paste0(x + 0, collapse = "")
ix <- gregexpr(collapse(pattern), collapse(df$history))[[1]]
ix <- unique(sort(c(outer(ix, seq_along(pattern) - 1L, "+"))))
df[ix, ]
Note
Lines <- "
value history
1 0.062500000 TRUE
2 0.031250000 FALSE
3 0.020833333 TRUE
4 0.015625000 TRUE
5 0.012500000 FALSE
6 0.010416667 TRUE
7 0.008928571 TRUE
8 0.007812500 TRUE
9 0.006944444 FALSE
10 0.006250000 TRUE"
df <- read.table(text = Lines)

option using lag:
df <- data.frame(value, history)
n<- grepl("TRUE, FALSE, TRUE", paste(lag(lag(history)), (lag(history)), history, sep = ", "))[-(1:2)]
cond <- n |lag(n)|lag(lag(n))
cond <- c(cond, cond[length(history)-2], cond[length(history)-2])
df[cond, ]

More efficient ways to use R than 'for' loops

I'm a relative newcomer to R so I'm sorry if there's an obvious answer to this. I've looked at other questions and I think 'apply' is the answer but I can't work out how to use it in this case.
I've got a longitudinal survey where participants are invited every year. In some years they fail to take part, and sometimes they die. I need to identify which participants have taken part for a consistent 'streak' since from the start of the survey (i.e. if they stop, they stop for good).
I've done this with a 'for' loop, which works fine in the example below. But I have many years and many participants, and the loop is very slow. Is there a faster approach I could use?
In the example, TRUE means they participated in that year. The loop creates two vectors - 'finalyear' for the last year they took part, and 'streak' to show if they completed all years before the finalyear (i.e. cases 1, 3 and 5).
dat <- data.frame(ids = 1:5, "1999" = c(T, T, T, F, T), "2000" = c(T, F, T, F, T), "2001" = c(T, T, T, T, T), "2002" = c(F, T, T, T, T), "2003" = c(F, T, T, T, F))
finalyear <- NULL
streak <- NULL
for (i in 1:nrow(dat)) {
x <- as.numeric(dat[i,2:6])
y <- max(grep(1, x))
finalyear[i] <- y
streak[i] <- sum(x) == y
}
dat$finalyear <- finalyear
dat$streak <- streak
Thanks!

We could use max.col and rowSums as a vectorized approach.
dat$finalyear <- max.col(dat[-1], 'last')
If there are rows without TRUE values, we can make sure to return 0 for that row by multiplying with the double negation of rowSums. The FALSE will be coerced to 0 and multiplying with 0 returns 0 for that row.
dat$finalyear <- max.col(dat[-1], 'last')*!!rowSums(dat[-1])
Then, we create the 'streak' column by comparing the rowSums of columns 2:6 with that of 'finalyear'
dat$streak <- rowSums(dat[,2:6])==dat$finalyear
dat
# ids X1999 X2000 X2001 X2002 X2003 finalyear streak
#1 1 TRUE TRUE TRUE FALSE FALSE 3 TRUE
#2 2 TRUE FALSE TRUE TRUE TRUE 5 FALSE
#3 3 TRUE TRUE TRUE TRUE TRUE 5 TRUE
#4 4 FALSE FALSE TRUE TRUE TRUE 5 FALSE
#5 5 TRUE TRUE TRUE TRUE FALSE 4 TRUE
Or a one-line code (it could fit in one-line, but decided to make it obvious by 2-lines ) suggested by #ColonelBeauvel
library(dplyr)
mutate(dat, finalyear=max.col(dat[-1], 'last'),
streak=rowSums(dat[-1])==finalyear)

For-loops are not inherently bad in R, but they are slow if you grow vectors iteratively (like you are doing). There are often better ways to do things. Example of a solution with only apply-functions:
dat$finalyear <- apply(dat[,2:6],MARGIN=1,function(x){max(which(x))})
dat$streak <- apply(dat[,2:7],MARGIN=1,function(x){sum(x[1:5])==x[6]})
Or option 2, based on comment by #Spacedman:
dat$finalyear <- apply(dat[,2:6],MARGIN=1,function(x){max(which(x))})
dat$streak <- apply(dat[,2:6],MARGIN=1,function(x){max(which(x))==sum(x)})
> dat
ids X1999 X2000 X2001 X2002 X2003 finalyear streak
1 1 TRUE TRUE TRUE FALSE FALSE 3 TRUE
2 2 TRUE FALSE TRUE TRUE TRUE 5 FALSE
3 3 TRUE TRUE TRUE TRUE TRUE 5 TRUE
4 4 FALSE FALSE TRUE TRUE TRUE 5 FALSE
5 5 TRUE TRUE TRUE TRUE FALSE 4 TRUE

Here is a solution with dplyr and tidyr.
gather(data = dat,year,value,-ids) %>%
mutate(year=as.integer(gsub("X","",year))) %>%
group_by(ids) %>%
summarize(finalyear=last(year[value]),
streak=!any(value[first(year):finalyear] == FALSE))
output
ids finalyear streak
1 1 2001 TRUE
2 2 2003 FALSE
3 3 2003 TRUE
4 4 2003 FALSE
5 5 2002 TRUE

Here's a base version using apply to loop over rows and rle to see how often the state changes. Your condition seems to be equivalent to the state starting as TRUE and only ever changing to FALSE at most once, so I test the rle as being shorter than 3 and the first value being TRUE:
> dat$streak = apply(dat[,2:6],1,function(r){r[1] & length(rle(r)$length)<=2})
>
> dat
ids X1999 X2000 X2001 X2002 X2003 streak
1 1 TRUE TRUE TRUE FALSE FALSE TRUE
2 2 TRUE FALSE TRUE TRUE TRUE FALSE
3 3 TRUE TRUE TRUE TRUE TRUE TRUE
4 4 FALSE FALSE TRUE TRUE TRUE FALSE
5 5 TRUE TRUE TRUE TRUE FALSE TRUE
There's probably loads of ways of working out finalyear, this just finds the last element of each row which is TRUE:
> dat$finalyear = apply(dat[,2:6], 1, function(r){max(which(r))})
> dat
ids X1999 X2000 X2001 X2002 X2003 streak finalyear
1 1 TRUE TRUE TRUE FALSE FALSE TRUE 3
2 2 TRUE FALSE TRUE TRUE TRUE FALSE 5
3 3 TRUE TRUE TRUE TRUE TRUE TRUE 5
4 4 FALSE FALSE TRUE TRUE TRUE FALSE 5
5 5 TRUE TRUE TRUE TRUE FALSE TRUE 4

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Merge same column names of a logical/binary dataframe - r

Related

How to group dataframe by Conditioning in R

IF TRUE then Variable name

R - Collapse Data by Grouped Row Observations

Subset data based on sequence of characters across rows

More efficient ways to use R than 'for' loops

Categories

Resources