Extract the edges of connected matrix cells - r

I have an n x p binary matrix with disconnected TRUE values. How do I extract the boundaries/edges of horizontally or vertically connected cells (i.e. not diagonally connected ones)?
An example:
m <- structure(c(TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,
TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,
TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,
TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,
TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE,
FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,
TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE,
FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,
TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE,
FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,
TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE,
FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,
TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,
TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,
TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,
TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE,
FALSE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE), .Dim = c(20L, 20L))
library(plot.matrix)
plot(m)
The plot.matrix plot of the example data with the desired edges/boundaries overlayed manually.
The desired solution should find the largest group of connected cells and remove any holes inside the connected cells. It would be great to find a mathematical/base R solution for this.
EDIT "What have I tried" part has been moved to one of the answers.

Here is one solution I found (moved it from the original question's "What have I tried" part to an answer since it solves the problem). It is not entirely optimal as it uses several external GIS packages but does address the issue.
## Load the required packages
library(raster)
library(ggplot2)
library(spatialEco)
image(m) ## Plot the original matrix
You'll need to transpose the matrix to account for the different philosophies of rows&columns between the raster package and R's native matrix. See #Dspanes comment here and this answer.
m <- t(m)
r <- raster::raster(m[nrow(m):1,])
Now you can use R's GIS capabilities to solve the problem
rc <- raster::clump(r, directions = 4) # Clump Rook's wise (comes from Chess)
id.df <- na.omit(as.data.frame(freq(rc))) ## To find the clump with most cells
rc[rc != id.df[which.max(id.df$count),"value"]] <- NA ## Remove the rest
p <- raster::rasterToPolygons(rc, dissolve = TRUE) ## Convert to polygons
# Remove holes (not relevant for this particular example but is required by the OP)
p <- spatialEco::remove.holes(p)
dt <- ggplot2::fortify(p) # Convert to data.frame
## Add the boundaries to the image. Note that the offset is because the image
## function plots the midpoints and the boundaries are plotted as a line
## Different orientation results from how the image function plots the matrix
lines(dt$long, dt$lat, col = "cyan")
I posted this because I found a solution to the question myself and thought this might help someone one day.

Related

Best approach to visualise presence/absence of events in multiple groups [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I have a dataset where the presence/absence of mutations in 40 particular genes has been recorded comparing normal tissue (e.g. lung tissue) vs a tumour from that tissue (e.g. lung tumor) for twenty tissue types. I am struggling to find the best way to visualise this data.
A subset of the data:
Gene Lung_Normal Lung_Cancer Skin_Normal Skin_Cancer Brain_Normal Brain_Cancer
Gene_1 TRUE TRUE TRUE TRUE TRUE TRUE
Gene_2 TRUE TRUE TRUE TRUE TRUE TRUE
Gene_3 FALSE TRUE FALSE FALSE FALSE FALSE
Gene_4 FALSE FALSE FALSE FALSE FALSE FALSE
Gene_5 FALSE TRUE FALSE FALSE FALSE TRUE
Gene_6 FALSE FALSE TRUE TRUE TRUE TRUE
Gene_7 FALSE FALSE FALSE TRUE FALSE FALSE
Gene_8 FALSE FALSE FALSE TRUE FALSE TRUE
Gene_9 FALSE TRUE FALSE FALSE FALSE FALSE
Gene_10 FALSE FALSE FALSE TRUE FALSE TRUE
The key message we want to convey is that while the same 3-4 genes are often mutated in normal tissues, each tumor has many more additional genes mutated and there is more diversity in the tumors. I could just leave it as a table like this, but I would love to find a good way to visualise the information in a clear way.
I would like to try making a figure, like a circus plot, with a single circle with two rings representing all the data. The inner ring would be the normal tissues, the outer ring would be the cancer tissues, with each segment containing the relevant normal tissue on the inner ring and the relevant cancer tissue on the outer ring. Each gene would be colour coded and only shown if mutated. So for all normal tissues the segment would show 2-3 colours for the 2-3 mutated genes, while the outer cancer segment would show many more colour segments, representing the many more mutations.
However I have not found a plotting software that could create such a visualisation. Does anyone know of a way to make a visualisation like this? Even just pointing me towards an R package would be very helpful. I have looked into circos and radar plots but I have not found a package that can make the type of visualisation I have in mind, only showing the events that occur in each case.
If anyone thinks a different kind of visualisation could represent this data please let me know I would be happy to consider alternatives that represent the data with clarity.
Thank you in advance.
Not sure if this is what you're looking for, but I took a stab at it. Also, I'm not entirely sure from the description above what you want to do with the different types of cells - Lung, Skin, Brain? If this isn't what you're looking for, perhaps you could post a drawing of what the intended output should look like.
In the picture below, the inner ring is normal cells and the outer ring is cancer cells. My answer here benefited from this post.
## Make the data
tib <- tibble::tribble(
~Gene, ~Lung_Normal, ~Lung_Cancer, ~Skin_Normal, ~Skin_Cancer, ~Brain_Normal, ~Brain_Cancer,
"Gene_1", TRUE , TRUE , TRUE , TRUE , TRUE , TRUE,
"Gene_2", TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,
"Gene_3", FALSE , TRUE , FALSE , FALSE , FALSE , FALSE,
"Gene_4", FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
"Gene_5", FALSE , TRUE , FALSE , FALSE , FALSE , TRUE,
"Gene_6", FALSE, FALSE, TRUE, TRUE, TRUE, TRUE,
"Gene_7", FALSE , FALSE , FALSE , TRUE , FALSE , FALSE,
"Gene_8", FALSE, FALSE, FALSE, TRUE, FALSE, TRUE,
"Gene_9", FALSE , TRUE , FALSE , FALSE , FALSE , FALSE,
"Gene_10", FALSE, FALSE, FALSE, TRUE, FALSE, TRUE)
library(tidyr)
library(dplyr)
## Re-arrange into long format
tib <- tib %>%
pivot_longer(cols=-Gene, names_pattern="(.*)_(.*)", names_to=c("type", ".value")) %>%
pivot_longer(c(Normal, Cancer), names_to = "diag", values_to="val") %>%
# code colors as the gene if it's mutated, otherwise Unmutated
mutate(f = case_when(val ~ Gene, TRUE ~ "Unmutated")) %>%
group_by(Gene, f, diag) %>%
summarise(s = n()) %>%
mutate(diag = factor(diag, levels=c("Normal", "Cancer")),
f = factor(f, levels=c(paste("Gene", c(1,2,6,3,5,7,8,9,10,4), sep="_"), "Unmutated")))
library(ggplot2)
library(RColorBrewer)
ggplot(tib, aes(x=diag,
y = s,
fill=f)) +
geom_bar(stat="identity") +
coord_polar("y") +
theme_void() +
scale_fill_manual(values=c(brewer.pal(9, "Paired"), "gray75")) +
labs(fill = "Mutations")
EDIT
Here' is what it looks like with the data Allan suggested. This approach doesn't scale quite as well as the need for having lots of colors is going to make the plot less readable.
df <- structure(list(genes = c("Gene1", "Gene2", "Gene3", "Gene4",
"Gene5", "Gene6", "Gene7", "Gene8", "Gene9", "Gene10", "Gene11",
"Gene12", "Gene13", "Gene14", "Gene15", "Gene16", "Gene17", "Gene18",
"Gene19", "Gene20", "Gene21", "Gene22", "Gene23", "Gene24", "Gene25",
"Gene26", "Gene27", "Gene28", "Gene29", "Gene30", "Gene31", "Gene32",
"Gene33", "Gene34", "Gene35", "Gene36", "Gene37", "Gene38", "Gene39",
"Gene40"), bone_cancer = c(FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, TRUE, FALSE,
TRUE, FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE,
FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE), bone_normal = c(FALSE,
FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, TRUE,
FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE,
FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE,
TRUE, FALSE, TRUE), brain_cancer = c(TRUE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, TRUE,
FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE), brain_normal = c(FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, TRUE,
FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE,
TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE), breast_cancer = c(FALSE, FALSE, TRUE, FALSE, FALSE, FALSE,
TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, TRUE, TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE,
FALSE, FALSE, TRUE, FALSE, FALSE, FALSE), breast_normal = c(TRUE,
FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE,
TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, TRUE, FALSE,
FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE,
FALSE), colon_cancer = c(FALSE, FALSE, TRUE, FALSE, FALSE, FALSE,
TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, TRUE, FALSE, TRUE, TRUE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, TRUE, FALSE), colon_normal = c(FALSE,
TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, TRUE, FALSE,
FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE,
TRUE, TRUE, FALSE), kidney_cancer = c(FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE,
TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE),
kidney_normal = c(FALSE, FALSE, FALSE, FALSE, FALSE, TRUE,
FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, TRUE,
TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE,
TRUE, FALSE, FALSE, TRUE, TRUE, TRUE, FALSE, FALSE, TRUE,
TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE), liver_cancer = c(FALSE,
FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE,
FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, TRUE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, TRUE, FALSE,
FALSE, FALSE, FALSE), liver_normal = c(TRUE, FALSE, FALSE,
FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE,
TRUE, TRUE, FALSE, TRUE, FALSE, FALSE, TRUE, TRUE, FALSE,
TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE,
FALSE), lung_cancer = c(TRUE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE,
TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE),
lung_normal = c(FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE,
TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE,
FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE), prostate_cancer = c(TRUE,
FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, TRUE,
FALSE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
TRUE, FALSE, TRUE), prostate_normal = c(TRUE, FALSE, FALSE,
FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE,
FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE), skin_cancer = c(FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE,
TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE,
FALSE, FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE), skin_normal = c(TRUE,
FALSE, TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, FALSE, TRUE,
FALSE, FALSE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE,
TRUE, FALSE, FALSE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE,
TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, TRUE,
FALSE, FALSE, FALSE), thyroid_cancer = c(FALSE, FALSE, FALSE,
FALSE, FALSE, TRUE, FALSE, TRUE, TRUE, FALSE, TRUE, TRUE,
FALSE, FALSE, TRUE, FALSE, TRUE, TRUE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE,
FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE), thyroid_normal = c(FALSE, FALSE, TRUE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE,
FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE)),
class = "data.frame", row.names = c(NA, 40L))
names(df)[1] <- "Gene"
tib <- df %>%
pivot_longer(cols=-Gene, names_pattern="(.*)_(.*)", names_to=c("type", ".value")) %>%
pivot_longer(c(normal, cancer), names_to = "diag", values_to="val") %>%
# code colors as the gene if it's mutated, otherwise Unmutated
mutate(f = case_when(val ~ Gene, TRUE ~ "Unmutated")) %>%
group_by(Gene, f, diag) %>%
summarise(s = n()) %>%
ungroup() %>%
group_by(Gene) %>%
mutate(diag = factor(diag, levels=c("normal", "cancer")))
levs <- tib %>%
dplyr::select(f, s) %>%
summarise(pct_mutated = sum(s*(f!= "Unmutated"))/sum(s)) %>%
arrange(-pct_mutated) %>%
dplyr::select(Gene) %>%
pull()
tib<- tib %>%
mutate(f = factor(f, levels=c(levs, "Unmutated")))
library(ggplot2)
library(RColorBrewer)
ggplot(tib, aes(x=diag,
y = s,
fill=f)) +
geom_bar(stat="identity") +
coord_polar("y") +
theme_void() +
scale_fill_manual(values=c(rainbow(length(levels(tib$f))-1), "gray75")) +
labs(fill = "Mutations")
An alternative would be a heatmap. You could either do this by faceting cancer vs normal, or adjusting the fill to reflect mutations in cancer only, normal tissue only, both, or neither.
For both of these, it would first be necessary to reshape your data:
Option 1
library(tidyr)
library(dplyr)
library(ggplot2)
df %>%
pivot_longer(-1) %>%
separate(name, into = c("tissue", "state"), sep = "_") %>%
mutate(genes = factor(genes, paste0("Gene", 1:40))) %>%
ggplot(aes(tissue, genes, fill = value)) + geom_tile(color = "black") +
facet_grid(.~state) +
scale_x_discrete(guide = guide_axis(n.dodge = 2))
Option 2
df %>%
pivot_longer(-1) %>%
separate(name, into = c("tissue", "state"), sep = "_") %>%
mutate(genes = factor(genes, paste0("Gene", 1:40))) %>%
group_by(genes, tissue) %>%
summarize(mutations = factor(2 + diff(value) + 2 * all(value))) %>%
ggplot(aes(tissue, genes, fill = mutations)) + geom_tile(color = "black") +
scale_x_discrete(guide = guide_axis(n.dodge = 2)) +
scale_fill_discrete(labels = c("Neither", "Cancer only", "Healthy only", "Both"))
For these I have used sample data that should approximate your data structure:
df <- structure(list(genes = c("Gene1", "Gene2", "Gene3", "Gene4",
"Gene5", "Gene6", "Gene7", "Gene8", "Gene9", "Gene10", "Gene11",
"Gene12", "Gene13", "Gene14", "Gene15", "Gene16", "Gene17", "Gene18",
"Gene19", "Gene20", "Gene21", "Gene22", "Gene23", "Gene24", "Gene25",
"Gene26", "Gene27", "Gene28", "Gene29", "Gene30", "Gene31", "Gene32",
"Gene33", "Gene34", "Gene35", "Gene36", "Gene37", "Gene38", "Gene39",
"Gene40"), bone_cancer = c(FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, TRUE, FALSE,
TRUE, FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE,
FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE), bone_normal = c(FALSE,
FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, TRUE,
FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE,
FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE,
TRUE, FALSE, TRUE), brain_cancer = c(TRUE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, TRUE,
FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE), brain_normal = c(FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, TRUE,
FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE,
TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE), breast_cancer = c(FALSE, FALSE, TRUE, FALSE, FALSE, FALSE,
TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, TRUE, TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE,
FALSE, FALSE, TRUE, FALSE, FALSE, FALSE), breast_normal = c(TRUE,
FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE,
TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, TRUE, FALSE,
FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE,
FALSE), colon_cancer = c(FALSE, FALSE, TRUE, FALSE, FALSE, FALSE,
TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, TRUE, FALSE, TRUE, TRUE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, TRUE, FALSE), colon_normal = c(FALSE,
TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, TRUE, FALSE,
FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE,
TRUE, TRUE, FALSE), kidney_cancer = c(FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE,
TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE),
kidney_normal = c(FALSE, FALSE, FALSE, FALSE, FALSE, TRUE,
FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, TRUE,
TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE,
TRUE, FALSE, FALSE, TRUE, TRUE, TRUE, FALSE, FALSE, TRUE,
TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE), liver_cancer = c(FALSE,
FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE,
FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, TRUE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, TRUE, FALSE,
FALSE, FALSE, FALSE), liver_normal = c(TRUE, FALSE, FALSE,
FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE,
TRUE, TRUE, FALSE, TRUE, FALSE, FALSE, TRUE, TRUE, FALSE,
TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE,
FALSE), lung_cancer = c(TRUE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE,
TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE),
lung_normal = c(FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE,
TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE,
FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE), prostate_cancer = c(TRUE,
FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, TRUE,
FALSE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
TRUE, FALSE, TRUE), prostate_normal = c(TRUE, FALSE, FALSE,
FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE,
FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE), skin_cancer = c(FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE,
TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE,
FALSE, FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE), skin_normal = c(TRUE,
FALSE, TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, FALSE, TRUE,
FALSE, FALSE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE,
TRUE, FALSE, FALSE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE,
TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, TRUE,
FALSE, FALSE, FALSE), thyroid_cancer = c(FALSE, FALSE, FALSE,
FALSE, FALSE, TRUE, FALSE, TRUE, TRUE, FALSE, TRUE, TRUE,
FALSE, FALSE, TRUE, FALSE, TRUE, TRUE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE,
FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE), thyroid_normal = c(FALSE, FALSE, TRUE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE,
FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE)), class = "data.frame", row.names = c(NA,
40L))
Here is an alternative. Similar to a heat map, but instead using colored points. I also chose to hide the non-mutations since I presume that we are more interested in visualizing where the mutations occur. If the FALSE values are also colored, it increases the cognitive load by providing an extra emphasis for our brains to interpret.
# Creating the data.
gene_data <- structure(list(Gene = c("Gene_1", "Gene_2", "Gene_3", "Gene_4",
"Gene_5", "Gene_6", "Gene_7", "Gene_8", "Gene_9", "Gene_10"),
Lung_Normal = c(TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE), Lung_Cancer = c(TRUE, TRUE, TRUE, FALSE,
TRUE, FALSE, FALSE, FALSE, TRUE, FALSE), Skin_Normal = c(TRUE,
TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE
), Skin_Cancer = c(TRUE, TRUE, FALSE, FALSE, FALSE, TRUE,
TRUE, TRUE, FALSE, TRUE), Brain_Normal = c(TRUE, TRUE, FALSE,
FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE), Brain_Cancer = c(TRUE,
TRUE, FALSE, FALSE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE)), row.names = c(NA,
-10L), class = c("tbl_df", "tbl", "data.frame"))
Processing the data a little:
mutation_data <- gene_data %>%
pivot_longer(-Gene, values_to = "Mutation") %>% # making long form
separate(name, into = c("Type", "Status")) %>% # splitting cell type and cancer status
mutate(Mutation = as.numeric(Mutation), # cleaning data for plotting
Type = factor(Type, levels = c("Cancer", "Normal")),
Gene = factor(Gene, levels = paste0("Gene_", 1:10))) %>% #genes appear in order
filter(Mutation == 1) # Removing FALSE for cleaner plot
Building the plot
ggplot() +
# plotting all cancer mutations in red
geom_point(data = mutation_data %>% filter(Status == "Cancer"),
aes(x = Type, y = Gene, color = Status), size = 4) +
# plotting all mutations present in normal and cancer in black over top of the red
geom_point(data = mutation_data %>% filter(Status == "Normal"),
aes(x = Type, y = Gene, color = Status), size = 4) +
# formatting the color scale and legend
scale_color_manual(name = "Mutation locations", values = c("Cancer" = "Red", "Normal" = "Black"),
labels = c("Cancer cells only", "Cancer and normal cells")) +
# providing title, subtitle, and x label
labs(title = "Gene mutations in brain, lung, and skin cells",
subtitle = "Cancer cells vs. normal cells",
x = "Cell type") +
theme_bw() # setting a theme I like

How do I replace sequences of TRUE values with values indicating where they begin and end?

This seems like it should be a straightforward application of which, but I can't figure it out. I have a matrix indicating whether a person was present or absent in a given survey wave. I would like to convert it to a list of vectors, one list element per row of the matrix, indicating the ranges of time that a person is present. Here's a working example of what I'm trying to do:
in.wave <- structure(c(TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,
TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,
TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,
TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,
TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,
TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,
TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,
TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,
TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,
TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE,
FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE,
TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE,
TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,
TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE,
TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE,
TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,
TRUE, FALSE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE,
TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE,
TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,
TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE,
TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, FALSE, TRUE,
TRUE, TRUE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,
TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE,
TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,
TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,
FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, TRUE, TRUE,
TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, TRUE,
TRUE, FALSE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, FALSE,
TRUE, TRUE, FALSE, FALSE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE,
FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, TRUE, FALSE,
TRUE, TRUE, TRUE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, FALSE,
FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE,
TRUE, TRUE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,
FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,
TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE,
TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE,
FALSE, TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, FALSE,
TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, FALSE, TRUE, TRUE,
TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, FALSE, FALSE,
TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE,
TRUE), .Dim = c(108L, 4L), .Dimnames = list(NULL, c("wave5",
"wave6", "wave7", "wave8")))
head(in.wave)
# wave5 wave6 wave7 wave8
# [1,] TRUE TRUE FALSE FALSE
# [2,] TRUE TRUE TRUE FALSE
# [3,] TRUE TRUE TRUE TRUE
# [4,] TRUE TRUE TRUE TRUE
# [5,] TRUE TRUE TRUE TRUE
# [6,] TRUE TRUE TRUE TRUE
# Current approach is pure brute force:
possibilities <- expand.grid(list(c(T, F), c(T, F), c(T, F), c(T, F)))
output <- list(
c(5.0, 8.0),
c(5.4, 8.0),
c(5.0, 5.4, 6.4, 8.0),
c(6.4, 8.0),
c(5.0, 6.4, 7.4, 8.0),
c(5.4, 6.4, 7.4, 8.0),
c(5.0, 5.4, 7.4, 8.0),
c(7.4, 8.0),
c(5.0, 7.4),
c(5.4, 7.4),
c(5.0, 5.4, 6.4, 7.4),
c(6.4, 7.4),
c(5.0, 6.4),
c(5.4, 6.4),
c(5.0, 5.4),
c(0)
)
desired <- apply(in.wave, 1, function(trial) {
output[[which(apply(possibilities, 1, function(x) all(trial == x)))]]
})
head(desired)
# [[1]]
# [1] 5.0 6.4
#
# [[2]]
# [1] 5.0 7.4
#
# [[3]]
# [1] 5 8
#
# [[4]]
# [1] 5 8
#
# [[5]]
# [1] 5 8
#
# [[6]]
# [1] 5 8
As the sample code shows, I'm currently doing this by brute force -- I'm enumerating all 2 ^ 4 possibilities, writing down what the output should be, and then looking up the correct output for each row of in.wave. Since I'll be expanding this to 8 columns, I would rather not write out all 2 ^ 8 possibilities.
The desired output is a list of vectors of even length, where each pair of elements indicates when someone enters and leaves the survey. So, for example, if you have a person who appears in all the waves, the desired output would be the vector c(5.0, 8.0), and if you have a person who is absent in wave 7, then the desired output would be c(5.0, 6.4, 7.4, 8.0).
If people couldn't be missing in a wave in the middle, you might use something like range with which to get the values where they were present. But having multiple spells is throwing me off. Any ideas how to solve this concisely?
Following 42's suggestion, I put together a function that makes use of rle. This seems to give the right answer, but hopefully someone else on here will have a more elegant solution.
makeJoinLeaveVector <- function(x) {
# Recodes a logical vector into a set of paired values indicating when people
# enter or leave a population
#
# Args:
# x: a logical vector
#
# Returns:
# A vector of length(rle(x)$values) * 2 with paired values indicating 0.6
# before each first value of TRUE in a set of them, and .4 after each last
# value of TRUE in a set of them.
stopifnot(is.logical(x))
# Get the run length encoding
x.rle <- rle(x)
# Now save a vector for each of the TRUE values
is.true <- which(x.rle$values)
out <- c(is.true, is.true + x.rle$lengths[is.true]) - 0.6
names(out) <- NULL
# Recode first and last possible values
out <- ifelse(out == 0.4, 1.0, out)
out <- ifelse(out == 4.4, 4.0, out)
return(sort(out) + 4)
}
desired <- apply(in.wave, 1, makeJoinLeaveVector)
head(desired)
# [[1]]
# [1] 5.0 6.4
#
# [[2]]
# [1] 5.0 7.4
#
# [[3]]
# [1] 5 8
#
# [[4]]
# [1] 5 8
#
# [[5]]
# [1] 5 8
#
# [[6]]
# [1] 5 8
This isn't more elegant, but I thought I'd add it now I've written it.
The end bits of the code could probably be written much more cleanly.
The approach is to calculate in.wave[, 2:4] - in.wave[, 1:3] which gives a -1 when someone exits the survey and a 1 when someone enters the survey.
times <- c(5, 5.4, 6.4, 7.4, 8)
# exiting gives a -1
# entering gives a 1
# No change gives a 0
transitions <- in.wave[, 2:4] - in.wave[, 1:3]
# Find the entrance and exit points of the survey
exits <- apply(transitions, 1, function(x) times[which(x == -1) + 1])
entrances <- apply(transitions, 1, function(x) times[which(x == 1) + 1])
# Combine entrances and exits
desired <- lapply(seq_len(nrow(in.wave)), function(x) sort(c(entrances[[x]], exits[[x]])))
# If no entrances, subject must have been in study from the beginning
desired <-
lapply(seq_len(nrow(in.wave)), function(x) if(length(entrances[[x]]) == 0){
c(times[1], desired[[x]])
} else {
desired[[x]]
})
# If no exits, subject must have remained in study until the end.
desired <-
lapply(seq_len(nrow(in.wave)), function(x) if(length(exits[[x]]) == 0){
c(desired[[x]], times[5])
} else {
desired[[x]]
})

GGVIS plot for logical matrix

I'm trying to plot a logical matrix similar to the question here, the difference is that I'm trying to do it using ggvis so that I can use the hover tool (the data has several thousand rows so I'd like to see row/column names when I hover over it). The following code worked for me with ggplot2.
library(reshape2)
library(ggplot2)
melted = melt(matrix)
ggplot(melted, aes(x = Var2, y = Var1)) +
geom_tile(aes(fill = value)) +
scale_file_manual(values = c("black", "red")) +
theme(axis.text.x = element_blank(), axis.text.y = element_blank()) +
coord_fixed(ratio = 1/10)
You can find an example for layer_rects here, I just made some adjustments.
Using the example for the linked question:
Load data
mm <- structure(c(TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE,
FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, TRUE,
FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE,
FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE,
FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE,
TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE,
TRUE, TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE,
TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE,
TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE,
TRUE, TRUE, TRUE), .Dim = c(10L, 10L), .Dimnames = list(NULL,
c("n1", "n2", "n3", "n4", "n5", "n1.1", "n2.1", "n3.1", "n4.1",
"n5.1")))
Melt and change types
library(reshape2)
melted <- melt(mm)
melted$value <- as.numeric(melted$value)
melted$Var1 <- as.factor(melted$Var1)
Plot
melted %>%
ggvis(~Var2, ~Var1, fill = ~value) %>%
layer_rects(width = band(), height = band()) %>%
scale_nominal("x", padding = 0, points = FALSE) %>%
scale_nominal("y", padding = 0, points = FALSE)

Logical variable in data.table is not truely binary

I tried hard to come up with a simple example demonstrating my issue. Unfortunately, I failed. So please apologize.
I am working with data.table 1.9.6 on a very large dataset with more than 100 million rows and nearly 60 columns. After doing several operations I face the following issue with column del which I try to explain giving the following outputs:
> data[, class(del)]
[1] "logical"
> data[, summary(del)]
Mode FALSE TRUE NA's
logical 124763883 2088978 0
So far so good, everything is fine. But then I noticed:
> nrow(data[del==TRUE])
[1] 0
> nrow(data[del==FALSE])
[1] 126790922
Seems that some of the entries are either TRUE nor FALSE but in fact somewhere in between 0 and 1:
> nrow(data[del<0.5])
[1] 124763883
> nrow(data[del>0.5])
[1] 2088978
But how can that happen? I only assigned TRUE and FALSE's to this column. Again, I would have loved to produce a little example but it does not work, in a sense that the issue disappears if I select a subset. For example if a select row 47639 (id is unique):
> data[id==47639, list(del)]
del
1: TRUE
del seems to be set to TRUE. So if I first select this specific row and then test whether del is TRUE it works:
> data[id==47639][del==TRUE, list(del)]
del
1: TRUE
But data[del==TRUE][id==47639, list(del)] produces no output.
Empty data.table (0 rows) of 1 col: del
[Edit] I have spent a considerable amount of time in reducing the size of my dataset to make a reproduceable example. It now has only 132 rows and 1 column and still the odd behaviour. This is what I get:
data[, summary(del)]
Mode FALSE TRUE NA's
logical 129 3 0
> nrow(data[del==TRUE])
[1] 1
> nrow(data[del==FALSE])
[1] 127
> nrow(data[del<0.5])
[1] 129
> nrow(data[del>0.5])
[1] 3
Using dput to post the sample here fails because:
> dput(data)
structure(list(del = c(FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, TRUE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE)), .Names = "del", class = c("data.table",
"data.frame"), row.names = c(NA, -132L), .internal.selfref = <pointer: 0x1914ca8>, index = structure(integer(0), "`__del`" = integer(0)))
So dput and dget together produce:
> dput(data, "dt")
> dget("dt")
Error in parse(file = file, keep.source = keep.source) :
dt:16:62: unexpected '<'
15: FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE)), .Names = "del", class = c("data.table",
16: "data.frame"), row.names = c(NA, -132L), .internal.selfref = <
This looks like a bug to me. Of course, I could use the workaround data[(del)] but that means rewriting all my previous code to make it safe. What concerns me most is that I do not know which operation exactly corrupts data.
I could also provide a tiny RData file if that is of any help but I do not know how to post it here correctly.

Print rows if condition met, R

I'm trying to figure out how to print the rows in my logical data.frame that have a TRUE's in them, example:
DATA:
structure(list(X1 = c(TRUE, TRUE, FALSE, TRUE, FALSE, FALSE,
TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE,
TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, TRUE,
TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE,
TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE,
TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE,
TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE,
FALSE, TRUE, FALSE, TRUE, TRUE, FALSE, TRUE, FALSE, FALSE, FALSE,
TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, FALSE,
FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE,
TRUE, FALSE, TRUE, TRUE, FALSE), X2 = c(TRUE, TRUE, FALSE, FALSE,
TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,
TRUE, TRUE, FALSE, FALSE, TRUE, FALSE, TRUE, TRUE, FALSE, TRUE,
TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE,
FALSE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, TRUE, TRUE, TRUE,
FALSE, TRUE, TRUE, FALSE, TRUE, TRUE, FALSE, TRUE, FALSE, FALSE,
FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, TRUE, FALSE,
TRUE, TRUE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, TRUE, FALSE,
FALSE, FALSE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE,
FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE,
TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE), X3 = c(TRUE, FALSE,
TRUE, FALSE, FALSE, TRUE, FALSE, TRUE, TRUE, FALSE, TRUE, FALSE,
FALSE, TRUE, TRUE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE,
FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, FALSE,
TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, FALSE,
FALSE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, TRUE,
TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE,
FALSE, TRUE, TRUE, TRUE, FALSE, FALSE, TRUE, TRUE, TRUE, FALSE,
TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE,
FALSE, TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, TRUE, TRUE, TRUE,
FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE)), .Names = c("X1",
"X2", "X3"), row.names = c(NA, -100L), class = "data.frame")
FUNCTION (not working):
t(apply(X = data, MARGIN = 1, FUN = function(x){x[x == TRUE,]}))
I want the TRUE check to be applied across each row (MARGIN = 1)and then return that row if there is even one TRUE statement.
You shouldn't use apply for subsetting directly. You can use apply to make an logical vector you can then use for indexing. (And really you should avoid apply with data.frames unless all the columns are of the same type.) If you really only have TRUE/FALSE values you can do a quick trick with rowSums. If you've saved your data in a data.frame called dd, you can do
dd[ rowSums(dd)>0 , ]
to get all the rows with at least one TRUE. If you wanted to use apply to find the desired rows, you could use the any() function which checks for at least one TRUE value.
dd[ apply(dd, 1, any), ]

Resources