R tableGrob heatmap or conditional formating in column - r

Is there a way to create a similar effect to excel's conditional formating -> color scales in order to present a table in grid.table/tablegrob object? The color indicator should be red for the lower values and green for the higher values in the column.
That object format is needed so the table can be presented in grid format along with plots.
Thank you.

You can do this within tableGrob. You create a vector of colours, and then assign these to the cells.
So using the data from clemens's answer:
library(gridExtra)
library(grid)
# define colour vector
# change `vec` argument of `findInterval` to suit your cut-points
cols <- c("red" ,"orange", "green") [findInterval(my_data$Balance, c(-Inf, 1e4, 2e4, Inf))]
# or
# https://stackoverflow.com/questions/34517031/red-amber-green-sequential-palette-for-treemap-in-r
cols <- colorRampPalette(c("red", "yellow", "green"))(nrow(my_data))[rank(my_data$Balance)]
# create tales individually for each column
# this make it easy to assign colours to rows
t1 <- tableGrob(my_data["Balance"],
theme=ttheme_default(
core=list(bg_params = list(fill=cols)),
colhead = list(bg_params=list(fill="white", col="grey90"))),
rows = NULL)
t2 <- tableGrob(my_data["ID"],
theme=ttheme_default(
core=list(bg_params = list(fill="white", col="grey90")),
colhead = list(bg_params=list(fill="white", col="grey90"))),
rows = NULL)
# join tables
tab <- gtable_combine(t2, t1)
# grid.newpage() ; grid.draw(tab)
# if also want to add black border
# https://stackoverflow.com/questions/31506294/gtable-put-a-black-line-around-all-cells-in-the-table-body
library(gtable)
tab <- gtable::gtable_add_grob(tab,
grobs = rectGrob(gp=gpar(fill=NA, lwd=2)),
t = 1, b = nrow(tab), l = 1, r = ncol(tab))
grid.newpage() ; grid.draw(tab)

You could use tableHTML for that:
library(tableHTML)
for the dataset:
set.seed(666)
my_data <- data.frame(ID = 101:117,
Balance = sample(-1000:60000, 17))
ID Balance
1 101 46237
2 102 11030
3 103 58657
4 104 11280
5 105 21034
6 106 44296
7 107 58697
8 108 29381
9 109 -188
10 110 14854
11 111 46322
12 112 -2
13 113 4839
14 114 7670
15 115 11875
16 116 48475
17 117 1228
You can than create an HTML table using the tableHTML() function. Then apply a colour rank with theme RAG to the 2nd column of the table:
my_data %>%
tableHTML(rownames = FALSE,
widths = c(50, 100)) %>%
add_css_conditional_column(columns = 2,
colour_rank_theme = 'RAG',
decreasing = TRUE)
The result looks like this:

The most natural solution for that is to use a heatmap()?
heatmap(data.matrix(mtcars))
Would yield a heatmap with some default color options. You can change the color using an additional parameter (e.g col = cm.colors(256)) or your own color palette to achieve the desired output.
,

A solution I found was to do the following.. this only works if the data is in order and you list the count of rows(17 based on your screenshot):
theme=ttheme_default(
core=list(bg_params = list(fill=blues9[1:17]) or
theme=ttheme_default(
core=list(bg_params = list(fill=blues9[1:17])
Hope that helps. I am also seeking for alternatives myself

Related

Continuous data binning based on observation distribution/frequency to decide bin range r dplyr

I have now for days without luck scanned the internet for help on this issue. Any suggestions would be highly appreciated! (especially in a tidyverse-friendly syntax)
I have a tibble with approx. 4300 rows/obs and 320 columns. One column is my dependent variable, a continuous numeric column called "RR" (Response Ratios). My goal is to bin the RR values into 10 factor levels. Later for Machine Learning classification.
I have experimented with the cut() function with this code:
df <- era.af.Al_noNaN %>%
rationalize() %>%
drop_na(RR) %>%
mutate(RR_MyQuantile = cut(RR,
breaks = unique(quantile(RR, probs = seq.int(0,1, by = 1 / numbers_of_bins))),
include.lowest = TRUE))
But I have no luck, because my bins come out with equal n in each, however, that does not reflect the distribution of the data.. I have studied a bit here https://towardsdatascience.com/understanding-feature-engineering-part-1-continuous-numeric-data-da4e47099a7b but I simply cannot achieve the same in R.
Here is the distribution of my RR data values grouped into classes *not what I want
You can try hist() to get the breaks. It's for plotting histograms but it also provides other associated data as side effect. In the example below, the plot is suppressed by plot = FALSE to expose the breaks data. Then, use that in cut(). This should give you the cutoffs, maintaining the distribution of the variable.
hist(iris$Sepal.Length, breaks = 5, plot = FALSE)
# $breaks
# [1] 4 5 6 7 8
#
# $counts
# [1] 32 57 49 12
#
# ...<omitted>
breaks <- hist(iris$Sepal.Length, breaks = 5, plot = FALSE)$breaks
dat <- iris %>%
mutate(sepal_length_group = cut(Sepal.Length, breaks = breaks))
dat %>%
count(sepal_length_group)
# sepal_length_group n
# 1 (4,5] 32
# 2 (5,6] 57
# 3 (6,7] 49
# 4 (7,8] 12
Thank you!
I also experimented using cut() and then count(). Then I use the labels=FALSE to give labels that can be used in a new mutate for a new column with character names of the intervals groups..
numbers_of_bins = 10
df <- era.af.Al_noNaN %>%
rationalize() %>%
drop_na(RR) %>%
mutate(RR_MyQuantile = cut(RR,
breaks = unique(quantile(RR, probs = seq.int(0,1, by = 1 / numbers_of_bins))),
include.lowest = TRUE))
head(df$RR_MyQuantile,10)
df %>%
group_by(RR_MyQuantile) %>%
count()

Selecting 10 names based on 10 highest numbers of other column

I want to select the top 10 voted restaurants, and plot them together.
So i want to create a plot that shows the restaurant names and their votes.
I used:
topTenVotes <- top_n(dataSet, 10, Votes)
and it showed me data of the columns in dataset based on the top 10 highest votes, however i want just the number of votes and restaurant names.
My Question is how to select only the top 10 highest votes and their restaurant names, and plotting them together?
expected output:
Restaurant Names Votes
A 300
B 250
C 230
D 220
E 210
F 205
G 200
H 194
I 160
J 120
K 34
And then a bar plot that shows these restaurant names and their votes
Another simple approach with base functions creating another variable:
df <- data.frame(Names = LETTERS, Votes = sample(40:400, length(LETTERS)))
x <- df$Votes
names(x) <- df$Names # x <- setNames(df$Votes, df$Names) is another approach
barplot(sort(x, decreasing = TRUE)[1:10], xlab = "Restaurant Name", ylab = "Votes")
Or a one-line solution with base functions:
barplot(sort(xtabs(Votes ~ Names, df), decreasing = TRUE)[1:10], xlab = "Restaurant Names")
I'm not seeing a data set to use, so here's a minimal example to show how it might work:
library(tidyverse)
df <-
tibble(
restaurant = c("res1", "res2", "res3", "res4"),
votes = c(2, 5, 8, 6)
)
df %>%
arrange(-votes) %>%
head(3) %>%
ggplot(aes(x = reorder(restaurant, votes), y = votes)) +
geom_col() +
coord_flip()
The top_n command also works in this case but is designed for grouped data.
Its more efficient, though less readable, to use base functions:
#toy data
d <- data.frame(list(Names = sample(LETTERS, size = 15), value = rnorm(25, 10, n = 15)))
head(d)
Names value
1 D 25.592749
2 B 28.362303
3 H 1.576343
4 L 28.718517
5 S 27.648078
6 Y 29.364797
#reorder by, and retain, the top 10
newdata <- data.frame()
for (i in 1:10) {
newdata <- rbind(newdata,d[which(d$value == sort(d$value, decreasing = T)[1:10][i]),])
}
newdata
Names value
8 W 45.11330
13 K 36.50623
14 P 31.33122
15 T 30.28397
6 Y 29.36480
7 Q 29.29337
4 L 28.71852
10 Z 28.62501
2 B 28.36230
5 S 27.64808

cbind 1:nrows of same ID variable value to original data.frame

I have a large dataframe, where a variable id (first column) recurs with different values in the second column. My idea is to order the dataframe, to split it into a list and then lapply a function which cbinds the sequence 1:nrows(variable id) to each group. My code so far:
DF <- DF[order(DF[,1]),]
DF <- split(DF,DF[,1])
DF <- lapply(1:length(DF), function(i) cbind(DF[[i]], 1:length(DF[[i]])))
But this gives me an error: arguments imply different number of rows.
Can you elaborate?
> head(DF, n=50)
cell area
1 1 121.2130
2 2 81.3555
3 3 81.5862
4 4 83.6345
...
33 1 121.3270
34 2 80.7832
35 3 81.1816
36 4 83.3340
DF <- DF[order(DF$cell),]
What I want is:
> head(DF, n=50)
cell area counter
1 1 121.213 1
33 1 121.327 2
65 1 122.171 3
97 1 122.913 4
129 1 123.697 5
161 1 124.474 6
...and so on.
This is my code:
cell.areas.t <- function(file) {
dat = paste(file)
DF <- read.table(dat, col.names = c("cell","area"))
DF <- splitstackshape::getanID(DF, "cell")[] # thanks to akrun's answer
ggplot2::ggplot(data = DF, aes(x = .id , y = area, color = cell)) +
geom_line(aes(group = cell)) + geom_point(size=0.1)
}
And the plot looks like this:
Most cells increase in area, only some decrease. This is only a first try to visualize my data, so what you can't see very well is that the areas drop down periodically due to cell division.
Additional question:
There is a problem I didn't take into account beforehand, which is that after a cell division a new cell is added to the data.frame and is handed the initial index 1 (you see in the image that all cells start from .id=1, not later), which is not what I want - it needs to inherit the index of its creation time. First thing that comes into my mind is that I could use a parsing mechanism that does this job for a newly added cell variable:
DF$.id[DF$cell != temporary.cellindex] <- max(DF$.id[DF$cell != temporary.cellindex])
Do you have a better idea? Thanks.
There is a boundary condition which may ease the problem: fixed number of cells at the beginning (32). Another solution would be to cut away all data before the last daughter cell is created.
Update: Additional question solved, here's the code:
cell.areas.t <- function(file) {
dat = paste(file)
DF <- read.table(dat, col.names = c("cell","area"))
DF$.id <- c(0, cumsum(diff(DF$cell) < 0)) + 1L # Indexing
title <- getwd()
myplot <- ggplot2::ggplot(data = DF, aes(x = .id , y = area, color = factor(cell))) +
geom_line(aes(group = cell)) + geom_line(size=0.1) + theme(legend.position="none") + ggtitle(title)
#save the plot
image=myplot
ggsave(file="cell_areas_time.svg", plot=image, width=10, height=8)
}
We can use getanID from splitstackshape
library(splitstackshape)
getanID(DF, "cell")[]
There's a much easier method to accomplish that goal. Use ave with seq.int
DF$group_seq <- ave(DF, DF[,1], FUN=function(x){ seq.int(nrow(x)) } )

Scatter Plot Matrices

I have a matrix mat[n,m] and I'd like to use splom to plot the scatterplots of mat[,"col4"] as a function of all other column values. Also I'd like to add different colors to points of certain row numbers which are stored in rownID[]. I've seen examples using splom but they plot all variables against all variables and use group of columns to change the color of points. Is it possible to do what I want using splom (or other R function)?
Example:
set.seed(1)
mat <- matrix(sample(0:100, 16), ncol=4)
dimnames(mat) <- list(rownames(mat, do.NULL = FALSE, prefix = "row"),
colnames(mat, do.NULL = FALSE, prefix = "col"))
mat
col1 col2 col3 col4
row1 26 19 58 61
row2 37 86 5 33
row3 56 97 18 66
row4 89 62 15 42
rowID <- matrix(c(1,3), ncol=1, nrow=2)
Thanks to https://stackoverflow.com/a/16033003/1262767
I've been using featurePlot function of caret package but I don't know how to change color of some specific points (that's why I'm interested in splom):
featurePlot(mat, mat$col4, plot = "scatter",
## Add some space between the panels
between = list(x = 1, y = 1), main = "testSet",
## Add a background grid ('g') and a smoother ('smooth')
type = c("g", "p", "s"))
This doesn't really seem like a good fit for splom. I think you're be better off reshaping your data and using a standard xyplot. For example
library(reshape2)
mm<-melt(cbind(data.frame(mat), high=1:nrow(mat) %in% rowID), c("col4","high"))
xyplot(col4~value|variable, mm, groups=high)
which gives

Plotting only a subset of the points?

I am trying to plot the CDF curve for a large dataset containing about 29 million values using ggplot. The way I am computing this is like this:
mycounts = ddply(idata.frame(newdata), .(Type), transform, ecd = ecdf(Value)(Value))
plot = ggplot(mycounts, aes(x=Value, y=ecd))
This is taking ages to plot. I was wondering if there is a clean way to plot only a sample of this dataset (say, every 10th point or 50th point) without compromising on the actual result?
I am not sure about your data structure, but a simple sample call might be enough:
n <- nrow(mycounts) # number of cases in data frame
mycounts <- mycounts[sample(n, round(n/10)), ] # get an n/10 sample to the same data frame
Instead of taking every n-th point, can you quantize your data set down to a sufficient resolution before plotting it? That way, you won't have to plot resolution you don't need (or can't see).
Here's one way you can do it. (The function I've written below is generic, but the example uses names from your question.)
library(ggplot2)
library(plyr)
## A data set containing two ramps up to 100, one by 1, one by 10
tens <- data.frame(Type = factor(c(rep(10, 10), rep(1, 100))),
Value = c(1:10 * 10, 1:100))
## Given a data frame and ddply-style arguments, partition the frame
## using ddply and summarize the values in each partition with a
## quantized ecdf. The resulting data frame for each partition has
## two columns: value and value_ecdf.
dd_ecdf <- function(df, ..., .quantizer = identity, .value = value) {
value_colname <- deparse(substitute(.value))
ddply(df, ..., .fun = function(rdf) {
xs <- rdf[[value_colname]]
qxs <- sort(unique(.quantizer(xs)))
data.frame(value = qxs, value_ecdf = ecdf(xs)(qxs))
})
}
## Plot each type's ECDF (w/o quantization)
tens_cdf <- dd_ecdf(tens, .(Type), .value = Value)
qplot(value, value_ecdf, color = Type, geom = "step", data = tens_cdf)
## Plot each type's ECDF (quantizing to nearest 25)
rounder <- function(...) function(x) round_any(x, ...)
tens_cdfq <- dd_ecdf(tens, .(Type), .value = Value, .quantizer = rounder(25))
qplot(value, value_ecdf, color = Type, geom = "step", data = tens_cdfq)
While the original data set and the ecdf set had 110 rows, the quantized-ecdf set is much reduced:
> dim(tens)
[1] 110 2
> dim(tens_cdf)
[1] 110 3
> dim(tens_cdfq)
[1] 10 3
> tens_cdfq
Type value value_ecdf
1 1 0 0.00
2 1 25 0.25
3 1 50 0.50
4 1 75 0.75
5 1 100 1.00
6 10 0 0.00
7 10 25 0.20
8 10 50 0.50
9 10 75 0.70
10 10 100 1.00
I hope this helps! :-)

Resources