Reverse "heat" in one column of a heatmap() - r

I'm using heatmap to plot the leader for each of the respective pitching performance categories for some baseball data. My problem is that I need to reverse the "heat" of just one of the columns, because the best ERA is the lowest, not the highest. Here's the code. mlb2010 is data that was imported from a SQL database via RSQLite.
mlb10 <- sapply(2:length(mlb2010), function(i) {
mlb2010[, i] <- as.numeric(mlb2010[, i])
})
rc <- rainbow(nrow(mlb10), start = 0, end = .3)
cc <- rainbow(ncol(mlb10), start = 0, end = .3)
heatmap(mlb10, col = rev(heat.colors(256)), scale = "column",
Rowv = NULL, Colv = NA, RowSideColors = rc, ColSideColors = cc,
margins = c(5,10), labRow = c(mlb2010$team), labCol = names(al2010)[-1],
xlab = "Performance factors", ylab = "Team",
main = c("Relating Performance to Payroll", "2010 MLB Season"))
I have tried the revC argument in heatmap with no success. Is that what I should be using? Or does that reorder all of the columns, and not what is inside the column? I've also tried an sapply over the colors to no avail.
Any help would be greatly appreciated.

Per request from OP, posting the basics of the solution.
Just do ml10$ERA <- -ml10$ERA to reverse the order, then plot as in the post.

Related

How can I delete column names in heatmap.2 in R?

I've built a great heatmap.2 in R but I am trying to remove the column names at the bottom of the heatmap because they are illegible. I am not able to delete the column names.
I've tried doing labCol = NULL and nothing seems to happen. Additionally, colCol = "white" also doesn't work. Any suggestions?
data<-read.csv("Symptoms Only.csv",row.names=1)
data$Subject_type<-gsub("0","Contact",data$Subject_type)
data$Subject_type<-gsub("1","Survivor",data$Subject_type)
data$Gender<-gsub("0","Male",data$Gender)
data$Gender<-gsub("1","Female",data$Gender)
condition_colors <- unlist(lapply(data$Subject_type,function(x){
if(grepl('Contact',x)) '#FFC0CB'
else if(grepl('Survivor',x)) '#808080'
}))
condition_colors
colnames(data)
data<-data[-c(1:3)]
colnames(data)
data<-t(data)
data<-as.matrix(data)
x<-data
z <- heat.clust(x,
scaledim="column",
zlim=c(-3,3),
zlim_select = c("dend","outdata"),
reorder=c("column","row"),
distfun = function(x) as.dist(1-cor(t(x))),
hclustfun= function(x) hclust(x, method="ward.D"),
scalefun = scale)
heatmap.2(z$data,
Rowv=z$Rowv,
Colv=z$Colv,
trace="none",
scale="none",
symbreaks = TRUE,
srtCol=90,
adjCol=c(0.8,1),
key=FALSE,
dendrogram = "both",
lhei=c(1,5),
cexRow = 1.1,
margins=c(4,7),
labCol = NULL,
xlab="complete",
ColSideColors = condition_colors,
col=rev(colorRampPalette(brewer.pal(10, "RdBu"))(256)),
)```
The image is base R plot, so setting xaxt="n" will remove column names. You would set the column names to empty strings, but maybe not so advisable.
library(gplots)
data(mtcars)
# correct solution
heatmap.2(x,xaxt="n")
# not so good
colnames(x) = rep("",ncol(x))
heatmap.2(x)

How to programmatically determine the column indices of principal components using FactoMineR package?

Given a data frame containing mixed variables (i.e. both categorical and continuous) like,
digits = 0:9
# set seed for reproducibility
set.seed(17)
# function to create random string
createRandString <- function(n = 5000) {
a <- do.call(paste0, replicate(5, sample(LETTERS, n, TRUE), FALSE))
paste0(a, sprintf("%04d", sample(9999, n, TRUE)), sample(LETTERS, n, TRUE))
}
df <- data.frame(ID=c(1:10), name=sample(letters[1:10]),
studLoc=sample(createRandString(10)),
finalmark=sample(c(0:100),10),
subj1mark=sample(c(0:100),10),subj2mark=sample(c(0:100),10)
)
I perform unsupervised feature selection using the package FactoMineR
df.princomp <- FactoMineR::FAMD(df, graph = FALSE)
The variable df.princomp is a list.
Thereafter, to visualize the principal components I use
fviz_screeplot() and fviz_contrib() like,
#library(factoextra)
factoextra::fviz_screeplot(df.princomp, addlabels = TRUE,
barfill = "gray", barcolor = "black",
ylim = c(0, 50), xlab = "Principal Component",
ylab = "Percentage of explained variance",
main = "Principal Component (PC) for mixed variables")
factoextra::fviz_contrib(df.princomp, choice = "var",
axes = 1, top = 10, sort.val = c("desc"))
which gives the following Fig1
and Fig2
Explanation of Fig1: The Fig1 is a scree plot. A Scree Plot is a simple line segment plot that shows the fraction of total variance in the data as explained or represented by each Principal Component (PC). So we can see the first three PCs collectively are responsible for 43.8% of total variance. The question now naturally arises, "What are these variables?". This I have shown in Fig2.
Explanation of Fig2: This figure visualizes the contribution of rows/columns from the results of Principal Component Analysis (PCA). From here I can see the variables, name, studLoc and finalMark are the most important variables that can be used for further analysis.
Further Analysis- where I'm stuck at: To derive the contribution of the aforementioned variables name, studLoc, finalMark. I use the principal component variable df.princomp (see above) like df.princomp$quanti.var$contrib[,4]and df.princomp$quali.var$contrib[,2:3].
I've to manually specify the column indices [,2:3] and [,4].
What I want: I want to know how to do dynamic column index assignment, such that I do not have to manually code the column index [,2:3] in the list df.princomp?
I've already looked at the following similar questions 1, 2, 3 and 4 but cannot find my solution? Any help or suggestions to solve this problem will be helpful.
Not sure if my interpretation of your question is correct, apologies if not. From what I gather you are using PCA as an initial tool to show you what variables are the most important in explaining the dataset. You then want to go back to your original data, select these variables quickly without manual coding each time, and use them for some other analysis.
If this is correct then I have saved the data from the contribution plot, filtered out the variables that have the greatest contribution, and used that result to create a new data frame with these variables alone.
digits = 0:9
# set seed for reproducibility
set.seed(17)
# function to create random string
createRandString <- function(n = 5000) {
a <- do.call(paste0, replicate(5, sample(LETTERS, n, TRUE), FALSE))
paste0(a, sprintf("%04d", sample(9999, n, TRUE)), sample(LETTERS, n, TRUE))
}
df <- data.frame(ID=c(1:10), name=sample(letters[1:10]),
studLoc=sample(createRandString(10)),
finalmark=sample(c(0:100),10),
subj1mark=sample(c(0:100),10),subj2mark=sample(c(0:100),10)
)
df.princomp <- FactoMineR::FAMD(df, graph = FALSE)
factoextra::fviz_screeplot(df.princomp, addlabels = TRUE,
barfill = "gray", barcolor = "black",
ylim = c(0, 50), xlab = "Principal Component",
ylab = "Percentage of explained variance",
main = "Principal Component (PC) for mixed variables")
#find the top contributing variables to the overall variation in the dataset
#here I am choosing the top 10 variables (although we only have 6 in our df).
#note you can specify which axes you want to look at with axes=, you can even do axes=c(1,2)
f<-factoextra::fviz_contrib(df.princomp, choice = "var",
axes = c(1), top = 10, sort.val = c("desc"))
#save data from contribution plot
dat<-f$data
#filter out ID's that are higher than, say, 20
r<-rownames(dat[dat$contrib>20,])
#extract these from your original data frame into a new data frame for further analysis
new<-df[r]
new
#finalmark name studLoc
#1 53 b POTYQ0002N
#2 73 i LWMTW1195I
#3 95 d VTUGO1685F
#4 39 f YCGGS5755N
#5 97 c GOSWE3283C
#6 58 g APBQD6181U
#7 67 a VUJOG1460V
#8 64 h YXOGP1897F
#9 15 j NFUOB6042V
#10 81 e QYTHG0783G
Based on your comment, where you said you wanted to 'Find variables with value greater than 5 in Dim.1 AND Dim.2 and save these variables to a new data frame', I would do this:
#top contributors to both Dim 1 and 2
f<-factoextra::fviz_contrib(df.princomp, choice = "var",
axes = c(1,2), top = 10, sort.val = c("desc"))
#save data from contribution plot
dat<-f$data
#filter out ID's that are higher than 5
r<-rownames(dat[dat$contrib>5,])
#extract these from your original data frame into a new data frame for further analysis
new<-df[r]
new
(This keeps all the original variables in our new data frame since they all contributed more than 5% to the total variance)
There are a lot of ways to extract contributions of individual variables to PCs. For numeric input, one can run a PCA with prcomp and look at $rotation (I spoke to soon and forgot you've got factors here so prcomp won't work directly). Since you are using factoextra::fviz_contrib, it makes sense to check how that function extracts this information under the hood. Key factoextra::fviz_contrib and read the function:
> factoextra::fviz_contrib
function (X, choice = c("row", "col", "var", "ind", "quanti.var",
"quali.var", "group", "partial.axes"), axes = 1, fill = "steelblue",
color = "steelblue", sort.val = c("desc", "asc", "none"),
top = Inf, xtickslab.rt = 45, ggtheme = theme_minimal(),
...)
{
sort.val <- match.arg(sort.val)
choice = match.arg(choice)
title <- .build_title(choice[1], "Contribution", axes)
dd <- facto_summarize(X, element = choice, result = "contrib",
axes = axes)
contrib <- dd$contrib
names(contrib) <- rownames(dd)
theo_contrib <- 100/length(contrib)
if (length(axes) > 1) {
eig <- get_eigenvalue(X)[axes, 1]
theo_contrib <- sum(theo_contrib * eig)/sum(eig)
}
df <- data.frame(name = factor(names(contrib), levels = names(contrib)),
contrib = contrib)
if (choice == "quanti.var") {
df$Groups <- .get_quanti_var_groups(X)
if (missing(fill))
fill <- "Groups"
if (missing(color))
color <- "Groups"
}
p <- ggpubr::ggbarplot(df, x = "name", y = "contrib", fill = fill,
color = color, sort.val = sort.val, top = top, main = title,
xlab = FALSE, ylab = "Contributions (%)", xtickslab.rt = xtickslab.rt,
ggtheme = ggtheme, sort.by.groups = FALSE, ...) + geom_hline(yintercept = theo_contrib,
linetype = 2, color = "red")
p
}
<environment: namespace:factoextra>
So it's really just calling facto_summarize from the same package. By analogy you can do the same thing, simply call:
> dd <- factoextra::facto_summarize(df.princomp, element = "var", result = "contrib", axes = 1)
> dd
name contrib
ID ID 0.9924561
finalmark finalmark 21.4149175
subj1mark subj1mark 7.1874438
subj2mark subj2mark 16.6831560
name name 26.8610132
studLoc studLoc 26.8610132
And that's the table corresponding to your figure 2. For PC2 use axes = 2 and so on.
Regarding "how to programmatically determine the column indices of the PCs", I'm not 100% sure I understand what you want, but if you just want to say for column "finalmark", grab its contribution to PC3 you can do the following:
library(tidyverse)
# make a tidy table of all column names in the original df with their contributions to all PCs
contribution_df <- map_df(set_names(1:5), ~factoextra::facto_summarize(df.princomp, element = "var", result = "contrib", axes = .x), .id = "PC")
# get the contribution of column 'finalmark' by name
contribution_df %>%
filter(name == "finalmark")
# get the contribution of column 'finalmark' to PC3
contribution_df %>%
filter(name == "finalmark" & PC == 3)
# or, just the numeric value of contribution
filter(contribution_df, name == "finalmark" & PC == 3)$contrib
BTW I think ID in your example is treated as numeric instead of factor, but since it's just an example I'm not bothering with it.

R: Get quantmod's chartSeries and AddTA to not show last value

When using chartSeries, by default it also shows on the top left of the plot the last value. Is there any way to prevent it from doing it?
When adding a new TA with addTA, you can avoid the last value on the plot by setting the argument legend = "", but only if you're making a new plot for the TA. If the TA is on a previously plotted graphic, it'll show the last value regardless of what you put in the legend argument.
getSymbols ("AAPL", src = "google")
chartSeries(AAPL)
What can I use here to prevent it from printing the last value on the plot?
addTA(EMA(Cl(AAPL)), on = 1, legend = "")
This still prints the last value on the top left of the plot. The weird part is that it doesn't do it if you're plotting on a new plot like this:
addTA(EMA(Cl(AAPL)), legend = "")
Is it like this by default, or is there something I can do to get around it?
The last value is shown by default (yes, annoyingly). You'll likely have to modify the source code to remove the last number showing in addTA.
I don't use addTA, but rather add_TA and chart_Series, because I think they look much better (second generation charts for quantmod). Here is a solution that removes the last number from showing for the add_TA version. But you must be willing to modify the source code.
In add_TA, you'll need to modify approximately lines 56-60 of the source:
Replace the text.exp, which is this:
# this is inside add_TA:
if (is.na(on)) {
plot_object$add_frame(ylim = c(0, 1), asp = 0.15)
plot_object$next_frame()
text.exp <- expression(text(x = c(1, 1 + strwidth(name)),
y = 0.3, labels = c(name, round(last(xdata[xsubset]),
5)), col = c(1, col), adj = c(0, 0), cex = 0.9,
offset = 0, pos = 4))
plot_object$add(text.exp, env = c(lenv, plot_object$Env),
with these modifications:
if (is.na(on)) {
plot_object$add_frame(ylim = c(0, 1), asp = 0.15)
plot_object$next_frame()
text.exp <- expression(text(x = c(strwidth(name)), # <- affects label on the subchart
y = 0.3, labels = name, col = c(col), adj = c(0), cex = 0.9,
offset = 1, pos = 4))
plot_object$add(text.exp, env = c(lenv, plot_object$Env),
expr = TRUE)
...
and assign this modified code to a new variable, called say add_TA.mine:
add_TA.mine <- function (x, order = NULL, on = NA, legend = "auto", yaxis = list(NULL,
NULL), col = 1, taType = NULL, ...)
{
lenv <- new.env()
lenv$name <- deparse(substitute(x))
lenv$plot_ta <- function(x, ta, on, taType, col = col, ...) {
xdata <- x$Env$xdata
....
[all the code for the rest of the function with modifications]....
}
}
plot_object
}
Now, just run the code with the modified function
library(quantmod)
getSymbols("AAPL")
environment(add_TA.mine) <- environment(get("add_TA", envir = asNamespace("quantmod")))
assignInNamespace(x = "add_TA", value = add_TA.mine, ns = "quantmod")
chart_Series(AAPL, subset = "2017")
add_TA(RSI(Cl(AAPL)))
quantmod:::add_TA(RSI(Cl(AAPL)))
You can see the last value is no longer printed:
(You could make the same kinds of changes in the old addTA code (perhaps via chartSeries if you really want to stick to the old plots)
If you're happy with the changes, and want to make them permament in add_TA, you can recompile the quantmod source code yourself with your modifications (i.e. you need to download the quantmod source code and recompile the package) . If you make a mess of things you can always redownload the original quandmod source code again.

turn off grid lines for R xyplot timeseries

I am plotting a time series with the timePlot function of the open air package of R. The graph has grey grid lines in the background that I would like to turn off but I do not find a way to do it. I would expect something simple such as grid = FALSE, but that is not the case. It appears to be rather complex, requiring the use of extra arguments which are passed to xyplot of the library lattice. I believe the answer lies some where in the par.settings function but all attempts have failed. Does anyone have any suggestions to this issue?
Here is by script:
timeozone <- import(i, date="date", date.format = "%m/%d/%Y", header=TRUE, na.strings="")
ROMO = timePlot(timeozone, pollutant = c("C7", "C9", "C10"), group = TRUE, stack = FALSE,y.relation = "same", date.breaks = 9, lty = c(1,2,3), lwd = c(2, 3, 3), fontsize = 15, cols = c("black", "black"), ylab = "Ozone (ppbv)")
panel = function(x, y) {
panel.grid(h = 0, v = 0)
panel.xyplot(x,y)
}

How to change the color and width of lines with par function in R

I have a question about the par function in R.
I want to change the color and/or width of a line in a graph with par function. (I am using par function because the gaps.plot command below does not allow "col" option to be included. The gaps.plot command is used after the synth command).
So, I used the following command. But I noticed that the lines of the BOX are changed rather than the lines of the GRAPHS.
synth1<-read.csv(file="C:\\Users\\Research\\R\\synthinR_v4.csv",header=TRUE)
attach(synth1)
library("Synth")
dataprep.out34 <- dataprep(foo = synth1, predictors = c("lncdsales", "md1", "md2","md3", "md4", "md5", "md6", "md7", "md8", "md9", "md10", "md11", "yd1", "yd2", "yd3", "yd4", "yd5", "yd6", "yd7", "yd8"), predictors.op = "mean", time.predictors.prior = -13:1, dependent = "lndigital", unit.variable = "artistalbumcode", time.variable = "release", treatment.identifier = 34, controls.identifier = c(1:33, 35:49), time.optimize.ssr = -13:1, time.plot = -13:25)
synth.out34 <- synth(data.prep.obj = dataprep.out34, method = "BFGS")
par(lwd = 2, col="#cccccc")
gaps.plot(synth.res = synth.out34, dataprep.res = dataprep.out34, Ylab = " Log Digital Sales ", Xlab = "Release", Ylim = c(-7, 7) , Main = NA)
Does anyone know how to fix this problem??
Thank you in advance for your willingness to help. I greatly appreciate it!
The col argument to par sets the default plotting colour (i.e. when col is not explicitly specified in plotting calls), but unfortunately col = "black" is hard-coded into the source of gaps.plot.
You can make a modified copy of the function by either (1) viewing the source (F2 in RStudio, or just executing gaps.plot), editing it and assigning it to a new object, or (2) doing something like the following:
gaps.plot2 <- eval(parse(text=gsub('col = "black"', 'col = "red"',
deparse(Synth:::gaps.plot))))
and then using gaps.plot2 as you would use gaps.plot:
gaps.plot2(synth.res = synth.out34, dataprep.res = dataprep.out34,
Ylab = " Log Digital Sales ", Xlab = "Release", Ylim = c(-7, 7) ,
Main = NA)
Alter the lwd similarly. For example to make lines red and have width of 3, use nested gsub calls like this:
gaps.plot2 <- eval(parse(text=gsub('lwd = 2', 'lwd = 3',
gsub('col = "black"', 'col = "red"',
deparse(Synth:::gaps.plot)))))

Resources