I can build very basic plots in R, but I'm trying to make my heatmap look more professional and I'm not sure how to do it. I have a data frame df of 11 observations of 11 variables:
> dput(df)
structure(list(`0` = c(6.08, 7.91, 5.14, 2.23, 0.72, 0.19, 0.04,
0.01, 0, 0, 0), `1` = c(9.12, 11.86, 7.71, 3.34, 1.09, 0.28,
0.06, 0.01, 0, 0, 0), `2` = c(6.84, 8.89, 5.78, 2.5, 0.81, 0.21,
0.05, 0.01, 0, 0, 0), `3` = c(3.42, 4.45, 2.89, 1.25, 0.41, 0.11,
0.02, 0, 0, 0, 0), `4` = c(1.28, 1.67, 1.08, 0.47, 0.15, 0.04,
0.01, 0, 0, 0, 0), `5` = c(0.38, 0.5, 0.33, 0.14, 0.05, 0.01,
0, 0, 0, 0, 0), `6` = c(0.1, 0.13, 0.08, 0.04, 0.01, 0, 0, 0,
0, 0, 0), `7` = c(0.02, 0.03, 0.02, 0.01, 0, 0, 0, 0, 0, 0, 0
), `8` = c(0, 0.01, 0, 0, 0, 0, 0, 0, 0, 0, 0), `9` = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0), `10+` = c(0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0)), row.names = c("0", "1", "2", "3", "4", "5", "6", "7",
"8", "9", "10+"), class = "data.frame")
So I transform df into a matrix to get a heatmap:
heatmap(data.matrix(df), Rowv=NA, Colv=NA, col = heat.colors(256), scale="column", margins=c(5,10))
This is what the plot looks like:
I'm not sure how to:
Change the location of the keys for the row and columns. I want them both to start from 0 in the top left corner, and both row and column to continue ascending until 10+
I'd also like more granularity in the colour. Right now you can't event tell the difference in values by looking at the colour...
Is heatmap from base R even the right library for this? I looked up a few examples and I wasn't sure if there's a better library to achieve what I want.
There are several libraries that offer heatmap functionalities. IMO base heatmap and gplots::heatmap.2 did not age well and are not the best options anymore. 3 good possibilities are with ggplot2::geom_tile, pheatmap and ComplexHeatmap.
Example data
Let's assume we have a matrix
dta <- matrix(rnorm(25), nrow=5)
rownames(dta) <- letters[1:5]
colnames(dta) <- LETTERS[1:5]
ggplot2::geom_tile
The ggplot2 version requires your data to be a tidy dataframe, so we can transform our matrix with tidyr::pivot_longer().
dta %>%
as_tibble(rownames = "myrows") %>%
pivot_longer(cols = -myrows, names_to = "mycols", values_to = "level") %>%
ggplot() +
geom_tile(aes(x=myrows, y=mycols, fill = level))
pheatmap
The pheatmap package is quite good at generating modern heatmaps. It takes a matrix as input. It can cluster the rows and columns and make a dendrogram, which is often a desired feature. It can also scale rows and columns (effectively plotting a Z-score).
pheatmap::pheatmap(dta,
scale = "none",
cluster_rows = FALSE,
cluster_cols = FALSE)
Note that the positions of rows and columns are not the same as with ggplot. You can look at the options that allow some useful customization. For example, if our rows have classes defined elsewhere.
ann_df <- data.frame(row.names = rownames(dta),
classification = rep(c("first", "second"), times = c(2,3)))
pheatmap::pheatmap(dta,
scale = "none",
cluster_rows = FALSE,
cluster_cols = FALSE,
annotation_row = ann_df,
gaps_row = c(2))
Color scale
One of the big aspects that make your heatmap look professional is the color scale. On ggplot, you should check out scale_fill_gradient2().
On pheatmap, you can try these settings for color as a starting point (see the documentation of these functions):
color = scales::div_gradient_pal(low = "navy",
mid = "green",
high="yellow")(seq(0,1,
length.out = max(dta))),
color = colorRampPalette(RColorBrewer::brewer.pal(n = 9,
name = "Blues"))(max(dta)),
color = viridisLite::plasma(max(dta)),
ComplexHeatmap
Finally, a package that has gained success recently is ComplexHeatmap. It is based on pheatmap but offers many additional options. See the link in zx8754's comment for a detailed book full of examples.
Related
I'm trying to create a function in R that will take in two names (characters) and two doubles. It will then output a ggplot2 heatmap. Here's the data:
> dput(df)
structure(list(`0` = c(0.0608, 0.0791, 0.0514, 0.0223, 0.0072,
0.0019, 4e-04, 1e-04, 0, 0, 0), `1` = c(0.0912, 0.1186, 0.0771,
0.0334, 0.0109, 0.0028, 6e-04, 1e-04, 0, 0, 0), `2` = c(0.0684,
0.0889, 0.0578, 0.025, 0.0081, 0.0021, 5e-04, 1e-04, 0, 0, 0),
`3` = c(0.0342, 0.0445, 0.0289, 0.0125, 0.0041, 0.0011, 2e-04,
0, 0, 0, 0), `4` = c(0.0128, 0.0167, 0.0108, 0.0047, 0.0015,
4e-04, 1e-04, 0, 0, 0, 0), `5` = c(0.0038, 0.005, 0.0033,
0.0014, 5e-04, 1e-04, 0, 0, 0, 0, 0), `6` = c(0.001, 0.0013,
8e-04, 4e-04, 1e-04, 0, 0, 0, 0, 0, 0), `7` = c(2e-04, 3e-04,
2e-04, 1e-04, 0, 0, 0, 0, 0, 0, 0), `8` = c(0, 1e-04, 0,
0, 0, 0, 0, 0, 0, 0, 0), `9` = c(0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0), `10+` = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)), class = "data.frame", row.names = c("0",
"1", "2", "3", "4", "5", "6", "7", "8", "9", "10+"))
Now if I hardcode the names on the axes (homeScore and awayScore) it works:
df %>%
as_tibble(rownames = "awayScore") %>%
pivot_longer(cols = -awayScore, names_to = "homeScore", values_to = "probability") %>%
mutate_at(vars(awayScore, homeScore), ~forcats::fct_relevel(.x, "10+", after = 10)) %>%
ggplot() +
geom_tile(aes(x=awayScore, y=homeScore, fill = probability)) +
scale_fill_gradient2(low = "red", mid = "white", high = muted("blue"))+
theme(plot.margin = unit(c(2,2,2,2),"cm"))
This is the result:
But now, I want homeScore and awayScore to be variables in a function. So this is my new function with the same df:
TestFunction<-function(home,away){
df %>%
as_tibble(rownames = away) %>%
pivot_longer(cols = -away, names_to = "home", values_to = "probability") %>%
mutate_at(vars(away, home), ~forcats::fct_relevel(.x, "10+", after = 10)) %>%
ggplot() +
geom_tile(aes(x=away, y=home, fill = probability)) +
scale_fill_gradient2(low = "red", mid = "white", high = muted("blue"))+
theme(plot.margin = unit(c(2,2,2,2),"cm"))
}
But this plot is not what's expected:
What do I have to make the homeScore and awayScore on either axes become variables from a function?
When you want to use strings in aes instead of variables names of the dataframe you need to use the function aes_string instead.
Here is your updated code:
TestFunction<-function(home,away){
df %>%
as_tibble(rownames = away) %>%
pivot_longer(cols = -away, names_to = home, values_to = "probability") %>%
mutate_at(vars(away, home), ~forcats::fct_relevel(.x, "10+", after = 10)) %>%
ggplot() +
geom_tile(aes_string(x=away, y=home, fill = "probability")) +
scale_fill_gradient2(low = "red", mid = "white", high = muted("blue"))+
theme(plot.margin = unit(c(2,2,2,2),"cm"))
}
TestFunction("homeScore",'awayScore')
Note that I changed the line pivot_longer to include name_to=home (you had it hardcoded) and also in the aes_string I put quotes in the fill="probability
Output:
I have a dataframe that I want to plot a heatmap of:
dput(df)
structure(list(`0` = c(6.08, 7.91, 5.14, 2.23, 0.72, 0.19, 0.04,
0.01, 0, 0, 0), `1` = c(9.12, 11.86, 7.71, 3.34, 1.09, 0.28,
0.06, 0.01, 0, 0, 0), `2` = c(6.84, 8.89, 5.78, 2.5, 0.81, 0.21,
0.05, 0.01, 0, 0, 0), `3` = c(3.42, 4.45, 2.89, 1.25, 0.41, 0.11,
0.02, 0, 0, 0, 0), `4` = c(1.28, 1.67, 1.08, 0.47, 0.15, 0.04,
0.01, 0, 0, 0, 0), `5` = c(0.38, 0.5, 0.33, 0.14, 0.05, 0.01,
0, 0, 0, 0, 0), `6` = c(0.1, 0.13, 0.08, 0.04, 0.01, 0, 0, 0,
0, 0, 0), `7` = c(0.02, 0.03, 0.02, 0.01, 0, 0, 0, 0, 0, 0, 0
), `8` = c(0, 0.01, 0, 0, 0, 0, 0, 0, 0, 0, 0), `9` = c(0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0), `10 or more` = c(0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0)), row.names = c("0", "1", "2", "3", "4", "5",
"6", "7", "8", "9", "10 or more"), class = "data.frame")
Now to plot the heatmap using ggplot2 this is how I approach the solution:
df %>%
as_tibble(rownames = "homeScore") %>%
pivot_longer(cols = -homeScore, names_to = "awayScore", values_to = "level") %>%
ggplot() +geom_tile(aes(x=homeScore, y=awayScore, fill = level))
The problem I face is that the columns and rows are being sorted on (0,1,10+,2,..) instead of (0,1,2,...10+). Here's the example:
How do I sort the values such that 10+ is the last for row and column, instead of the third?
As #Nate already mentioned you have to convert your vars to factors and put the levels in the right order. Instead of converting via factor(as.numeric(.)) (which converts "10 or more" to NA) I would recommend to make use of forcats::fct_relevel which allows you to change the order of the levels, e.g. forcats::fct_relevel(homeScore, "10 or more", after = 10) will change the order of the levels such that 10 or more becomes the last level. Try this:
library(ggplot2)
library(tidyr)
library(dplyr)
library(forcats)
df %>%
as_tibble(rownames = "homeScore") %>%
pivot_longer(cols = -homeScore, names_to = "awayScore", values_to = "level") %>%
mutate_at(vars(homeScore, awayScore), ~forcats::fct_relevel(.x, "10 or more", after = 10)) %>%
ggplot() +
geom_tile(aes(x=homeScore, y=awayScore, fill = level))
First time doing TS and I want to calculate a time-series regression but first have to convert my data frame to a ts frame. The problem is, that I have missing quarters, so the ts conversion fails. How can I impute quarters and respective values to get a full ts frame?
Example with quarter 1981.3 missing:
1 1981.1 1.2 0.00000000 0 7.881275
2 1981.2 -0.2 1.17870604 0 7.676712
3 1981.4 -0.1 0.00000000 0 7.333129
4 1982.1 0.4 0.00000000 0 7.266816
This is the dput.
structure(list(Quarter.y = c(1981.1, 1981.2, 1981.4, 1982.1,
1982.4, 1983.4, 1984.1, 1984.3, 1984.4, 1985.2, 1985.4, 1986.1,
1986.4, 1987.2, 1987.4, 1988.2, 1988.4, 1989.2, 1989.4, 1990.1,
1990.2, 1990.4, 1991.2, 1991.4, 1992.2, 1992.4, 1993.2, 1993.3,
1993.4, 1994.1, 1994.2, 1994.3, 1995.1, 1995.2, 1995.3, 1995.4,
1996.1, 1996.2, 1996.4, 1997.1, 1997.2, 1997.4, 1998.2, 1998.4,
1999.1, 1999.2, 1999.4, 2000.1, 2000.2, 2000.3, 2000.4, 2001.1,
2001.2, 2001.3, 2001.4, 2002.1, 2002.3, 2002.4, 2003.1, 2003.3,
2003.4, 2004.1, 2004.2, 2004.4, 2005.1), GDPGrowth = c(1.2, -0.2,
-0.1, 0.4, 0.2, 1.4, 1.3, 2.7, 0.8, 1, 0.6, -0.6, 1, 2.2, 1.6,
1.8, 1.2, 0.4, 1.2, 2.1, 0.5, 1.8, -0.5, 1.3, -0.7, -0.3, 0,
0.6, -0.1, 1.4, 0.3, 0.7, -0.3, 0.8, 0.2, 0, -0.8, 1.4, 0.7,
-0.5, 1.3, 0.7, -0.4, -0.4, 1.2, 0, 0.8, 1.4, 0.8, 0, -0.3, 2,
0, -0.2, -0.2, -0.5, 0.5, -0.2, -1.5, 0.8, 0.3, -0.2, 0.5, -0.1,
0), AverageCONS = c(0, 1.17870603993396, 0, 0, 0, 0, 3.61936244127144,
0.416666666666667, 0, 4.32707915240231, 7.6364088578926, 0.257076257076257,
0, 22.4207759459411, 8.04871523114194, 14.6555459609091, 16.2250782932878,
1.17084307021898, 6.78706870557528, 0, 0, 9.7539118534683, 0,
0, 0, 0.0402095172505245, 0, 0, 0.0636265006342972, 0, 0.171974252305606,
0, 11.1524740312643, 2.68040672020172, 6.2111801242236, 3.24760735460988,
28.2976799963101, 0, 7.5732270977962, 0, 0, 1.49412278319752,
70.2064896755162, 34.0042105697558, 18.5823772614653, 18.0896275972026,
8.41449577357745, 10, 0, 0, 34.7491138493683, 8.36236933797909,
39.6563615833003, 74.4262295081967, 22.3611248302746, 10, 22.911760840126,
0, 0.0666722800439236, 0, 50.3843726943174, 0, 0, 1.72909969128655,
0), BRGOVMEHR = c(0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0), ApprovalGOV = c(7.88127469343306, 7.67671239967451,
7.33312916270801, 7.26681550104175, 7.57774047678561, 7.44440610328638,
7.51950710910292, 7.43455364650782, 7.21735906430465, 6.93572597869877,
7.33281216290814, 7.34110311728404, 7.81953413874152, 7.78638539397364,
7.5580444419889, 6.70433786365357, 6.76698016084029, 6.9932335657093,
7.89148017350285, 8.24705859843768, 8.20161269191644, 6.13021112846596,
7.44587363057623, 7.48928443049876, 6.66476678973035, 6.48625864551226,
6.13914369458924, 6.05634679973895, 5.9702909369734, 6.19216550005443,
6.87967122943963, 7.49940214266322, 7.08991191311255, 7.3351806925688,
7.46762039999888, 7.22336518577119, 6.75192112299076, 6.61614229895973,
6.39396085192013, 6.09682321355397, 5.99711627005931, 5.95861508444216,
6.31451929735713, 6.07326509257996, 7.58677238161551, 7.24041796080827,
6.69547440053917, 7.72437292977251, 7.61985191697131, 7.85861327446016,
7.78974162557168, 8.00182694049075, 7.82019060836613, 7.58946475073855,
7.89751118735182, 7.14978411180804, 7.4578134112501, 6.24455242517448,
5.8823776113788, 6.08103241246385, 5.73879250743035, 5.68128370028589,
5.83312282222293, 6.49992542902688, 6.45920800878159)), row.names = c(NA,
-65L), class = c("tbl_df", "tbl", "data.frame"))
If the input is dat then convert that to a zoo object z having a yearqtr index. Converting z to a ts object will insert NAs for the missing entries and then we can use na.approx on that to fill in those NAs.
library(zoo)
z <- read.zoo(dat, FUN = function(x) as.yearqtr(paste(x), "%Y.%q"))
na.approx(as.ts(z))
EDIT
Simplified slightly.
The code below uses the tidyverse packages to create a new dataframe with 4 rows per each year in your dataset and then uses left_join to join your original dataset to that template. This will return NA values for missing quarters in your original data.
EDIT: removed assigning data. Data provided is in the df object.
library(tidyverse)
yrs <- df %>%
mutate(year = str_replace_all(Quarter.y, "\\.\\d{1}$", "")) %>%
distinct(year) %>%
as_vector()
impute <- tibble(yr = as.integer(rep(yrs, 4))) %>%
group_by(yr) %>%
mutate(qtr = row_number()) %>%
ungroup() %>%
arrange(yr, qtr) %>%
mutate(yr_qtr = paste0(yr, ".", qtr) %>%
as.numeric()) %>%
left_join(df, by = c("yr_qtr" = "Quarter.y"))
As far as imputing missing values goes, I'm not sure what the right approach here is -- it will depend on what patterns you expect to see/what a reasonable default value is.
I am trying to find several info between each correlation
corr.test
I have two data set df1 and df2
df1<- structure(list(col1A = c(1.64, 0.03, 0, 4.202, 2.981, 0.055,
0, 0.002, 0.005, 0, 0.002, 0.649, 2.55, 2.762, 6.402), col2A = c(2.635,
0.019, 0, 5.638, 3.542, 0.793, 0.259, 0, 0.046, 0.004, 0.017,
0.971, 3.81, 3.104, 5.849), col3A = c(0.91, 0.037, 0, 5.757,
3.916, 0.022, 0, 0, 0.003, 0, 0.262, 0.136, 2.874, 3.466, 5.003
), col4A = c(1.027, 0.021, 0, 4.697, 2.832, 0.038, 0.032, 0.001,
0.003, 0, 0, 0.317, 2.743, 3.187, 6.455)), class = "data.frame", row.names = c(NA,
-15L))
the second data is like below
df2<-structure(list(col1 = c(2.172, 0, 0, 4.353, 4.581, 0.001, 0.027,
0, 0.002, 0, 0, 0.087, 2.129, 4.317, 5.849), col2 = c(2.093,
0, 0, 4.235, 3.166, 0, 0, 0.006, 0.01, 0, 0, 0.475, 0, 2.62,
5.364), col3 = c(3.322, 0, 0, 4.332, 4.018, 0.049, 0.169, 0.004,
0.02, 0, 0.032, 1.354, 2.944, 4.323, 5.44), col4 = c(0.928, 0.018,
0, 3.943, 3.723, 0.02, 0, 0, 0, 0, 0.075, 0.136, 3.982, 3.875,
5.83)), row.names = c("A", "AA", "AAA", "Aab", "buy", "yuyn",
"gff", "fgd", "kil", "lilk", "hhk", "lolo", "fdd", "vgfh", "nghg"
), class = "data.frame")
I want to obtain all possible correlation between the two and extract all p values and adjusted p values
I use
library(psych)
corr.test(df1,df2, use = "pairwise",method="pearson",adjust="holm",alpha=.05,ci=TRUE,minlength=5)
it does not give me any p value. also I cannot control any sort of permutation to calculate the adjusted p value.
I was thinking to use the following
x <-df1[,1]
y <-df2[,2]
corr_init <- cor(x,y) # original correlation
N <- 1000 # initialize number of permutations
count <- 0 # counts correlation greater than corr_init
for (i in 1:N) {
y_perm <- permute(y)
if (cor(y_perm,x) > corr_init) count <- count+1
}
p <- count/N #final p
but then I have do it one by one and still I need to extract each column and test ...
I am wondering if there is better way to calculate all correlation between two data, get R values, p values and P adjusted with specific number of randomization ?
It could be done using the Hmisc package:
library(Hmisc)
df1_cor_matrix <- rcorr(as.matrix(df1), type = "pearson")
df2_cor_matrix <- rcorr(as.matrix(df2), type = "pearson")
You can then extract out the coefficients using the following:
df1_coef <- df1_cor_matrix$r
df2_coef <- df2_cor_matrix$r
You can extract the p-values using the following:
df1_p_values <- df1_cor_matrix$P
df2_p_values <- df2_cor_matrix$P
You could get the adjusted p-values using the rcorr.adjust function:
rcorr.adjust(df1_cor_matrix, type = "pearson")
rcorr.adjust(df2_cor_matrix, type = "pearson")
I am trying to run a multiple imputation using the mice function (from the package of the same name) in R. I get a Warning that events have been logged. Here is the output from mice(.)$loggedEvents from my MWE (see below):
it im dep meth out
1 1 X pmm H
I'm not sure what is causing this warning and what the implications are. From what I understand, this can be caused by collinearity amongst variables, but this should be prevented by using remove_collinear=FALSE, but this isn't fixing the Warning.
MWE:
Pop <- data.frame(X = c( NA, 0.02, -1.15, 0.54, -0.61, -2.07),
Z = c( 0.83, 1.40, -3.07, -0.07, -0.20, -1.90),
D = c( 0, 0, 0, 1, 0, 0),
H = c( 0.01, 0.01, 0.01, 0.01, 0.02, 0.02))
Pop.Imp <- mice(Pop, m = 1, maxit = 1, print = T)
Obviously my original issue involved much more rows and columns of data and a higher number of imputations and iterations, but I've managed to trim this down to find this MWE.
Any help into figuring out what's causing this problem would be great. Is there some sort of cut-off that mice uses when deciding if/when a covariable is collinear? If it's very high, would this override the remove_collinear=FALSE parameter?
This isn't a full answer, but I couldn't fit the reply in a comment.
The logged events warning can arise from a variety of issues. The issue raised can be identified from the "meth" column in the mice()$loggedEvents output.
The two issues I know of are collinearity, and a constant predictor across all values (or maybe constant across all missing/not missing also satisfied this criteria). Added some variables to highlight these:
Pop <- data.frame(X = c( NA, 0.02, -1.15, 0.54, -0.61, -2.07, NA),
Z1 = c( 0.83, 1.40, -3.07, -0.07, -0.20, -1.90, 2.00),
Z2 = c( 0.83, 1.40, -3.07, -0.07, -0.20, -1.90, 2.00),
D = c( 0, 0, 0, 1, 0, 0, 1),
H = c( 0.01, 0.01, 0.01, 0.01, 0.02, 0.02, 0.02))
Pop.Imp <- mice(Pop, m = 1, maxit = 1, print = T)
Pop.Imp$loggedEvents
it im dep meth out
1 0 0 collinear Z2
2 1 1 X pmm H
Pop <- data.frame(X = c( NA, 0.02, -1.15, 0.54, -0.61, -2.07, NA),
Z1 = c( 0.83, 1.40, -3.07, -0.07, -0.20, -1.90, 2.00),
consvar = c( 0.83, 0.83, 0.83, 0.83, 0.83, 0.83, 0.83),
D = c( 0, 0, 0, 1, 0, 0, 1),
H = c( 0.01, 0.01, 0.01, 0.01, 0.02, 0.02, 0.02))
Pop.Imp <- mice(Pop, m = 1, maxit = 1, print = T)
Pop.Imp$loggedEvents
it im dep meth out
1 0 0 constant consvar
2 1 1 X pmm H
Unfortunately I don't know what issue "pmm" is. Maybe something to do with the predictive mean matching (chosen imputation method) not able to work in such a small dataset?