Heatmap in ggplot2 issue with fill - r

I'm trying to make a heatmap using ggplot2. What I want to be plotted is in the form of a matrix which is the result of a function.
Here is the data:
Image A B C D E F
1 3 23 45 23 45 90
2 4 34 34 34 34 89
3 34 33 24 89 23 67
4 3 45 234 90 12 78
5 78 89 34 23 12 56
6 56 90 56 67 34 45
Here is the function:
vector_a <- names(master)[2:4]
vector_b <- names(master)[5:6]
heatmap_prep <- function(dataframe, vector_a,vector_b){
dummy <- as.data.frame(matrix(0, nrow=length(vector_a), ncol=length(vector_b)))
for (i in 1:length(vector_a)){
first_value <- dataframe[[ vector_a[i] ]]
# print(first_value)
for(j in 1:length(vector_b)){
second_value <- dataframe[[ vector_b[j] ]]
result <- cor(first_value, second_value, method = "spearman")
dummy [i,j] <- result
}
}
rownames(dummy) <- vector_a
return(as.matrix(dummy))
heatmap_data_matrix1 <- heatmap_prep(master,vector_a, vector_b)
Using the data in heatmap_data_matrix1, I want to create a heatmap using the following code:
library(ggplot2)
if (length(grep("ggplot2", (.packages() ))) == 0){
library(ggplot2)
}
p <- ggplot(data = heatmap_data_matrix1, aes(x = vector_a, y = vector_b)
+ geom_tile(aes(fill = ))
However, this does not work. How should I reformat my data/code so this heatmap can be created? What should I put under "fill="?
Thanks!

Due to many of R functions being vectorized and that, for the most part, you don't need to pre-allocate or define a vector the for loop is unnecessary. You can simply run corr(x,y, method = "spearman") without the complications of the loop.
Regarding your question of what to put in for fill, you'll need to reshape your data to the configuration that ggplot2 uses (long format).
The gather function from tidyr does this, placing the rows/columns of the correlation into separate columns, and then using the r value for fill.
library(tidyverse) # for tidyr, tibble, ggplot2, and magrittr
heatmap_function <- function(df, a, b) {
cor_data <- cor(df[a], df[b], method = "spearman") %>%
as.data.frame(rownames = a) %>%
rownames_to_column("x") %>%
gather(y, fill, -x)
ggplot(cor_data, aes(x = x, y = y, fill = fill)) +
geom_tile()
}
This results in:
heatmap_function(master, c("A","B","C"), c("D","E"))

Related

Calculating the distance between coordinates R

We have a set of 50 csv files from participants, currently being read into a list as
file_paths <- fs::dir_ls("data")
file_paths
file_contents <- list ()
for (i in seq_along (file_paths)) {
file_contents[[i]] <- read_csv(
file = file_paths[[i]]
)
}
dt <- set_names(file_contents, file_paths)
My data looks like this:
level time X Y Type
1 1 355. -10.6 22.36 P
1 1 371. -33 24.85 O
1 2 389. -10.58 17.23 P
1 2 402. -16.7 30.46 O
1 3 419. -29.41 17.32 P
1 4 429. -10.28 26.36 O
2 5 438. -26.86 32.98 P
2 6 451. -21 17.06 O
2 7 463. -21 32.98 P
2 8 474. -19.9 17.06 O
We have 70 sets of coordinates per csv.
Time does not matter for this, but I would like to split up by the level column at some stage.
For every 'P' I want to compare it to 'O' and get the distance between coordinates.The first P will always match with the first O and so on.
For now, I have them split into two different lists, though this may be the complete wrong way to do it! I'm having trouble figuring out how to take all of these csv files and get the distances for all of them, the list seems to cause issues with most functions (like dist)
Here is how I've pulled the right information so far
for (i in seq_along (dt)) {
pLoc[[i]] <- dplyr::filter(dt[[i]], grepl("P", type))
oLoc[[i]] <- dplyr::filter(dt[[i]], grepl("o", type))
pX[[i]] <- pLoc[[i]] %>% pull(as.numeric(headX))
pY[[i]] <- pLoc[[i]] %>% pull(as.numeric(headY))
pCoordinates[[i]] <- cbind(pX[[i]], pY[[i]])
}
[EDITED] Following comments, here is how you can do it with the raster library:
library(raster)
library(dplyr)
df = data.frame(
x = c(10, 20 ,15,9),
y = c(45,34,54,24),
type = c("P","O","P","O")
)
df = cbind(df[df$type=="P",] %>%
dplyr::select(-type) %>%
dplyr::rename(xP = x,
yP = y),
df[df$type=="O",] %>%
dplyr::select(-type) %>%
dplyr::rename(xO = x,
yO = y))
The following could probably be achieved more efficiently with some form of the apply() function:
v = c()
for(i in 1:nrow(df)){
dist = raster::pointDistance(lonlat = F,
p1 = c(df$xP[i],df$yP[i]),
p2 = c(df$xO[i],df$yO[i]))
v = c(v,dist)
}
df$dist = v
print(df)
xP yP xO yO dist
1 10 45 20 34 14.86607
3 15 54 9 24 30.59412

Accessing the values by their rowname and columnname,instead of numbers

I have a table which has multiple columns and rows. I want to access the each value by its column name and rowname, and make a plot with these values.
The table looks like this with 101 columns:
IDs Exam1 Exam2 Exam3 Exam4 .... Exam100
Ellie 12 48 33 64
Kate 98 34 21 76
Joe 22 53 49 72
Van 77 40 12
Xavier 88 92
What I want is to be able to reach the marks for given row (IDs),and given column(exams) as:
table[Ellie,Exam3] --> 48
table[Ellie,Exam100] --> 64
table[Ellie,Exam2] --> (empty)
Then with these numbers, I want to see the distribution of how Ellie did comparing the rest of exams to Exam2,3 and 100.
I have almost figured out this part with R:
library(data.table)
library(ggplot2)
pdf("distirbution_given_row.pdf")
selectedvalues <- c(table[Ellie,Exam3] ,table[Ellie,Exam100])
library(plyr)
cdat <- ddply(selected values, "IDs", summarise, exams.mean=mean(exams))
selectedvaluesggplot <- ggplot(selectedvalues, aes(x=IDs, colour=exams)) + geom_density() + geom_vline(data=cdat, aes(xintercept=exams.mean, colour=IDs), linetype="dashed", size=1)
dev.off()
Which should generate the Ellie's marks for exams of interests versus the rest of the marks ( if it is a blank, then it should not be seen as zero. It is still a blank.)
Red: Marks for Exam3, 100 and 2 , Blue: The marks for the remaining 97 exams
(The code and the plot are taken as an example of ggplot2 from this link.)
All ideas are appreciated!
For accessing your data at least you can do the following:
df=data.frame(IDs=c("Ellie","Kate","Joe","Van","Xavier"),Exam1=c(12,98,22,77,NA),Exam2=c(NA,34,53,NA,NA),
Exam3=c(48,21,49,40,NA),Exam4=c(33,76,NA,12,88))
row.names(df)=df$IDs
df=df%>%select(-IDs)
> df['Joe','Exam2']
[1] 53
Now I prepared an example with random created numbers to illustrate a bit what you could do. First let us create an example data frame
df=as.data.frame(matrix(rnorm(505,50,10),ncol=101))
colnames(df)=c("IDs",paste0("Exam",as.character(1:100)))
df$IDs=c("Ellie","Kate","Joe","Van","Xavier")
To work with ggplot it is recomended to convert it to long format:
df0=df%>%gather(key="exams",value="score",-IDs)
From here on you can play with your variables as desired. For instance plotting the density of the score per ID:
ggplot(df0, aes(x=score,col=IDs)) + geom_density()
or selecting only Exams 2,3,100 and plotting density for different exams
df0=df0%>%filter(exams=="Exam2"|exams=="Exam3"|exams=="Exam100")
ggplot(df0, aes(x=score,col=exams)) + geom_density()
IIUC - you want to plot each IDs select exams with all else exams. Consider the following steps:
Reshape your data to long format even replace NAs with zero as needed.
Run by() to subset data by IDs and build mean aggregrate data and ggplots.
Within by, create a SelectValues indicator column on the select exams then graph with vertical line mean summation.
Data
txt = 'IDs Exam1 Exam2 Exam3 Exam4 Exam100
Ellie 12 NA 48 33 64
Kate 98 34 21 76 NA
Joe 22 53 49 NA 72
Van 77 NA 40 12 NA
Xavier NA NA NA 88 92'
exams_df <- read.table(text=txt, header = TRUE)
# ADD OTHER EXAM COLUMNS (SEEDED FOR REPRODUCIBILITY)
set.seed(444)
exams_df[paste0("Exam", seq(5:99))] <- replicate(99-4, sample(100, 5))
Reshape and Graph
library(ggplot2) # ONLY PACKAGE NEEDED
# FILL NA
exams_df[is.na(exams_df)] <- 0
# RESHAPE (BASE R VERSION)
exams_long_df <- reshape(exams_df,
timevar = "Exam",
times = names(exams_df)[grep("Exam", names(exams_df))],
v.names = "Score",
varying = names(exams_df)[grep("Exam", names(exams_df))],
new.row.names = 1:1000,
direction = "long")
# GRAPH BY EACH ID
by(exams_long_df, exams_long_df$IDs, FUN=function(df) {
df$SelectValues <- ifelse(df$Exam %in% c("Exam1", "Exam3", "Exam100"), "Select Exams", "All Else")
cdat <- aggregate(Score ~ SelectValues, df, FUN=mean)
ggplot(df, aes(Score, colour=SelectValues)) +
geom_density() + xlim(-50, 120) +
labs(title=paste(df$IDs[[1]], "Density Plot of Scores"), x ="Exam Score", y = "Density") +
geom_vline(data=cdat, aes(xintercept=Score, colour=SelectValues), linetype="dashed", size=1)
})
Output

Call objects in function wrapped around ggplot2-function [duplicate]

This question already has answers here:
How to use a variable to specify column name in ggplot
(6 answers)
Closed 4 years ago.
I am about to write a function to iterate my plots over various variables. Unfortunately I am getting an error i don't understand.
library(ggplot2)
library(dplyr)
library(purrr)
df <- data.frame(af = c(rep(1,6),rep(2,6),rep(3,6)),
p = c(rep(c(rep("A",2),rep("B",2),rep("C",2)),3)),
ele.1 = sample(c(1:100), size=6),
ele.2 = sample(c(1:100), size=6),
ele.3 = sample(c(1:100), size=6))
af p ele.1 ele.2 ele.3
1 A 99 1 68
1 A 55 38 72
1 B 70 36 13
1 B 86 77 89
1 C 7 24 49
1 C 89 23 53
2 A 99 1 68
2 A 55 38 72
....
test <- function(.x = df, .af = 1,.p=c("A","B"), .var = ele.1) {.x %>%
filter(af == .af & p %in% .p) %>%
ggplot(aes(x = .var, y = ele.2)) +
geom_point() +
geom_path()}
test(df)
this results in
**Error in FUN(X[[i]], ...) : object 'ele.1' not found
In addition: Warning message:
In FUN(X[[i]], ...) : restarting interrupted promise evaluation**
how could i call the object ele.1 in ggplot warped around that function?
hope this is no reword from another question.
cheers
If you set .var argument as character it runs. Is this what you are looking for?
test <- function(.x = df, .af = 1,.p=c("A","B"), .var = "ele.1") {
.x %>%filter(af == .af & p %in% .p)
ggplot() +geom_point(aes(x = .x[[.var]], y = .x[["ele.2"]])) + geom_path()
}
test(df)

Stacked bar plot with percentages in separate columns

I am attempting to draw a stacked bar plot with the the following data, using either ggplot2 or the barplot function in r. I have failed with both.
str(ISCE_LENGUAJE5_APE_DEC)
'data.frame': 50 obs. of 5 variables:
$ Nombre : Factor w/ 49 levels "C.E. DE BORAUDO",..: 6 5 25 21 16 7 27 45 24 38 ...
$ v2014_5L_porNivInsu: int 100 93 73 67 67 65 63 60 59 54 ...
$ v2014_5L_porNivMini: int 0 7 22 26 32 32 37 26 34 35 ...
$ v2014_5L_porNivSati: int 0 0 4 6 2 3 0 12 6 10 ...
$ v2014_5L_porNivAvan: int 0 0 1 2 0 0 0 2 1 2 ...
The integers are percentage values: them sum of the v2014... columns for each observation is 100.
I have attempted to use ggplot2, but I only manage to plot one of the variables, not the stacked bar with all four.
ggplot(ISCE_LENGUAJE5_APE_DEC, aes(x=Nombre, y= v2014_5L_porNivInsu)) + geom_bar(stat="identity")
I can't figure out how to pass the values for all four columns to the y parameter.
If I only pass x, I get an error:
ggplot(ISCE_LENGUAJE5_APE_DEC, aes(x=Nombre)) + geom_bar(stat="identity")
Error in exists(name, envir = env, mode = mode) :
argument "env" is missing, with no default
I found this answer, but don't understand the data transformations used. Thank you for any help provided.
ggplot2 works with data expressed in "long" format. The function melt from package reshape2 is your friend.
Because you did not provide a reproducible example, I generated some data.
v2014 <- data.frame(v2014_5L_porNivInsu = sample(1:100, 50, replace = TRUE),
v2014_5L_porNivMini = sample(1:50, 50, replace = TRUE),
v2014_5L_porNivSati = sample(0:10, 50, replace = TRUE),
v2014_5L_porNivAvan = sample(0:2, 50, replace = TRUE))
v2014_prop <- t(apply(dummy[, -1], 1, function(x) {x / sum(x) * 100}))
ISCE_LENGUAJE5_APE_DEC <- data.frame(Nombre = factor(sample(1:100, 50)),
v2014_prop)
You first express your table in long format using melt.
library(reshape2)
gg <- melt(ISCE_LENGUAJE5_APE_DEC, id = "Nombre")
See how your new table, gg, looks like.
str(gg)
head(gg)
In your ggplot, you use the data.frame gg. The x-axis is Nombre, the y-axis is value, i.e. the proportions, segmented by different fill colours defined from the variable column, where you find the v2014_... expressed as factor levels instead as column headers thanks to the melt function.
library(ggplot2)
ggplot(gg, aes(x = Nombre, y = value, fill = variable)) +
geom_bar(stat = "identity")

Correct parameters for the geom_line layer in ggplot2

I have a dataframe as such:
1 Pos like 77
2 Neg like 58
3 Pos make 44
4 Neg make 34
5 Pos movi 154
6 Neg movi 145
...
20 Neg will 45
I would like to produce a plot using the geom_text layer in ggplot2.
I have used this code
q <- ggplot(my_data_set, aes(x=value, y=value, label=variable))
q <- q + geom_text()
q
which produced this plot:
Obviously, this is not an ideal plot.
I would like to produce a plot similar, except I would like to have the Positive class on the x-axis, and the Negative class on the y-axis.
UPDATE: Here is an example of something I am attempting to emulate:
I can't seem to figure out the correct way to give the arguments to the geom_line layer.
What is the correct way to plot the value of the Positive arguments on the X-axis, and the value of the Negative arguments on the Y-axis, given the data frame I have?
Thanks for your attention.
my_data_set <- read.table(text = "
id variable value
Pos like 77
Neg like 58
Pos make 44
Neg make 34
Pos movi 154
Neg movi 145", header = T)
library(data.table)
my_data_set <- as.data.frame(data.table(my_data_set)[, list(
Y = value[id == "Neg"],
X = value[id == "Pos"]),
by = variable])
library(ggplot2)
q <- ggplot(my_data_set, aes(x=X, y=Y, label=variable))
q <- q + geom_text()
q
This can also be easily done with reshape2 (with the same result as David Arenburg's answer):
df <- read.table(text = "id variable value
Pos like 77
Neg like 58
Pos make 44
Neg make 34
Pos movi 154
Neg movi 145", header = TRUE)
require(reshape2)
df2 <- dcast(df, variable ~ id, value.var="value")
library(ggplot2)
ggplot(df2, aes(x=Pos, y=Neg, label=variable)) +
geom_text()
which results in:

Resources