How to plot dataframe in R as a heatmap/grid? - r

I would be extremely grateful for some help with R. I would like to plot a dataframe of gridded data (like for like running down the diagonal, from top left to bottom right). I've seen quite a few examples using ggplot2, however, I simply lack the experience necessary with R to manipulate the data structures; I've been programming in LISP and Java for years yet my head won't get around R :-(
The data looks like this:
tension cluster migraineNoAura migraineAura
tension NA 1.5 6.960453e+00 3.596953
cluster 1.943113e+08 NA NA NA
migraineNoAura 8.462798e+00 NA NA 7.499999
migraineAura 2.833333e+00 NA 7.148313e+07 NA
This is only a small subset, it's a 60x60 data frame. Notice the NAs.
I'm hoping for a 60x60 grid, coloured by the value and the x and y labeled using the names from the data frame.

First, you need to format your data frame from wide format to long format. The following is an example using tidyverse to format the data frame.
library(tidyverse)
dt2 <- dt %>%
rownames_to_column() %>%
gather(colname, value, -rowname)
head(dt2)
# rowname colname value
# 1 tension tension NA
# 2 cluster tension 1.943113e+08
# 3 migraineNoAura tension 8.462798e+00
# 4 migraineAura tension 2.833333e+00
# 5 tension cluster 1.500000e+00
# 6 cluster cluster NA
Now we are ready to use the ggplot2 to plot the heatmap using geom_tile.
ggplot(dt2, aes(x = rowname, y = colname, fill = value)) +
geom_tile()

Related

arithmatic operations and labelling in ggplot or R

I have a file that looks like this
2 3 LOGIC:A
2 5 LOGIC:A
3 4 LOGIC:Z
I plotted column 1 on x axis vs column 2 on y with column 3 acting as a legend
ggplot(Data, aes(V1, V2, col = V3)) + geom_point()
However is it possible in ggplot itself to subtract column 2 and column 1 and label the top 10 highest absolute difference rows of this subtraction with column 3 values on each scatter point. I dont want to label the entire dataset. Just the top 10 highest deltas
You can try this (if you original dataframe is Data):
library(dplyr)
library(ggplot2)
Data$sub <- abs(Data$V2 - Data$V1)
Data2<- Data %>%
top_n(10,sub)
ggplot()+ geom_text(data=Data2,aes(V1,V2-0.1,label=V3))+
geom_point(data=Data,aes(V1,V2))
With the library dplyr you can filter the top values of a dataframe.
You can change "0.1" for a better value in your plot

Scatter plot in ggplot, one numeric variable across two groups

I would like to create a scatter plot in ggplot2 which displays male test_scores on the x-axis and female test_scores on the y-axis using the dataset below. I can easily create a geom_line plot splitting male and female and putting the date ("dts") on the x-axis.
library(tidyverse)
#create data
dts <- c("2011-01-02","2011-01-02","2011-01-03","2011-01-04","2011-01-05",
"2011-01-02","2011-01-02","2011-01-03","2011-01-04","2011-01-05")
sex <- c("M","F","M","F","M","F","M","F","M","F")
test <- round(runif(10,.5,1),2)
semester <- data.frame("dts" = as.Date(dts), "sex" = sex, "test_scores" =
test)
#show the geom_line plot
ggplot(semester, aes(x = dts, y = test, color = sex)) + geom_line()
It seems with only one time series, ggplot2 does better with the data in wide format than long format. For instance, I could easily create two columns, "male_scores" and "female_scores" and plot those against each other, but I would like to keep my data tidy and in long format.
Cheers and thank you.
You've over-tidied. Tidying data isn't just the mechanism of making it as long as possible, its making it as wide as necessary..
For example, if you had location as X and Y for animal sightings you wouldn't have two rows, one with a "label" column containing "X" and the X coordinate in a "value" column and another with "Y" in the "label" column and the Y coordinate in the "value" column - unless you really where storing the data in a key-value store but that's another story...
Widen your data and put the test scores for male and female into test_core_male and test_score_female, then they are the x and y aesthetics for your scatter plot.
The problem with keeping the data long is that you will not have a corresponding X value a given Y value. The reason for this is the structure of the dataset --
dts sex test_scores
1 2011-01-02 M 0.67
2 2011-01-02 F 0.78
3 2011-01-03 M 0.58
4 2011-01-04 F 0.58
5 2011-01-05 M 0.51
If ypu were to use the code --
ggplot(semester, aes(x = semester$test_scores[semester$sex=='M',] ,
y = semester$test_scores[semester$sex=='F',],
color = sex)) + geom_point()
GGplot will kick an error. The main reason is by subsetting the male score there are no corresponding female scores for that subset. You need to first collapse the data down to a date level. As you correctly point out this isn't in a long format at that point.
I would recommend for this one off plot creating a wide dataset. There are multiple ways of doing that, but that is a different topic.

Plot multiline graph with custom x axis in R [duplicate]

This question already has an answer here:
Grouping & Visualizing cumulative features in R
(1 answer)
Closed 6 years ago.
I don't have much experience with R and since I am trying to create a fairly specific graph in R, I hope some of you can help me out.
I have data of the results of four classifiers being used on five different datasets. To get an accurate result each classifier was run on the same dataset 10 times. So now I have a table of the results as following:
DataSet1 DataSet1 DataSet1 ... DataSet2 DataSet2 ...
Classifier1 0.6 0.5 0.7 0.3 0.2
Classifier2 0.4 0.5 0.6 0.6 0.7
And so on.
What I am trying to get for my graph is to have four seperate graphs in different colors representing the four Classifiers. The y axis would just represent the results of the classifications and the x-axis should portray the five different datasets.
Each "mark" on the x-axis should be one dataset and the point on the y-axis for each graph would be the mean value of the 10 results for that classifier on that specific dataset.
I have tried using ggplot2 to achieve this by creating a data frame out of the data and melting it with the dataset names as variables. I might not truly understand what melting really does.
I am not very familiar with creating graphs and plots and apologize if my description is clumsy and lacking.
I would appreciate any help greatly.
TL;DR: Reformat input data + ggplot with facets
Input
Since test data wasn't provided, I created dummy data
library(dplyr)
library(tidyr)
library(ggplot2)
set.seed(1)
dummydata <-
matrix(
data = sample(do.call("c", select(iris, -Species)), 10*5*4, replace = T),
nrow=4, ncol=10*5
)
rownames(dummydata) <- paste0("Classifier", 1:4)
colnames(dummydata) <- rep(paste0("DataSet", 1:5), each=10)
dummydata
Here, dummydata is a matrix that looks like
# DataSet1 DataSet1 ... (x10 total columns of each Dataset) ... DataSet5
# Classifier1 3.1 5.6 ...
# Classifier2 2.8 1.3 ...
# ...
# Classifier4 1.3 ...
Reformat input to a workable state
Make sure column names are unique
Make sure object is a data frame
Make sure there is a column for the row name
Make the data frame long (to be used by ggplot)
We do so by:
## Make col names unique
colnames(dummydata) <- paste(colnames(dummydata), 1:10, sep="_")
dummydata_reformat <-
dummydata %>%
## make sure it is data frame
## with a column for the classifier type i.e. rowname
as.data.frame() %>%
tibble::rownames_to_column("classifier") %>%
## reformat data
gather(dataset,value,-classifier) %>%
separate(dataset, into=c("dset", "x"), sep="_")
The data now looks like
#> dummydata_reformat
# classifier dset x value
#1 Classifier1 DataSet1 1 3.1
#2 Classifier2 DataSet1 1 2.8
#3 Classifier3 DataSet1 1 1.6
# ...
Plot
## Plot
dummydata_reformat %>%
ggplot(aes(x=dset,
y=value
## can add: "color = classifier" to color by classifier
## but since you are splitting the plots by classifier,
## this does not make sense
)) +
geom_point() +
xlab("") +
facet_wrap(~classifier) +
theme(
## rotate the x-axis to fit text
axis.text.x = element_text(angle=90, hjust=1, vjust=0.5)
)

How to draw a basic histogram with X and Y axis in R

I want to make a simple histogram which involves two vectors ,
values <- c(1,2,3,4,5,6,7,8)
freq <- c(4,6,4,4,3,2,1,1)
df <- data.frame(values,freq)
Now the data.farame df consists the following values :
values freq
1 4
2 6
3 4
4 4
5 3
6 2
7 1
8 1
Now I want to draw a simple histogram, in which values are on the x axis and freq is on y axis. I am trying to use the hist function, but I am not able to give two variables. How can I make a simple histogram from this data?
using ggplot2:
library(ggplot2)
ggplot(df, aes(x = values, y = freq)) +
geom_bar(stat="identity")
Since you have the frequencies already, what you really want is a bar plot:
barplot(df$freq,names.arg=df$values)
If you've got your heart set on using hist, you should do:
hist(rep(df$values,df$freq))
Please read ?barplot and ?hist for further plotting options.
Also, because I'm somewhat of a zealot, I think the code looks cleaner if you use data.table:
library(data.table)
setDT(df) #convert df to a data.table by reference
df[,barplot(freq,names.arg=values)]
and
df[,hist(rep(values,freq))]

How to plot histogram with means calculated by factor levels from multiple columns

I am new to R and may be my question looks silly, I spent half of the day trying to solve it on my own with no luck. I've found no tutorial which illustrates how to do it, and if you know such tutorial you're welcome. I want to plot a histogram with means calculated by factors from columns. My initial data looks like this (simplified version):
code_group scale1 scale2
1 5 3
2 3 2
3 5 2
So I need histogram where each bean colored by code_group and it's value is mean for each level from code_group, x-axis with labels scale1 and scale2. Every label contains three beans (for three levels of factor code_group). I've managed to calculate means for each level on my own, it looks like this:
code_group scale1 scale2
1 -1.0270270 0.05405405
2 -1.0882353 0.14705882
3 -0.7931034 -0.34482759
but I have no idea how to plot it in historgam! Thanks in advance!
Assuming you mean bar chart and not histogram (please clarify your question if this isn't the case), you can melt your data and plot it with ggplot like this:
library(ggplot2)
library(reshape2)
##
mdf <- melt(
df,
id.vars="code_group",
variable.name="scale_type",
value.name="mean_value")
##
R> ggplot(
mdf,
aes(x=scale_type,
y=mean_value,
fill=factor(code_group)))+
geom_bar(stat="identity",position="dodge")
Data:
df <- read.table(
text="code_group scale1 scale2
1 -1.0270270 0.05405405
2 -1.0882353 0.14705882
3 -0.7931034 -0.34482759",
header=TRUE)
Edit:
You could just make the modifications to the data itself (or a copy of it) like below:
mdf2 <- mdf
mdf2$code_group <- factor(
mdf2$code_group,
levels=1:3,
labels=c("neutral",
"likers",
"lovers"))
names(mdf2)[1] <- "group"
##
ggplot(
mdf2,
aes(x=scale_type,
y=mean_value,
fill=group))+
geom_bar(stat="identity",position="dodge")
##
Given the mean values you provided, you could do something like this:
To recreate your simplified dataset:
d=data.frame(code_group=c(1,2,3),scale1=c(-1.02,-1.08,-0.79),scale2=c(0.05,.15,-0.34))
To create your graph:
barplot(c(d[,'scale1'],d[,'scale2']),col=d[,'code_group'],names.arg=c(paste('scale1',unique(d[,'code_group']),sep='_'),paste('scale2',unique(d[,'code_group']),sep='_')))
This will give you the following graph:

Resources