This question already has an answer here:
Grouping & Visualizing cumulative features in R
(1 answer)
Closed 6 years ago.
I don't have much experience with R and since I am trying to create a fairly specific graph in R, I hope some of you can help me out.
I have data of the results of four classifiers being used on five different datasets. To get an accurate result each classifier was run on the same dataset 10 times. So now I have a table of the results as following:
DataSet1 DataSet1 DataSet1 ... DataSet2 DataSet2 ...
Classifier1 0.6 0.5 0.7 0.3 0.2
Classifier2 0.4 0.5 0.6 0.6 0.7
And so on.
What I am trying to get for my graph is to have four seperate graphs in different colors representing the four Classifiers. The y axis would just represent the results of the classifications and the x-axis should portray the five different datasets.
Each "mark" on the x-axis should be one dataset and the point on the y-axis for each graph would be the mean value of the 10 results for that classifier on that specific dataset.
I have tried using ggplot2 to achieve this by creating a data frame out of the data and melting it with the dataset names as variables. I might not truly understand what melting really does.
I am not very familiar with creating graphs and plots and apologize if my description is clumsy and lacking.
I would appreciate any help greatly.
TL;DR: Reformat input data + ggplot with facets
Input
Since test data wasn't provided, I created dummy data
library(dplyr)
library(tidyr)
library(ggplot2)
set.seed(1)
dummydata <-
matrix(
data = sample(do.call("c", select(iris, -Species)), 10*5*4, replace = T),
nrow=4, ncol=10*5
)
rownames(dummydata) <- paste0("Classifier", 1:4)
colnames(dummydata) <- rep(paste0("DataSet", 1:5), each=10)
dummydata
Here, dummydata is a matrix that looks like
# DataSet1 DataSet1 ... (x10 total columns of each Dataset) ... DataSet5
# Classifier1 3.1 5.6 ...
# Classifier2 2.8 1.3 ...
# ...
# Classifier4 1.3 ...
Reformat input to a workable state
Make sure column names are unique
Make sure object is a data frame
Make sure there is a column for the row name
Make the data frame long (to be used by ggplot)
We do so by:
## Make col names unique
colnames(dummydata) <- paste(colnames(dummydata), 1:10, sep="_")
dummydata_reformat <-
dummydata %>%
## make sure it is data frame
## with a column for the classifier type i.e. rowname
as.data.frame() %>%
tibble::rownames_to_column("classifier") %>%
## reformat data
gather(dataset,value,-classifier) %>%
separate(dataset, into=c("dset", "x"), sep="_")
The data now looks like
#> dummydata_reformat
# classifier dset x value
#1 Classifier1 DataSet1 1 3.1
#2 Classifier2 DataSet1 1 2.8
#3 Classifier3 DataSet1 1 1.6
# ...
Plot
## Plot
dummydata_reformat %>%
ggplot(aes(x=dset,
y=value
## can add: "color = classifier" to color by classifier
## but since you are splitting the plots by classifier,
## this does not make sense
)) +
geom_point() +
xlab("") +
facet_wrap(~classifier) +
theme(
## rotate the x-axis to fit text
axis.text.x = element_text(angle=90, hjust=1, vjust=0.5)
)
Related
I would be extremely grateful for some help with R. I would like to plot a dataframe of gridded data (like for like running down the diagonal, from top left to bottom right). I've seen quite a few examples using ggplot2, however, I simply lack the experience necessary with R to manipulate the data structures; I've been programming in LISP and Java for years yet my head won't get around R :-(
The data looks like this:
tension cluster migraineNoAura migraineAura
tension NA 1.5 6.960453e+00 3.596953
cluster 1.943113e+08 NA NA NA
migraineNoAura 8.462798e+00 NA NA 7.499999
migraineAura 2.833333e+00 NA 7.148313e+07 NA
This is only a small subset, it's a 60x60 data frame. Notice the NAs.
I'm hoping for a 60x60 grid, coloured by the value and the x and y labeled using the names from the data frame.
First, you need to format your data frame from wide format to long format. The following is an example using tidyverse to format the data frame.
library(tidyverse)
dt2 <- dt %>%
rownames_to_column() %>%
gather(colname, value, -rowname)
head(dt2)
# rowname colname value
# 1 tension tension NA
# 2 cluster tension 1.943113e+08
# 3 migraineNoAura tension 8.462798e+00
# 4 migraineAura tension 2.833333e+00
# 5 tension cluster 1.500000e+00
# 6 cluster cluster NA
Now we are ready to use the ggplot2 to plot the heatmap using geom_tile.
ggplot(dt2, aes(x = rowname, y = colname, fill = value)) +
geom_tile()
I would like to create a scatter plot in ggplot2 which displays male test_scores on the x-axis and female test_scores on the y-axis using the dataset below. I can easily create a geom_line plot splitting male and female and putting the date ("dts") on the x-axis.
library(tidyverse)
#create data
dts <- c("2011-01-02","2011-01-02","2011-01-03","2011-01-04","2011-01-05",
"2011-01-02","2011-01-02","2011-01-03","2011-01-04","2011-01-05")
sex <- c("M","F","M","F","M","F","M","F","M","F")
test <- round(runif(10,.5,1),2)
semester <- data.frame("dts" = as.Date(dts), "sex" = sex, "test_scores" =
test)
#show the geom_line plot
ggplot(semester, aes(x = dts, y = test, color = sex)) + geom_line()
It seems with only one time series, ggplot2 does better with the data in wide format than long format. For instance, I could easily create two columns, "male_scores" and "female_scores" and plot those against each other, but I would like to keep my data tidy and in long format.
Cheers and thank you.
You've over-tidied. Tidying data isn't just the mechanism of making it as long as possible, its making it as wide as necessary..
For example, if you had location as X and Y for animal sightings you wouldn't have two rows, one with a "label" column containing "X" and the X coordinate in a "value" column and another with "Y" in the "label" column and the Y coordinate in the "value" column - unless you really where storing the data in a key-value store but that's another story...
Widen your data and put the test scores for male and female into test_core_male and test_score_female, then they are the x and y aesthetics for your scatter plot.
The problem with keeping the data long is that you will not have a corresponding X value a given Y value. The reason for this is the structure of the dataset --
dts sex test_scores
1 2011-01-02 M 0.67
2 2011-01-02 F 0.78
3 2011-01-03 M 0.58
4 2011-01-04 F 0.58
5 2011-01-05 M 0.51
If ypu were to use the code --
ggplot(semester, aes(x = semester$test_scores[semester$sex=='M',] ,
y = semester$test_scores[semester$sex=='F',],
color = sex)) + geom_point()
GGplot will kick an error. The main reason is by subsetting the male score there are no corresponding female scores for that subset. You need to first collapse the data down to a date level. As you correctly point out this isn't in a long format at that point.
I would recommend for this one off plot creating a wide dataset. There are multiple ways of doing that, but that is a different topic.
I am new to R and may be my question looks silly, I spent half of the day trying to solve it on my own with no luck. I've found no tutorial which illustrates how to do it, and if you know such tutorial you're welcome. I want to plot a histogram with means calculated by factors from columns. My initial data looks like this (simplified version):
code_group scale1 scale2
1 5 3
2 3 2
3 5 2
So I need histogram where each bean colored by code_group and it's value is mean for each level from code_group, x-axis with labels scale1 and scale2. Every label contains three beans (for three levels of factor code_group). I've managed to calculate means for each level on my own, it looks like this:
code_group scale1 scale2
1 -1.0270270 0.05405405
2 -1.0882353 0.14705882
3 -0.7931034 -0.34482759
but I have no idea how to plot it in historgam! Thanks in advance!
Assuming you mean bar chart and not histogram (please clarify your question if this isn't the case), you can melt your data and plot it with ggplot like this:
library(ggplot2)
library(reshape2)
##
mdf <- melt(
df,
id.vars="code_group",
variable.name="scale_type",
value.name="mean_value")
##
R> ggplot(
mdf,
aes(x=scale_type,
y=mean_value,
fill=factor(code_group)))+
geom_bar(stat="identity",position="dodge")
Data:
df <- read.table(
text="code_group scale1 scale2
1 -1.0270270 0.05405405
2 -1.0882353 0.14705882
3 -0.7931034 -0.34482759",
header=TRUE)
Edit:
You could just make the modifications to the data itself (or a copy of it) like below:
mdf2 <- mdf
mdf2$code_group <- factor(
mdf2$code_group,
levels=1:3,
labels=c("neutral",
"likers",
"lovers"))
names(mdf2)[1] <- "group"
##
ggplot(
mdf2,
aes(x=scale_type,
y=mean_value,
fill=group))+
geom_bar(stat="identity",position="dodge")
##
Given the mean values you provided, you could do something like this:
To recreate your simplified dataset:
d=data.frame(code_group=c(1,2,3),scale1=c(-1.02,-1.08,-0.79),scale2=c(0.05,.15,-0.34))
To create your graph:
barplot(c(d[,'scale1'],d[,'scale2']),col=d[,'code_group'],names.arg=c(paste('scale1',unique(d[,'code_group']),sep='_'),paste('scale2',unique(d[,'code_group']),sep='_')))
This will give you the following graph:
I feel like this question has been asked many times before, however from the questions I've looked at, none of the solutions so far have worked for me.
I wish to plot the values of two correlation matrices as scatter plots, next to each other in one plot with the same y range (from 0 - 1).
My original data are time series spanning several years of 112 companies, which I've split into two subsets, period A & period B. The original data is a zoo object.
I have then created the correlation matrices for both periods:
corr_A <- cor(series_A)
corr_B <- cor(series_B)
For further data analysis, I have removed the double entries:
corr_A[lower.tri(corr_A, diag=TRUE)] <- NA
corr_A <- as.vector(corr_A)
corr_A <- corr_A[!is.na(corr_A)]
corr_B[lower.tri(corr_B, diag=TRUE)] <- NA
corr_B <- as.vector(corr_B)
corr_B <- corr_B[!is.na(corr_B)]
As a result, I have two vectors, each with a length of 6216 (111 + 110 + 109 + .. + 1 = 6216).
I have then combined these vectors into a matrix:
correlation <- matrix(c(corr_A, corr_B), nrow=6216)
colnames(correlation) <- c("period_A", "period_B")
I would now like to plot this matrix so the result looks similar to this picture:
I've tried to plot using xyplot from lattice:
xyplot(period_A + period_B ~ X, correlation)
However, in the resulting plot, the two scatter plots are stacked over each other:
I have also tried changing the matrix itself - instead of using 6216 rows, I have used 12432 rows, and then index'd the first 6512 rows as "period_A" and the last 6512 rows as "period_B" - the resulting plot looks quite similar to my desired plot:
Is there any way I can create my desired plot using xyplot? Or are there any other (ggplot, car) methods of generating the plot?
Edit (added sample data for reproducible example):
head(correlation) #data frame with 6216 rows, 3 columns
X period_A period_B
1 0.5 0.4
2 0.3 0.6
3 0.2 0.4
4 0.6 0.6
I've finally found a solution:
https://stats.stackexchange.com/questions/63203/boxplot-equivalent-for-heavy-tailed-distributions
First of all, we stack the data.
correlation <- stack(correlation)
Then, we use stripplot (from lattice) combiend with jitter=TRUE to create the desired plot.
stripplot(correlation$values ~ correlation$ind, jitter=T)
The resulting plot looks exactly like my desired plot and can be manipulated using the standard lattice/plot commands (ylab, xlab, xlim, ylim,scales, etc.).
I'm a bit out of my depth with this one here. I have the following code that generates two equally sized matrices:
MAX<-100
m<-5
n<-40
success<-matrix(runif(m*n,0,1),m,n)
samples<-floor(MAX*matrix(runif(m*n),m))+1
the success matrix is the probability of success and the samples matrix is the corresponding number of samples that was observed in each case. I'd like to make a bar graph that groups each column together with the height being determined by the success matrix. The color of each bar needs to be a color (scaled from 1 to MAX) that corresponds to the number of observations (i.e., small samples would be more red, for instance, whereas high samples would be green perhaps).
Any ideas?
Here is an example with ggplot. First, get data into long format with melt:
library(reshape2)
data.long <- cbind(melt(success), melt(samples)[3])
names(data.long) <- c("group", "x", "success", "count")
head(data.long)
# group x success count
# 1 1 1 0.48513473 8
# 2 2 1 0.56583802 58
# 3 3 1 0.34541582 40
# 4 4 1 0.55829073 64
# 5 5 1 0.06455401 37
# 6 1 2 0.88928606 78
Note melt will iterate through the row/column combinations of both matrices the same way, so we can just cbind the resulting molten data frames. The [3] after the second melt is so we don't end up with repeated group and x values (we only need the counts from the second melt). Now let ggplot do its thing:
library(ggplot2)
ggplot(data.long, aes(x=x, y=success, group=group, fill=count)) +
geom_bar(position="stack", stat="identity") +
scale_fill_gradient2(
low="red", mid="yellow", high="green",
midpoint=mean(data.long$count)
)
Using #BrodieG's data.long, this plot might be a little easier to interpret.
library(ggplot2)
library(RColorBrewer) # for brewer.pal(...)
ggplot(data.long) +
geom_bar(aes(x=x, y=success, fill=count),colour="grey70",stat="identity")+
scale_fill_gradientn(colours=brewer.pal(9,"RdYlGn")) +
facet_grid(group~.)
Note that actual values are probably different because you use random numbers in your sample. In future, consider using set.seed(n) to generate reproducible random samples.
Edit [Response to OP's comment]
You get numbers for x-axis and facet labels because you start with matrices instead of data.frames. So convert success and samples to data.frames, set the column names to whatever your test names are, and prepend a group column with the "list of factors". Converting to long format is a little different now because the first column has the group names.
library(reshape2)
set.seed(1)
success <- data.frame(matrix(runif(m*n,0,1),m,n))
success <- cbind(group=rep(paste("Factor",1:nrow(success),sep=".")),success)
samples <- data.frame(floor(MAX*matrix(runif(m*n),m))+1)
samples <- cbind(group=success$group,samples)
data.long <- cbind(melt(success,id=1), melt(samples, id=1)[3])
names(data.long) <- c("group", "x", "success", "count")
One way to set a threshold color is to add a column to data.long and use that for fill:
threshold <- 25
data.long$fill <- with(data.long,ifelse(count>threshold,max(count),count))
Putting it all together:
library(ggplot2)
library(RColorBrewer)
ggplot(data.long) +
geom_bar(aes(x=x, y=success, fill=fill),colour="grey70",stat="identity")+
scale_fill_gradientn(colours=brewer.pal(9,"RdYlGn")) +
facet_grid(group~.)+
theme(axis.text.x=element_text(angle=-90,hjust=0,vjust=0.4))
Finally, when you have names for the x-axis labels they tend to get jammed together, so I rotated the names -90°.