Run points() after plot() on a dataframe - r

I'm new to R and want to plot specific points over an existing plot. I'm using the swiss data frame, which I visualize through the plot(swiss) function.
After this, want to add outliers given by the Mahalanobis distance:
mu_hat <- apply(swiss, 2, mean); sigma_hat <- cov(swiss)
mahalanobis_distance <- mahalanobis(swiss, mu_hat, sigma_hat)
outliers <- swiss[names(mahalanobis_distance[mahalanobis_distance > 10]),]
points(outliers, pch = 'x', col = 'red')
but this last line has no effect, as the outlier points aren't added to the previous plot. I see that if repeat this procedure on a pair of variables, say
plot(swiss[2:3])
points(outliers[2:3], pch = 'x', col = 'red')
the red points are added to the plot.
Ask: is there any restriction to how the points() function can be used for a multivariate data frame?

Here's a solution using GGally::ggpairs. It's a little ugly as we need to modify the ggally_points function to specify the desired color scheme.
I've assumed that mu_hat = colMeans(swiss) and sigma_hat = cov(swiss).
library(dplyr)
library(GGally)
swiss %>%
bind_cols(distance = mahalanobis(swiss, colMeans(swiss), cov(swiss))) %>%
mutate(is_outlier = ifelse(distance > 10, "yes", "no")) %>%
ggpairs(columns = 1:6,
mapping = aes(color = is_outlier),
upper = list(continuous = function(data, mapping, ...) {
ggally_points(data = data, mapping = mapping) +
scale_colour_manual(values = c("black", "red"))
}),
lower = list(continuous = function(data, mapping, ...) {
ggally_points(data = data, mapping = mapping) +
scale_colour_manual(values = c("black", "red"))
}),
axisLabels = "internal")

Unfortunately this isn't possible the way you're currently doing things. When plotting a data frame R produces many plots and aligns them. What you're actually seeing there is 6 by 6 = 36 individual plots which have all been aligned to look nice.
When you use the dots command, it tells it to place the dots on the current plot. Which doesn't really make sense when you have 36 plots, at least not the way you want it to.
ggplot is a really powerful tool in R, it provides far greater combustibility. For example you could set up the dataframe to include your outliers, but have them labelled as "outlier" and place it in each plot that you have set up as facets. The more you explore it you might find there are better plots which suit your needs as well.
Plotting a dataframe in base R is a good exploratory tool. You could set up those outliers as a separate dataframe and plot it, so you can see each of the 6 by 6 plots side by side and compare. It all depends on your goal. If you're goal is to produce exactly as you've described, the ggplot2 package will help you create something more professional. As #Gregor suggested in the comments, looking up the function ggpairs from the GGally package would be a good place to start.
A quick google image search shows some funky plots akin to what you're after and then some!
Find it here

Related

Why aren't any points showing up in the qqcomp function when using plotstyle="ggplot"?

I want to compare the fit of different distributions to my data in a single plot. The qqcomp function from the fitdistrplus package pretty much does exactly what I want to do. The only problem I have however, is that it's mostly written using base R plot and all my other plots are written in ggplot2. I basically just want to customize the qqcomp plots to look like they have been made in ggplot2.
From the documentation (https://www.rdocumentation.org/packages/fitdistrplus/versions/1.0-14/topics/graphcomp) I get that this is totally possible by setting plotstyle="ggplot". If I do this however, no points are showing up on the plot, even though it worked perfectly without the plotstyle argument. Here is a little example to visualize my problem:
library(fitdistrplus)
library(ggplot2)
set.seed(42)
vec <- rgamma(100, shape=2)
fit.norm <- fitdist(vec, "norm")
fit.gamma <- fitdist(vec, "gamma")
fit.weibull <- fitdist(vec, "weibull")
model.list <- list(fit.norm, fit.gamma, fit.weibull)
qqcomp(model.list)
This gives the following output:
While this:
qqcomp(model.list, plotstyle="ggplot")
gives the following output:
Why are the points not showing up? Am I doing something wrong here or is this a bug?
EDIT:
So I haven't figured out why this doesn't work, but there is a pretty easy workaround. The function call qqcomp(model.list, plotstyle="ggplot") still returns an ggplot object, which includes the data used to make the plot. Using that data one can easily write an own plot function that does exactly what one wants. It's not very elegant, but until someone finds out why it's not working as expected I will just use this method.
I was able to reproduce your error and indeed, it's really intriguing. Maybe, you should contact developpers of this package to mention this bug.
Otherwise, if you want to reproduce this qqplot using ggplot and stat_qq, passing the corresponding distribution function and the parameters associated (stored in $estimate):
library(ggplot2)
df = data.frame(vec)
ggplot(df, aes(sample = vec))+
stat_qq(distribution = qgamma, dparams = as.list(fit.gamma$estimate), color = "green")+
stat_qq(distribution = qnorm, dparams = as.list(fit.norm$estimate), color = "red")+
stat_qq(distribution = qweibull, dparams = as.list(fit.weibull$estimate), color = "blue")+
geom_abline(slope = 1, color = "black")+
labs(title = "Q-Q Plots", x = "Theoritical quantiles", y = "Empirical quantiles")
Hope it will help you.

Multiple histograms with title and mean as a line?

I'm struggeling with the histogram function in my exploratory analysis. I would like to run a couple of variables in my dataset through a histogram function and for each add the title and a line at the arithmetic mean. This is how far I've got (but the main title is still missing):
histo.abline <-function(x){
hist(x)
abline(v = mean(x, na.rm = TRUE), col = "blue", lwd = 4)}
sapply(dataset[c(7:10)], histo.abline)
I tried to add a main argument in the histogram function but it just doesn't pick the right variable name of my dataset vector. When I put main=x there, it says returns NULL for each variable. Colnames, names and other functions didn't work either. Could you help me?
you can try to do it with ggplot:
library(ggplot)
histo.abline <-function(dataset,colnum){
p<-ggplot(dataset,aes(dataset[,colnum]))+geom_histogram(bins=5,fill=I("blue"),col=I("red"), alpha=I(.2))+
geom_vline(xintercept = mean(dataset[,colnum], na.rm = TRUE))+xlab(as.character(names(dataset)[colnum]))
return(p)
}
since you have not provided data lets work with mtcars and create a list of histograms
dataset=mtcars
listOfHistograms<-lapply(3:7,function(x) histo.abline(dataset,x))
your list has 5 histograms that you can plot for instance the first by:
print(listOfHistograms[[1]])
More histogram options for ggplot here: https://www.r-bloggers.com/how-to-make-a-histogram-with-ggplot2/
hope this helps
EDIT: Multiple Plot in one graph
One way to do it is through cowplot library:
library(cowplot)
plot_grid(plotlist=listOfHistograms[1:4])

Plot a table with box size changing

Does anyone have an idea how is this kind of chart plotted? It seems like heat map. However, instead of using color, size of each cell is used to indicate the magnitude. I want to plot a figure like this but I don't know how to realize it. Can this be done in R or Matlab?
Try scatter:
scatter(x,y,sz,c,'s','filled');
where x and y are the positions of each square, sz is the size (must be a vector of the same length as x and y), and c is a 3xlength(x) matrix with the color value for each entry. The labels for the plot can be input with set(gcf,properties) or xticklabels:
X=30;
Y=10;
[x,y]=meshgrid(1:X,1:Y);
x=reshape(x,[size(x,1)*size(x,2) 1]);
y=reshape(y,[size(y,1)*size(y,2) 1]);
sz=50;
sz=sz*(1+rand(size(x)));
c=[1*ones(length(x),1) repmat(rand(size(x)),[1 2])];
scatter(x,y,sz,c,'s','filled');
xlab={'ACC';'BLCA';etc}
xticks(1:X)
xticklabels(xlab)
set(get(gca,'XLabel'),'Rotation',90);
ylab={'RAPGEB6';etc}
yticks(1:Y)
yticklabels(ylab)
EDIT: yticks & co are only available for >R2016b, if you don't have a newer version you should use set instead:
set(gca,'XTick',1:X,'XTickLabel',xlab,'XTickLabelRotation',90) %rotation only available for >R2014b
set(gca,'YTick',1:Y,'YTickLabel',ylab)
in R, you should use ggplot2 that allows you to map your values (gene expression in your case?) onto the size variable. Here, I did a simulation that resembles your data structure:
my_data <- matrix(rnorm(8*26,mean=0,sd=1), nrow=8, ncol=26,
dimnames = list(paste0("gene",1:8), LETTERS))
Then, you can process the data frame to be ready for ggplot2 data visualization:
library(reshape)
dat_m <- melt(my_data, varnames = c("gene", "cancer"))
Now, use ggplot2::geom_tile() to map the values onto the size variable. You may update additional features of the plot.
library(ggplot2)
ggplot(data=dat_m, aes(cancer, gene)) +
geom_tile(aes(size=value, fill="red"), color="white") +
scale_fill_discrete(guide=FALSE) + ##hide scale
scale_size_continuous(guide=FALSE) ##hide another scale
In R, corrplotpackage can be used. Specifically, you have to use method = 'square' when creating the plot.
Try this as an example:
library(corrplot)
corrplot(cor(mtcars), method = 'square', col = 'red')

How to color different groups in qqplot?

I'm plotting some Q-Q plots using the qqplot function. It's very convenient to use, except that I want to color the data points based on their IDs. For example:
library(qualityTools)
n=(rnorm(n=500, m=1, sd=1) )
id=c(rep(1,250),rep(2,250))
myData=data.frame(x=n,y=id)
qqPlot(myData$x, "normal",confbounds = FALSE)
So the plot looks like:
I need to color the dots based on their "id" values, for example blue for the ones with id=1, and red for the ones with id=2. I would greatly appreciate your help.
You can try setting col = myData$y. I'm not sure how the qqPlot function works from that package, but if you're not stuck with using that function, you can do this in base R.
Using base R functions, it would look something like this:
# The example data, as generated in the question
n <- rnorm(n=500, m=1, sd=1)
id <- c(rep(1,250), rep(2,250))
myData <- data.frame(x=n,y=id)
# The plot
qqnorm(myData$x, col = myData$y)
qqline(myData$x, lty = 2)
Not sure how helpful the colors will be due to the overplotting in this particular example.
Not used qqPlot before, but it you want to use it, there is a way to achieve what you want. It looks like the function invisibly passes back the data used in the plot. That means we can do something like this:
# Use qqPlot - it generates a graph, but ignore that for now
plotData <- qqPlot(myData$x, "normal",confbounds = FALSE, col = sample(colors(), nrow(myData)))
# Given that you have the data generated, you can create your own plot instead ...
with(plotData, {
plot(x, y, col = ifelse(id == 1, "red", "blue"))
abline(int, slope)
})
Hope that helps.

Plotting three densities on the same graph in different line patterns with titles etc

I am very, very new to R so please forgive the basic nature of my question. In short, I have done a lot of Google searching to try to answer this, but I find that even the basic guides available, and simple discussions on forums are assuming more prior knowledge than I have, especially when it comes to outlining what all of the coding terms are and what changing them means for a plot.
In short I have a tab formatted table with three columns of data that I wish to plot densities for on a single graph. I would like the lines to be different patterns (dotted, dashed etc. whatever makes it easy to tell them apart, I cannot use colours as my supervisor is colour blind).
I have code that reads in the data and makes accessible the columns I am interested in:
mydata <- read.table("c:/Users/Demon/Desktop/Thesis/Fst_all_genome.txt", header=TRUE,
sep="\t")
fstdata <- data.frame(Fst_ceu_mkk =rnorm(10),
Fst_ceu_yri =rnorm(10),
Fst_mkk_yri =rnorm(10))
Where do I go from here?
Appendix A of 'An Introduction to R' has a nice walkthrough tutorial you can do in ten minutes; it teaches among other things about line types etc
After that, plotting densities was explained dozens of times here too; search in the search box above for eg '[r] density'. There is also the R Graph Gallery (possibly down right now) and more.
A nice, free guide I often recommend is John Verzani's simpleR which stresses graphs a lot and will teach you what you need here.
Two options for you to explore using high-level graphics.
# dummy data
d = data.frame(x = rnorm(10), y = rnorm(10), z = rnorm(10))
You first need to reshape the data from wide to long format,
require(reshape2)
m = melt(d)
ggplot2 graphics
require(ggplot2)
ggplot(data = m, mapping = aes(x = value, linetype = variable)) +
geom_line(stat = "density")
Lattice graphics
Using the same melt()ed data,
require(lattice)
densityplot( ~ value, data = m, group = variable,
auto.key = TRUE, par.settings = col.whitebg())
If you need something very simple, you could do simply:
plot(density(mydata$col_1))
lines(density(mydata$col_2), lty = 2)
lines(density(mydata$col_2), lty = 3)
If the second and third density curves are far away from the first, you'll need define xy limits of the plotting region explicitly:
dens1 <- density(mydata$col_1)
dens2 <- density(mydata$col_2)
dens3 <- density(mydata$col_3)
plot(dens1, xlim = range(dens1$x, dens2$x, dens3$x),
ylim = range(dens1$y, dens2$y, dens3$y))
lines(density(mydata$col_2), lty = 2)
lines(density(mydata$col_2), lty = 3)
Hope this helps.

Resources