Multiple histograms with title and mean as a line? - r

I'm struggeling with the histogram function in my exploratory analysis. I would like to run a couple of variables in my dataset through a histogram function and for each add the title and a line at the arithmetic mean. This is how far I've got (but the main title is still missing):
histo.abline <-function(x){
hist(x)
abline(v = mean(x, na.rm = TRUE), col = "blue", lwd = 4)}
sapply(dataset[c(7:10)], histo.abline)
I tried to add a main argument in the histogram function but it just doesn't pick the right variable name of my dataset vector. When I put main=x there, it says returns NULL for each variable. Colnames, names and other functions didn't work either. Could you help me?

you can try to do it with ggplot:
library(ggplot)
histo.abline <-function(dataset,colnum){
p<-ggplot(dataset,aes(dataset[,colnum]))+geom_histogram(bins=5,fill=I("blue"),col=I("red"), alpha=I(.2))+
geom_vline(xintercept = mean(dataset[,colnum], na.rm = TRUE))+xlab(as.character(names(dataset)[colnum]))
return(p)
}
since you have not provided data lets work with mtcars and create a list of histograms
dataset=mtcars
listOfHistograms<-lapply(3:7,function(x) histo.abline(dataset,x))
your list has 5 histograms that you can plot for instance the first by:
print(listOfHistograms[[1]])
More histogram options for ggplot here: https://www.r-bloggers.com/how-to-make-a-histogram-with-ggplot2/
hope this helps
EDIT: Multiple Plot in one graph
One way to do it is through cowplot library:
library(cowplot)
plot_grid(plotlist=listOfHistograms[1:4])

Related

Error in axis(side = side, at = at, labels = labels, ...) : invalid value specified for graphical parameter "pch"

I have applied DBSCAN algorithm on built-in dataset iris in R. But I am getting error when tried to visualise the output using the plot( ).
Following is my code.
library(fpc)
library(dbscan)
data("iris")
head(iris,2)
data1 <- iris[,1:4]
head(data1,2)
set.seed(220)
db <- dbscan(data1,eps = 0.45,minPts = 5)
table(db$cluster,iris$Species)
plot(db,data1,main = 'DBSCAN')
Error: Error in axis(side = side, at = at, labels = labels, ...) :
invalid value specified for graphical parameter "pch"
How to rectify this error?
I have a suggestion below, but first I see two issues:
You're loading two packages, fpc and dbscan, both of which have different functions named dbscan(). This could create tricky bugs later (e.g. if you change the order in which you load the packages, different functions will be run).
It's not clear what you're trying to plot, either what the x- or y-axes should be or the type of plot. The function plot() generally takes a vector of values for the x-axis and another for the y-axis (although not always, consult ?plot), but here you're passing it a data.frame and a dbscan object, and it doesn't know how to handle it.
Here's one way of approaching it, using ggplot() to make a scatterplot, and dplyr for some convenience functions:
# load our packages
# note: only loading dbscacn, not loading fpc since we're not using it
library(dbscan)
library(ggplot2)
library(dplyr)
# run dbscan::dbscan() on the first four columns of iris
db <- dbscan::dbscan(iris[,1:4],eps = 0.45,minPts = 5)
# create a new data frame by binding the derived clusters to the original data
# this keeps our input and output in the same dataframe for ease of reference
data2 <- bind_cols(iris, cluster = factor(db$cluster))
# make a table to confirm it gives the same results as the original code
table(data2$cluster, data2$Species)
# using ggplot, make a point plot with "jitter" so each point is visible
# x-axis is species, y-axis is cluster, also coloured according to cluster
ggplot(data2) +
geom_point(mapping = aes(x=Species, y = cluster, colour = cluster),
position = "jitter") +
labs(title = "DBSCAN")
Here's the image it generates:
If you're looking for something else, please be more specific about what the final plot should look like.

how can i show two histogram in same window but different plots in R?

I want to show the effect of removing outliers on my histogram, so I have to plot both hists together.
boxplot(Costs, Costs1,
xlab=" Costs and Costs after removig outliers",
col=topo.colors(2))
so i tried this:
hist(Costs,Costs1,main="Histogram of Maintenance_cost ",col="blue",
border="darkblue",xlab="Total_cost",ylab=" ",yaxt = 'n',
#ylim=c(0,3000),
#xlim=c(0,max(My_Costs)),
breaks=60)
the first code give me to box plot, but I tried it for hist it doesn't work
can anyone tell me how to do it in R?
For a base R solution, use par with mfrow.
set.seed(1234)
Costs = rnorm(5000, 100, 20)
OUT = which(Costs %in% boxplot(Costs, plot=FALSE)$out)
Costs1 = Costs[-OUT]
par(mfrow=c(1,2), mar=c(5,1,2,1))
hist(Costs,main="Histogram of Maintenance_cost ",col="blue",
border="darkblue",xlab="Total_cost",ylab=" ",yaxt = 'n',
breaks=60, xlim=c(30,170))
hist(Costs1,main="Maintenance_cost without outliers",col="blue",
border="darkblue",xlab="Total_cost",ylab=" ",yaxt = 'n',
breaks=60, xlim=c(30,170))
For multiple plots, you should use ggplot2 with facet_wrap. Here is an example:
Plot several histograms with ggplot in one window with several variables

Run points() after plot() on a dataframe

I'm new to R and want to plot specific points over an existing plot. I'm using the swiss data frame, which I visualize through the plot(swiss) function.
After this, want to add outliers given by the Mahalanobis distance:
mu_hat <- apply(swiss, 2, mean); sigma_hat <- cov(swiss)
mahalanobis_distance <- mahalanobis(swiss, mu_hat, sigma_hat)
outliers <- swiss[names(mahalanobis_distance[mahalanobis_distance > 10]),]
points(outliers, pch = 'x', col = 'red')
but this last line has no effect, as the outlier points aren't added to the previous plot. I see that if repeat this procedure on a pair of variables, say
plot(swiss[2:3])
points(outliers[2:3], pch = 'x', col = 'red')
the red points are added to the plot.
Ask: is there any restriction to how the points() function can be used for a multivariate data frame?
Here's a solution using GGally::ggpairs. It's a little ugly as we need to modify the ggally_points function to specify the desired color scheme.
I've assumed that mu_hat = colMeans(swiss) and sigma_hat = cov(swiss).
library(dplyr)
library(GGally)
swiss %>%
bind_cols(distance = mahalanobis(swiss, colMeans(swiss), cov(swiss))) %>%
mutate(is_outlier = ifelse(distance > 10, "yes", "no")) %>%
ggpairs(columns = 1:6,
mapping = aes(color = is_outlier),
upper = list(continuous = function(data, mapping, ...) {
ggally_points(data = data, mapping = mapping) +
scale_colour_manual(values = c("black", "red"))
}),
lower = list(continuous = function(data, mapping, ...) {
ggally_points(data = data, mapping = mapping) +
scale_colour_manual(values = c("black", "red"))
}),
axisLabels = "internal")
Unfortunately this isn't possible the way you're currently doing things. When plotting a data frame R produces many plots and aligns them. What you're actually seeing there is 6 by 6 = 36 individual plots which have all been aligned to look nice.
When you use the dots command, it tells it to place the dots on the current plot. Which doesn't really make sense when you have 36 plots, at least not the way you want it to.
ggplot is a really powerful tool in R, it provides far greater combustibility. For example you could set up the dataframe to include your outliers, but have them labelled as "outlier" and place it in each plot that you have set up as facets. The more you explore it you might find there are better plots which suit your needs as well.
Plotting a dataframe in base R is a good exploratory tool. You could set up those outliers as a separate dataframe and plot it, so you can see each of the 6 by 6 plots side by side and compare. It all depends on your goal. If you're goal is to produce exactly as you've described, the ggplot2 package will help you create something more professional. As #Gregor suggested in the comments, looking up the function ggpairs from the GGally package would be a good place to start.
A quick google image search shows some funky plots akin to what you're after and then some!
Find it here

How to color different groups in qqplot?

I'm plotting some Q-Q plots using the qqplot function. It's very convenient to use, except that I want to color the data points based on their IDs. For example:
library(qualityTools)
n=(rnorm(n=500, m=1, sd=1) )
id=c(rep(1,250),rep(2,250))
myData=data.frame(x=n,y=id)
qqPlot(myData$x, "normal",confbounds = FALSE)
So the plot looks like:
I need to color the dots based on their "id" values, for example blue for the ones with id=1, and red for the ones with id=2. I would greatly appreciate your help.
You can try setting col = myData$y. I'm not sure how the qqPlot function works from that package, but if you're not stuck with using that function, you can do this in base R.
Using base R functions, it would look something like this:
# The example data, as generated in the question
n <- rnorm(n=500, m=1, sd=1)
id <- c(rep(1,250), rep(2,250))
myData <- data.frame(x=n,y=id)
# The plot
qqnorm(myData$x, col = myData$y)
qqline(myData$x, lty = 2)
Not sure how helpful the colors will be due to the overplotting in this particular example.
Not used qqPlot before, but it you want to use it, there is a way to achieve what you want. It looks like the function invisibly passes back the data used in the plot. That means we can do something like this:
# Use qqPlot - it generates a graph, but ignore that for now
plotData <- qqPlot(myData$x, "normal",confbounds = FALSE, col = sample(colors(), nrow(myData)))
# Given that you have the data generated, you can create your own plot instead ...
with(plotData, {
plot(x, y, col = ifelse(id == 1, "red", "blue"))
abline(int, slope)
})
Hope that helps.

How to plot a violin scatter boxplot (in R)?

I just came by the following plot:
And wondered how can it be done in R? (or other softwares)
Update 10.03.11: Thank you everyone who participated in answering this question - you gave wonderful solutions! I've compiled all the solution presented here (as well as some others I've came by online) in a post on my blog.
Make.Funny.Plot does more or less what I think it should do. To be adapted according to your own needs, and might be optimized a bit, but this should be a nice start.
Make.Funny.Plot <- function(x){
unique.vals <- length(unique(x))
N <- length(x)
N.val <- min(N/20,unique.vals)
if(unique.vals>N.val){
x <- ave(x,cut(x,N.val),FUN=min)
x <- signif(x,4)
}
# construct the outline of the plot
outline <- as.vector(table(x))
outline <- outline/max(outline)
# determine some correction to make the V shape,
# based on the range
y.corr <- diff(range(x))*0.05
# Get the unique values
yval <- sort(unique(x))
plot(c(-1,1),c(min(yval),max(yval)),
type="n",xaxt="n",xlab="")
for(i in 1:length(yval)){
n <- sum(x==yval[i])
x.plot <- seq(-outline[i],outline[i],length=n)
y.plot <- yval[i]+abs(x.plot)*y.corr
points(x.plot,y.plot,pch=19,cex=0.5)
}
}
N <- 500
x <- rpois(N,4)+abs(rnorm(N))
Make.Funny.Plot(x)
EDIT : corrected so it always works.
I recently came upon the beeswarm package, that bears some similarity.
The bee swarm plot is a
one-dimensional scatter plot like
"stripchart", but with closely-packed,
non-overlapping points.
Here's an example:
library(beeswarm)
beeswarm(time_survival ~ event_survival, data = breast,
method = 'smile',
pch = 16, pwcol = as.numeric(ER),
xlab = '', ylab = 'Follow-up time (months)',
labels = c('Censored', 'Metastasis'))
legend('topright', legend = levels(breast$ER),
title = 'ER', pch = 16, col = 1:2)
(source: eklund at www.cbs.dtu.dk)
I have come up with the code similar to Joris, still I think this is more than a stem plot; here I mean that they y value in each series is a absolute value of a distance to the in-bin mean, and x value is more about whether the value is lower or higher than mean.
Example code (sometimes throws warnings but works):
px<-function(x,N=40,...){
x<-sort(x);
#Cutting in bins
cut(x,N)->p;
#Calculate the means over bins
sapply(levels(p),function(i) mean(x[p==i]))->meansl;
means<-meansl[p];
#Calculate the mins over bins
sapply(levels(p),function(i) min(x[p==i]))->minl;
mins<-minl[p];
#Each dot is one value.
#X is an order of a value inside bin, moved so that the values lower than bin mean go below 0
X<-rep(0,length(x));
for(e in levels(p)) X[p==e]<-(1:sum(p==e))-1-sum((x-means)[p==e]<0);
#Y is a bin minum + absolute value of a difference between value and its bin mean
plot(X,mins+abs(x-means),pch=19,cex=0.5,...);
}
Try the vioplot package:
library(vioplot)
vioplot(rnorm(100))
(with awful default color ;-)
There is also wvioplot() in the wvioplot package, for weighted violin plot, and beanplot, which combines violin and rug plots. They are also available through the lattice package, see ?panel.violin.
Since this hasn't been mentioned yet, there is also ggbeeswarm as a relatively new R package based on ggplot2.
Which adds another geom to ggplot to be used instead of geom_jitter or the like.
In particular geom_quasirandom (see second example below) produces really good results and I have in fact adapted it as default plot.
Noteworthy is also the package vipor (VIolin POints in R) which produces plots using the standard R graphics and is in fact also used by ggbeeswarm behind the scenes.
set.seed(12345)
install.packages('ggbeeswarm')
library(ggplot2)
library(ggbeeswarm)
ggplot(iris,aes(Species, Sepal.Length)) + geom_beeswarm()
ggplot(iris,aes(Species, Sepal.Length)) + geom_quasirandom()
#compare to jitter
ggplot(iris,aes(Species, Sepal.Length)) + geom_jitter()

Resources