Imagine we have 7 categories (e.g. religion), and we would like to plot them not in a linear way, but in clusters that are automatically chosen to be nicely aligned. Here the individuals within groups have the same response, but should not be plotted on one line (which happens when plotting ordinal data).
So to sum it up:
automatically using available graph space
grouping without order, spread around canvas
individuals remain visible; no overlapping
would be nice to have the individuals within groups to be bound by some (invisible) circle
Are there any packages designed for this purpose? What are keywords I need to look for?
Example data:
religion <- sample(1:7, 100, T)
# No overlap here, but I would like to see the group part come out more.
plot(religion)
After assigning coordinates to the center of each group,
you can use wordcloud::textplot to avoid overlapping labels.
# Data
n <- 100
k <- 7
religion <- sample(1:k, n, TRUE)
names(religion) <- outer(LETTERS, LETTERS, paste0)[1:n]
# Position of the groups
x <- runif(k)
y <- runif(k)
# Plot
library(wordcloud)
textplot(
x[religion], y[religion], names(religion),
xlim=c(0,1), ylim=c(0,1), axes=FALSE, xlab="", ylab=""
)
Alternatively, you can build a graph with a clique (or a tree)
for each group,
and use one of the many graph-layout algorithms in igraph.
library(igraph)
A <- outer( religion, religion, `==` )
g <- graph.adjacency(A)
plot(g)
plot(minimum.spanning.tree(g))
In the image you linked each point has three numbers associated: coordinates x and y and group (color). If you only have one information for each individual, you can do something like this:
set.seed(1)
centers <- data.frame(religion=1:7, cx=runif(7), cy=runif(7))
eps <- 0.04
data <- within(merge(data.frame(religion=sample(1:7, 100, T)), centers),
{
x <- cx+rnorm(length(cx),sd=eps)
y <- cy+rnorm(length(cy),sd=eps)
})
with(data, plot(x,y,col=religion, pch=16))
Note that I'm creating random centers for each group and also creating small displacements around these centers for each observation. You'll have to play around with parameter eps and maybe set the centers manually if want to pursue this path.
Related
I need to make a histogram for my variable which is 'travel time'. And inside that, I need to plot the regression(correlation) data i.e. my observed data vs predicted. And I need to repeat it for different time of day and week(in simple words, make a matrix of such figure using par function). for now, I can draw histograms and arrange that in matrix form but I am facing a problem in inside plot (plotting x and y data together with y=x line, and arranging them within their consecutive histograms plot, in a matrix ). How can I do that, as in the figure below. Any help would be appreciated. Thanks!
One way to do this is to loop over your data and on every iteration create a desired plot. Here is one not very polished example, but it shows the logic how plotting a small plot over larger plot can be done. You will have to tweak the code to get it work in the way you need, but it shouldn't be that difficult.
# create some sample dataset (your x values)
a <- c(rnorm(100,0,1))
b <- c(rnorm(100,2,1))
# create their "y" values counterparts
x <- a + 3
y <- b + 4
# bind the data into two dataframes (explanatory variables in one, explained in the other)
data1 <- cbind(a,b)
data2 <- cbind(x,y)
# set dimensions of the plot matrix
par(mfrow = c(2,1))
# for each of the explanatory - explained pair
for (i in 1:ncol(data2))
{
# set positioning of the histogram
par("plt" = c(0.1,0.95,0.15,0.9))
# plot the histogram
hist(data1[, i])
# set positioning of the small plot
par("plt" = c(0.7, 0.95, 0.7, 0.95))
# plot the small plot over the histogram
par(new = TRUE)
plot(data1[, i], data2[, i])
# add some line into the small plot
lines(data1[, i], data1[, i])
}
I have around 20.000 points in my scatter plot. I have a list of interesting points and want to show those points in the scatter plot with different color. Is there any simple way to do it? Thank you.
Further explanation,
I have a matrix, consist of 20.000 rows, let's say R1 to R20000 and 4 columns, let's say A,B,C, and, D. Each row has its own row.names. I want to make a scatter plot between A and C. It is easy with plot(data$A,data$B).
On the other hand, I have a list of row.names which I want to check where in the scatter plot those point is. Let's say R1,R3,R5,R10,R20,R25.
I just want to change the color of R1,R3,R5,R10,R20,R25 in the scatter plot different from other points. Sorry if my explanation is not clear.
If your data is in a simple form, then it is easy to do. For example:
# Make some toy data
dat <- data.frame(x = rnorm(1000), y = rnorm(1000))
# List of indicies (or a logical vector) defining your interesting points
is.interesting <- sample(1000, 30)
# Create vector/column of colours
dat$col <- "lightgrey"
dat$col[is.interesting] <- "red"
# Plot
with(dat, plot(x, y, col = col, pch = 16))
Without a reproducible example, it's hard to say anything more specific.
I have a perspective plot of a locfit model and I wish to add two things to it
Predictor variables as points in the 3D space
Color the surface according to the Z axis value
For the first, I have tried to use the trans3d function. But I get the following error even though my variables are in vector format:
Error in cbind(x, y, z, 1) %*% pmat : requires numeric/complex matrix/vector arguments
Here is a snippet of my code
library(locfit)
X <- as.matrix(loc1[,1:2])
Y <- as.matrix(loc1[,3])
zz <- locfit(Y~X,kern="bisq")
pmat <- plot(zz,type="persp",zlab="Amount",xlab="",ylab="",main="Plains",
phi = 30, theta = 30, ticktype="detailed")
x1 <- as.vector(X[,1])
x2 <- as.vector(X[,2])
Y <- as.vector(Y)
points(trans3d(x1,x2,Y,pmat))
My "loc1" data can be found here - https://www.dropbox.com/s/0kdpd5hxsywnvu2/loc1_amountfreq.txt?dl=0
TL,DR: not really in plot.locfit, but you can reconstruct it.
I don't think plot.locfit has good support for this sort of customisation. Supposedly get.data=T in your plot call will plot the original data points (point 1), and it does seem to do so, except if type="persp". So no luck there. Alternatively you can points(trans3d(...)) as you have done, except you need the perspective matrix returned by persp, and plot.locfit.3d does not return it. So again, no luck.
For colouring, typically you make a colour scale (http://r.789695.n4.nabble.com/colour-by-z-value-persp-in-raster-package-td4428254.html) and assign each z facet the colour that goes with it. However, you need the z-values of the surface (not the z-values of your original data) for this, and plot.locfit does not appear to return this either.
So to do what you want, you'll essentially be recoding plot.locfit yourself (not hard, though just cludgy).
You could put this into a function so you can reuse it.
We:
make a uniform grid of x-y points
calculate the value of the fit at each point
use these to draw the surface (with a colour scale), saving the perspective matrix so that we can
plot your original data
so:
# make a grid of x and y coords, calculate the fit at those points
n.x <- 20 # number of x points in the x-y grid
n.y <- 30 # number of y points in the x-y grid
zz <- locfit(Total ~ Mex_Freq + Cal_Freq, data=loc1, kern="bisq")
xs <- with(loc1, seq(min(Mex_Freq), max(Mex_Freq), length.out=20))
ys <- with(loc1, seq(min(Cal_Freq), max(Cal_Freq), length.out=30))
xys <- expand.grid(Mex_Freq=xs, Cal_Freq=ys)
zs <- matrix(predict(zz, xys), nrow=length(xs))
# generate a colour scale
n.cols <- 100 # number of colours
palette <- colorRampPalette(c('blue', 'green'))(n.cols) # from blue to green
# palette <- colorRampPalette(c(rgb(0,0,1,.8), rgb(0,1,0,.8)), alpha=T)(n.cols) # if you want transparency for example
# work out which colour each z-value should be in by splitting it
# up into n.cols bins
facetcol <- cut(zs, n.cols)
# draw surface, with colours (col=...)
pmat <- persp(x=xs, y=ys, zs, theta=30, phi=30, ticktype='detailed', main="plains", xlab="", ylab="", zlab="Amount", col=palette[facetcol])
# draw your original data
with(loc1, points(trans3d(Mex_Freq,Cal_Freq,Total,pmat), pch=20))
Note - doesn't look that pretty! might want to adjust say your colour scale colours, or the transparency of the facets, etc. Re: adding legend, there are some other questions that deal with that.
(PS: what a shame ggplot doesn't do 3D scatter plots.)
Here is some data to work with.
df <- data.frame(x1=c(234,543,342,634,123,453,456,542,765,141,636,3000),x2=c(645,123,246,864,134,975,341,573,145,468,413,636))
If I plot these data, it will produce a simple scatter plot with an obvious outlier:
plot(df$x2,df$x1)
Then I can always write the code below to remove the y-axis outlier(s).
plot(df$x2,df$x1,ylim=c(0,800))
So my question is: Is there a way to exclude obvious outliers in scatterplots automatically? Like ouline=F would do if I were to plot, say, boxplots for an example. To my knowledge, outline=F doesn't work with scatterplots.
This is relevant because I have hundreds of scatterplots and I want to exclude all obvious outlying data points without setting ylim(...) for each individual scatterplot.
You could write a function that returns the index of what you define as an obvious outlier. Then use that function to subset your data before plotting.
Here all observations with "a" exceeding 5 * median of "a" are excluded.
df <- data.frame(a = c(1,3,4,2,100), b=c(1,3,2,4,2))
f <- function(x){
which(x$a > 5*median(x$a))
}
with(df[-f(df),], plot(b, a))
There is no easy yes/no option to do what you are looking for (the question of defining what is an "obvious outlier" for a generic scatterplot is potentially quite problematic).
That said, it should not be too difficult to write a reasonable function to give y-axis limits from a set of data points. If we take "obvious outlier" to mean a point with y value significantly above or below the bulk of the sample (which could be justified assuming a sufficient distribution of x values), then you could use something like:
ybounds <- function(y){ # y is the response variable in the dataframe
bounds = quantile(df$x1, probs=c(0.05, 0.95), type=3, names=FALSE)
return(bounds + c(-1,1) * 0.1 * (bounds[2]-bounds[1]) )
}
Then plot each dataframe with plot(df$x, df$y, ylim=ybounds(df$y))
I have come across a number of situations where I want to plot more points than I really ought to be -- the main holdup is that when I share my plots with people or embed them in papers, they occupy too much space. It's very straightforward to randomly sample rows in a dataframe.
if I want a truly random sample for a point plot, it's easy to say:
ggplot(x,y,data=myDf[sample(1:nrow(myDf),1000),])
However, I was wondering if there were more effective (ideally canned) ways to specify the number of plot points such that your actual data is accurately reflected in the plot. So here is an example.
Suppose I am plotting something like the CCDF of a heavy tailed distribution, e.g.
ccdf <- function(myList,density=FALSE)
{
# generates the CCDF of a list or vector
freqs = table(myList)
X = rev(as.numeric(names(freqs)))
Y =cumsum(rev(as.list(freqs)));
data.frame(x=X,count=Y)
}
qplot(x,count,data=ccdf(rlnorm(10000,3,2.4)),log='xy')
This will produce a plot where the x & y axis become increasingly dense. Here it would be ideal to have fewer samples plotted for large x or y values.
Does anybody have any tips or suggestions for dealing with similar issues?
Thanks,
-e
I tend to use png files rather than vector based graphics such as pdf or eps for this situation. The files are much smaller, although you lose resolution.
If it's a more conventional scatterplot, then using semi-transparent colours also helps, as well as solving the over-plotting problem. For example,
x <- rnorm(10000); y <- rnorm(10000)
qplot(x, y, colour=I(alpha("blue",1/25)))
Beyond Rob's suggestions, one plot function I like as it does the 'thinning' for you is hexbin; an example is at the R Graph Gallery.
Here is one possible solution for downsampling plot with respect to the x-axis, if it is log transformed. It log transforms the x-axis, rounds that quantity, and picks the median x value in that bin:
downsampled_qplot <- function(x,y,data,rounding=0, ...) {
# assumes we are doing log=xy or log=x
group = factor(round(log(data$x),rounding))
d <- do.call(rbind, by(data, group,
function(X) X[order(X$x)[floor(length(X)/2)],]))
qplot(x,count,data=d, ...)
}
Using the definition of ccdf() from above, we can then compare the original plot of the CCDF of the distribution with the downsampled version:
myccdf=ccdf(rlnorm(10000,3,2.4))
qplot(x,count,data=myccdf,log='xy',main='original')
downsampled_qplot(x,count,data=myccdf,log='xy',rounding=1,main='rounding = 1')
downsampled_qplot(x,count,data=myccdf,log='xy',rounding=0,main='rounding = 0')
In PDF format, the original plot takes up 640K, and the downsampled versions occupy 20K and 8K, respectively.
I'd either make image files (png or jpeg devices) as Rob already mentioned, or I'd make a 2D histogram. An alternative to the 2D histogram is a smoothed scatterplot, it makes a similar graphic but has a more smooth cutoff from dense to sparse regions of space.
If you've never seen addictedtor before, it's worth a look. It has some very nice graphics generated in R with images and sample code.
Here's the sample code from the addictedtor site:
2-d histogram:
require(gplots)
# example data, bivariate normal, no correlation
x <- rnorm(2000, sd=4)
y <- rnorm(2000, sd=1)
# separate scales for each axis, this looks circular
hist2d(x,y, nbins=50, col = c("white",heat.colors(16)))
rug(x,side=1)
rug(y,side=2)
box()
smoothscatter:
library("geneplotter") ## from BioConductor
require("RColorBrewer") ## from CRAN
x1 <- matrix(rnorm(1e4), ncol=2)
x2 <- matrix(rnorm(1e4, mean=3, sd=1.5), ncol=2)
x <- rbind(x1,x2)
layout(matrix(1:4, ncol=2, byrow=TRUE))
op <- par(mar=rep(2,4))
smoothScatter(x, nrpoints=0)
smoothScatter(x)
smoothScatter(x, nrpoints=Inf,
colramp=colorRampPalette(brewer.pal(9,"YlOrRd")),
bandwidth=40)
colors <- densCols(x)
plot(x, col=colors, pch=20)
par(op)