I'd like to plot a dataset that consists of two vectors of length 100. The mean difference of the vectors being high and the variance of each of them being considerably smaller, it is quite difficult to plot both vectors and still be able to see the variation within each vector.
What I'd like to be able to manually set the breaks so that we could both see the difference between the vectors and within them.
Consider this data set
a=rnorm(100,sd=0.005)+1
b=rnorm(100,sd=0.005)+10
vec = c(a,b)
Neither plot(vec) nor plot(vec,log="y") gives satisfying results, as it is not possible to distinguish the variation within the vector (see picture).
I'd like the breaks on the y-axis to be (min(a), max(a), 5, min(b), max(b)) (and get equal distance between them). How could one achieve that?
Depending on exactly what you are trying to do, a simple transformation of the data in each part of the vector might be enough:
vec2 <- c( (a - min(a))/ (max(a)-min(a)) , 3 + (b - min(b))/ (max(b)-min(b)) )
plot(vec2, axes=F)
box()
axis(1)
axis(2, at=c(0,1,2,3,4), labels = round(c(min(a), max(a), 5, min(b), max(b)),2))
Alternative approaches might be a custom transformation in ggplot, a secondary axis in ggplot, breaking the graph into facets, or using ggbreak.
Related
I want to plot the distribution of the datasets using the histogram in R. I tried using different arguments (default, Freedman-Diaconis, and Scott) to get the best representation. I consider using a log scale later, but first I want to know the raw distribution without any scaling. However, the results look different, why is that? The dataset I use can be downloaded from here data or here data. The code I'm running are
hist(as.matrix(deviation_all_genes_all_spots), xlim = c(-(1*10^(4)), 10^(4.5)), breaks = 200)
result is
hist(as.matrix(deviation_all_genes_all_spots), xlim = c(-(1*10^(4)), 10^(4.5)), breaks = "Scott")
Result is
hist(as.matrix(deviation_all_genes_all_spots), xlim = c(-(1*10^(4)), 10^(4.5)), breaks="Freedman-Diaconis")
result is
Please help. Thank you very much.
Histograms are very sensitive to the choice of cell break points. Even for the same (!) number of cells, the histogram can become considerably different by just a small shift of the cell borders. It is thus generally preferable to use kernel density estimators instead of histograms, because they do not depend on random cell border placement:
# increase n if you have a wide range of values
d <- density(as.matrix(deviation_all_genes_all_spots), n=512)
plot(d$x, d$y)
In your second and third call of hist, you ask for an automatic way to select the number of cells and the cell borders. Obviously, this results in more cells than in your first call with breaks=200. You can query the cells from the return value of hist, e.g.
h <- hist(as.matrix(deviation_all_genes_all_spots))
cat(srintf("number of cells = %i\n", length(h$mids))
I use Julia with Plots , to generate my plots.
I want to plot data (A,B) and i know that all interesting data lies in two region of A. The two regions should be plotted between each other in one plot.
My A-data is evenly spaced. So what i did was cutting out my interesting pieces and glued them into one object.
My problem is that i don't know how to manipulate the scale on the x-axis.
When I just plot the B data against their array index, I basically get the form I want. I just need the numbers from A on the x-axis.
I give here a toy example
using Plots
N=5000
B=rand(N)
A=(1:1:N)
xl_1=100
xu_1=160
xl_2=600
xu_2=650
A_new=vcat(A[xl_1:xu_1],A[xl_2:xu_2])
B_new=vcat(B[xl_1:xu_1],B[xl_2:xu_2])
plot(A_new,B_new) # This leaves the spacing between the data explicit
plot(B_new) # This creats basically the right spacing, but
# without the right x axis grid
I did not find anything how one can use two successive xlims, therefore i try it this way.
You can't pass two successive xlims, because you can't have a break in the axis. That is by design in Plots.
So your possibilities are: 1) to have two subplots with different parts of the plot, or 2) to plot with the index, and just change the axis labels.
The second approach would use a command like xticks = ([1, 50, 100, 150], ["1", "50", "600", "650"], but I'd recommend the first as it's strictly speaking a more correct way of displaying the data:
plot(
plot(A[xl_1:xu_1], B[xl_1:xu_1], legend = false),
plot(A[xl_2:xu_2], B[xl_2:xu_2], yshowaxis = false),
link = :y
)
So, I've spent the last four hours trying to find an efficient way of plotting the curve(s) of a function with two variables - to no avail. The only answer that I could actually put to practice wasn't producing a multiple-line graph as I expected.
I created a function with two variables, x and y, and it returns a continuous numeric value. I wanted to plot in a single screen the result of this function with certain values of x and all possible values of y within a given range (y is also a continuous variable).
Something like that:
These two questions did help a little, but I still can't get there:
Plotting a function curve in R with 2 or more variables
How to plot function of multiple variables in R by initializing all variables but one
I also used the mosaic package and plotFun function, but the results were rather unappealing and not very readable: https://www.youtube.com/watch?v=Y-s7EEsOg1E.
Maybe the problem is my lack of proficiency with R - though I've been using it for months so I'm not such a noob. Please enlighten me.
Say we have a simple function with two arguments:
fun <- function(x, y) 0.5*x - 0.01*x^2 + sqrt(abs(y)/2)
And we want to evaluate it on the following x and y values:
xs <- seq(-100, 100, by=1)
ys <- c(0, 100, 300)
This line below might be a bit hard to understand but it does all of the work:
res <- mapply(fun, list(xs), ys)
mapply allows us to run function with multiple variables across a range of values. Here we provide it with only one value for "x" argument (note that xs is a long vector, but since it is in a list - it's only one instance). We also provide multiple values of "y" argument. So the function will run 3 times each with the same value of x and different values of y.
Results are arranged column-wise so in the end we have 3 columns. Now we only have to plot:
cols <- c("black", "cornflowerblue", "orange")
matplot(xs, res, col=cols, type="l", lty=1, lwd=2, xlab="x", ylab="result")
legend("bottomright", legend=ys, title="value of y", lwd=2, col=cols)
Here the matplot function does all the work - it plots a line for every column in the provided matrix. Everything else is decoration.
Here is the result:
I'm looking to plot a set of sparklines in R with just a 0 and 1 state that looks like this:
Does anyone know how I might create something like that ideally with no extra libraries?
I don't know of any simple way to do this, so I'm going to build up this plot from scratch. This would probably be a lot easier to design in illustrator or something like that, but here's one way to do it in R (if you don't want to read the whole step-by-step, I provide my solution wrapped in a reusable function at the bottom of the post).
Step 1: Sparklines
You can use the pch argument of the points function to define the plotting symbol. ASCII symbols are supported, which means you can use the "pipe" symbol for vertical lines. The ASCII code for this symbol is 124, so to use it for our plotting symbol we could do something like:
plot(df, pch=124)
Step 2: labels and numbers
We can put text on the plot by using the text command:
text(x,y,char_vect)
Step 3: Alignment
This is basically just going to take a lot of trial and error to get right, but it'll help if we use values relative to our data.
Here's the sample data I'm working with:
df = data.frame(replicate(4, rbinom(50, 1, .7)))
colnames(df) = c('steps','atewell','code','listenedtoshell')
I'm going to start out by plotting an empty box to use as our canvas. To make my life a little easier, I'm going to set the coordinates of the box relative to values meaningful to my data. The Y positions of the 4 data series will be the same across all plotting elements, so I'm going to store that for convenience.
n=ncol(df)
m=nrow(df)
plot(1:m,
seq(1,n, length.out=m),
# The following arguments suppress plotting values and axis elements
type='n',
xaxt='n',
yaxt='n',
ann=F)
With this box in place, I can start adding elements. For each element, the X values will all be the same, so we can use rep to set that vector, and seq to set the Y vector relative to Y range of our plot (1:n). I'm going to shift the positions by percentages of the X and Y ranges to align my values, and modified the size of the text using the cex parameter. Ultimately, I found that this works out:
ypos = rev(seq(1+.1*n,n*.9, length.out=n))
text(rep(1,n),
ypos,
colnames(df), # These are our labels
pos=4, # This positions the text to the right of the coordinate
cex=2) # Increase the size of the text
I reversed the sequence of Y values because I built my sequence in ascending order, and the values on the Y axis in my plot increase from bottom to top. Reversing the Y values then makes it so the series in my dataframe will print from top to bottom.
I then repeated this process for the second label, shifting the X values over but keeping the Y values the same.
text(rep(.37*m,n), # Shifted towards the middle of the plot
ypos,
colSums(df), # new label
pos=4,
cex=2)
Finally, we shift X over one last time and use points to build the sparklines with the pipe symbol as described earlier. I'm going to do something sort of weird here: I'm actually going to tell points to plot at as many positions as I have data points, but I'm going to use ifelse to determine whether or not to actually plot a pipe symbol or not. This way everything will be properly spaced. When I don't want to plot a line, I'll use a 'space' as my plotting symbol (ascii code 32). I will repeat this procedure looping through all columns in my dataframe
for(i in 1:n){
points(seq(.5*m,m, length.out=m),
rep(ypos[i],m),
pch=ifelse(df[,i], 124, 32), # This determines whether to plot or not
cex=2,
col='gray')
}
So, piecing it all together and wrapping it in a function, we have:
df = data.frame(replicate(4, rbinom(50, 1, .7)))
colnames(df) = c('steps','atewell','code','listenedtoshell')
BinarySparklines = function(df,
L_adj=1,
mid_L_adj=0.37,
mid_R_adj=0.5,
R_adj=1,
bottom_adj=0.1,
top_adj=0.9,
spark_col='gray',
cex1=2,
cex2=2,
cex3=2
){
# 'adJ' parameters are scalar multipliers in [-1,1]. For most purposes, use [0,1].
# The exception is L_adj which is any value in the domain of the plot.
# L_adj < mid_L_adj < mid_R_adj < R_adj
# and
# bottom_adj < top_adj
n=ncol(df)
m=nrow(df)
plot(1:m,
seq(1,n, length.out=m),
# The following arguments suppress plotting values and axis elements
type='n',
xaxt='n',
yaxt='n',
ann=F)
ypos = rev(seq(1+.1*n,n*top_adj, length.out=n))
text(rep(L_adj,n),
ypos,
colnames(df), # These are our labels
pos=4, # This positions the text to the right of the coordinate
cex=cex1) # Increase the size of the text
text(rep(mid_L_adj*m,n), # Shifted towards the middle of the plot
ypos,
colSums(df), # new label
pos=4,
cex=cex2)
for(i in 1:n){
points(seq(mid_R_adj*m, R_adj*m, length.out=m),
rep(ypos[i],m),
pch=ifelse(df[,i], 124, 32), # This determines whether to plot or not
cex=cex3,
col=spark_col)
}
}
BinarySparklines(df)
Which gives us the following result:
Try playing with the alignment parameters and see what happens. For instance, to shrink the side margins, you could try decreasing the L_adj parameter and increasing the R_adj parameter like so:
BinarySparklines(df, L_adj=-1, R_adj=1.02)
It took a bit of trial and error to get the alignment right for the result I provided (which is what I used to inform the default values for BinarySparklines), but I hope I've given you some intuition about how I achieved it and how moving things using percentages of the plotting range made my life easier. In any event, I hope this serves as both a proof of concept and a template for your code. I'm sorry I don't have an easier solution for you, but I think this basically gets the job done.
I did my prototyping in Rstudio so I didn't have to specify the dimensions of my plot, but for posterity I had 832 x 456 with the aspect ratio maintained.
I am looking to improve upon output I implemented in R based on Jeromy's answer here (thanks!). Mine is a 31x31 matrix with positive and negative values, and uses basically the same ggplot2 code:
library(ggplot2)
library(reshape)
z<-cor(insheet3,use="complete.obs",method="kendall")
zm<-melt(z)
ggplot(zm, aes(X1,X2, fill=value)) + geom_tile() +
scale_fill_gradient2(low = "blue", high = "dark violet")
I need to change three things:
Right now, the rows appear in reverse alphabetical order, which means no visible data trends. How can I influence the order of the rows and columns, such that either:
A. (Preferred:) The columns are ordered by correlation value (negative to positive or vice versa), as they are in the ellipse package output on that same page; or
B. The columns are manually ordered, so that I can group similar variables?
Along the bottom X-axis, my variable names are overlapping dramatically and are unreadable. They need to remain long (i.e., OrthoPhos, Ammonia, Residential...), so how can I rotate their labels 90 degrees?
Is there a way to remove the "X1" and "X2" labels along each axis?
Thank you!
Following what I'll call an extensive/religious R journey into correlation matrix possibilities, I wanted to share what I'm finally going to use. Also, thanks to the previous answerers; I've found that there are many "right" answers to this.
Since my reviewers insisted I include numbers and not just colors, and that I stay away from more "confusing" and "busy" output like correlogram, I finally found "image" and based my final output on this example. Thanks #Marcinthebox.
Also to appease StackOverflow, here is a link to the image, rather than the image itself.
Because some of these specifications took a while to figure out and were critical to the final output, here's my code, shortened as much as I could.
#Subsetting to only the vectors I want to see in the correlation, as ordered
insheet<-subset(insheet1,
select=c("Cond", "CL", "SO4", "TN", "TP", "OrthoPhos", "DO", ...., "Rural"))
#Defining "high" and "low" colors
library(colorspace)
mycolors<-diverge_hcl(8, h = c(8, 240), c = 80, l = c(50,100), power = 1)
#Correlating them into a matrix
sheet<-cor(insheet,use="complete.obs")
#Making it!
image(x=seq(dim(sheet)[2]), y=seq(dim(sheet)[2]), z=sheet, ann=FALSE,
col=mycolors, xlab="x column", ylab="y column", xaxt='n', yaxt='n')
text(expand.grid(x=seq(dim(sheet)[2]), y=seq(dim(sheet)[2])),
labels=round(c(sheet),2), cex=0.5)
axis(1, 1:dim(insheet2)[2], colnames(insheet2), las=2)
axis(2, 1:dim(insheet2)[2], colnames(insheet2), las=2)
par(mar=c(5.5, 5.5, 2, 1)) #Moves margins over to allow for axis labels
I was also able to for-loop this to output multiple .wmf files, once errors were suppressed. Too bad I couldn't visualize significant p-values as well... another time. Thanks!
I assume that you mean "clustering" for point 1.?
For such tasks I prefer the heatmap.2() function from the gplots package, which offers various clustering options.
For point 2 and 3: The heatmap.2() function will also take care of the 90º rotation and the labels since it is using a data matrix as input instead of a data table.