I'm trying to do a plot from a data.frame that contains positive and negative values and I cannot plot all points. Someone know, if is possible to adapt the code to plot all point?
example = data.frame(X1=c(-1,-0.5,1),X2=c(-1,0.7,1),X3=c(1,-0.5,2))
ggtern(data=example, aes(x=X1,y=X2,z=X3)) + geom_point()
Well actually your points are getting plotted, but they lie outside the plot region.
Firstly, to understand why, each row of your data must sum to unity, else it will be coerced that way, therefore what you will be plotting is the following:
example = example / matrix(rep(rowSums(example),ncol(example)),nrow(example),ncol(example))
example
X1 X2 X3
1 1.000000 1.000000 -1.000000
2 1.666667 -2.333333 1.666667
3 0.250000 0.250000 0.500000
Now the rows sum to unity:
print(rowSums(example))
[1] 1 1 1
You have negative values, which are nonsensical in terms of 'composition', however, negative concentrations can still be plotted, as they will numerically represent points outside the ternary area, lets expand the limits and render to see where they lie:
ggtern(data=example, aes(x=X1,y=X2,z=X3)) +
geom_mask() +
geom_point() +
limit_tern(5,5,5)
Related
I would like to connect observations from my df with a common point, i.e. the centerpoint (0,0) using ggplot2.
x y
1 5 4
2 -4 -2
3 -1 5
4 2 -8
Using geom_point(), I get the following.
Now, I would like to have lines connecting the four observations with the centerpoint at (0,0), like in the following (not made with R):
Is this possible at all using ggplot2?
I found a solution:
ggplot(df) + geom_point(aes(x,y)) + geom_segment(aes(xend=0, yend=0))
Answer based on #roland comments on a question.
Here is a plot of several different time series that I made in R:
I made these using a simple loop:
for(i in 1:ngroups){
x[paste0("Group_",i)] = apply(x[,group == i],1,mean)
}
plot(x$Group_1,type="l",ylim=c(0,300))
for(i in 2:ngroups){
lines(x[paste0("Group_",i)],col=i)
}
I also could have made this plot using matplot. Now, as you can see, each group is the mean of several other columns. What I would like to do is plot the series as in the plot above, but additionally show the range of the underlying data contributing to that mean. For example, the purple line would be bounded by a region shaded light purple. At any given time index, the purple region will extend from the lowest value in the purple group to the highest value (or, say, the 5 to 95 percentiles). Is there an elegant/clever way to do this?
Here is an answer using the graphics package (graphics that come with R). I also try to explain how it is that the polygon (which is used to generate the CI) is created. This can be repurposed to solve your problem, for which I do not have the exact data.
# Values for noise and CI size
s.e. <- 0.25 # standard error of noise
interval <- s.e.*qnorm(0.975) # standard error * 97.5% quantile
# Values for Fake Data
x <- 1:10 # x values
y <- (x-1)*0.5 + rnorm(length(x), mean=0, sd=s.e.) # generate y values
# Main Plot
ylim <- c(min(y)-interval, max(y)+interval) # account for CI when determining ylim
plot(x, y, type="l", lwd=2, ylim=ylim) # plot x and y
# Determine the x values that will go into CI
CI.x.top <- x # x values going forward
CI.x.bot <- rev(x) # x values backwards
CI.x <- c(CI.x.top, CI.x.bot) # polygons are drawn clockwise
# Determine the Y values for CI
CI.y.top <- y+interval # top of CI
CI.y.bot <- rev(y)-interval # bottom of CI, but rev Y!
CI.y <- c(CI.y.top,CI.y.bot) # forward, then backward
# Add a polygon for the CI
CI.col <- adjustcolor("blue",alpha.f=0.25) # Pick a pretty CI color
polygon(CI.x, CI.y, col=CI.col, border=NA) # draw the polygon
# Point out path of polygon
arrows(CI.x.top[1], CI.y.top[1]+0.1, CI.x.top[3], CI.y.top[3]+0.1)
arrows(CI.x.top[5], CI.y.top[5]+0.1, CI.x.top[7], CI.y.top[7]+0.1)
arrows(CI.x.bot[1], CI.y.bot[1]-0.1, CI.x.bot[3], CI.y.bot[3]-0.1)
arrows(CI.x.bot[6], CI.y.bot[6]-0.1, CI.x.bot[8], CI.y.bot[8]-0.1)
# Add legend to explain what the arrows are
legend("topleft", legend="Arrows indicate path\nfor drawing polygon", xjust=0.5, bty="n")
And here is the final result:
I have made a df using some random data.
Here's the df
df
x y
1 1 3.1667912
2 1 3.5301539
3 1 3.8497014
4 1 4.4494311
5 1 3.8306889
6 1 4.7681518
7 1 2.8516945
8 1 1.8350802
9 1 5.8163498
10 1 4.8589443
11 2 0.3419090
12 2 2.7940851
13 2 1.9688636
14 2 1.3475315
15 2 0.9316124
16 2 1.3208475
17 2 3.0367743
18 2 3.2340156
19 2 1.8188969
20 2 2.5050162
When you plot using stat_summary with mean_cl_normal and geom smooth
ggplot(df,aes(x=x,y=y))+geom_point() +
stat_summary(fun.data=mean_cl_normal, geom="smooth", colour="red")
As someone commented, maybe mean_cl_boot was better so I used it.
ggplot(df,aes(x=x,y=y))+geom_point() +
stat_summary(fun.data=mean_cl_boot, geom="smooth", colour="red")
They are indeed a little different. Also you could play with confint parameter depending on your need.
I was investigating the interpretation of a biplot and meaning of loadings/scores in PCA in this question: What are the principal components scores?
According to the author of the first answer the scores are:
x y
John -44.6 33.2
Mike -51.9 48.8
Kate -21.1 44.35
According to the second answer regarding "The interpretation of the four axis in bipolar":
The left and bottom axes are showing [normalized] principal component
scores; the top and right axes are showing the loadings.
So, theoretically after plotting the biplot from "What are principal components scores" I should get on the left and bottom axes the scores:
x y
John -44.6 33.2
Mike -51.9 48.8
Kate -21.1 44.35
and on the right and top the loadings.
I entered the data he provided in R:
DF<-data.frame(Maths=c(80, 90, 95), Science=c(85, 85, 80), English=c(60, 70, 40), Music=c(55, 45, 50))
pca = prcomp(DF, scale = FALSE)
biplot(pca)
This is the plot I got:
Firstly, the left and bottom axis represent the loadings of the principal components. The top and right axis represent the scores BUT they do not correspond to the scores the author from the post provided (3 aka Kate has positive scores on the plot but one negative on PC1 according to the Tony Breyal in the first answer to the question in the post).
If I am doing or understanding something wrong, where is my mistake?
There are a few nuances you missed:
biplot.princomp function
For some reason biplot.princomp scales the loading and score axes differently. So the scores you see are transformed. To get the actual values you can invoke the biplot function like this:
biplot(pca, scale=0)
see help(biplot.princomp) for more.
Now the values are actual scores. You can confirm this by comparing the plot with pca$x.
Centering.
However the result is still not the same as per the answer you found in crossvalidated SO.
This is because Tony Breyal calculated the scores manually and he was using non-centered data for that. the prcomp function does centering by default and then uses centered data to get the scores.
So you can center the data first:
> scale(DF, scale=FALSE)
Maths Science English Music
[1,] -8.333333 1.666667 3.333333 5
[2,] 1.666667 1.666667 13.333333 -5
[3,] 6.666667 -3.333333 -16.666667 0
And now use these numbers to get the scores as per the answer:
x y
John 0.28*(-8.3) + -0.17*1.6 + -0.94*3 + 0.07*5 0.77*(-8.3) + -0.08*1.6 + 0.19*3 + -0.60*5
Mike 0.28*1.6 + -0.17*1.6 + -0.94*13 + 0.07*(-5) 0.77*1.6 + -0.08*1.6 + 0.19*13 + -0.60*(-5)
Kate 0.28*6.6 + -0.17*(-3.3) + -0.94*(-16) + 0.07*0 0.77*6.6 + -0.08*(-3.3) + 0.19*(-16) + -0.60*0
After doing this you should get the same scores as plotted by biplot(pca, scale=0)
Hope this helps.
Not sure whether this should go on cross validated or not but we'll see. Basically I obtained data from an instrument just recently (masses of compounds from 0 to 630) which I binned into 0.025 bins before plotting a histogram as seen below:-
I want to identify the bins that are of high frequency and that stands out from against the background noise (the background noise increases as you move from right to left on the a-xis). Imagine drawing a curve line ontop of the points that have almost blurred together into a black lump and then selecting the bins that exists above that curve to further investigate, that's what I'm trying to do. I just plotted a kernel density plot to see if I could over lay that ontop of my histogram and use that to identify points that exist above the plot. However, the density plot in no way makes any headway with this as the densities are too low a value (see the second plot). Does anyone have any recommendations as to how I Can go about solving this problem? The blue line represents the density function plot overlayed and the red line represents the ideal solution (need a way of somehow automating this in R)
The data below is only part of my dataset so its not really a good representation of my plot (which contains just about 300,000 points) and as my bin sizes are quite small (0.025) there's just a huge spread of data (in total there's 25,000 or so bins).
df <- read.table(header = TRUE, text = "
values
1 323.881306
2 1.003373
3 14.982121
4 27.995091
5 28.998639
6 95.983138
7 2.0117459
8 1.9095478
9 1.0072853
10 0.9038475
11 0.0055748
12 7.0964916
13 8.0725191
14 9.0765316
15 14.0102531
16 15.0137390
17 19.7887675
18 25.1072689
19 25.8338140
20 30.0151683
21 34.0635308
22 42.0393751
23 42.0504938
")
bin <- seq(0, 324, by = 0.025)
hist(df$values, breaks = bin, prob=TRUE, col = "grey")
lines(density(df$values), col = "blue")
Assuming you're dealing with a vector bin.densities that has the densities for each bin, a simple way to find outliers would be:
look at a window around each bin, say +- 50 bins
current.bin <- 1
window.size <- 50
window <- bin.densities[current.bin-window.size : current.bin+window.size]
find the 95% upper and lower quantile value (or really any value you think works)
lower.quant <- quantile(window, 0.05)
upper.quant <- quantile(window, 0.95)
then say that the current bin is an outlier if it falls outside your quantile range.
this.is.too.high <- (bin.densities[current.bin] > upper.quant
this.is.too.low <- (bin.densities[current.bin] < lower.quant)
#final result
this.is.outlier <- this.is.too.high | this.is.too.low
I haven't actually tested this code, but this is the general approach I would take. You can play around with window size and the quantile percentages until the results look reasonable. Again, not exactly super complex math but hopefully it helps.
I am trying to use rgl to surface plot this matrix "r":
library(rgl)
Banana Apple BlueBerry Kiwi Raisin Strawberry
Chicago 1.0000000 0.9972928 0.9947779 1.0623767 0.9976347 0.9993892
Wilmette 1.0016507 1.0000000 0.9976524 0.9863927 0.9985248 1.0016828
Winnetka 1.0040722 1.0025362 1.0000000 0.9886008 1.0016501 0.9955785
Glenview 0.9961316 1.0105463 1.0167024 1.0000000 1.0129399 1.0123440
Deerfield 1.0023308 1.0026052 0.9979093 0.9870921 1.0000000 1.0025606
Wheeling 1.0073697 0.9985745 1.0045129 0.9870925 1.0008054 1.0000000
rgl.surface(1:6 , 1:6 , r, color="red", back="lines")
Since the z-values are so close together in magnitude, the surface plot looks almost flat, even though there are subtle bumps in the data.
1) How do I make it so that I have a zoomed in version where I can see as much detail as possible?
2) Is there a way to show in different colors the data (faces) that have the biggest "slope", and so that the labels of the columns and rows of the matrix are preserved on the 3D surface (maybe just using the first three letters of the labels)? In other words, so I can see that the Kiwi in Chicago and the Kiwi in Wilmette causes the greatest min/max variation?
Something like this?
library(rgl)
library(colorRamps) # for matlab.like(...)
palette <- matlab.like(10) # palette of 10 colors
rlim <- range(r[!is.na(r)])
colors <- palette[9*(r-rlim[1])/diff(rlim) + 1]
open3d(scale=c(1/6,1/6,1/diff(range(r))))
surface3d(1:6 , 1:6 , r, color=colors, back="lines")
Part of your problem is that you were using rgl.surface(...) incorrectly. The second argument is the matrix of z-values. With surface3d(...) the arguments are x, y ,z in that order.
EDIT: Response to OP's comment.
Using your ex post facto dataset...
open3d(scale=c(1/6,1/6,1/diff(range(r))))
bbox3d(color="white")
surface3d(1:6 , 1:6 , r, color=colors, back="lines")
axis3d('x--',labels=rownames(r),tick=TRUE)
axis3d('y-+',labels=colnames(r),tick=TRUE)
axis3d('z--',tick=TRUE)