Breaks in between bars, R histogram - r

data:
varx <- c(1.234, 1.32, 1.54, 2.1 , 2.76, 3.2, 4.56, 5.123, 6.1, 6.9)
hist(varx)
Gives me
What I would like to do is create the same histogram but with spaces in between the bars.
I've tried what is found here How to separate the two leftmost bins of a histogram in R
But no luck.
When I do it on my actual data I get:
This is my actual data:
a <- c(2.6667
,4.45238
,5.80952
,3.09524
,3.52381
,4.04762
,4.53488
,3.80952
,5.7619
,3.42857
,4.57143
,6.04762
,4.02381
,5.47619
,4.09524
,6.18182
,4.85714
,4.52381
,5.61905
,4.90476
,4.42857
,5.31818
,2.47619
,5
,2.78571
,4.61905
,3.71429
,2.47619
,4.33333
,4.80952
,6.52381
,5.06349
,4.06977
,5.2381
,5.90476
,4.04762
,3.95238
,2.42857
,4.38333
,4.225
,3.96667
,3.875
,3.375
,4.18333
,5.45
,4.45
,3.76667
,4.975
,2.2
,5.53846
,6.1
,5.9
,4.25
,5.7
,3.475
,3.5
,4
,4.38333
,3.81667
,3.9661
,1.2332
,1.2443
,5.4323
,2.324
,1.342
,1.321
,3.81667
,3.9661
,1.2332
,1.2443
,5.4323
,2.324
,1.342
,1.321
,4.32
,6.43
,6.98
,4.321
,3.253
,2.123
,1.234)
Why do I get these skinny bars and how do I remove them?

The code works, but needs smaller numbers:
varx <- c(1.234, 1.32, 1.54, 2.1 , 2.76, 3.2, 4.56, 5.123, 6.1, 6.9)
hist(varx, breaks=rep(1:7,each=2)+c(-.04,.04), freq=T)
This returns a warning as it prefers to return "density" instead of "frequency" after manually changing the breaks in that way. Change to freq=F if you prefer.

In general this is a bad idea - histograms show the continuity of data, and gaps ruin that. You can use the previous code with smaller gaps (your values hit the previous gaps):
hist(varx,breaks=rep(1:7,each=2)+c(-.05,.05))
But this is not a general solution - any values closer than 0.05 to the cutoff will end up in the gap region.
We can make a bar plot of factored data using ggplot2, depending on how you want to round values. In this case, I have taken the floor (rounds down to nearest integer), and rounded to the nearest integer:
library(ggplot2)
varx <- as.data.frame(varx)
varx$floor <- floor(varx$varx)
varx$round <- round(varx$varx)
ggplot(varx, aes(x = as.factor(floor))) + geom_bar()
ggplot(varx, aes(x = as.factor(round))) + geom_bar()

In case anyone is looking for a more vanilla solution, you can just set the border argument for hist to be the same color as the background of the plot:
par(mfrow=1:2)
# connected bars
hist(y <- rnorm(100))
# seemingly disconnected bars
hist(y, border=par('bg'))
Adding artificial separation between bars

Related

Bins vs. Breaks

I'm new to coding (Particularly R) and wanted to know what the differences between
Breaks =
vs.
Bins()
are and in what scenarios you would use one over the other.
Thanks in advance for the clarification!
If this is in relation to something like histograms in ggplot2, the bins arguments automatically stack your data into a set number of columns, whereas the breaks arguments specify where exactly that is. As an example, we can look at these two plots:
#### Automatically Separates into Bins ####
iris %>%
ggplot(aes(x=Sepal.Length))+
geom_histogram(bins = 10)
#### Manually Inserts Breaks at Designated Spots ####
iris %>%
ggplot(aes(x=Sepal.Length))+
geom_histogram(breaks=c(1,2,3,4,5,
6,7,8,9,10))
The first automatically got assigned 10 bins (columns) like below:
Since the data deals with decimal values and is bounded between 4.3 and 7.9, the second manual 10 breaks at numbers 1 to 10 (explicitly I'm saying "I want Sepal Length 1 to 10") doesn't end up looking the same:
If I want to set it at much more precise locations, I can do this instead with the breaks argument:
iris %>%
ggplot(aes(x=Sepal.Length))+
geom_histogram(breaks=c(4.0,
4.3,
5.0,
5.3,
6.0,
6.3,
7.0,
7.3,
8.0))

Plotting ECDF results in horizontal lines beyond the expected range. How can I prevent this?

I want to plot the ECDF of a vector. The vector has range [0,1]. However, plotting ECDF(data) results in horizontal lines that extend beyond this range. I want to create a plot that does not have these lines beyond the range [0,1].
Calling plot.stepfun shows that the function chooses a vector of abscissa values that includes the values -0.16 and 1.16, but I don't know why. I have tried manually selecting the abscissa values using the argument xval, but this made no difference.
I have tried using ggplot2, but again this made no difference.
I have also tried removing the first and last values of the vector, which are 0 and 1, but again this made no difference.
I could of course just use MS Paint, but that seems like a poor solution to the problem.
data <- c(0, 0.0267937939860966, 0.0831161599875003, 0.089312646620322,
0.09, 0.162046969424378, 0.214535013990776, 0.216, 0.254227922418882,
0.29770882206774, 0.3, 0.346218858110426, 0.3483, 0.351120057363453,
0.446176768935429, 0.469316812739393, 0.47178, 0.506720537855168,
0.51, 0.53499413030498, 0.577201705567453, 0.579825, 0.61501969832776,
0.653481161056275, 0.657, 0.667975762603373, 0.6705828, 0.685122481157394,
0.742234640167266, 0.74470167, 0.745169566125031, 0.756545373540315,
0.7599, 0.795669365154443, 0.801746023714245, 0.803996766, 0.828933122166261,
0.83193, 0.837497330035643, 0.848695641093207, 0.8506916541,
0.87169919974533, 0.879781895687186, 0.882351, 0.885279431049518,
0.8870099004, 0.899358675688768, 0.913502229556406, 0.914974950051,
0.915505354483016, 0.9176457, 0.921514704291551, 0.935095914758442,
0.9363300788754, 0.939114814765667, 0.940605918657197, 0.94235199,
0.951503562401266, 0.95252438490057, 0.952993345228527, 0.958244748310785,
0.959646393, 0.963897452890123, 0.964732400211852, 0.970641607614244,
0.9717524751, 0.973212104364713, 0.973888411695313, 0.979355426072477,
0.980181739205269, 0.98022673257, 0.980724900269631, 0.985376582975203,
0.985481180229861, 0.98580953864678, 0.986158712799, 0.989235347816543,
0.989578152973373, 0.989788073567854, 0.9903110989593, 0.9923627402258,
0.992816530697457, 0.99321776927151, 0.994414541359167, 0.994946291138756,
0.995252438490057, 0.995922615192192, 0.9964442204999, 0.99667670694304,
0.997028536003077, 0.997497885105047, 0.997673694860128, 0.997837868960133,
0.998239132446338, 0.998371586402089, 0.998429033902679, 0.998860091673285,
0.998860110481463, 0.999173901730287, 0.999202077337024, 0.999402017502492,
0.999441454135917, 0.999567612655648, 0.999609017895142, 0.999687669141686,
0.999726312526599, 0.999774606597093, 0.999808418768619, 0.999837491356504,
0.999865893138033, 0.99988293066653, 0.999906125196623, 0.999915732168455,
0.999934287637636, 0.999939389009237, 0.999954001346345, 0.999956435850389,
0.999967800942442, 0.999968709576019, 0.999977460659709, 0.999984222461796,
0.999988955723257, 0.99999226900628, 0.999994588304396, 0.999996211813077,
0.999997348269154, 1)
plot(ecdf(data), do.points=FALSE)
I would like to be able to plot the ECDF with the x axis matching the range of the vector, that is, [0,1].

How to display both raw and percent in Venn diagram

I am plotting a Venn diagram using the function draw.triple.venn() library(VennDiagram). This is my code in R:
g = draw.triple.venn(area1 = 4.1, area2 = 5.6, area3 = 15.9, n12 = 1.3, n23 = 4.2, n13 = 2.3, n123 = 1.2, category = c("Landuse", "Environment", "Space"), fill = c("darkgray", "gray", "lightgrey"), print.mode = c("percent", "percent", "percent"), sigdig=2, ind = T)
grid.arrange(gTree(children=g))
This is the current figure:
Now, I would like to display both 'percentage' and 'raw' for each fraction. In the package description it states: 'print.mode' can be either 'raw' or 'percent'. This is the format that the numbers will be printed in. Can pass in a vector with the second element being printed under the first.
This seems to suggest that both 'raw' and 'percent' can be displayed together. Any suggestions on how to do this?
Also, how can I control that number of digits is used consistently, i.e. have 56.0% (rather than 56%) and 0.5% (rather than 0.53%)? I have set sigdig=2 which I thought would force consistency in that space.
Moreover, is there a way to control the fill colour of each fraction (as compared to only a vector of 3 colours)?
Finally, is there any way to add text manually? I would like to note the proportion of residual variation in the bottom left corner.
This is a link to the package https://cran.r-project.org/web/packages/VennDiagram/VennDiagram.pdf
Any help with this is much appreciated.
Using print.mode = c("raw", "percent") works to include both raw and percent values.
Function grid::grid.text("some label", x=0.1, y=0.1) works to add text manually.

retrieve x and y value based on graph in r

I'm new in r and I would ask you all some help. I have x (value) and prob (it's probability) as follow:
x <- c(0.00, 1.08, 2.08, 3.08, 4.08, 4.64, 4.68)
prob <- c(0.000, 0.600, 0.370, 0.010, 0.006, 0.006, 0.006)
My aim is to contruct an estimate distribution graph based on those values. So far, I use qplot(x,prob,geom=c("point", "smooth"),span=0.55) to make it and it's shown here
https://i.stack.imgur.com/aVgNk.png
my question are:
Are there any other ways to contruct a nice distribution like that
without using qplot?
I need to retrieve the all the x values (i.e., 0.5, 1, 1.2, etc) and their corresponding prob values. Can can I do that?
I've been searching for a while, but with no luck.
Thank you all
If you're looking to predict the values of prob for given values of x, this is one way to do it. Note I'm using a loess prediction function here (because I believe it's the default for ggplot's smooth geom, which you've used), which may or may not be appropriate for you.
x <- c(0.00, 1.08, 2.08, 3.08, 4.08, 4.64, 4.68)
prob <- c(0.000, 0.600, 0.370, 0.010, 0.006, 0.006, 0.006)
First make a data frame with one column, I'll put a whole lot of data points into that column, just to make a bunch of predictions.
df <- data.frame( datapoints = seq.int( 0, max(x), 0.1 ) )
Then create a prediction column. I'm using the predict function, passing a loess smoothed function to it. The loess function is given your input data, and predict is asked to use the function from loess to predict for the values of df$datapoints
df$predicted <- predict( loess( prob ~ x, span = 0.55 ), df$datapoints )
Here's what the output looks like.
> head( df )
datapoints predicted
1 0.0 0.01971800
2 0.1 0.09229939
3 0.2 0.15914675
4 0.3 0.22037484
5 0.4 0.27609841
6 0.5 0.32643223
On the plotting side of things, ggplot2 is a good way to go, so I don't see a reason to shy away from qplot here. If you want more flexibility in what you get from ggplot2, you can code the functions more explicitly (as #Jan Sila has mentioned in another answer). Here's a way with ggplot2's more common (and more flexible) syntax:
plot <- ggplot( data = df,
mapping = aes( x = datapoints,
y = predicted ) ) +
geom_point() +
geom_smooth( span = 0.55 )
plot
you can get the observations once you specify the probability distribution.Have a look here. This will help you and walk you through MASS package.
..nicer graphs? I think ggplot is the best (also pretty sure that grapgh is from ggplot2). IF you want exacatly that, then you want a blue geom_line and on top of that add geom_point with the same mapping :) Try to have alook at tutorials, or we can help you out with that.

In R package "segmented", How could I set the slope of one of lines in the model to 0?

I am using the R package segmented to calculate parameters for a model, in which the response variable is linearly correlated with the explanatory variable until a breakpoint, then the response variable becomes independent from the explanatory variable. In other words, a segmented linear model with the second part having a slope = 0.
What I already did is:
linear1 <- lm(Y ~ X)
linear2 <- segmented (linear1, seg.Z = ~ X, psi = 2)
This gives a model that have a very good first line, but the second line is not horizontal (but not significant). I want to make the second line horizontal. (psi = 2 is the place where I observed a breakpoint.)
Also, when I use "abline" to show the broken line on the plotting, it only show the first part of the model, giving a warning: "only using the first two of 4 regression coefficients". How could I display both parts of the model?
To input my data into R:
X <- c(0.2, 0.4, 0.6, 0.8, 1.0, 1.2, 1.4, 1.6, 1.8, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0)
Y <- c(1.31, 1.60, 1.86, 2.16, 2.44, 2.71, 3.00, 3.24, 3.57, 3.81, 3.80, 3.83, 3.78, 3.94, 3.75, 3.89)
This is as easy as using the plot method for segmented class objects provided by the package segmented and linked in the help for segmented
Assuming your data is in the data.frame d
linear2 <- segmented (linear1, seg.Z = ~ X, psi = 2, data = d)
plot(linear2)
points(Y~X, data = d)
An easy way to fudge a horizontal line would be to replace the coefficient with value required for that line to be horizontal
fudgedmodel <- linear2
fudgedmodel$coefficients[3] <- - fudgedmodel$coefficients[2]
plot(fudgedmodel)
points(Y~X, data = d)
Searching for the same thing and found a neat answer on this post from the R help mailing list:
https://stat.ethz.ch/pipermail/r-help/2007-July/137625.html
Here's an edited version of that answer that cuts straight to the solution:
library(segmented)
# simulate data - linear slope down until some point, at which slope=0
n<-50
x<-1:n/n
y<- 0-pmin(x-.5,0)+rnorm(50)*.03
plot(x,y) #This should be your scatterplot..
abline(0,0,lty=2)
# a parsimonious modelling: constrain right slope=0
# NB. This is probably what you want...
o<-lm(y~1)
xx<- -x
o2<-segmented(o,seg.Z=~xx,psi=list(xx=-.3))
slope(o2)
points(x,fitted(o2),col=2)
# now constrain \hat{\mu}(x)=0 for x>psi (you can do this if you know what the value of y is when x becomes independent)
o<-lm(y~0)
xx<- -x
o3<-segmented(o,seg.Z=~xx,psi=list(xx=-.3))
slope(o3)
points(x,fitted(o3),col=3)
You should get something like this. Red points are the first method, which sounds like the one for you. Green points are the second method, which only applies if you already know the value of y at which x becomes independent:

Resources