Bins vs. Breaks - r

I'm new to coding (Particularly R) and wanted to know what the differences between
Breaks =
vs.
Bins()
are and in what scenarios you would use one over the other.
Thanks in advance for the clarification!

If this is in relation to something like histograms in ggplot2, the bins arguments automatically stack your data into a set number of columns, whereas the breaks arguments specify where exactly that is. As an example, we can look at these two plots:
#### Automatically Separates into Bins ####
iris %>%
ggplot(aes(x=Sepal.Length))+
geom_histogram(bins = 10)
#### Manually Inserts Breaks at Designated Spots ####
iris %>%
ggplot(aes(x=Sepal.Length))+
geom_histogram(breaks=c(1,2,3,4,5,
6,7,8,9,10))
The first automatically got assigned 10 bins (columns) like below:
Since the data deals with decimal values and is bounded between 4.3 and 7.9, the second manual 10 breaks at numbers 1 to 10 (explicitly I'm saying "I want Sepal Length 1 to 10") doesn't end up looking the same:
If I want to set it at much more precise locations, I can do this instead with the breaks argument:
iris %>%
ggplot(aes(x=Sepal.Length))+
geom_histogram(breaks=c(4.0,
4.3,
5.0,
5.3,
6.0,
6.3,
7.0,
7.3,
8.0))

Related

Finding the x-value at a certain y-value on a ggplot

I am currently having some difficulties trying to find the Effective Concentration of 50% for one of my datasets. To shortly summarize what it is, it is data on how levels of glutathione in cells depleted from 100% when exposed to a substance known as HEMA.
GSH50 <- read.table("Master list for all GSH data T9 TVN.csv", header = TRUE, sep = ";", dec = ",")
After some further subsetting, I end up with a plot like this
GSH plot
I have several more plots in addition to this, so I need to find the EC50 value for everyone so I can then compare them with each other (the problem is consistent on several plots, so if it can be fixed here it should be fixed on the others as well).
From an earlier dataset with almost the same setup (the only difference being x-axis values) I managed to get fairly correct EC50 using a setup like this:
HG <- approxfun(x, y)
optimize(function(t0) abs(HG(t0) - 50), interval = range(x))
Where I then got my EC50 value from the optimize function. However, it does not work on this data for some reason, as if I input the value from optimize, I end up getting this GSH plot instead.
If somebody has any idea how I can fix this issue, it would be most appreciated.
Edit
If you want a reproducible dataset I gathered the averages of the data, and as such the plot should still be similar to the GSH plots I have shown:
Concentration <- seq(from = 0, to = 9, by=1)
GSH <- c(100, 67.405, 47.78, 39.2325, 33.97, 28.435, 26.97, 24.5125, 23.5275, 21.565)
df <- data.frame(Concentration, GSH)
ggplot(df, aes(Concentration, GSH)) + geom_smooth()
I am quite certain that the dose is high enough to reach the lower level, but I have not stored the model somewhere. I hope the example data provided is enough.
Edit2
I should mention that the approx and optimize code does work for the example when we use geom_lines(), but for some reason, it is not as accurate on geom_smooth().

Adding Label *row number* into the Plot

How Can I modify this code to have a plot so that it shows for
each point on the graph its corresponding row number as a label.
inter <- seq(7.5, 21.5, 1)
LogDifference <- c("na",1.5,0.8,0.6,0.01,-0.57,-0.11,0.41,0.068,-0.19,-0.31,0.05,0.14,0.6,0.5)
S<-data.frame(inter,LogDifference)
plot(x = S$inter,S$LogDifference)
First of all, notice that your basic plot is not doing what you want.
The y values being plotted are the numbers 1 to 14. I think that you wanted
the numerical values that are in LogDifference. You can fix this by
first converting LogDifference to character (it is a factor), then converting
to numeric. I am just leaving out the "na".
After that, you can use text to place labels next to the points.
The full code is:
inter <- seq(7.5, 21.5, 1)
LogDifference <- c("na",1.5,0.8,0.6,0.01,-0.57,-0.11,0.41,0.068,
-0.19,-0.31,0.05,0.14,0.6,0.5)
S<-data.frame(inter,LogDifference)
plot(x = S$inter[-1], as.numeric(as.character(S$LogDifference[-1])))
text(x=inter[-1]+0.4, y=as.numeric(as.character(LogDifference[-1]))+0.05, labels=2:15)

R: Create a more readable X-axis after binning data in ggplot2. Turn bins into whole numbers

I have a dummy variable call it "drink" and a corresponding age variable that represents a precise age estimate (several decimal points) for each person in a dataset. I want to first "bin" the age variable, extracting the mean value for each bin based on the "drink" dummy, and then graph the result. My code to do so looks like this:
df$bins <- cut(df$age, seq(from = 17, to = 31, by = .2), include.lowest = TRUE)
df.plot <- ddply(df, .(bins), summarise, avg.drink = mean(drinks_alcohol))
qplot(bins, avg.drink, data = df.plot)
This works well enough, but the x-axis in the graph is unreadable because it corresponds to the length size of the bins. Is there a way to make the modify the X-axis to show, for example, ages 19-23 only, with the "ticks" still aligning with the correct bins? For example, in my current code there is a bin for (19, 19.2] and another bin for (20, 20.2]. I would want only the bins that start in whole numbers to be identified on the X-axis with the first number (19, 20), not the second (19.2, 20.2) shown.
Is there any straightforward way to do this?
The most direct way to specify axis labels is with the appropriate scale function... in the case of factors on the x axis, scale_x_discrete. It will use whatever labels you give it with the labels argument, or you can give it a function that formats things as you like.
To "manually" specify the labels, you just need to create a vector of appropriate length. In this case, if you factor values go are intervals beginning with seq(17, 31.8, by = 0.2) and you want to label bins beginning with integers, then your labels vector will be
bin_starts = seq(17, 31.8, by = 0.2)
bin_labels = ifelse(bin_starts - trunc(bin_starts) < 0.0001, as.character(bin_starts), "")
(I use the a - b < 0.0001 in case of precision problems, though it shouldn't be a problem in this particular case).
A more robust solution would to label the factor levels with the number at the start of the interval from the beginning. cut also has a labels argument.
my_breaks = seq(17, 32, by = 0.2)
df$bins <- cut(df$age, breaks = my_breaks, labels = head(my_breaks, -1),
include.lowest = TRUE)
You could then fairly easily write a formatter (following templates from the scales package) to print only the ones you want:
int_only = function(x) {
# test if we can coerce to numeric, if not do nothing
if (any(is.na(as.numeric(x)))) return(x)
# otherwise convert to numeric and return integers and blanks as labels
x = as.numeric(x)
return(ifelse(x - trunc(x) < 1e-10, as.character(x), ""))
}
Then, using the nicely formatted data created above, you should be able to pass int_only as a formatter function to labels to get the labels you want. (Note: untested! necessary tweaks left as an exercise for the reader, though I'll gladly accept edits :) )

Breaks in between bars, R histogram

data:
varx <- c(1.234, 1.32, 1.54, 2.1 , 2.76, 3.2, 4.56, 5.123, 6.1, 6.9)
hist(varx)
Gives me
What I would like to do is create the same histogram but with spaces in between the bars.
I've tried what is found here How to separate the two leftmost bins of a histogram in R
But no luck.
When I do it on my actual data I get:
This is my actual data:
a <- c(2.6667
,4.45238
,5.80952
,3.09524
,3.52381
,4.04762
,4.53488
,3.80952
,5.7619
,3.42857
,4.57143
,6.04762
,4.02381
,5.47619
,4.09524
,6.18182
,4.85714
,4.52381
,5.61905
,4.90476
,4.42857
,5.31818
,2.47619
,5
,2.78571
,4.61905
,3.71429
,2.47619
,4.33333
,4.80952
,6.52381
,5.06349
,4.06977
,5.2381
,5.90476
,4.04762
,3.95238
,2.42857
,4.38333
,4.225
,3.96667
,3.875
,3.375
,4.18333
,5.45
,4.45
,3.76667
,4.975
,2.2
,5.53846
,6.1
,5.9
,4.25
,5.7
,3.475
,3.5
,4
,4.38333
,3.81667
,3.9661
,1.2332
,1.2443
,5.4323
,2.324
,1.342
,1.321
,3.81667
,3.9661
,1.2332
,1.2443
,5.4323
,2.324
,1.342
,1.321
,4.32
,6.43
,6.98
,4.321
,3.253
,2.123
,1.234)
Why do I get these skinny bars and how do I remove them?
The code works, but needs smaller numbers:
varx <- c(1.234, 1.32, 1.54, 2.1 , 2.76, 3.2, 4.56, 5.123, 6.1, 6.9)
hist(varx, breaks=rep(1:7,each=2)+c(-.04,.04), freq=T)
This returns a warning as it prefers to return "density" instead of "frequency" after manually changing the breaks in that way. Change to freq=F if you prefer.
In general this is a bad idea - histograms show the continuity of data, and gaps ruin that. You can use the previous code with smaller gaps (your values hit the previous gaps):
hist(varx,breaks=rep(1:7,each=2)+c(-.05,.05))
But this is not a general solution - any values closer than 0.05 to the cutoff will end up in the gap region.
We can make a bar plot of factored data using ggplot2, depending on how you want to round values. In this case, I have taken the floor (rounds down to nearest integer), and rounded to the nearest integer:
library(ggplot2)
varx <- as.data.frame(varx)
varx$floor <- floor(varx$varx)
varx$round <- round(varx$varx)
ggplot(varx, aes(x = as.factor(floor))) + geom_bar()
ggplot(varx, aes(x = as.factor(round))) + geom_bar()
In case anyone is looking for a more vanilla solution, you can just set the border argument for hist to be the same color as the background of the plot:
par(mfrow=1:2)
# connected bars
hist(y <- rnorm(100))
# seemingly disconnected bars
hist(y, border=par('bg'))
Adding artificial separation between bars

Histogram of uniform distribution not plotted correctly in R

When I run the code
hist(1:5)
or
hist(c(1,2,3,4,5))
The generated histogram shows that the first number "1" has frequency of 2 when there is only one "1" in the array.
I also tried
hist(c(1,2,3,7,7,7,9))
but it still shows that the first bar is twice times higher than the second one
However when I run
hist(c(1:10))
The frequency height of every bars are equal
I'm pretty new to statistics and R so I don't know what is the reason behind this. I hope somebody can help me clarify why is this happening. Thank you
Taking your first example, hist(1:5), you have five numbers, which get put into four bins. So two of those five get lumped into one.
The histogram has breaks at 2, 3, 4, and 5, so you can reasonably infer that the definition of hist for where a number is plotted, is:
#pseudocode
if (i <= break) { # plot in bin }
You can specify the breaks manually to solve this:
hist(1:5, breaks=0:5)
Try this:
> trace("hist.default", quote(print(fuzzybreaks)), at = 25)
Tracing function "hist.default" in package "graphics"
[1] "hist.default"
>
> out <- hist(1:5)
Tracing hist.default(1:5) step 25
[1] 0.9999999 2.0000001 3.0000001 4.0000001 5.0000001
> out$count
[1] 2 1 1 1
which shows the actual fuzzybreaks value it is using as well as the count in each bin. Clearly there are two points in the first bin (between 0.9999999 and 2.0000001) and one point in every other bin.
Compare with:
> out <- hist(1:5, breaks = 0:5 + 0.5)
Tracing hist.default(1:5, breaks = 0:5 + 0.5) step 25
[1] 0.4999999 1.5000001 2.5000001 3.5000001 4.5000001 5.5000001
> out$count
[1] 1 1 1 1 1
Now there is clearly one point in each bin.
What you are seeing is that hist is placing 1:5 into four bins. So there will be one bin with 2 counts.
If you specify the cutoff points like so:
hist(1:5, breaks=(c(0.5, 1.5, 2.5, 3.5, 4.5 , 5.5)))
then you will get the behaviour that you expect.

Resources