I am looking for a way to help better visualize the relationship between a independent continuous variable and a binary response variable.
I am trying to understand how I can add a 2nd y axis to the existing plot I have below. I want to get a sense of the response rate over different numerical ranges visually.
How can I add in the response percent at any given histogram bin? For example if there were 10 observations in a bin and 2 were the positive class, then this would show a response of 20%.
Ideally it's possible that this would be dynamic in that I might change the # of bins. For instance, I have 10 here, I might want 20 the next time.
This would be a connected line-chart with the corresponding percentages from #1 on the right y axis.
Or in other words, I want a line chart of the positive class to be displayed as a line chart with % show in Y axis.
library(mlbench)
library(tidyverse)
data(Sonar) ## from mlbench
library(ggplot2)
ggplot(Sonar, aes(x=V11, fill=Class)) +
geom_histogram(col='black', bins = 10) +
scale_fill_manual(values=c("purple", "green")) +
labs(title = "Count Left Y Axis; 'R' class percent of BIN in Right Y Axis" ,
x = 'Variable Value in this case V33', y ='Count of Observations' )
Not sure if this is what you are after but the description you gave sounded very similar to a conditional density plot.
ggplot probably has an alternative to this, but with base R:
cdplot(Class ~ V1, Sonar, col=c("cornflowerblue", "orange"), main="Conditional density plot")
And the result:
Related
how in R, should I have a histogram with a categorical variable in x-axis and
the frequency of a continuous variable on the y axis?
is this correct?
There are a couple of ways one could interpret "one graph" in the title of the question. That said, using the ggplot2 package, there are at least a couple of ways to render histograms with by groups on a single page of results.
First, we'll create data frame that contains a normally distributed random variable with a mean of 100 and a standard deviation of 20. We also include a group variable that has one of four values, A, B, C, or D.
set.seed(950141237) # for reproducibility of results
df <- data.frame(group = rep(c("A","B","C","D"),200),
y_value = rnorm(800,mean=100,sd = 20))
The resulting data frame has 800 rows of randomly generated values from a normal distribution, assigned into 4 groups of 200 observations.
Next, we will render this in ggplot2::ggplot() as a histogram, where the color of the bars is based on the value of group.
ggplot(data = df,aes(x = y_value, fill = group)) + geom_histogram()
...and the resulting chart looks like this:
In this style of histogram the values from each group are stacked atop each other(i.e. the frequency of group A is added to B, etc. before rendering the chart), which might not be what the original poster intended.
We can verify the "stacking" behavior by removing the fill = group argument from aes().
# verify the stacking behavior
ggplot(data = df,aes(x = y_value)) + geom_histogram()
...and the output, which looks just like the first chart, but drawn in a single color.
Another way to render the data is to use group with facet_wrap(), where each distribution appears in a different facet on one chart.
ggplot(data = df,aes(x = y_value)) + geom_histogram() + facet_wrap(~group)
The resulting chart looks like this:
The facet approach makes it easier to see differences in frequency of y values between the groups.
I am trying to make a plot of proportions of a binomial distribution (yes/no) depending on one ordinal and one continuous variable. Somehow when including the continuous one as color of the dots the appearance of the plot radically changes. Can someone help me with how to include the third variable without having the plot turn into below table-looking result?
Code as follows:
#making table with proportions of people who switch (1),
## after arsenic level and education.
educ_switch <- prop.table(table(welldata$educ[welldata$switch==1],
welldata$arsenic[welldata$switch==1],
welldata$switch[welldata$switch==1]))
educ_switch <- as_data_frame(educ_switch, make.names=TRUE)
#remove observations where the proportion is 0
educ_switch1 <- educ_switch[which (educ_switch$proportion>0),]
p <- ggplot(educ_switch1, aes(x = educ, y=proportion))
If I do p + geom_point()
I get the following picture:
But when I try to distinguish the third variable by coloring it with p + geom_point(aes(colour = arsenic))
I get this weird looking thing instead:
I come to encounter a problem that using two different data with the help of second axis function as described in this previous post how-to-use-facets-with-a-dual-y-axis-ggplot.
I am trying to use geom_point and geom_bar but the since the geom_bar data range is different it is not seen on the graph.
Here is what I have tried;
point_data=data.frame(gr=seq(1,10),point_y=rnorm(10,0.25,0.1))
bar_data=data.frame(gr=seq(1,10),bar_y=rnorm(10,5,1))
library(ggplot2)
sec_axis_plot <- ggplot(point_data, aes(y=point_y, x=gr,col="red")) + #Enc vs Wafer
geom_point(size=5.5,alpha=1,stat='identity')+
geom_bar(data=bar_data,aes(x = gr, y = bar_y, fill = gr),stat = "identity") +
scale_y_continuous(sec.axis = sec_axis(trans=~ .*15,
name = 'bar_y',breaks=seq(0,10,0.5)),breaks=seq(0.10,0.5,0.05),limits = c(0.1,0.5),expand=c(0,0))+
facet_wrap(~gr, strip.position = 'bottom',nrow=1)+
theme_bw()
as it can be seen that bar_data is removed. Is is possible to plot them together in this context ??
thx
You're running into problems here because the transformation of the second axis is only used to create the second axis -- it has no impact on the data. Your bar_data is still being plotted on the original axis, which only goes up to 0.5 because of your limits. This prevents the bars from appearing.
In order to make the data show up in the same range, you have to normalize the bar data so that it falls in the same range as the point data. Then, the axis transformation has to undo this normalization so that you get the appropriate tick labels. Like so:
# Normalizer to bring bar data into point data range. This makes
# highest bar equal to highest point. You can use a different
# normalization if you want (e.g., this could be the constant 15
# like you had in your example, though that's fragile if the data
# changes).
normalizer <- max(bar_data$bar_y) / max(point_data$point_y)
sec_axis_plot <- ggplot(point_data,
aes(y=point_y, x=gr)) +
# Plot the bars first so they're on the bottom. Use geom_col,
# which creates bars with specified height as y.
geom_col(data=bar_data,
aes(x = gr,
y = bar_y / normalizer)) + # NORMALIZE Y !!!
# stat="identity" and alpha=1 are defaults for geom_point
geom_point(size=5.5) +
# Create second axis. Notice that the transformation undoes
# the normalization we did for bar_y in geom_col.
scale_y_continuous(sec.axis = sec_axis(trans= ~.*normalizer,
name = 'bar_y')) +
theme_bw()
This gives you the following plot:
I removed some of your bells and whistles to make the axis-specific stuff more clear, but you should be able to add it back in no problem. A couple of notes though:
Remember that the second axis is created by a 1-1 transformation of the primary axis, so make sure they cover the same limits under the transformation. If you have bars that should go to zero, the primary axis should include the untransformed analogue of zero.
Make sure that the data normalization and the axis transformation undo each other so that your axis lines up with the values you're plotting.
I'm trying to plot points along the genome: there will be plot points for every chromosome. My data file looks like this:
CHROM BP P DP
1 234567 0.0000555 30
.....
Y 12345678 0.09 14
I'm using gglopt2 to plot P values, coloured by DP, for each chromosome, using the following:
mc.points <- ggplot(sample,aes(x = BP,y = P, colour =DP)) +
geom_point() +
labs(x = "Chromosome",y = "P") +
scale_color_gradient2(low = "green", high = "red")
However, instead of being plotted at each BP in the right chromosomal order, its being plotted by BP without any thought of chromosome number.
Is there a way to sort the data to make this happen (ie order by chromosome then BP)? I've tried to make CHROM and BP factors but this seems to crash R. In addition, if this is possible is there a way to label the X-tics on the X axis as chromosome numbers rather than BP (similar to a Manhattan plot).
I can provide dummy data if need be but this is quite long.
Just to provide an update: facet_grid seems to solve my problem but I was wondering whether I can transform this? It splits the grids by chromosome, but doesn't plot them on the same x-axis in consecutive order - But plots 22 different plots using the same scale x-axis. Any solutions?????
Have you tried something this untested code before the plot:
sample$BP <- factor(sample$BP,
levels=sample[ !duplicated(sample[,"BP"]), "BP"][
order(sample[!duplicated(sample[ ,"BP"]), "chromosome"] )]
)
Would have been easier and perhaps more compact if you included a suitable sample for testing. In the future you should NOT use the name `sample" since it is an important R function name.
What I have is a 3-Levels Repeated Measures Factor and a continuous variable (Scores in psychological questionnaire, measured only once pre-experiment, NEO), which showed significant interaction together in a Linear Mixed Effects Model with a Dependent Variable (DV; State-Scores measured at each time level, IAS).
To see the nature of this interaction, I would like to create a plot with time levels on X-Axis, State-Score on Y-Axis and multiple curves for the continuous variable, similar to this. The continuous variable should be categorized in, say quartiles (so I get 4 different curves), which is exactly what I can't achieve. Until now I get a separate curve for each value in the continuous variable.
My goal is also comparable to this, but I need the categorial (time) variable not as separate curves but on the X-Axis.
I tried out a lot with different plot functions in R but did'nt manage to get what I want, maybe because I am not so skilled in dealing with R.
F. e.
gplot(Data_long, aes(x = time, y = IAS, colour = NEO, group = NEO)) +
geom_line()
from the first link shows me dozens of curves (one for each value in the measurement NEO) and I can't find how to group continuous variables in a meaningful way in that gplot function.
Edit:
Original Data:
http://www.pastebin.ca/2598926
(I hope it is not too inconvenient.)
This object (Data_long) was created/converted with the following line:
Data_long <- transform(Data_long0, neo.binned=cut(NEO,c(25,38,46,55,73),labels=c(".25",".50",".75","1.00")))
Every value in the neo.binned col seems to be set correctly with enough cases per quantile.
What I then tried and didn't work:
ggplot(Data_long, aes(x = time, y = ias, color = neo.binned)) + stat_summary(fun.y="median",geom="line")
geom_path: Each group consist of only one observation. Do you need to adjust the group >aesthetic?
I got 92 subjects and values for NEO between 26-73. Any hints what to enter for cut and labels function? Quantiles are 0% 25% 50% 75% 100% 26 38 46 55 73.
Do you mean something like this? Here, your data is binned according to NEO into three classes, and then the median of IAS over these bins is drawn. Check out ?cut.
Data_long <- transform(Data_long, neo.binned=cut(NEO,c(0,3,7,10),labels=c("lo","med","hi")))
Plot everything in one plot.
ggplot(Data_long, aes(x = time, y = IAS, color = neo.binned))
+ stat_summary(aes(group=neo.binned),fun.y="median",geom="line")
And stealing from CMichael's answer you can do it all in multiple (somehow you linked to facetted plots in your question):
ggplot(Data_long,aes(x=time,y=IAS))
+ stat_summary(fun.y="median",geom="line")
+ facet_grid(neo.binned ~ .)
Do you mean facetting #ziggystar initial Plot?
quantiles = quantile(Data_long$NEO,c(0.25,0.5,0.75))
Data_long$NEOQuantile = ifelse(Data_long$NEO<=quantiles[1],"first NEO Quantile",
ifelse(Data_long$NEO<=quantiles[2],
"second NEO Quantile",
ifelse(Data_long$NEO<=quantiles[3],
"third NEO Quantile","forth NEO Quantile")))
require(ggplot2)
p = ggplot(Data_long,aes(x=time,y=IAS)) + stat_quantile(quantiles=c(1),formula=y ~ x)
p = p + facet_grid(.~NEOQuantile)
p