paired data for a facet_wrap - r

Imagine I have data foo below. Each row contains a measurement (y) on a species and each species is paired with another (species.pair). So in the example below, species a is paired with e, b with f, and so on. The number of observations for each species varies. I'd like to plot the density of each species's distribution along with its partner's distribution in its own facet. Below I hand coded this with the column sppPairs. The species are all unique and each has a match in species.pair. I'm unsure of how to make the grouping column sppPairs below. I'm sure there is some clever way to do this with {dplyr} but I can't figure out what to do. Some kind of pasting species to species.pair I imagine? Any help much appreciated.
foo <- data.frame(species = rep(letters[1:8],each=10),
species.pair = rep(letters[c(5:8,1:4)],each=10),
y=rnorm(80))
# species and species pair match exactly
all(unique(foo$species) %in% unique(foo$species.pair))
# what I want
foo$sppPairs <- c(rep("a:e",10),
rep("b:f",10),
rep("c:g",10),
rep("d:h",10),
rep("a:e",10),
rep("b:f",10),
rep("c:g",10),
rep("d:h",10))
p1 <- ggplot(foo,aes(y,fill=species))
p1 <- p1 + geom_density(alpha=0.5)
p1 <- p1 + facet_wrap(~sppPairs)
p1

Yes, you can use apply on the appropriate columns to paste the sorted elements together in the correct order (otherwise a:e is different from e:a and so on, and you end up with 8 groups instead of 4):
library(ggplot2)
foo <- data.frame(species = rep(letters[1:8], each = 10),
species.pair = rep(letters[c(5:8, 1:4)], each = 10),
y = rnorm(80))
foo$sppPairs <- apply(foo[c("species", "species.pair")], 1,
function(x) paste(sort(x), collapse = ":"))
ggplot(foo, aes(y, fill = species)) +
geom_density(alpha = 0.5) +
facet_wrap(~sppPairs)
Created on 2020-10-05 by the reprex package (v0.3.0)

Related

ggplot legend values rescale

I would like to rescale the values on the legend of a plot coming from conditional_effects.
By doing something like this
plot(conditional_effects(brm_c_5, effects = "t:w_c_ratio",cond = conditions5), rug = T, points = T)
I'm getting the following
For time being I'm doing
p_col_1 <- ggplot_build(p_col_1)
and then I'm chainging the ranges in here p_col_1$plot$scales$scales[[3]]$range$range and here p_col_1$plot$scales$scales[[4]]$range$range but I'm not trusting this solution.
EXAMPLE:
As example please see this code. The defaults values for kidney$age is from 10 to 69 but let's say that I want to rescale it from -1 to 1. Then I could use the solution via ggplot_build but I'm looking for a smarter and more elegant solution.
library(brms)
fit1 <- brm(time | cens(censored) ~ age + sex + disease,
data = kidney, family = weibull, init = "0")
fit1
p_tr <- (plot(conditional_effects(fit1, effects = "disease:age"), rug = T, points = T)[[1]])
p_tr <- ggplot_build(p_tr)
p_tr$plot$scales$scales[[3]]$range$range <- c(1,0, -1) %>% as.character()
p_tr$plot$scales$scales[[4]]$range$range <- c(1, 0 ,-1)%>% as.character()
plot(p_tr %>% ggplot_gtable)
`
How could I rescale the values of w_c_ratio from -0.9:+0.9 in the original scale (which is going from 2 to 10)?

Filter datapoints in a scatterplot using a linear equation

I am writing a function in order to filter out datapoints out of my plot based on a linear equation.
I currently have the following function (with a different function within it):
MD_filter<- function(dataframe, mz_col){
#In-function MD calculation
MZ<- mz_col
MZR<- trunc(mz_col, digits = 0)#Either floor() or trunc() can be used for this part.
MD<- as.numeric(MZ-MZR)
dataframe<- dataframe%>%
dplyr::mutate(MD)%>%
dplyr::select(MD, everything())
#fit data to m/z defect maxima equation
f<- function(x){#This could be problem `1`, maybe resolved by leaving x....
y<-0.00112*x + 0.01953
return(y)}
fit<-as.data.frame(t(apply(dataframe,1,f)))# t() transforms df to matrix...?
filtered<-dataframe[which((dataframe$MD<= fit$MZ)),]
#keep rows in dataframe if MD is less than or equal to fitted value (mz after equation)
#As "fit" calculated the maximum MD value for each MZ value in the MZ column, we subset fit$MZ, as this contains the dataframe MZ values.
#The MD calculated at the very start, needs to be lower than the equivalent MZ value of the fitted dataframe.
filtered<-write.table(filtered,"feature_list_mz_defect_filtered.txt",sep="\t",col.names=NA)
#Now we have pre filter dataframe (dataframe) and post filter df (filtered)
#2 Different plots: (highlight to be removed as well, so we need a 3rd eventually)
MD_plot<- ggplot(dataframe, aes(x= MZ, y = MD)) +
geom_point() +#THE FOLLOWING PART DOES NOT WORK YET
ggtitle(paste("Unfiltered MD data - ", dataframe))
#stat_smooth(method="lm", se=FALSE)-> For linear line through the plot, but may not be necessary to show
return(MD_plot)#While I do get a plot, I have not yet gotten the equation. I could use my earlier approach maybe.
MD_plot_2<- ggplot(filtered, aes(x= MZ, y = MD)) +#Filtered is basically the second dataframe,
#which subsets datapoints with an Y value (which is the MD), below the linear equation MD...
geom_point() +#THE FOLLOWING PART DOES NOT WORK YET
ggtitle(paste("Filtered MD data - ", dataframe))
#stat_smooth(method="lm", se=FALSE) -> For linear line through the plot, but may not be necessary to show
return(MD_plot_2)
}
The function works as follows:
The argument inputs are a dataframe and a specific column inside that dataframe which I call the mz_col.
From this column a second column, the MD column, is generated.
From here on out I want to make two plots:
ggplot 1: A plot with the mz_col (MZ) values on the X axis and the MD values on the Y axis
ggplot 2: EXACTLY the same as ggplot 1, but I want to filter out the datapoints if MD exceeds the linear equation y<-0.00112*x + 0.01953 (as is visible in the code). This linear line is basically my maximum filter limit in the plot, everything above this I want gone.
I've tried many different solutions. I swapped the "x" argument with mz_col among many other solutions such as trying to use plot() instead of ggplot. Currently I'm getting no plot, but I do get this:
Basically my question is: How do I solve my function, so I can get my two plots? The first plot is no real issue, this already works, but the second plot just won't filter out datapoints based on my linear equation.
Thanks in advance! I'm quite new to SO and R, so I apologize if anything is unclear. Please let me know if any clearification is needed and thanks in advance for all the help!
Reproducable sample data:
structure(list(mz = c(446.0394, 346.043, 199.0446, 199.0464, 97.057, 657.0595, 879.0606, 978.0631, 199.0684, 199.0707, 199.0724, 86.0748, 199.0761, 196.0789, 199.0819, 199.0852, 199.0878, 199.089, 908.0935, 147.0958, 199.0999,199.1299, 199.1322, 199.1384, 199.1398, 199.1434, 124.1475, 199.1513, 187.156, 199.1686, 199.1766, 199.1797, 199.1879, 199.1924, 187.1959, 479.1981, 169.1999, 109.2049, 399.2092, 299.2125, 159.2146, 199.2242, 356.2405, 69.2423, 956.4337, 978.5537, 199.5695, 676.5769, 199.5851, 500.6021, 260.6039, 270.6081, 200.6114, 200.6131, 200.6172, 200.6221,
200.6315, 200.6402, 200.6476, 200.766, 200.8591, 200.8732, 200.8768,
200.89, 200.8937, 200.8972, 200.9067, 200.9127, 200.9147, 200.9231,
200.9253, 200.9288, 200.9324, 200.935, 200.9468, 200.9515, 200.9536,
200.9557, 200.9568, 200.9594, 200.9661, 200.968, 200.9729, 200.9745,
200.9819, 200.9837, 200.9858, 200.9937)), row.names = c(NA, -88L), class = c("tbl_df", "tbl", "data.frame"))
I got a bit lost trying to follow your code, but based on your description, does the following work for you?
library(dplyr)
library(ggplot2)
MD_filter <- function(dataframe, mz_col, a = 0.01953, b = 0.00112){
# rename column so that rest of function doesn't depend on inputted column name
dataframe[["MZ"]] <- dataframe[[mz_col]]
# process dataframe
dataframe <- dataframe %>%
select(MZ) %>%
mutate(MD = MZ - trunc(MZ, digits = 0),
MD.limit = a + b*MZ)
p1 <- ggplot(dataframe,
aes(x = MZ, y = MD)) +
geom_point() +
geom_smooth(method = "lm", se = F) +
ggtitle("Unfiltered MD data")
p2 <- p1 %+% filter(dataframe, MD <= MD.limit) +
expand_limits(y = range(dataframe[["MD"]])) + # optional (if you want same
# y-axis range for both plots)
ggtitle("Filtered MD data")
cowplot::plot_grid(p1, p2, nrow = 1)
}
Data & usage
dd <- structure(list(mz = c(
446.0394, 346.043, 199.0446, 199.0464, 97.057, 657.0595, 879.0606,
978.0631, 199.0684, 199.0707, 199.0724, 86.0748, 199.0761, 196.0789,
199.0819, 199.0852, 199.0878, 199.089, 908.0935, 147.0958, 199.0999,
199.1299, 199.1322, 199.1384, 199.1398, 199.1434, 124.1475, 199.1513,
187.156, 199.1686, 199.1766, 199.1797, 199.1879, 199.1924, 187.1959,
479.1981, 169.1999, 109.2049, 399.2092, 299.2125, 159.2146, 199.2242,
356.2405, 69.2423, 956.4337, 978.5537, 199.5695, 676.5769, 199.5851,
500.6021, 260.6039, 270.6081, 200.6114, 200.6131, 200.6172, 200.6221,
200.6315, 200.6402, 200.6476, 200.766, 200.8591, 200.8732, 200.8768,
200.89, 200.8937, 200.8972, 200.9067, 200.9127, 200.9147, 200.9231,
200.9253, 200.9288, 200.9324, 200.935, 200.9468, 200.9515, 200.9536,
200.9557, 200.9568, 200.9594, 200.9661, 200.968, 200.9729, 200.9745,
200.9819, 200.9837, 200.9858, 200.9937)),
row.names = c(NA, -88L),
class = c("tbl_df", "tbl", "data.frame"))
MD_filter(dd, "mz")
# MD_filter(dd, "mz", a = 0.02, b = 0.001) # if you want to change the limit

Find the y-coordinate at intersection of two curves when x is known

Background and Summary of Objective
I am trying to find the y-coordinate at the intersection of two plotted curves using R. I will provide complete details and sample data below, but in the hopes that this is a simple problem, I'll be more concise up front.
The cumulative frequencies of two curves(c1 and c2 for simplicity) are defined by the following function, where a and b are known coefficients:
f(x)=1/(1+exp(-(a+bx)))
Using the uniroot() function, I found "x" at the intersection of c1 and c2.
I had assumed that if x is known then determining y should be simple substitution: for example, if x = 10, y=1/(1+exp(-(a+b*10))) (again, a and b are known values); however, as will be shown below, this is not the case.
The objective of this post is to determine how to find y-coordinate.
Details
This data replicates respondents' stated price at which they find the product's price to be too.cheap (i.e., they question its quality) and the price at which they feel the product is a bargain.
The data will be cleaned before use to ensure that too.cheap is
always less than the bargain price.
The cumulative frequency for the
bargain price will be inverted to become not.bargain.
The intersection of bargain and too.cheap will represent the point at
which an equal share of respondents feel the price is not a bargain
and too.cheap --- the point of marginal cheapness ("pmc").
Getting to the point where I'm having a challenge will take a number of steps.
Step 1: Generate some data
# load libraries for all steps
library(car)
library(ggplot2)
# function that generates the data
so.create.test.dataset <- function(n, mean){
step.to.bargain <- round(rnorm(n = n, 3, sd = 0.75), 2)
price.too.cheap <- round(rnorm(n = n, mean = mean, sd = floor(mean * 100 / 4) / 100), 2)
price.bargain <- price.too.cheap + step.to.bargain
df.temp <- cbind(price.too.cheap,
price.bargain)
df.temp <- as.data.frame(df.temp)
return(df.temp)
}
# create 389 "observations" where the too.cheap has a mean value of 10.50
# the function will also create a "bargain" price by
#adding random values with a mean of 3.00 to the too.cheap price
so.test.df <- so.create.test.dataset(n = 389, mean = 10.50)
Step 2: Create a data frame of cumulative frequencies
so.get.count <- function(p.points, p.vector){
cc.temp <- as.data.frame(table(p.vector))
cc.merged <- merge(p.points, cc.temp, by.x = "price.point", by.y = "p.vector", all.x = T)
cc.extracted <- cc.merged[,"Freq"]
cc.extracted[is.na(cc.extracted)] <- 0
return(cc.extracted)
}
so.get.df.price<-function(df){
# creates cumulative frequencies for three variables
# using the price points provided by respondents
# extract and sort all unique price points
# Thanks to akrun for their help with this step
price.point <- sort(unique(unlist(round(df, 2))))
#create a new data frame to work with having a row for each price point
dfp <- as.data.frame(price.point)
# Create cumulative frequencies (as percentages) for each variable
dfp$too.cheap.share <- 1 - (cumsum(so.get.count(dfp, df$price.too.cheap)) / nrow(df))
dfp$bargain.share <- 1 - cumsum(so.get.count(dfp, df$price.bargain)) / nrow(df)
dfp$not.bargain.share <- 1 - dfp$bargain.share# bargain inverted so curves will intersect
return(dfp)
}
so.df.price <- so.get.df.price(so.test.df)
Step 3: Estimate the curves for the cumulative frequencies
# Too Cheap
so.l <- lm(logit(so.df.price$too.cheap.share, percents = TRUE)~so.df.price$price.point)
so.cof.TCh <- coef(so.l)
so.temp.nls <- nls(too.cheap.share ~ 1 / (1 + exp(-(a + b * price.point))), start = list(a = so.cof.TCh[1], b = so.cof.TCh[2]), data = so.df.price, trace = TRUE)
so.df.price$Pr.TCh <- predict(so.temp.nls, so.df.price$price.point, lwd=2)
#Not Bargain
so.l <- lm(logit(not.bargain.share, percents = TRUE) ~ price.point, so.df.price)
so.cof.NBr <- coef(so.l)
so.temp.nls <- nls(not.bargain.share ~ 1 / (1 + exp(-(a + b * price.point))), start = list(a = so.cof.NBr[1], b = so.cof.Br[2]), data= so.df.price, trace=TRUE)
so.df.price$Pr.NBr <- predict(so.temp.nls, so.df.price$price.point, lwd=2)
# Thanks to John Fox & Sanford Weisberg - "An R Companion to Applied Regression, second edition"
At this point, we can plot and compare the "observed" cumulative frequencies against the estimated frequencies
ggplot(data = so.df.price, aes(x = price.point))+
geom_line(aes(y = so.df.price$Pr.TCh, colour = "Too Cheap"))+
geom_line(aes(y = so.df.price$Pr.NBr, colour = "Not Bargain"))+
geom_line(aes(y = so.df.price$too.cheap.share, colour = "too.cheap.share"))+
geom_line(aes(y = so.df.price$not.bargain.share, colour = "not.bargain.share"))+
scale_y_continuous(name = "Cummulative Frequency")
The estimate appears to fit the observations reasonably well.
Step 4: Find the intersection point for the two estimate functions
so.f <- function(x, a, b){
# model for the curves
1 / (1 + exp(-(a + b * x)))
}
# note, this function may also be used in step 3
#I was building as I went and I don't want to risk a transpositional error that breaks the example
so.pmc.x <- uniroot(function(x) so.f(x, so.cof.TCh[1], so.cof.TCh[2]) - so.f(x, so.cof.Br[1], so.cof.Br[2]), c(0, 50), tol = 0.01)$root
We may visually test the so.pmc.x by plotting it with the two estimates. If it is correct, a vertical line for so.pmc.x should pass through the intersection of too.cheap and not.bargain.
ggplot(data = so.df.price, aes(x = price.point)) +
geom_line(aes(y = so.df.price$Pr.TCh, colour = "Too Cheap")) +
geom_line(aes(y = so.df.price$Pr.NBr, colour = "Not Bargain")) +
scale_y_continuous(name = "Cumulative Frequency") +
geom_vline(aes(xintercept = so.pmc.x))
...which it does.
Step 5: Find y
Here is where I get stumped, and I'm sure I'm overlooking something very basic.
If a curve is defined by f(x) = 1/(1+exp(-(a+bx))), and a, b and x are all known, then shouldn't y be the result of 1/(1+exp(-(a+bx))) for either estimate?
In this instance, it is not.
# We attempt to use the too.cheap estimate to find y
so.pmc.y <- so.f(so.pmc.x, so.cof.TCh[1], so.cof.TCh[2])
# In theory, y for not.bargain at price.point so.pmc.x should be the same
so.pmc.y2 <- so.f(so.pmc.x, so.cof.NBr[1], so.cof.NBr[2])
EDIT: This is where the error occurs (see solution below).
a != so.cof.NBr[1] and b != so.cof.NBr[2], instead a and be should be defined as the coefficients from so.temp.nls (not so.l)
# Which they are
#> so.pmc.y
#(Intercept)
# 0.02830516
#> so.pmc.y2
#(Intercept)
# 0.0283046
If we calculate the correct value for y, a horizontal line at yintercept = so.pmc.y, should pass through the intersection of too.cheap and not.bargain.
...which it obviously does not.
So how does one estimate y?
I've solved this, and as I suspected, it was a simple error.
My assumption that y = 1/(1+exp(-(a+bx))) is correct.
The issue is that I was using the wrong a, b coefficients.
My curve was defined using the coefficients in so.cof.NBr as defined by so.l.
#Not Bargain
so.l <- lm(logit(not.bargain.share, percents = TRUE) ~ price.point, so.df.price)
so.cof.NBr <- coef(so.l)
so.temp.nls <- nls(not.bargain.share ~ 1 / (1 + exp(-(a + b * price.point))), start = list(a = so.cof.NBr[1], b = so.cof.Br[2]), data= so.df.price, trace=TRUE)
so.df.price$Pr.NBr <- predict(so.temp.nls, so.df.price$price.point, lwd=2)
But the resulting curve is so.temp.nls, NOT so.l.
Therefore, once I find so.pmc.x I need to extract the correct coefficients from so.temp.nls and use those to find y.
# extract coefficients from so.temp.nls
so.co <- coef(so.temp.nls)
# find y
so.pmc.y <- 1 / (1 + exp(-(so.co[1] + so.co[2] * so.pmc.x)))
ggplot(data = so.df.price, aes(x = price.point))+
geom_line(aes(y = so.df.price$Pr.TCh, colour = "Too Cheap"))+
geom_line(aes(y = so.df.price$Pr.NBr, colour = "Not Bargain"))+
scale_y_continuous(name = "Cumulative Frequency")+
geom_hline(aes(yintercept = so.pmc.y))
Yielding the following...
which graphically depicts the correct answer.

adding strip to single dotplot from lattice package

I have a situation where I have two different kind of plots and I need to plot them together along with strips on their top stating their heading. I have to use lattice package only thus ggplot2 is not an option. Thanks
I have a sample code here :
library(lattice)
library(latticeExtra)
X<-rnorm(100)
Y<-rnorm(100)
S<- rnorm(500)
df1<-data.frame(X,Y,S)
p1<-dotplot(X~Y, data=df1)
p2<-dotplot(X~S, data=df1)
#combining plots
c(p1,p2)
As mentioned in the comments, you need to stack the variables Y and S and create an additional column indicating what variable each value comes from.
> df2 <- reshape(df1, varying=2:3, v.names='Value', timevar='Var', times=c('Y','S'), direction='long')
> head(df2)
X Var Value id
1.Y -2.3720450 Y 2.3965643 1
2.Y 0.8082862 Y 0.0215850 2
3.Y 0.3774736 Y -0.6385176 3
4.Y 0.7161986 Y -0.3908185 4
5.Y -0.3633583 Y -0.9611222 5
6.Y -0.3484920 Y 3.3387813 6
Then
> dotplot(X ~ Value | Var, data = df2)
should do what you want.
Either:
p1 <- dotplot(X~Y, data=df1)
p2 <- dotplot(X~S, data=df1)
c(
"Here is plot 1" = p1
,"And here second one" = p2
)
or:
p1 <- dotplot(X~Y|"Here is plot 1", data=df1)
p2 <- dotplot(X~S|"And here second one", data=df1)
c(p1, p2)
First one seems better cause you create the strip where you need it.

Generating "2D" histogram in R

I am new to R and I would like to know how to generate histograms for the following situation :
I initially have a regular frequency table with 2 columns : Column A is the category (or bin) and Column B is the number of cases that fall in that category
Col A Col B
1-10 7
11-20 4
21-30 5
From this initial frequency table, I create a table with 3 columns : Col A is again the category (or bin), but now Col B is the "fraction of total cases", so for the category 1-10, column B will have the value 7/(7+4+5) = 7/16 . Now there is also a third column, Col C which is "fraction of total cases falling between the categories 1-20", so for 1-10, the value for Col C would be 7/(7+4) = 7/11. The complete table would look like below :
Col A Col B Col C
1-10 7/16 7/11
11-20 4/16 4/11
21-30 5/16 0
How do I generate a histogram from this 3-column table above ? My X axis should be the bin (1-10, 11-20 etc.) and my Y axis should be the fraction, however for every bin I have two fractions (Col B and Col C), so there will be two fraction "bars" for every bin in the histogram.
Any help would be greatly appreciated.
The data:
dat <- data.frame(A = c("1-10", "11-20", "21-30"), B = c(7, 4, 5))
Now, calculate the proportions and create a new object:
dat2 <- rbind(B = dat$B/sum(dat$B), C = c(dat$B[1:2]/sum(dat$B[1:2]), 0))
colnames(dat2) <- dat$A
Plot:
barplot(dat2, beside = TRUE, legend = rownames(dat2))
Your title should be changed to "Dodged Bar Chart" instead of 2D histogram, because histograms have continuous scale on x axis unlike bar chart and they are basically used for comparing the distributions of univariate data or the distributions of univariate data modeled on the dependent factor. You are trying to compare colB vs colC which can be effectively visualized using a 2D scatter plot but not with bar chart. The better way to compare the distributions of colB and colC using histograms would be plotting two histograms separately and check the change in location of the data points.
If you want to compare distributions of colB and colC, try the following code: I did round up the values for getting a reasonable data per your data description. Notice a random sampling by permutation is happening and everytime, you run the same code, there will be slight change in the distribution, but that will not affect the inference of distribution between colB and colC.
library("ggplot2")
# 44 datapoints between 1-10
a <- rep(1:10, 4)
a <- c(a, sample(a, size=4, replace=FALSE))
# 25 datapoints between 11-20
b <- rep(11:20, 2)
b <- c(b, sample(b, size=5, replace=FALSE))
# 31 datapoints between 21-30
c <- rep(21:30, 3)
c <- c(c, sample(c, size=1, replace=FALSE))
colB <- c(a, b, c)
# 64 datapoints between 1-10
a <- rep(1:10, 6)
a <- c(a, sample(a, size=4, replace=FALSE))
# 36 datapoints between 11-20
b <- rep(11:20, 3)
b <- c(b, sample(b, size=6, replace=FALSE))
colC <- c(a, b)
df <- data.frame(cbind(colB, colC=colC))
write.table(df, file = "data")
data <- read.table("data", header=TRUE)
data
ggplot(data=data, aes(x=colB, xmin=1, xmax=30)) + stat_bin(binwidth = 1)
ggplot(data=data, aes(x=colC, xmin=1, xmax=30)) + stat_bin(binwidth = 1)
# if you want density distribution, then you can try something like this:
ggplot(data=data, aes(x=colB, y = ..density.., xmin=1, xmax=30)) + stat_bin(binwidth = 1)
ggplot(data=data, aes(x=colC, y = ..density.., xmin=1, xmax=30)) + stat_bin(binwidth = 1)
HTH
-Sathish

Resources