Y Coordinates of Best Fit Curve in Ggplot2 - r

I have managed to generate pseudotime vs gene expression plots in Monocle for individual markers using the following code:
library("monocle")
lung <- load_lung()
diff_test_res <- differentialGeneTest(
lung,
fullModelFormulaStr = "~genotype"
)
ordering_genes <- diff_test_res[diff_test_res$qval < 0.01, "gene_id"]
lung <- setOrderingFilter(lung, ordering_genes)
plot_ordering_genes(lung)
#> Warning: Transformation introduced infinite values in continuous y-axis
lung <- reduceDimension(
lung,
max_components = 2,
method = 'DDRTree'
)
lung <- orderCells(lung)
lung_expressed_genes <- fData(lung)[fData(lung)$num_cells_expressed >= 5, "gene_id"]
lung_filtered <- lung[lung_expressed_genes, ]
my_genes <- rownames(lung_filtered)[1:3]
lung_subset <- lung_filtered[my_genes, ]
plot_genes_in_pseudotime(lung_subset, color_by = "genotype")
The "plot_genes_in_pseudotime" function on the final line generates a best fit curve of the plotted data. I was wondering if the y coordinates of this curve can somehow be obtained for say, every 0.01 units along the pseudotime axis? You can find the code and example plots here: http://cole-trapnell-lab.github.io/monocle-release/docs/#trajectory-step-3-order-cells-along-the-trajectory

You can access the Pseudotime and "expectation" values that comprise the curve in plot$data (monocle just plots Pseudotime against spline-smoothed mean expression for the specified genes).
You can then use approxfun to do 2d interpolation and evaluate a grid of points along the range of pseudotime.
NOTE: I am not sure this is a sensible thing to do. Pseudotime is a fairly loose and wooly thing, and reading deeply into minute changes in pseudotime is likely to lead to pretty shaky conclusions.
In any case, if you're interested in using this type of approach I would just read the code on github as it should be fairly easy to reproduce the output.
options(stringsAsFactors = FALSE)
library("monocle")
lung <- load_lung()
#> Removing 4 outliers
diff_test_res <- differentialGeneTest(
lung,
fullModelFormulaStr = "~genotype"
)
ordering_genes <- diff_test_res[diff_test_res$qval < 0.01, "gene_id"]
lung <- setOrderingFilter(lung, ordering_genes)
lung <- reduceDimension(
lung,
max_components = 2,
method = 'DDRTree'
)
lung <- orderCells(lung)
lung_expressed_genes <- fData(lung)[fData(lung)$num_cells_expressed >= 5, "gene_id"]
lung_filtered <- lung[as.character(lung_expressed_genes), ]
my_genes <- rownames(lung_filtered)[1:3]
## Use only 1 gene here. Otherwise the plot data will include multiple genes
lung_subset <- lung_filtered["ENSMUSG00000000031.9", ]
p <- plot_genes_in_pseudotime(lung_subset, color_by = "genotype")
df <- p$data
fun <- approxfun(df$Pseudotime, df$expectation)
s <- seq(min(df$Pseudotime), max(df$Pseudotime), by = 0.01)
plot(s, fun(s))

Related

Plot 3D regression surface using plot_ly

I am trying to plot a regression model for a data set with measurements for "mue" (friction coefficient for breaking train), speed of train and temp of train. I built a simple regression model using lm so I could test plotting with plot_ly. The plot attached shows the blue markers of the original data and the surface plotted doesn't look right. It should look more like a regression surface... I also plotted in 2D to make sure the regression actually works and it does. I've posted the code below and am wondering if anyone here has any advice. Been trying everything I can find online and none of it seems to be working. I think the issue might have to do with building a grid? I've tried that a few times, but I always get error messages for vectors not matching up, etc. I'd be happy to post that as well if needed. Thank you!
3D Regression Plot
2D Regression Plot
set.seed(123) # randum number generator
training.samples <- avg.frame$avg.mue %>%
createDataPartition(p = 0.8, list = FALSE) # pick 80 percent of data
train.data <- avg.frame[training.samples, ] # 80 percent is training data
test.data <- avg.frame[-training.samples, ] # 20 percent is test data
model_2 <- lm(avg.mue ~ avg.speed + avg.temp, data = train.data)
vals <- predict(model_2, train.data)
avg.mue <- matrix(vals, nrow = length(test.data$avg.speed), ncol = length(test.data$avg.temp))
plane <- avg.mue
p <- plot_ly(data = train.data, z = ~avg.mue, x = ~avg.speed, y = ~avg.temp, opacity = 0.6) %>%
add_markers()
p %>% add_surface(z = ~plane, x = ~avg.speed, y = ~avg.temp, showscale = FALSE) %>%
layout(showlegend = FALSE)

3D plot in r of variables from expand grid

I'm trying to plot a 3D plane from three variables. I've read many questions on the topic but haven't been able to find what I'm looking for.
I have two sets of variables:
prob <- seq(0,1,by=0.01)
n <- seq(999,9999, by = 1000)
n <- c(9,99,n)
combis <- expand.grid(prob,n)
which I then use to produce my results:
res <- apply(combis,1,calc,pos=pos)
where the values of res can be between 0 and 2/3.
So I'm trying to make a 3d plot where n,prob are x,z and y is res. However most packages I've found require matrices etc, and cannot get this to work.
Any help would be appreciated, and sorry if I haven't found the answer.
Assuming that res is just a vector, you can just combine your data and reshape it into a wide-format matrix and then plot with something like the lattice package
prob <- seq(0,1,by=0.01)
n <- seq(999,9999, by = 1000)
n <- c(9,99,n)
combis <- expand.grid(prob,n)
res <- runif(n=nrow(combis), 0, 0.67) #generate sample data for res
dat <- cbind(combis, res)
library(reshape2)
datm <- acast(data = dat, Var1~Var2, value.var = "res") #cast it into wide format
library(lattice)
library(latticeExtra)
cloud(datm, panel.3d.cloud = panel.3dbars, xlab="n", ylab="res", zlab="prob")

R PCA makes graph that is fishy, can't ID why

Link to data as txt file here
I'm having trouble with this PCA. PC1 results appear binary, and I can't figure out why as none of my variables are binary.
df = bees
pca_dat_condition <- bees %>% ungroup() %>%
select(Length.1:Length.25, OBJECTID, Local, Elevation, Longitude,
Latitude, Cubital.Index) %>%
na.omit()
pca_dat_first <- pca_dat_condition %>% #remove the final nonnumerical information
select(-Local, -OBJECTID, -Elevation, -Longitude, -Latitude)
pca <- pca_dat_first%>%
scale() %>%
prcomp()
# add identifying information back into PCA data
pca_data <- data.frame(pca$x, Local=pca_dat_condition$Local, ID =
pca_dat_condition$OBJECTID, elevation = pca_dat_condition$Elevation,
Longitude = pca_dat_condition$Longitude, Latitude =
pca_dat_condition$Latitude)
ggplot(pca_data, aes(x=PC1, y=PC2, color = Latitude)) +
geom_point() +ggtitle("PC1 vs PC2: All Individuals") +
scale_colour_gradient(low = "blue", high = "red")
I'm not getting any error messages with the code, and when I look at the data frame nothing looks out of place. Should I be using a different function for the PCA? Any insight into why my graph may look like this?
Previously, I did the same PCA but for the average values for each Local (whereas this is each individual), and it came out as a normal PCA with no clear clustering. I don't understand why this problem would arise when looking at individual points. It's possible I merged some other data frames in a wonky way, but the structure of the dataset seems completely normal.
This is how the PCA looks.
bees <- read.csv(paste0("https://gist.githubusercontent.com/AkselA/",
"08a4e78a6a29a918ed597e9a32adc228/raw/",
"6d0005fad4cb91830bcf7087176283b18683e9cd/bees.csv"),
header=TRUE)
# bees <- bees[bees[,1] < 10,] # This will remove the three offending rows
bees <- na.omit(bees)
bees.cond <- bees[, grep("Length|OBJ|Loc|Ele|Lon|Lat|Cubi", colnames(bees))]
bees.first <- bees[, grep("Length|Cubi", colnames(bees))]
summary(bees.first)
par(mfrow=c(7, 4), mar=rep(1, 4))
q <- lapply(1:ncol(bees.first), function(x) {
h <- hist(scale(bees.first[, x]), plot=FALSE)
h$counts <- log1p(h$counts)
plot(h, main="", axes=FALSE, ann=FALSE)
legend("topright", legend=names(bees.first[x]),
bty="n", cex=0.8, adj=c(0, -2), xpd=NA)
})
bees.pca <- prcomp(bees.first, scale.=TRUE)
biplot(bees.pca)
Before removing the outliers
After

How to plot numerous polygons in each data category?

I am working with ggplot to plot bivariate data in groups along with standard ellipses of these data using a separate set of tools. These return n=100 x,y coordinates that define each ellipse, and then for each group, I would like to plot about 10-25 ellipses.
Conceptually, how can this be achieved? I can plot a single ellipse easily using geom_polygon, but I am confused how to get the data organized to make it work so multiple ellipses are plotted and guides (color, fills, linetypes, etc.) are applied per group.
In the traditional R plotting, I could just keep adding lines using a for loop.
Thanks!
UPDATE: Here is a CSV containing 100 coordinates for a single ellipse.
Data
Let's say I have three groups of bivariate data to which the ellipse fitting has been applied: Green, Red, Blue. For each group, I'd like to plot several ellipses.
I don't know how I would organize the data in such a way to work in the long format prefered by ggplot and preserve the group affiliations. Would a list work?
UPDATE2:
Here is a csv of raw x and y data organized into two groups: river and lake
Data
The data plot like this:
test.data <- read.csv("ellipse_test_data.csv")
ggplot(test.data) +
geom_point(aes(x, y, color = group)) +
theme_classic()
I am using a package called SIBER, which will fit Bayesian ellipses to the data for comparing groups by ellipse area, etc. The output of the following creates a list with number of elements = number of groups of data, and each element contains a 6 x n (n=number of draws) for each fitted ellipse - first four columns are a covariance matrix Sigma in vector format and the last two are the bivariate means:
# options for running jags
parms <- list()
parms$n.iter <- 2 * 10^5 # number of iterations to run the model for
parms$n.burnin <- 1 * 10^3 # discard the first set of values
parms$n.thin <- 100 # thin the posterior by this many
parms$n.chains <- 2 # run this many chains
# define the priors
priors <- list()
priors$R <- 1 * diag(2)
priors$k <- 2
priors$tau.mu <- 1.0E-3
# fit the ellipses which uses an Inverse Wishart prior
# on the covariance matrix Sigma, and a vague normal prior on the
# means. Fitting is via the JAGS method.
ellipses.test <- siberMVN(siber.test, parms, priors)
First few rows of the first element in the list:
$`1.river`
Sigma2[1,1] Sigma2[2,1] Sigma2[1,2] Sigma2[2,2] mu[1] mu[2]
[1,] 1.2882740 2.407070e-01 2.407070e-01 1.922637 -15.52846 12.14774
[2,] 1.0677979 -3.997169e-02 -3.997169e-02 2.448872 -15.49182 12.37709
[3,] 1.1440816 7.257331e-01 7.257331e-01 4.040416 -15.30151 12.14947
I would like to be able to extract a random number of these ellipses and plot them with ggplot using alpha transparency.
The package SIBER has a function (addEllipse) to convert the '6 x n' entries to a set number of x and y points that define an ellipse, but I don't know how to organize that output for ggplot. I thought there might be an elegant way to do with all internally with ggplot.
The ideal output would be something like this, but in ggplot so the ellipses could match the aesthetics of the levels of data:
some code to do this on the bundled demo dataset from SIBER.
In this example we try to create some plots of the multiple samples of the posterior ellipses using ggplot2.
library(SIBER)
library(ggplot2)
library(dplyr)
library(ellipse)
Fit a basic SIBER model to the example data bundled with the package.
# load in the included demonstration dataset
data("demo.siber.data")
#
# create the siber object
siber.example <- createSiberObject(demo.siber.data)
# Calculate summary statistics for each group: TA, SEA and SEAc
group.ML <- groupMetricsML(siber.example)
# options for running jags
parms <- list()
parms$n.iter <- 2 * 10^4 # number of iterations to run the model for
parms$n.burnin <- 1 * 10^3 # discard the first set of values
parms$n.thin <- 10 # thin the posterior by this many
parms$n.chains <- 2 # run this many chains
# define the priors
priors <- list()
priors$R <- 1 * diag(2)
priors$k <- 2
priors$tau.mu <- 1.0E-3
# fit the ellipses which uses an Inverse Wishart prior
# on the covariance matrix Sigma, and a vague normal prior on the
# means. Fitting is via the JAGS method.
ellipses.posterior <- siberMVN(siber.example, parms, priors)
# The posterior estimates of the ellipses for each group can be used to
# calculate the SEA.B for each group.
SEA.B <- siberEllipses(ellipses.posterior)
siberDensityPlot(SEA.B, xticklabels = colnames(group.ML),
xlab = c("Community | Group"),
ylab = expression("Standard Ellipse Area " ('\u2030' ^2) ),
bty = "L",
las = 1,
main = "SIBER ellipses on each group"
)
Now we want to create some plots of some sample ellipses from these distributions. We need to create a data.frame object of all the ellipses for each group. In this exmaple we simply take the frist 10 posterior draws assuming them to be independent of one another, but you could take a random sample if you prefer.
# how many of the posterior draws do you want?
n.posts <- 10
# decide how big an ellipse you want to draw
p.ell <- 0.95
# for a standard ellipse use
# p.ell <- pchisq(1,2)
# a list to store the results
all_ellipses <- list()
# loop over groups
for (i in 1:length(ellipses.posterior)){
# a dummy variable to build in the loop
ell <- NULL
post.id <- NULL
for ( j in 1:n.posts){
# covariance matrix
Sigma <- matrix(ellipses.posterior[[i]][j,1:4], 2, 2)
# mean
mu <- ellipses.posterior[[i]][j,5:6]
# ellipse points
out <- ellipse::ellipse(Sigma, centre = mu , level = p.ell)
ell <- rbind(ell, out)
post.id <- c(post.id, rep(j, nrow(out)))
}
ell <- as.data.frame(ell)
ell$rep <- post.id
all_ellipses[[i]] <- ell
}
ellipse_df <- bind_rows(all_ellipses, .id = "id")
# now we need the group and community names
# extract them from the ellipses.posterior list
group_comm_names <- names(ellipses.posterior)[as.numeric(ellipse_df$id)]
# split them and conver to a matrix, NB byrow = T
split_group_comm <- matrix(unlist(strsplit(group_comm_names, "[.]")),
nrow(ellipse_df), 2, byrow = TRUE)
ellipse_df$community <- split_group_comm[,1]
ellipse_df$group <- split_group_comm[,2]
ellipse_df <- dplyr::rename(ellipse_df, iso1 = x, iso2 = y)
Now to create the plots. First plot all the raw data as we want.
first.plot <- ggplot(data = demo.siber.data, aes(iso1, iso2)) +
geom_point(aes(color = factor(group):factor(community)), size = 2)+
ylab(expression(paste(delta^{15}, "N (\u2030)")))+
xlab(expression(paste(delta^{13}, "C (\u2030)"))) +
theme(text = element_text(size=15))
print(first.plot)
Now we can try to add the posterior ellipses on top and facet by group
second.plot <- first.plot + facet_wrap(~factor(group):factor(community))
print(second.plot)
# rename columns of ellipse_df to match the aesthetics
third.plot <- second.plot +
geom_polygon(data = ellipse_df,
mapping = aes(iso1, iso2,
group = rep,
color = factor(group):factor(community),
fill = NULL),
fill = NA,
alpha = 0.2)
print(third.plot)
Facet-wrapped plot of sample of posterior ellipses by group

Bootstrap a proportion (of factor levels) in ggplot2

I have a long format data frame where rows represent the responses (one of four categories) of different people. An example dataset is provided here:
df <- data.frame(person=c(rep("A",100),rep("B",100)),resp=c(sample(4,100,replace=TRUE),sample(4,100,replace=TRUE)))
df$resp <- factor(df$resp)
summary(df)
person resp
A:100 1:52
B:100 2:55
3:54
4:39
I want to present a chart where the x-axis plots the response category, the y-axis shows the proportion of responses in a category, and where error bars are calculated via bootstrapping (sampling with replacement).
I can calculate the proportion (in an extremely kludgy way; I'm sure this could be improved but this is not my main concern):
pFrame <- ddply(df,.(person,resp),summarise,trials = length(resp))
# can't figure out how to calculate the proportion with plyr.
pFrame$prop <- NA
people <- unique(df$person)
responses <- unique(df$resp)
for (i in 1 : length(people)){
nTrials <- nrow(subset(df,person==people[i]))
for (j in 1 : 4){
pFrame$prop[pFrame$person==people[i] & pFrame$resp==responses[j]] <- pFrame$trials[pFrame$person==people[i] & pFrame$resp==responses[j]] / nTrials
}
}
and plot it:
ggplot(pFrame,aes(x=resp,y=prop,colour=person)) + geom_point()
but I would really like to use something like stat_summary(fun.data="mean_cl_boot") to show the variability on the proportions (i.e. acting on the original data frame df, and bootstrapping over the rows). I've tried a few attempts at creating custom functions but this doesn't seem trivial because the factor levels need to be transformed for the bootstrap first.
I couldn't get ggplot's "mean_cl_boot" to work. Here is an alternative solution though:
library(boot)
summary_for_plot <- melt(prop.table(table(df), 1))
names(summary_for_plot) <- c("person", "resp", "V1")
# function for boot()
summary_function <- function(df, d){
melt(prop.table(table(df[d,]), 1))[, 3]
}
bootres <- boot(df, statistic = summary_function, R=100)
# get the standard deviation, used for the confidence intervals
summary_for_plot$sd <- sd(bootres$t)
ggplot(summary_for_plot, aes(x= resp, y = V1, color = person)) + geom_point() +
geom_errorbar(aes(ymin = V1-sd, ymax = V1+sd), width = 0.2)

Resources