Plot 3D regression surface using plot_ly - r

I am trying to plot a regression model for a data set with measurements for "mue" (friction coefficient for breaking train), speed of train and temp of train. I built a simple regression model using lm so I could test plotting with plot_ly. The plot attached shows the blue markers of the original data and the surface plotted doesn't look right. It should look more like a regression surface... I also plotted in 2D to make sure the regression actually works and it does. I've posted the code below and am wondering if anyone here has any advice. Been trying everything I can find online and none of it seems to be working. I think the issue might have to do with building a grid? I've tried that a few times, but I always get error messages for vectors not matching up, etc. I'd be happy to post that as well if needed. Thank you!
3D Regression Plot
2D Regression Plot
set.seed(123) # randum number generator
training.samples <- avg.frame$avg.mue %>%
createDataPartition(p = 0.8, list = FALSE) # pick 80 percent of data
train.data <- avg.frame[training.samples, ] # 80 percent is training data
test.data <- avg.frame[-training.samples, ] # 20 percent is test data
model_2 <- lm(avg.mue ~ avg.speed + avg.temp, data = train.data)
vals <- predict(model_2, train.data)
avg.mue <- matrix(vals, nrow = length(test.data$avg.speed), ncol = length(test.data$avg.temp))
plane <- avg.mue
p <- plot_ly(data = train.data, z = ~avg.mue, x = ~avg.speed, y = ~avg.temp, opacity = 0.6) %>%
add_markers()
p %>% add_surface(z = ~plane, x = ~avg.speed, y = ~avg.temp, showscale = FALSE) %>%
layout(showlegend = FALSE)

Related

Y Coordinates of Best Fit Curve in Ggplot2

I have managed to generate pseudotime vs gene expression plots in Monocle for individual markers using the following code:
library("monocle")
lung <- load_lung()
diff_test_res <- differentialGeneTest(
lung,
fullModelFormulaStr = "~genotype"
)
ordering_genes <- diff_test_res[diff_test_res$qval < 0.01, "gene_id"]
lung <- setOrderingFilter(lung, ordering_genes)
plot_ordering_genes(lung)
#> Warning: Transformation introduced infinite values in continuous y-axis
lung <- reduceDimension(
lung,
max_components = 2,
method = 'DDRTree'
)
lung <- orderCells(lung)
lung_expressed_genes <- fData(lung)[fData(lung)$num_cells_expressed >= 5, "gene_id"]
lung_filtered <- lung[lung_expressed_genes, ]
my_genes <- rownames(lung_filtered)[1:3]
lung_subset <- lung_filtered[my_genes, ]
plot_genes_in_pseudotime(lung_subset, color_by = "genotype")
The "plot_genes_in_pseudotime" function on the final line generates a best fit curve of the plotted data. I was wondering if the y coordinates of this curve can somehow be obtained for say, every 0.01 units along the pseudotime axis? You can find the code and example plots here: http://cole-trapnell-lab.github.io/monocle-release/docs/#trajectory-step-3-order-cells-along-the-trajectory
You can access the Pseudotime and "expectation" values that comprise the curve in plot$data (monocle just plots Pseudotime against spline-smoothed mean expression for the specified genes).
You can then use approxfun to do 2d interpolation and evaluate a grid of points along the range of pseudotime.
NOTE: I am not sure this is a sensible thing to do. Pseudotime is a fairly loose and wooly thing, and reading deeply into minute changes in pseudotime is likely to lead to pretty shaky conclusions.
In any case, if you're interested in using this type of approach I would just read the code on github as it should be fairly easy to reproduce the output.
options(stringsAsFactors = FALSE)
library("monocle")
lung <- load_lung()
#> Removing 4 outliers
diff_test_res <- differentialGeneTest(
lung,
fullModelFormulaStr = "~genotype"
)
ordering_genes <- diff_test_res[diff_test_res$qval < 0.01, "gene_id"]
lung <- setOrderingFilter(lung, ordering_genes)
lung <- reduceDimension(
lung,
max_components = 2,
method = 'DDRTree'
)
lung <- orderCells(lung)
lung_expressed_genes <- fData(lung)[fData(lung)$num_cells_expressed >= 5, "gene_id"]
lung_filtered <- lung[as.character(lung_expressed_genes), ]
my_genes <- rownames(lung_filtered)[1:3]
## Use only 1 gene here. Otherwise the plot data will include multiple genes
lung_subset <- lung_filtered["ENSMUSG00000000031.9", ]
p <- plot_genes_in_pseudotime(lung_subset, color_by = "genotype")
df <- p$data
fun <- approxfun(df$Pseudotime, df$expectation)
s <- seq(min(df$Pseudotime), max(df$Pseudotime), by = 0.01)
plot(s, fun(s))

R - Inconsistent p-value in running Spearman correlation

My problem is when I compute running correlation for some odd reason I do not get the same p-value for the same estimates/correlations values.
My target is to calculate a running Spearman correlation on two vectors in the same data.frame (subject1 and subject2 in the example below). In addition, my window (length of the vector) and stide (the jumps/steps between each window) are constant. As such, when looking at the formula below (from wiki) I should get the same critical t hence the same p-value for the same Spearman correlation. These is because the n states the same (it's the same window size) and the r is same. However, my end p value is different.
#Needed pkgs
require(tidyverse)
require(pspearman)
require(gtools)
#Sample data
set.seed(528)
subject1 <- rnorm(40, mean = 85, sd = 5)
set.seed(528)
subject2 <- c(
lag(subject1[1:21]) - 10,
rnorm(n = 6, mean = 85, sd = 5),
lag(subject1[length(subject1):28]) - 10)
df <- data.frame(subject1 = subject1,
subject2 = subject2) %>%
rowid_to_column(var = "Time")
df[is.na(df)] <- subject1[1] - 10
rm(subject1, subject2)
#Function for Spearman
psSpearman <- function(x, y)
{
out <- pspearman::spearman.test(x, y,
alternative = "two.sided",
approximation = "t-distribution") %>%
broom::tidy()
return(data.frame(estimate = out$estimate,
statistic = out$statistic,
p.value = out$p.value )
}
#Running correlation along the subjects
dfRunningCor <- running(df$subject1, df$subject2,
fun = psSpearman,
width = 20,
allow.fewer = FALSE,
by = 1,
pad = FALSE,
align = "right") %>%
t() %>%
as.data.frame()
#Arranging the Results into easy to handle data.frame
Results <- do.call(rbind.data.frame, dfRunningCor) %>%
t() %>%
as.data.frame() %>%
rownames_to_column(var = "Win") %>%
gather(CorValue, Value, -Win) %>%
separate(Win, c("fromIndex", "toIndex")) %>%
mutate(fromIndex = as.numeric(substring(fromIndex, 2)),
toIndex = as.numeric(toIndex, 2)) %>%
spread(CorValue, Value) %>%
arrange(fromIndex) %>%
select(fromIndex, toIndex, estimate, statistic, p.value)
My problem is when I plot the Results with estimates (Spearman rho;estimate), window number (fromIndex) and I color the p value, I should get like a "tunnel"/"path" of the same color across the same area - I don't.
For example, in the picture below, points in the same height in the red circle should be with the same color - but the aren't.
Code for the graph:
Results %>%
ggplot(aes(fromIndex, estimate, color = p.value)) +
geom_line()
What I found so far is that it might might be due to:
1. Functions like Hmisc::rcorr() tend to not give the same p.value in small sample or many ties. This is why I use pspearman::spearman.test which from what I read here suppose to solve this problem.
2. Small sample size - I tried using a bigger sample size. I still get the same problem.
3. I tried rounding my p values - I still get the same problem.
Thank you for your help!
Edit.
Could it be "pseudo" coloring by ggplot? Could it be that ggplot just interpolate "last" color until the next point?. Which is why I get "light blue" from point 5 to 6 but "dark blue" from point 7 to 8?
The results you obtain for the p.value variable are coherent with the estimate value.
You can check it as follows:
Results$orderestimate <- order(-abs(Results$estimate))
Results$orderp.value <- order(abs(Results$p.value))
identical(Results$orderestimate ,Results$orderp.value)
I don't think you should include a colour for the p.value in the graph, it is an unnecessary visual distraction and it is hard to interpret.
If I were you I would only display the p.value and perhaps include a point to indicate the sign of the estimate variable.
p <- Results %>%
ggplot(aes(fromIndex, p.value)) +
geom_line()
# If you want to display the sign of the estimate
Results$estimate.sign <- as.factor(sign(Results$estimate))
p+geom_point( aes(color = estimate.sign ))

R PCA makes graph that is fishy, can't ID why

Link to data as txt file here
I'm having trouble with this PCA. PC1 results appear binary, and I can't figure out why as none of my variables are binary.
df = bees
pca_dat_condition <- bees %>% ungroup() %>%
select(Length.1:Length.25, OBJECTID, Local, Elevation, Longitude,
Latitude, Cubital.Index) %>%
na.omit()
pca_dat_first <- pca_dat_condition %>% #remove the final nonnumerical information
select(-Local, -OBJECTID, -Elevation, -Longitude, -Latitude)
pca <- pca_dat_first%>%
scale() %>%
prcomp()
# add identifying information back into PCA data
pca_data <- data.frame(pca$x, Local=pca_dat_condition$Local, ID =
pca_dat_condition$OBJECTID, elevation = pca_dat_condition$Elevation,
Longitude = pca_dat_condition$Longitude, Latitude =
pca_dat_condition$Latitude)
ggplot(pca_data, aes(x=PC1, y=PC2, color = Latitude)) +
geom_point() +ggtitle("PC1 vs PC2: All Individuals") +
scale_colour_gradient(low = "blue", high = "red")
I'm not getting any error messages with the code, and when I look at the data frame nothing looks out of place. Should I be using a different function for the PCA? Any insight into why my graph may look like this?
Previously, I did the same PCA but for the average values for each Local (whereas this is each individual), and it came out as a normal PCA with no clear clustering. I don't understand why this problem would arise when looking at individual points. It's possible I merged some other data frames in a wonky way, but the structure of the dataset seems completely normal.
This is how the PCA looks.
bees <- read.csv(paste0("https://gist.githubusercontent.com/AkselA/",
"08a4e78a6a29a918ed597e9a32adc228/raw/",
"6d0005fad4cb91830bcf7087176283b18683e9cd/bees.csv"),
header=TRUE)
# bees <- bees[bees[,1] < 10,] # This will remove the three offending rows
bees <- na.omit(bees)
bees.cond <- bees[, grep("Length|OBJ|Loc|Ele|Lon|Lat|Cubi", colnames(bees))]
bees.first <- bees[, grep("Length|Cubi", colnames(bees))]
summary(bees.first)
par(mfrow=c(7, 4), mar=rep(1, 4))
q <- lapply(1:ncol(bees.first), function(x) {
h <- hist(scale(bees.first[, x]), plot=FALSE)
h$counts <- log1p(h$counts)
plot(h, main="", axes=FALSE, ann=FALSE)
legend("topright", legend=names(bees.first[x]),
bty="n", cex=0.8, adj=c(0, -2), xpd=NA)
})
bees.pca <- prcomp(bees.first, scale.=TRUE)
biplot(bees.pca)
Before removing the outliers
After

How to plot numerous polygons in each data category?

I am working with ggplot to plot bivariate data in groups along with standard ellipses of these data using a separate set of tools. These return n=100 x,y coordinates that define each ellipse, and then for each group, I would like to plot about 10-25 ellipses.
Conceptually, how can this be achieved? I can plot a single ellipse easily using geom_polygon, but I am confused how to get the data organized to make it work so multiple ellipses are plotted and guides (color, fills, linetypes, etc.) are applied per group.
In the traditional R plotting, I could just keep adding lines using a for loop.
Thanks!
UPDATE: Here is a CSV containing 100 coordinates for a single ellipse.
Data
Let's say I have three groups of bivariate data to which the ellipse fitting has been applied: Green, Red, Blue. For each group, I'd like to plot several ellipses.
I don't know how I would organize the data in such a way to work in the long format prefered by ggplot and preserve the group affiliations. Would a list work?
UPDATE2:
Here is a csv of raw x and y data organized into two groups: river and lake
Data
The data plot like this:
test.data <- read.csv("ellipse_test_data.csv")
ggplot(test.data) +
geom_point(aes(x, y, color = group)) +
theme_classic()
I am using a package called SIBER, which will fit Bayesian ellipses to the data for comparing groups by ellipse area, etc. The output of the following creates a list with number of elements = number of groups of data, and each element contains a 6 x n (n=number of draws) for each fitted ellipse - first four columns are a covariance matrix Sigma in vector format and the last two are the bivariate means:
# options for running jags
parms <- list()
parms$n.iter <- 2 * 10^5 # number of iterations to run the model for
parms$n.burnin <- 1 * 10^3 # discard the first set of values
parms$n.thin <- 100 # thin the posterior by this many
parms$n.chains <- 2 # run this many chains
# define the priors
priors <- list()
priors$R <- 1 * diag(2)
priors$k <- 2
priors$tau.mu <- 1.0E-3
# fit the ellipses which uses an Inverse Wishart prior
# on the covariance matrix Sigma, and a vague normal prior on the
# means. Fitting is via the JAGS method.
ellipses.test <- siberMVN(siber.test, parms, priors)
First few rows of the first element in the list:
$`1.river`
Sigma2[1,1] Sigma2[2,1] Sigma2[1,2] Sigma2[2,2] mu[1] mu[2]
[1,] 1.2882740 2.407070e-01 2.407070e-01 1.922637 -15.52846 12.14774
[2,] 1.0677979 -3.997169e-02 -3.997169e-02 2.448872 -15.49182 12.37709
[3,] 1.1440816 7.257331e-01 7.257331e-01 4.040416 -15.30151 12.14947
I would like to be able to extract a random number of these ellipses and plot them with ggplot using alpha transparency.
The package SIBER has a function (addEllipse) to convert the '6 x n' entries to a set number of x and y points that define an ellipse, but I don't know how to organize that output for ggplot. I thought there might be an elegant way to do with all internally with ggplot.
The ideal output would be something like this, but in ggplot so the ellipses could match the aesthetics of the levels of data:
some code to do this on the bundled demo dataset from SIBER.
In this example we try to create some plots of the multiple samples of the posterior ellipses using ggplot2.
library(SIBER)
library(ggplot2)
library(dplyr)
library(ellipse)
Fit a basic SIBER model to the example data bundled with the package.
# load in the included demonstration dataset
data("demo.siber.data")
#
# create the siber object
siber.example <- createSiberObject(demo.siber.data)
# Calculate summary statistics for each group: TA, SEA and SEAc
group.ML <- groupMetricsML(siber.example)
# options for running jags
parms <- list()
parms$n.iter <- 2 * 10^4 # number of iterations to run the model for
parms$n.burnin <- 1 * 10^3 # discard the first set of values
parms$n.thin <- 10 # thin the posterior by this many
parms$n.chains <- 2 # run this many chains
# define the priors
priors <- list()
priors$R <- 1 * diag(2)
priors$k <- 2
priors$tau.mu <- 1.0E-3
# fit the ellipses which uses an Inverse Wishart prior
# on the covariance matrix Sigma, and a vague normal prior on the
# means. Fitting is via the JAGS method.
ellipses.posterior <- siberMVN(siber.example, parms, priors)
# The posterior estimates of the ellipses for each group can be used to
# calculate the SEA.B for each group.
SEA.B <- siberEllipses(ellipses.posterior)
siberDensityPlot(SEA.B, xticklabels = colnames(group.ML),
xlab = c("Community | Group"),
ylab = expression("Standard Ellipse Area " ('\u2030' ^2) ),
bty = "L",
las = 1,
main = "SIBER ellipses on each group"
)
Now we want to create some plots of some sample ellipses from these distributions. We need to create a data.frame object of all the ellipses for each group. In this exmaple we simply take the frist 10 posterior draws assuming them to be independent of one another, but you could take a random sample if you prefer.
# how many of the posterior draws do you want?
n.posts <- 10
# decide how big an ellipse you want to draw
p.ell <- 0.95
# for a standard ellipse use
# p.ell <- pchisq(1,2)
# a list to store the results
all_ellipses <- list()
# loop over groups
for (i in 1:length(ellipses.posterior)){
# a dummy variable to build in the loop
ell <- NULL
post.id <- NULL
for ( j in 1:n.posts){
# covariance matrix
Sigma <- matrix(ellipses.posterior[[i]][j,1:4], 2, 2)
# mean
mu <- ellipses.posterior[[i]][j,5:6]
# ellipse points
out <- ellipse::ellipse(Sigma, centre = mu , level = p.ell)
ell <- rbind(ell, out)
post.id <- c(post.id, rep(j, nrow(out)))
}
ell <- as.data.frame(ell)
ell$rep <- post.id
all_ellipses[[i]] <- ell
}
ellipse_df <- bind_rows(all_ellipses, .id = "id")
# now we need the group and community names
# extract them from the ellipses.posterior list
group_comm_names <- names(ellipses.posterior)[as.numeric(ellipse_df$id)]
# split them and conver to a matrix, NB byrow = T
split_group_comm <- matrix(unlist(strsplit(group_comm_names, "[.]")),
nrow(ellipse_df), 2, byrow = TRUE)
ellipse_df$community <- split_group_comm[,1]
ellipse_df$group <- split_group_comm[,2]
ellipse_df <- dplyr::rename(ellipse_df, iso1 = x, iso2 = y)
Now to create the plots. First plot all the raw data as we want.
first.plot <- ggplot(data = demo.siber.data, aes(iso1, iso2)) +
geom_point(aes(color = factor(group):factor(community)), size = 2)+
ylab(expression(paste(delta^{15}, "N (\u2030)")))+
xlab(expression(paste(delta^{13}, "C (\u2030)"))) +
theme(text = element_text(size=15))
print(first.plot)
Now we can try to add the posterior ellipses on top and facet by group
second.plot <- first.plot + facet_wrap(~factor(group):factor(community))
print(second.plot)
# rename columns of ellipse_df to match the aesthetics
third.plot <- second.plot +
geom_polygon(data = ellipse_df,
mapping = aes(iso1, iso2,
group = rep,
color = factor(group):factor(community),
fill = NULL),
fill = NA,
alpha = 0.2)
print(third.plot)
Facet-wrapped plot of sample of posterior ellipses by group

Restrict fitted regression line (abline) to range of data used in model

Is it possible to draw an abline of a fit only in a certain range of x-values?
I have a dataset with a linear fit of a subset of that dataset:
# The dataset:
daten <- data.frame(x = c(0:6), y = c(0.3, 0.1, 0.9, 3.1, 5, 4.9, 6.2))
# make a linear fit for the datapoints 3, 4, 5
daten_fit <- lm(formula = y~x, data = daten, subset = 3:5)
When I plot the data and draw a regression line:
plot (y ~ x, data = daten)
abline(reg = daten_fit)
The line is drawn for the full range of x-values in the original data. But, I want to draw the regression line only for the subset of data that was used for curve fitting. There were 2 ideas that came to my mind:
Draw a second line that is thicker, but is only shown in the range 3:5. I checked the parameters for abline, lines and segments but I could not find anything
Add small ticks to the respective positions, that are perpendicular to the abline. I have now idea how I could do this. this would be the nicer way of course.
Do you have any idea for a solution?
The answer is No, it is not possible to get abline() to draw the fitted line on only one part of the plot region where the model was fitted. This is because it uses only the model coefficients to draw the line, not predictions from the model. If you look closely, you'll see that the line draw actually extends outside the plot region, covering the plot frame where it exists the region.
The simplest solution to such problems is to predict from the model for the regions you want.
# The dataset:
daten <- data.frame(x = c(0:6), y = c(0.3, 0.1, 0.9, 3.1, 5, 4.9, 6.2))
# make a linear fit for the datapoints 3, 4, 5
mod <- lm(y~x, data = daten, subset = 3:5)
First, we get the range of x values we want to differentiate:
xr <- with(daten, range(x[3:5]))
then we predict for a set of evenly spaced points on this range using the model:
pred <- data.frame(x = seq(from = xr[1], to = xr[2], length = 50))
pred <- transform(pred, yhat = predict(mod, newdata = pred))
Now plot the data and the model using abline():
plot(y ~ x, data = daten)
abline(mod)
then add in the region you want to emphasise:
lines(yhat ~ x, data = pred, col = "red", lwd = 2)
Which gives us this plot:
If you have a model that is more complex than that which can be handled by abline(), then we take a slightly different strategy, predicting over the range of the available, plotted data to draw the line, and then pick out the interval we want to highlight. The following code does that:
## range of all `x` data
xr2 <- with(daten, range(x))
## same as before
pred <- data.frame(x = seq(from = xr2[1], to = xr2[2], length = 100))
pred <- transform(pred, yhat = predict(mod, newdata = pred))
## plot the data and the fitted model line
plot(y ~ x, data = daten)
lines(yhat ~ x, data = pred)
## add emphasis to the interval used in fitting
with(pred, lines(yhat ~ x, data = pred, subset = x >= xr[1] & x <= xr[2],
lwd = 2, col = "red"))
What we do here is use the subset argument to pick out the values from the predictions that are in the interval used in fitting, the vector we pass to subset is a logical vector of TRUE and FALSE values indicating which data are in the region of interest and lines() only plots a line along those data.
R> head(with(pred, x >= xr[1] & x <= xr[2]))
[1] FALSE FALSE FALSE FALSE FALSE FALSE
One might wonder why I have done predictions over 50 or 100 evenly spaced values of the predictor variable when we could, in this case, just have done a prediction for the start and the end of the data or region of interest and join the two points? Well, not all modelling exercises are that simple - you double log model from a previous question is a case in point - and the generic solution I outline above will work in all cases whereas simply joining two predictions won't.
#Andrie has furnished you with a solution to Idea 2.
One way would be to use colours to distinguish between points that are fitted and those that aren't:
daten_fit <- lm(formula = y~x, data = daten[3:5, ])
plot(y ~ x, data = daten)
points(y ~ x, data = daten[3:5, ], col="red")
abline(reg=daten_fit, col="red")
The second way is to plot the tick marks on the x-axis. These ticks are called rugs, and can be drawn using the rug function. But first you have to calculate the range:
#points(y ~ x, data = daten[3:5, ], col="red")
abline(reg=daten_fit, col="red")
rug(range(daten[3:5, 1]), lwd=3, col="red")
This is a somewhat basic plotting question -- use the ylim=c(low, high) option with suitable options for low and high.
You may want to read then An Introduction to R manual that came with your R version, and the other fine contributed documentation on the CRAN site.

Resources