Count Points in R scatter plot above/below threshold - r

I have a scatter plot of two samples of rna seq data normalized to a log2 scale. I am working in R. I expect most values to fall on the x=y line, however I am interested in finding the amount of points that fall outside that range to eliminate the 'noise'. How do I get the count of points above and below a certain threshold (i.e. the red lines: 1+logx, 1+logy).
plot(log2(data$SRR850589_sorted/19108931*1000000+.5),
log2(data$SRR850604_sorted/22989410*1000000+.5),
xlab="log(SRR850589)", ylab="lg(SRR85604)")
abline(0,1,col='cyan')
abline(1,1,col='red')
abline(-1,1,col='red')

For each x,y pair, use the fitted function to determine how far each y-value is from the fitted line at each x-value. For example, let's assume we want to mark points that are at least 5 units away from the fitted line (in the vertical direction):
set.seed(10)
x = runif(100, 0, 10)
y = 2*x + 5 + rnorm(100,0,5)
m1 = lm(y ~ x)
To get the count of points more than 5 units away:
table(abs(y - (coef(m1)[1] + coef(m1)[2]*x)) > 5)
FALSE TRUE
63 37
To plot the points:
plot(x,y,type="n")
abline(coef(m1))
abline(coef(m1) + c(5,0), col="red")
abline(coef(m1) - c(5,0), col="red")
points(x,y, pch=16, col=ifelse(abs(y - (coef(m1)[1] + coef(m1)[2]*x)) > 5, "blue","red"))

Related

How to graph a wave with increasing wavelengths?

I was looking at a bunch of electromagnetic spectrum diagrams and realized that the wave they show never actually corresponds with the given wavelengths in the diagram. For example, the wavelength of the wave at infrared (lambda = 800 nm) should appear 800 times longer than the wavelength at X-ray (lambda = 1 nm).
How can I plot a wave in r such that the wavelength increases proportional to the specified wavelength? i.e. f(x) = lambda and f(n * x) = n * lambda
Simple example just over a small part of the electromagnetic spectrum:
library(ggplot2)
# Make a simple dataset
max.x <- 800 # Maximum wavelength in nanometers (nm) to show
steps <- 10000 # More steps make plotting smooter
waveforms <- as.data.frame(matrix(data=NA, nrow=steps+1, ncol=2, dimnames=list(NULL, c("x", "y"))))
waveforms$x <- seq(0,max.x, by=max.x/steps)
# We can make a wave with periods that increase easily
waveforms$y <- sin(sqrt(waveforms$x))
ggplot()+
geom_line(data=waveforms, aes(x=x, y=y), color="red")+
scale_x_continuous("Wavelength in nanometers (nm)")
...but just eyeballing the graph, a wavelength at the low end of the spectrum isn't 800 times shorter than a wavelength at the upper end. What is the actual formula that would make the wavelengths proportional?
I think you need the cumulative sum of the reciprocal of wavelength (obviously you have to skip 0 because otherwise your cumulative sum would be infinite):
waveforms$y <- c(0, sin(cumsum(1 / waveforms$x[-1])))
ggplot()+
geom_line(data=waveforms, aes(x=x, y=y), color="red")+
scale_x_continuous("Wavelength in nanometers (nm)")

Erratic behavior of a density plot

I have two numerical variables that I plotted in the following by means command density in R. The code is the following:
d0<-density(T0,n=2^14)
df_density0<-data.frame(x=d0$x,y=d0$y,stringsAsFactors = FALSE)
d1<-density(T1,n=2^14)
df_density1<-data.frame(x=d1$x,y=d1$y,stringsAsFactors = FALSE)
Initially, I had left the number of equally spaced points n at the default value 512, but then I realized that the area under a density plot d1 was not equal to 1 (it was around 13). Then I selected the proper number n in order to obtain an AUC (area under curve) near to 1 for both the density plots. In this way:
library(zoo)
x <- df_density0$x
y <- df_density0$y
id <- order(x)
AUC0 <- sum(diff(x[id])*rollmean(y[id],2))
x <- df_density1$x
y <- df_density1$y
id <- order(x)
AUC1 <- sum(diff(x[id])*rollmean(y[id],2))
For n=2^14 I obtained AUC0 and AUC1 equal to 0.9999... I plotted these density curves and I obtained the following graphic (blue is for T0 and red is for T1):
Since these graphs were indistinguishable from each other, I selected a logarithmic scale for the vertical axis:
Is it possible that I get such a result? Should I change the kernel field in the density function?
Ps. To draw these graphs I exported d0 and d1 in two csv files to import them into LaTeX. Anyway, I obtained the same plots in R. Moreover,
> nrow(T0)
[1] 9760
> nrow(T1)
[1] 1963

Calculating the distance along a line that each point would intersect at

I would like to fit a line through two points from a random distribution of points, then calculate the location along that line that each point intersects it orthogonally. I am not interested in the residual distance of each point from the line (points above/below the line are treated equally), I am only interested in calculating the location along the line of where that point would intersect (e.g. points at different distances from the line but at the same orthogonal location would have the same value). The data aren't connected to the line explicitly as the abline is drawn from the location of only 2 points, and so i can't extract these values in a classic residual type way. I don't think this is difficult, but I can't wrap by head around how to calculate it and it's really bugging me!
I have explored the dist2d function but that calculates the orthogonal distance of each point to the line. Is there a way to use that value to the then calculate the hypotenuse from the data point to some fixed constant point on the line, and then in turn calculate the adjacent distance from that constant? I would really appreciate any help!
#here is some example starter code here to visualise what I mean
#get random data
r = rnorm(100)
t = rnorm(100)
#bind and turn into a df
data = cbind(r,t)
data = as.data.frame(data)
head(data)
#plot
plot(data)
#want to draw abline between 2 points
#isolate points of interest
#here randomly select first two rows
d = data[c(1:2),]
head(d)
#calculate abline through selected points
lm = lm(t ~ r, d)
abline(lm)
#draw points to see which ones they cut through
points(d$r, d$t, bg = "red", pch = 21)
This code below works.
# Create dataframe
data = data.frame(x = rnorm(100), y = rnorm(100))
plot(data, xlim=c(-3, 3), ylim=c(-3, 3))
# Select two points
data$x1_red <- data[1,1]; data$y1_red <- data[1,2]; data$x2_red <- data[2,1]; data$y2_red <- data[2,2];
points(data$x1_red, data$y1_red, bg = "red", pch = 21); points(data$x2_red, data$y2_red, bg = "red", pch = 21);
# Show a red line where the points intersect
# Get its slope (m_red) and intercept (b_red)
data$m_red <- (data[2,2] - data[1,2]) / (data[2,1] - data[1,1])
data$b_red <- data$y1_red - data$m * data$x1_red
abline(data$b_red, data$m_red, col='red')
# Calculate the orthogonal slope
data$m_blue <- (-1/data$m_red)
abline(0, data$m_blue, col='blue')
# Solve for each point's b-intercept (if using the blue slope)
# y = m_blue * x + b
# b = y - m_blue * x
data$b <- data$y - data$m_blue * data$x
# Solve for where each point (using the m_blue slope) intersects the red line (x' and y')
# y' = m_blue * x' + b
# y' = m_red * x' + b_red
# Set those equations equal to each other and solve for x'
data$x_intersect <- (data$b_red - data$b) / (data$m_blue - data$m_red)
# Then solve for y'
data$y_intersect <- data$m_blue * data$x_intersect + data$b
# Calculate the distance between the point and where it intersects the red line
data$dist <- sqrt( (data$x - data$x_intersect)^2 + (data$y - data$y_intersect)^2 )

Filling a curve with points that fit under the curve in R plot

I was wondering how I can efficiently (using short R code) fill a curve with points that can fill up the area under my curve?
I have tried something without success, here is my R code:
data = rnorm(1000) ## random data points to fill the curve
curve(dnorm(x), -4, 4) ## curve to be filled by "data" above
points(data) ## plotting the points to fill the curve
Here's a method that uses interpolation to ensure that the plotted points won't exceed the height of the curve (although, if you want the actual point markers to not stick out above the curve, you'll need to set the threshold slightly below the height of the curve):
# Curve to be filled
c.pts = as.data.frame(curve(dnorm(x), -4, 4))
# Generate 1000 random points in the same x-interval and with y value between
# zero and the maximum y-value of the curve
set.seed(2)
pts = data.frame(x=runif(1000,-4,4), y=runif(1000,0,max(c.pts$y)))
# Using interpolation, keep only those points whose y-value is less than y(x)
pts = pts[pts$y < approx(c.pts$x,c.pts$y,xout=pts$x)$y, ]
# Plot the points
points(pts, pch=16, col="red", cex=0.7)
A method for plotting exactly a desired number of points under a curve
Responding to #d.b's comment, here's a way to get exactly a desired number of points plotted under a curve:
First, let's figure out how many random points we need to generate over the entire plot region in order to get (roughly) a target number of points under the curve. We do this as follows:
Calculate the area under the curve as a fraction of the area of the rectangle bounded by zero and the maximum height of the curve on the vertical axis, and by the width of the curve on the horizontal axis.
The number of random points we need to generate is the target number of points, divided by the area ratio calculated above.
# Area ratio
aa = sum(c.pts$y*median(diff(c.pts$x)))/(diff(c(-4,4))*max(c.pts$y))
# Target number of points under curve
n.target = 1000
# Number of random points to generate
n = ceiling(n.target/aa)
But we need more points than this to ensure we get at least n.target, because random variation will result in fewer than n.target points about half the time, once we limit the plotted points to those below the curve. So we'll add an excess.factor in order to generate more points under the curve than we need, then we'll just randomly select n.target of those points to plot. Here's a function that takes care of the entire process for a general curve.
# Plot a specified number of points under a curve
pts.under.curve = function(data, n.target=1000, excess.factor=1.5) {
# Area under curve as fraction of area of plot region
aa = sum(data$y*median(diff(data$x)))/(diff(range(data$x))*max(data$y))
# Number of random points to generate
n = excess.factor*ceiling(n.target/aa)
# Generate n random points in x-range of the data and with y value between
# zero and the maximum y-value of the curve
pts = data.frame(x=runif(n,min(data$x),max(data$x)), y=runif(n,0,max(data$y)))
# Using interpolation, keep only those points whose y-value is less than y(x)
pts = pts[pts$y < approx(data$x,data$y,xout=pts$x)$y, ]
# Randomly select only n.target points
pts = pts[sample(1:nrow(pts), n.target), ]
# Plot the points
points(pts, pch=16, col="red", cex=0.7)
}
Let's run the function for the original curve:
c.pts = as.data.frame(curve(dnorm(x), -4, 4))
pts.under.curve(c.pts)
Now let's test it with a different distribution:
# Curve to be filled
c.pts = as.data.frame(curve(df(x, df1=100, df2=20),0,5,n=1001))
pts.under.curve(c.pts, n.target=200)
n_points = 10000 #A large number
#Store curve in a variable and plot
cc = curve(dnorm(x), -4, 4, n = n_points)
#Generate 1000 random points
p = data.frame(x = seq(-4,4,length.out = n_points), y = rnorm(n = n_points))
#OR p = data.frame(x = runif(n_points,-4,4), y = rnorm(n = n_points))
#Find out the index of values in cc$x closest to p$x
p$ind = findInterval(p$x, cc$x)
#Only retain those points within the curve whose p$y are smaller than cc$y
p2 = p[p$y >= 0 & p$y < cc$y[p$ind],] #may need p[p$y < 0.90 * cc$y[p$ind],] or something
#Plot points
points(p2$x, p2$y)

Generating values of x,y for equally spaced x in an interval

I am new to working in R and I would like to generate values of x,y to plot for lowess smoothing. I would like to generate equally spaced x values in an interval for a given function.
For example, I would like to generate the values for the function:
f(x) = 5x^3 - 2x^2 -2x +1
in the interval of [-5,5].
(p.s. my background is in biology so I don't understand the technical things as well as I would like!)
You mean something like this
f1 <- function(x) (5*x)^3 - (2*x)^2 - 2*x + 1
seqx <- seq(-5,5, by = 0.1)
plot(seqx, f1(seqx), pch = 20)

Resources