Assigning colours to data sets - r

I've got the following code that generates a random data set with a graph of the following,
x1=abs(rnorm(200))
x2=abs(rnorm(200))-7*x1^2
plot(x1,x2)
My goal is to separate the data so that the first 100 points are blue and the remaining 100 points are red in a data.frame. So I have two quick questions,
1) How do I separate the data so as I move along x1 the first 100 points are blue and the other are red? I've added an image below for clarification, mind my artistic talent with the snipping tool.
2) If after the colours are assigned, is a simple z=data.frame(x1,x2, colours) enough to get the data into a dataset so that I may run the data using some basic machine learning tools, such as SVM, Bagging and Boosting?
Cheers for the help.

set.seed(42)
dat <- data.frame(x1 = abs(rnorm(200)))
dat$x2 <- abs(rnorm(200)) - 7*dat$x1^2
dat$col <- ifelse(rank(dat$x1) <= 100, "blue", "red")
plot(x2 ~ x1, data = dat, col = col)
# also: plot(dat$x1, dat$x2, col = dat$col)
The "first 100" is subjective depending on your needs and the context of the data. One might also want the euclidean distance from origin (pythagorean), manhattan distance, or some other valuation. Or x1 <= mean(x1) or x1 <= median(x1). Lots of ways, this is just one way, where we use ifelse to differentiate/assign.

Related

How to make a range of points on a graph i n r a different colour

I have made a graph with the year on the x-axis and sea level rise on the y-axis.
I am trying to make the data from 2025 (predictions) a different colour to that before 2025.
I have grouped and labeled the predicted data using this code and have also included the code for my graph
predictions=data[which(data$Year>2024 & data$Year<2121),]
plot(data$Sea.Level..cm.~data$Year,xlab="Year",ylab="Sea Level (cm)",pch=21,col=c("Blue"))
How do I go from here in making the predictions red but the previous data blue?
Thanks in advance
You can try something like this. Toy data used.
vec <- c( rep(2001,10), rep(2002, 3) )
tf <- (vec < 2002) + 1
barplot( 1:length(vec), vec, col=c("red","blue")[tf], names=vec )
You can provide a vector of colours to plot in the col argument. For example, following your existing syntax:
predictions <- rep("red", nrow(data))
predictions[which(data$Year>2024 & data$Year<2121)] <- "blue"
plot(data$Sea.Level..cm.~data$Year,xlab="Year",ylab="Sea Level (cm)",pch=21,col=predictions)

How can I fit a smooth hysteresis in R?

I have a measurment of which should fit an hysteresis. For visualisation purpose I would like to plot a line approximating the hysteresis to help explain this pattern.
I created an example in the following image using the code below.
I would like to have an output similar to the green curve - however I don't have this data directly available, and I don't care whether it is pointy.
However most smoothing functions such as smooth.spline which I plotted in blue - allow no loops. The closest I can find is from the bezier library - plotted in red. Not nicely visible here but it produces a loop, however it fits poorly (and gives some warnings and takes quite some time).
Can you suggest a method?
set.seed(12345)
up <- seq(0,1,length.out=100)^3
down <- sqrt(seq(1,0,length.out=100))
x <- c(seq(0,1,length.out=length(up)),
seq(1,0, length.out=length(down)))
data <- data.frame(x=x, y=c(up,down),
measuredx=x + rnorm(length(x))*0.01,
measuredy=c(up,down) + rnorm(length(up)+length(down))*0.03)
with(data,plot(measuredx,measuredy, type = "p"))
with(data,lines(x,y, col='green'))
sp <- with(data,smooth.spline(measuredx, measuredy))
with(sp, lines(x,y, col="blue"))
library(bezier)
bf <- bezierCurveFit(as.matrix(data[,c(1,3)]))
lines(bezier(t=seq(0, 1, length=500), p=bf$p), col="red", cex=0.25)
UPDATE
As it turns out my actual problem is slightly different I ask another question to reflect my actual issue in the question: How to fit a smooth hysteresis in a poorly distributed data set?
set.seed(12345)
up <- seq(0,1,length.out=100)^3
down <- sqrt(seq(1,0,length.out=100))
x <- c(seq(0,1,length.out=length(up)),
seq(1,0, length.out=length(down)))
data <- data.frame(x=x, y=c(up,down),
measuredx=x + rnorm(length(x))*0.01,
measuredy=c(up,down) + rnorm(length(up)+length(down))*0.03)
Instead of smoothing data$measuredy directly over data$measuredx, do two separate smoothing, by smoothing each against a time stamp variable. Then combine the fitted values from two smoothing. This is a general way for smoothing a closed curve or a loop. (See also Q & A: Smoothing Continuous 2D Points)
t <- seq_len(nrow(data) + 1)
xs <- smooth.spline(t, c(data$measuredx, data$measuredx[1]))$y
ys <- smooth.spline(t, c(data$measuredy, data$measuredy[1]))$y
with(data, plot(measuredx, measuredy))
lines(xs, ys)
c(data$measuredx, data$measuredx[1]) for example is just to ensure that the last value in the vector agrees with the first, so that it completes a cycle.
The curve is not really closed at the bottom left corner, because smooth.spline is doing smoothing not interpolation, so even if we have ensure that data vector completes a cycle, the fitted one may not be a closed one. A practical workaround is to use weighted regression, imposing heavy weight on this spot to make it closed.
t <- seq_len(nrow(data) + 1)
w <- rep(1, length(t)) ## initially identical weight everywhere
w[c(1, length(w))] <- 100000 ## give heavy weight
xs <- smooth.spline(t, c(data$measuredx, data$measuredx[1]), w)$y
ys <- smooth.spline(t, c(data$measuredy, data$measuredy[1]), w)$y
with(data, plot(measuredx, measuredy), col = 8)
lines(xs, ys, lwd = 2)

Identify spikes/peaks in density plot by group

I created a density plot with ggplot2 package for R. I would like to identify the spikes/peaks in the plot which occur between 0.01 and 0.02. There are too many legends to pick it out so I deleted all legends. I tried to filter my data out to find most number of rows that a group has between 0.01 and 0.02. Then I filtered out the selected group to see whether the spike/peak is gone but no, it is there plotted still. Can you suggest a way to identify these spikes/peaks in these plots?
Here is some code :
ggplot(NumofHitsnormalized, aes(NumofHits_norm, fill = name)) + geom_density(alpha=0.2) + theme(legend.position="none") + xlim(0.0 , 0.15)
## To filter out the data that is in the range of first spike
test <- NumofHitsnormalized[which(NumofHitsnormalized$NumofHits_norm > 0.01 & NumofHitsnormalized$NumofHits_norm <0.02),]
## To figure it out which group (name column) has the most number of rows ##thus I thought maybe I could get the data that lead to spike
testMatrix <- matrix(ncol=2, nrow= length(unique(test$name)))
for (i in 1:length(unique(test$name))){
testMatrix[i,1] <- unique(test$name)[i]
testMatrix[i,2] <- nrow(unique(test$name)[i])}
Konrad,
This is the new plot made after I filtered my data out with extremevalues package. There are new peaks and they are located at different intervals and it also says 96% of the initial groups have data in the new plot (though number of rows in filtered data reduced to 0.023% percent of the initial dataset) so I cant identify which peaks belong to which groups.
I had a similar problem to this.
How i did was to create a rolling mean and sd of the y values with a 3 window.
Calculate the average sd of your baseline data ( the data you know won't have peaks)
Set a threshold value
If above threshold, 1, else 0.
d5$roll_mean = runMean(d5$`Current (pA)`,n=3)
d5$roll_sd = runSD(x = d5$`Current (pA)`,n = 3)
d5$delta = ifelse(d5$roll_sd>1,1,0)
currents = subset(d5,d5$delta==1,na.rm=TRUE) # Finds all peaks
my threshold was a sd > 1. depending on your data you may want to use mean or sd. for slow rising peaks mean would be a better idea than sd.
Without looking at the code, I drafted this simple function to add TRUE/FALSE flags to variables indicating outliers:
GenerateOutlierFlag <- function(x) {
# Load required packages
Vectorize(require)(package = c("extremevalues"), char = TRUE)
# Run check for ouliers
out_flg <- ifelse(1:length(x) %in% getOutliers(x, method = "I")$iLeft,
TRUE,FALSE)
out_flg <- ifelse(1:length(x) %in% getOutliers(x, method = "I")$iRight,
TRUE,out_flg)
return(out_flg)
}
If you care to read about the extremevalues package you will see that it provides some flexibility in terms of identifying outliers but broadly speaking it's a good tool for finding various peaks or spikes in the data.
Side point
You could actually optimise it significantly by creating one object corresponding to getOutliers(x, method = "I") instead of calling the method twice.
More sensible syntax
GenerateOutlierFlag <- function(x) {
# Load required packages
require("extremevalues")
# Outliers object
outObj <- getOutliers(x, method = "I")
# Run check for ouliers
out_flg <- ifelse(1:length(x) %in% outObj$iLeft,
TRUE,FALSE)
out_flg <- ifelse(1:length(x) %in% outObj$iRight,
TRUE,out_flg)
return(out_flg)
}
Results
x <- c(1:10, 1000000, -99099999)
table(GenerateOutlierFlag(x))
FALSE TRUE
10 2

Long vector-plot/Coverage plot in R

I really need your R skills here. Been working with this plot for several days now. I'm a R newbie, so that might explain it.
I have sequence coverage data for chromosomes (basically a value for each position along the length of every chromosome, making the length of the vectors many millions). I want to make a nice coverage plot of my reads. This is what I got so far:
Looks alright, but I'm missing y-labels so I can tell which chromosome it is, and also I've been having trouble modifying the x-axis, so it ends where the coverage ends. Additionally, my own data is much much bigger, making this plot in particular take extremely long time. Which is why I tried this HilbertVis plotLongVector. It works but I can't figure out how to modify it, the x-axis, the labels, how to make the y-axis logged, and the vectors all get the same length on the plot even though they are not equally long.
source("http://bioconductor.org/biocLite.R")
biocLite("HilbertVis")
library(HilbertVis)
chr1 <- abs(makeRandomTestData(len=1.3e+07))
chr2 <- abs(makeRandomTestData(len=1e+07))
par(mfcol=c(8, 1), mar=c(1, 1, 1, 1), ylog=T)
# 1st way of trying with some code I found on stackoverflow
# Chr1
plotCoverage <- function(chr1, start, end) { # Defines coverage plotting function.
plot.new()
plot.window(c(start, length(chr1)), c(0, 10))
axis(1, labels=F)
axis(4)
lines(start:end, log(chr1[start:end]), type="l")
}
plotCoverage(chr1, start=1, end=length(chr1)) # Plots coverage result.
# Chr2
plotCoverage <- function(chr2, start, end) { # Defines coverage plotting function.
plot.new()
plot.window(c(start, length(chr1)), c(0, 10))
axis(1, labels=F)
axis(4)
lines(start:end, log(chr2[start:end]), type="l")
}
plotCoverage(chr2, start=1, end=length(chr2)) # Plots coverage result.
# 2nd way of trying with plotLongVector
plotLongVector(chr1, bty="n", ylab="Chr1") # ylab doesn't work
plotLongVector(chr2, bty="n")
Then I have another vector called genes that are of special interest. They are about the same length as the chromosome-vectors but in my data they contain more zeroes than values.
genes_chr1 <- abs(makeRandomTestData(len=1.3e+07))
genes_chr2 <- abs(makeRandomTestData(len=1e+07))
These gene vectors I would like plotted as a red dot under the chromosomes! Basically, if the vector has a value there (>0), it is presented as a dot (or line) under the long vector plot. This I have not idea how to add! But it seems fairly straightforward.
Please help me! Thank you so much.
DISCLAIMER: Please do not simply copy and paste this code to run off the entire positions of your chromosome. Please sample positions (for example, as #Gx1sptDTDa shows) and plot those. Otherwise you'd probably get a huge black filled rectangle after many many hours, if your computer survives the drain.
Using ggplot2, this is really easily achieved using geom_area. Here, I've generated some random data for three chromosomes with 300 positions, just to show an example. You can build up on this, I hope.
# construct a test data with 3 chromosomes and 100 positions
# and random coverage between 0 and 500
set.seed(45)
chr <- rep(paste0("chr", 1:3), each=100)
pos <- rep(1:100, 3)
cov <- sample(0:500, 300)
df <- data.frame(chr, pos, cov)
require(ggplot2)
p <- ggplot(data = df, aes(x=pos, y=cov)) + geom_area(aes(fill=chr))
p + facet_wrap(~ chr, ncol=1)
You could use the ggplot2 package.
I'm not sure what exactly you want, but here's what I did:
This has 7000 random data points (about double the amount of genes on Chromosome 1 in reality). I used alpha to show dense areas (not many here, as it's random data).
library(ggplot2)
Chr1_cov <- sample(1.3e+07,7000)
Chr1 <- data.frame(Cov=Chr1_cov,fil=1)
pl <- qplot(Cov,fil,data=Chr1,geom="pointrange",ymin=0,ymax=1.1,xlab="Chromosome 1",ylab="-",alpha=I(1/50))
print(pl)
And that's it. This ran in less than a second. ggplot2 has a humongous amount of settings, so just try some out. Use facets to create multiple graphs.
The code beneath is for a sort of moving average, and then plotting the output of that. It is not a real moving average, as a real moving average would have (almost) the same amount of data points as the original - it will only make the data smoother. This code, however, takes an average for every n points. It will of course run quite a bit faster, but you will loose a lot of detailed information.
VeryLongVector <- sample(500,1e+07,replace=TRUE)
movAv <- function(vector,n){
chops <- as.integer(length(vector)/n)
count <- 0
pos <- 0
Cov <-0
pos[1:chops] <- 0
Cov[1:chops] <- 0
for(c in 1:chops){
tmpcount <- count + n
tmppos <- median(count:tmpcount)
tmpCov <- mean(vector[count:tmpcount])
pos[c] <- tmppos
Cov[c] <- tmpCov
count <- count + n
}
result <- data.frame(pos=pos,cov=Cov)
return(result)
}
Chr1 <- movAv(VeryLongVector,10000)
qplot(pos,cov,data=Chr1,geom="line")

Indexing Issues

I've been trying to plot the difference between two sets of information (the residuals). Both sets of data have similar (yet different) characteristics, and both data sets go from 0 to the same X value. The only inconsistency is that they are indexed differently, so while the first graph reaches X in A steps, the second reaches X in B steps. Thus, I cannot simply subtract the dependent variable values of one data frame from the other. I am speaking in very general terms, so I've provided a simple example. I want to plot the residuals between two data sets that look like this:
data1 <- data.frame(x1=c(1,2,3,4,5,6), y1=c(10,5,7,3,2,4))
data2 <- data.frame(x2=c(1,3,6), y2=c(1,3,2))
plot(data1, y1 ~ x1, type = 'l', lty = 1, col = 'blue', xlim = c(1,6), ylim = c(0,10))
points(data2$y2 ~ data2$x2, type = 'l', lty = 1, col = 'red')
So I guess my question is:
How can I plot the residuals of two functions (like the above) that are indexed differently. Is there a function that will solve for the residuals between the two data sets?
EDIT1: The example was faulty, Spacedman helped me to rectify this.
If a linear interpolation is good enough, you can use approx to interpolate at a bunch of X coordinates. EG:
> xout = sort(unique(c(seq(1,6,len=100),data1$x1,data2$x2))) # include data coords (untested)
> d1 = approx(data1$x1,data1$y1,xout)
> d2 = approx(data2$x2,data2$y2,xout)
> plot(xout,d1$y-d2$y,type="l")

Resources