Converting NMR ascii file to peak list - r

I have some Bruker NMR spectra that i am using to create a program as part of a project. My program needs to work on the actual spectrum. So i converted the 1r files of the Bruker NMR spectra to ASCII. For Carnitine this is what the ascii file looks like(this is not the complete list. The complete list runs into thousands of lines. This is only a snapshot):
-0.807434 -23644
-0.807067 -22980
-0.806701 -22967
-0.806334 -24513
-0.805967 -27609
-0.805601 -31145
-0.805234 -33951
-0.804867 -35553
-0.804501 -35880
-0.804134 -35240
-0.803767 -34626
-0.8034 -34613
-0.803034 -34312
-0.802667 -32411
-0.8023 -28925
-0.801934 -25177
-0.801567 -22132
-0.8012 -19395
and this is what the spectrum is:
(source: wisc.edu)
My program has to identify the peaks from this data. So i need to know how to interpret these numbers. And how exactly they are converted into their appropriate values in the spectrum. So far this is what i have learnt:
1.) The first column represents the spectral point position (ppm)
2.) The second column represents the intensity of each peak.
3.) notice that in the second column there are some numbers which are not perfectly aligned but are closer to the first column. Eg:-34613, -28925, -19395. I think this is significant.
For the sake of full disclosure- I am doing the programming in R.
NOTE: I have also asked this in Biostar but i think i have a better chance of getting an answer here than there because not many people seem to be answering questions there.
EDIT: This is one solution that i have found is plausible:
A friend gave me the idea to use an awk script to check where exactly the intensities in the file change from positive to negative to find the local maxima. Here is a working script:
awk 'BEGIN{dydx = 0;}
{
if(NR > 1)
{ dydx = ($2 - y0)/($1 - x0); }
if(NR > 2 && last * dydx < 0)
{ printf( "%.4f %.4f\n", (x0 + $1)/2, log((dydx<0)?-dydx:dydx)); } ;
last=dydx; x0=$1; y0=$2
}' /home/chaitanya/Work/nmr_spectra/caffeine/pdata/1/spectrumtext.txt | awk '$2 > 17'
Tell me if you dont understand it. I will improve the explanation.
Also, there is this related question i asked.

Here's a worked example with a reproducible piece of code. I don't claim it's any good with regard to the strategy or coding, but it could get you started.
find_peaks <- function (x, y, n.fine = length(x), interval = range(x), ...) {
maxdif <- max(diff(x)) # longest distance between successive points
## selected interval for the search
range.ind <- seq(which.min(abs(x - interval[1])),
which.min(abs(x - interval[2])))
x <- x[range.ind]
y <- y[range.ind]
## smooth the data
spl <- smooth.spline(x, y, ...)
## finer x positions
x.fine <- seq(range(x)[1], range(x)[2], length = n.fine)
## predicted y positions
y.spl <- predict(spl, x.fine, der = 0)$y
## testing numerically the second derivative
test <- diff(diff((y.spl), 1) > 0, 1)
maxima <- which(test == -1) + 1
## according to this criterion, we found rough positions
guess <- data.frame(x=x.fine[maxima], y=y.spl[maxima])
## cost function to maximize
obj <- function(x) predict(spl, x)$y
## optimize the peak position around each guess
fit <- data.frame(do.call(rbind,
lapply(guess$x, function(g) {
fit <- optimize(obj, interval = g + c(-1,1) * maxdif, maximum=TRUE)
data.frame(x=fit$maximum,y=fit$objective)
})))
## return both guesses and fits
invisible(list(guess=guess, fit=fit))
}
set.seed(123)
x <- seq(1, 15, length=100)
y <- jitter(cos(x), a=0.2)
plot(x,y)
res <- find_peaks(x,y)
points(res$guess,col="blue")
points(res$fit,col="red")

Package PRocess has a function to find peaks in spectra. There are many more, if you search for something like "peak finding R"

Related

How can i stop a while loop and start another while loop where the previous one started

I am trying to count the letters in the list by skipping 1 letter and grouping them in three until i find "t a c" in the data frame and then i want to group the rest of them in three by skipping 3 letters until i find "a t t"
example of what i am trying to say:
"agttacgtaattatgat"
it should do:
agt,gtt,tta,tac stop, gta,att stop ,atg,tga,gat
(data frame's name is agen)
my code for that:
y=c()
x=1
while(x<853){
x=x+1
rt<-paste(agen[x],agen[x+1],agen[x+2])
y=c(y,rt)
ff<-data.frame(y)
if(ff=="t a c"){break}
}
ay=c()
while(x<853){
x=x+3
art<-paste(agen[x],agen[x+1],agen[x+2])
ay=c(ay,art)
aff<-data.frame(ay)
if(aff=="a t t"){break}
}
the first one is working fine but the second one does not break.
there will be a lot of stops and starts in the code, so can you help me write a loop that can do the job?
I guess I know just roughly what you need, but here is a code example, that maybe does what you need. I used the example you specified and used a vector with your DNA bases as elements instead of a 'data frame'. I also changed some style things.
agen_string <- "agttacgtaattatgat"
# Is not a data frame, but a vector. I don't know, why you try to use a data frame.
agen <- strsplit(agen_string, split = "")[[1]]
y <- c()
x <- 0 # Start with 0. Otherwise, you wouldn't find 'tac' in the beginning
# Search for 'tac' triplett
while(x < length(agen)){
x <- x + 1
rt <- paste(agen[x], agen[x+1], agen[x+2], sep = "")
print(rt)
y <- c(y, rt)
#ff <- data.frame(y)
if(rt == "tac"){
print("stop")
break
}
}
ay <- c()
while(x < length(agen)) {
x <- x + 3
art <- paste(agen[x], agen[x+1], agen[x+2], sep = "")
print(art)
ay = c(ay,art)
#aff<-data.frame(ay)
if(art == "att"){
print("stop")
break
}
}
If you work more on DNA sequences, you may want to use a more specialized R-package, like Biostrings for example.

What is the fastest method to map decision tree nodes to one-hot vectors?

Consider the function f which takes decision-tree node parameters {-1,+1} and maps it to an one-hot vector [0,0,0,1] for example.
I think this will end up being one of the bottlenecks of a program I'm working on, so I'd like to know if anyone finds a faster way to map the parameters to the vector.
f<-function(h){
# function takes as arguments:
# an m-bit vector of potential split decisions (h)
# function returns:
# an m+1-length one-hot indicator vector
theta_vec = c(rep(0,length(h)+1))
position = length(h)+1
for(bit in seq(1,length(h),2)){
if(h[bit]>0){
position=position
}
else{
position=position/2
}
}
theta_vec[position]=1
return(theta_vec)
}
Thank you for your help
I think I've got a solution that runs in a quarter of the time. Are you able to refactor so that you use (0,1) instead of (-1,1); and use it as lists of rows instead of a vector? I find its easier to interpret when thinking about the problem, although the function below could be re-written to use a vector as input.
findPos <- function(h){
# find number of rows from input
N <- length(h)
# go through and pick out the values in each tree that are valid based
# on previous route
out <- c(h[[1]], rep(0, N-1))
for(i in 2:N){
out[i] <- h[[i]][sum(out[i:(i-1)] * 2^(i-1)/(2^((i-1):1))) + 1]
}
# now find the final position in the bottom row and return as a vector
out_pos <- sum(out * 2^N/(2^(1:N))) + 1
full_vec <- rep(0, 2^N)
full_vec[out_pos] <- 1
return(full_vec)
}
# couple of e.gs
f(c(0,1,1))
findPos(list(0, c(1,1)))
f(c(1,1,1))
findPos(list(1, c(1,1)))
# works with larger trees
findPos(list(1, c(1,1), c(1,0,0,0)))
# check time using microbenchmark package
microbenchmark::microbenchmark(
"old" = {
f(c(0,1,1))
},
"new" = {
findPos(list(0, c(1,1)))
}
)
Best
Jonny

R for-loop iterating from central value out to extremes

I'm trying to improve the speed of my code, which is trying to optimise a value using 3 variables which have large ranges. The most likely output uses values in the middle of the ranges, so it is wasting time starting from the lowest possible value of each variable. I want to start from the middle value and iterate out! The actual problem has thousands of lines with numbers from 150-650. C,H and O limits will be defined somewhat based on the starting number, but will always be more likely at a central value in the defined range. Is there a way to define the for loop to work outwards like I want? The only, quite shabby, way I can think of is to simply redefine the value within the loop from a vector (e.g. 1=20, 2=21, 3=19, etc). See current code below:
set_error<-2.5
ct<-c(325.00214,325.00952,325.02004,325.02762,325.03535,325.03831,325.04588, 325.05641,325.06402,325.06766,325.07167,325.07454,325.10396)
FormFun<-function(x){
for(C in 1:40){
for(H in 1:80){
for(O in 1:40){
test_mass=C*12+H*1.007825+O*15.9949146-1.0072765
error<-1000000*abs(test_mass-x)/x
if(error<set_error){
result<-paste("C",C,"H",H,"O",O,sep ="")
return(result)
break;break;break;break
}
}
}
}
}
old_t <- Sys.time()
ct2<-lapply(ct,FormFun)
new_t <- Sys.time() - old_t # calculate difference
print(new_t)
Use vectorization and create a closure:
FormFun1_fac <- function(gr) {
gr <<- gr
function(x, set_error){
test_mass <- with(gr, C*12+H*1.007825+O*15.9949146-1.0072765)
error <- 1000000 * abs(test_mass - x) / x
ind <- which(error < set_error)[1]
if (is.na(ind)) return(NULL)
paste0("C", gr[ind, "C"],"H", gr[ind, "H"],"O", gr[ind, "O"])
}
}
FormFun1 <- FormFun1_fac(expand.grid(C = 1:40, H = 1:80, O = 1:40))
ct21 <- lapply(ct, FormFun1, set_error = set_error)
all.equal(ct2, ct21)
#[1] TRUE
This saves a grid of all combinations of C, H, O in the function environment and calculates the error for all combinations (which is fast in vectorized code). The first combination that passes the test is returned.

Get sequence of adjacent cell numbers out of a raster

I am currently working on a script, that loads a TIF file into a raster object, crops it and plots two points (starting point and point of destination; selected via the click-function) into that raster. I then want it to get the cell numbers of those two points. All of that hasn't caused any trouble but now I have tried to write a while-Loop which gets me the number of a random cell (which is adjacent to the current cell; beginning from the starting point) until that cell number equals the cell number of my point of destination. My idea behind that was to "walk" across the raster until I have reached my point of destination or at least the column containing it (to reduce computation time). The numbers of the cells i cross during that walk should be stored in a vector ("Path"). I select the adjacent cell (=choose my next step) by randomly sampling from a vector that contains numbers that, when added to the current cell number, lead to the number of an adjacent cell. I have multiple vectors from which to sample as the number of possible directions in which to "walk" differs depending on the position of the current cell (e.g. I can't "walk" to the cell to my lower rigth (=n + (ncol_dispersal + 1) if I am currently positioned at the bottom of the raster). The script looks like this so far:
library(gdistance)
library(raster)
library(rgdal)
library(sp)
setwd("C:/Users/Giaco/Dropbox/Random Walk")
altdata <- raster("altitude.tif")
plot(altdata)
e <- extent(92760.79,93345.79,204017.5,204242.5)
dispersal_area <- crop(altdata,e)
plot(dispersal_area)
points(92790.79,204137.5,pch=16,cex=1)
points(93300.79,204062.5,pch=16,cex=1)
Pts <- matrix(c(92790.79,204137.5,93300.79,204062.5),nrow=2,ncol=2,byrow=TRUE)
Start <- cellFromXY(dispersal_area,Pts[1,])
End <- cellFromXY(dispersal_area,Pts[2,])
nrow_dispersal <- nrow(dispersal_area)
ncol_dispersal <- ncol(dispersal_area)
col_start <- colFromCell(dispersal_area,Start)
row_start <- rowFromCell(dispersal_area,Start)
col_end <- colFromCell(dispersal_area,End)
row_end <- rowFromCell(dispersal_area,End)
upper_left_corner <- cellFromRowCol(dispersal_area,1,1)
lower_left_corner <- cellFromRowCol(dispersal_area,14,1)
sample_standard <- c(1,(ncol_dispersal+1),(ncol_dispersal*-1+1))
sample_top <- c(1,ncol_dispersal,(ncol_dispersal+1))
sample_bottom <- c(1,(ncol_dispersal*-1+1),(ncol_dispersal*-1))
sample_left <- c(1,(ncol_dispersal+1),(ncol_dispersal*-1+1))
sample_upper_left <- c(1,ncol_dispersal,(ncol_dispersal+1))
sample_lower_left <- c(1,(ncol_dispersal*-1+1),(ncol_dispersal*-1))
Path <- c()
Path[1] <- Start
n <- Start
counter <- 1
while (n != End)
{
n = Start+sample(sample_standard,1)
if (colFromCell(dispersal_area,n)==col_end) {
n=End
break
} else if (n==upper_left_corner) {
n = n+sample(sample_upper_left,1)
} else if(n==lower_left_corner){
n = n+sample(sample_lower_left,1)
} else if(colFromCell(dispersal_area,n)==1) {
n = n+sample(sample_left,1)
} else if(rowFromCell(dispersal_area,n)==1){
n = n+sample(sample_top,1)
} else if(rowFromCell(dispersal_area,n)==nrow_dispersal) {
n = n+sample(sample_bottom,1)
}
counter <- counter+1
Path[counter] <- n
}
When I run the script and print the path vector it returns a veeerryy long vector (I always have to stop it as it never finishes computing) which contains only a few different numbers. Why is that ? I have been staring at this all day but I simply can't figure out where i went wrong. There must be something wrong with the while Loop but I don't see it.
If anyone of you guys could help me out with this I would be really really thankful.
Thanks in advance !
Here is a simple and reproducible example (that also answers your question).
library(gdistance)
r <- raster(system.file("external/maungawhau.grd", package="gdistance"))
r <- aggregate(r, 5)
p <- matrix(c(2667531, 6478843, 2667731, 6479227), ncol=2, byrow=TRUE)
start <- cellFromXY(r, p[1,])
end <- cellFromXY(r, p[2,])
counter <- 1
cell <- start
path <- cell
while (cell != end) {
a <- adjacent(r, cell, pairs=F)
cell <- sample(a, 1)
path <- c(path, cell)
}
xy <- xyFromCell(r, path)
plot(r)
lines(xy)
or
cols <- rainbow(nrow(xy))
for (i in 1:nrow(xy)-1) { lines(xy[i:(i+1), ], col=cols[i]) }
This is pretty fast on this coarse raster, but it could indeed take a very long time to reach a particular cell on a large raster by random walk.
Perhaps there are function in gdistance that are more useful?

How to append bootstrapped values of cluster's (tree) nodes in NEWICK format in R

I want to make a tree (cluster) using Interactive Tree of Life web-based tool (iTOL). As an input file (or string) this tool uses Newick format which is a way of representing graph-theoretical trees with edge lengths using parentheses and commas. Beside that, additional information might be supported such as bootstrapped values of cluster's nodes.
For example, here I created dataset for a cluster analysis using clusterGeneration package:
library(clusterGeneration)
set.seed(1)
tmp1 <- genRandomClust(numClust=3, sepVal=0.3, numNonNoisy=5,
numNoisy=3, numOutlier=5, numReplicate=2, fileName="chk1")
data <- tmp1$datList[[2]]
Afterwards I performed cluster analysis and assessed support for the cluster's nodes by bootstrap using pvclust package:
set.seed(2)
y <- pvclust(data=data,method.hclust="average",method.dist="correlation",nboot=100)
plot(y)
Here is the cluster and bootstrapped values:
In order to make a Newick file, I used ape package:
library(ape)
yy<-as.phylo(y$hclust)
write.tree(yy,digits=2)
write.tree function will print tree in a Newick format:
((x2:0.45,x6:0.45):0.043,((x7:0.26,(x4:0.14,(x1:0.14,x3:0.14):0.0064):0.12):0.22,(x5:0.28,x8:0.28):0.2):0.011);
Those numbers represent branch lengths (cluster's edge lengths). Following instructions from iTOL help page ("Uploading and working with your own trees" section) I manually added bootstrapped values into my Newick file (bolded values below):
((x2:0.45,x6:0.45)74:0.043,((x7:0.26,(x4:0.14,(x1:0.14,x3:0.14)55:0.0064)68:0.12)100:0.22,(x5:0.28,x8:0.28)100:0.2)63:0.011);
It works fine when I upload the string into iTOL. However, I have a huge cluster and doing it by hand seems tedious...
QUESTION: What would be a code that can perform it instead of manual typing?
Bootstrap values can be obtained by:
(round(y$edges,2)*100)[,1:2]
Branch lengths used to form Newick file can be obtained by:
yy$edge.length
I tried to figure out how write.tree function works after debugging it. However, I noticed that it internally calls function .write.tree2 and I couldn't understand how to efficiently change the original code and obtain bootstrapped values in appropriate position in a Newick file.
Any suggestion are welcome.
Here is one solution for you: objects of class phylo have an available slot called node.label that, appropriately, gives you the label of a node. You can use it to store your bootstrap values. There will be written in your Newick File at the appropriate place as you can see in the code of .write.tree2:
> .write.tree2
function (phy, digits = 10, tree.prefix = "")
{
brl <- !is.null(phy$edge.length)
nodelab <- !is.null(phy$node.label)
...
if (is.null(phy$root.edge)) {
cp(")")
if (nodelab)
cp(phy$node.label[1])
cp(";")
}
else {
cp(")")
if (nodelab)
cp(phy$node.label[1])
cp(":")
cp(sprintf(f.d, phy$root.edge))
cp(";")
}
...
The real difficulty is to find the proper order of the nodes. I searched and searched but couldn't find a way to find the right order a posteriori.... so that means we will have to get that information during the transformation from an object of class hclust to an object of class phylo.
And luckily, if you look into the function as.phylo.hclust, there is a vector containing the nodes index in their correct order vis-à-vis the previous hclust object:
> as.phylo.hclust
function (x, ...)
{
N <- dim(x$merge)[1]
edge <- matrix(0L, 2 * N, 2)
edge.length <- numeric(2 * N)
node <- integer(N) #<-This one
...
Which means we can make our own as.phylo.hclust with a nodenames parameter as long as it is in the same order as the nodes in the hclust object (which is the case in your example since pvclust keeps a coherent order internally, i. e. the order of the nodes in the hclust is the same as in the table in which you picked the bootstraps):
# NB: in the following function definition I only modified the commented lines
as.phylo.hclust.with.nodenames <- function (x, nodenames, ...) #We add a nodenames argument
{
N <- dim(x$merge)[1]
edge <- matrix(0L, 2 * N, 2)
edge.length <- numeric(2 * N)
node <- integer(N)
node[N] <- N + 2L
cur.nod <- N + 3L
j <- 1L
for (i in N:1) {
edge[j:(j + 1), 1] <- node[i]
for (l in 1:2) {
k <- j + l - 1L
y <- x$merge[i, l]
if (y > 0) {
edge[k, 2] <- node[y] <- cur.nod
cur.nod <- cur.nod + 1L
edge.length[k] <- x$height[i] - x$height[y]
}
else {
edge[k, 2] <- -y
edge.length[k] <- x$height[i]
}
}
j <- j + 2L
}
if (is.null(x$labels))
x$labels <- as.character(1:(N + 1))
node.lab <- nodenames[order(node)] #Here we define our node labels
obj <- list(edge = edge, edge.length = edge.length/2, tip.label = x$labels,
Nnode = N, node.label = node.lab) #And you put them in the final object
class(obj) <- "phylo"
reorder(obj)
}
In the end, here is how you would use this new function in your case study:
bootstraps <- (round(y$edges,2)*100)[,1:2]
yy<-as.phylo.hclust.with.nodenames(y$hclust, nodenames=bootstraps[,2])
write.tree(yy,tree.names=TRUE,digits=2)
[1] "((x5:0.27,x8:0.27)100:0.24,((x7:0.25,(x4:0.14,(x1:0.13,x3:0.13)61:0.014)99:0.11)100:0.23,(x2:0.46,x6:0.46)56:0.022)61:0.027)100;"
#See the bootstraps ^^^ here for instance
plot(yy,show.node.label=TRUE) #To show that the order is correct
plot(y) #To compare with (here I used the yellow value)

Resources