EDIT: I changed my example at the end but I'm still stumped.
How can I apply a trapezoidal integration function so that each matrix point is integrated through the 3rd dimension of an array?
I am working with a large array of data (2160, 4320, 46), and I want to use trapezoidal integration over the third dimension of the array (each matrix point down through the 46 stacks of the third dimension). So for example, point 1,1,1 ; 1,1,2 ; 1,1,3 , etc are inputted as the Y points in the integration function.
EDIT: I have been told that this function can be found in CRAN package caTools by Jarek Tuszynski
I have a custom function from my PI for the trapezoidal integration:
trapz = function(x, y){
idx = 2:length(x)
return (as.double( (x[idx] - x[idx-1]) %*% (y[idx] + y[idx-1])) / 2)
}
And I have used apply functions on such arrays before like this:
data_output = apply(X = data, MARGIN = 1:2 , FUN = function(k) (mean(na.omit(k))) )
But I can't figure out how to get the trapz function to work along the margins that I want.
Working with simple example code, if I create an array like this:
x.mat = matrix(1:100, nrow = 10, ncol = 10)
y.mat = matrix(1:100, nrow = 10, ncol = 10)
library(abind)
x.array = abind(x.mat,(x.mat+1),(x.mat+2),along = 3, force.array = T)
y.array = abind(y.mat,(y.mat+1),(y.mat+2),along = 3, force.array = T)
apply(MARGIN = 1:2, FUN = trapz, X = x.array, y = y.array)
The output is a matrix of the correct dimensions (10,10) but every number is a 4.
Please help me understand what I'm doing wrong.
Related
I have 15 protein sequences as fasta format in one file. I have to pairwise align them globally and locally then generate a distance score matrix 15x15 to construct dendrogram.
But when I do, i.e. A sequence is not aligning with itself and I get NA result. Moreover, AxB gives 12131 score but BxA gives NA. Thus R can not construct phylogenetic tree.
What should I do?
I'm using this script for the loop but it reads one way only :
for (i in 1:150) {
globalpwa<-pairwiseAlignment(toString(ProtDF[D[1,i],2])
,toString(ProtDF[D[2,i],2]),
substitutionMatrix = "BLOSUM62",
gapOpening = 0,
gapExtension = -2,
scoreOnly=FALSE,
type="global")
ScoreX[i]<-c(globalpwa#score)
nameSeq1[i]<-c(as.character(ProtDF[D[1,i],1]))
nameSeq2[i]<-c(as.character(ProtDF[D[2,i],1]))
}
I used an example fasta file, protein sequence of RPS29 in fungi.
ProtDF <-
c(OQS54945.1 = "MINDLKVRKDVEKSKAHCHVKPFGKGSRACERCASHRGHNRKYGMNLCRRCLHTNAKILGFTSFD",
XP_031008245.1 = "KHTESPVEPARRDNLKTAIMSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKASDIGFVKVCDGHTDSSYDGSEF",
TVY80688.1 = "MSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKAADIGFVKHR",
TVY57447.1 = "LPFLKIRVEPARRDNLKPAIMSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKASDIGFVKVCVDAMGTLEPRASSPEL",
TVY47820.1 = "EPARRDNLKTTIMSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKAADIGFVK",
TVY37154.1 = "AISKLNSRPQRPISTTPRKAKHTKSLVEPARRDNLKTAIMSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKASDIGFVKHR",
TVY29458.1 = "KHTESPVEPARRDNLKTAIMSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKASDIGFVKVCDGHTDSSYDGSEF",
TVY14147.1 = "MSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKASDIGFVKVCDGWIGTLEL",
`sp|Q6CPG3.1|RS29_KLULA` = "MAHENVWYSHPRKFGKGSRQCRISGSHSGLIRKYGLNIDRQSFREKANDIGFYKYR",
`sp|Q8SS73.1|RS29_ENCCU` = "MSFEPSGPHSHRKPFGKGSRSCVSCYTFRGIIRKLMMCRRCFREYAGDIGFAIYD",
`sp|O74329.3|RS29_SCHPO` = "MAHENVWFSHPRKYGKGSRQCAHTGRRLGLIRKYGLNISRQSFREYANDIGFVKYR",
TPX23066.1 = "MTHESVFYSRPRNYGKGSRQCRVCAHKAGLIRKYGLLVCRQCFREKSQDIGFVKYR",
`sp|Q6FWE3.1|RS29_CANGA` = "MAHENVWFSHPRRFGKGSRQCRVCSSHTGLIRKYDLNICRQCFRERASDIGFNKYR",
`sp|Q6BY86.1|RS29_DEBHA` = "MAHESVWFSHPRNFGKGSRQCRVCSSHSGLIRKYDLNICRQCFRERASDIGFNKFR",
XP_028490553.1 = "MSHESVWNSRPRSYGKGSRSCRVCKHSAGLIRKYDLNLCRQCFREKAKDIGFNKFR"
)
So you got it right to use combn. As you said, you need a distance score matrix for dendrogram, so better to store the values in a matrix. See below, I simply named the matrix after the names of the fasta, and slot in the pairwise values
library(Biostrings)
# you can also read in your file
# like ProtDF = readAAStringSet("fasta")
ProtDF=AAStringSet(ProtDF)
# combination like you want
# here we just use the names
D = combn(names(ProtDF),2)
#make the pairwise matrix
mat = matrix(NA,ncol=length(ProtDF),nrow=length(ProtDF))
rownames(mat) = names(ProtDF)
colnames(mat) = names(ProtDF)
# loop through D
for(idx in 1:ncol(D)){
i <- D[1,idx]
j <- D[2,idx]
globalpwa<-pairwiseAlignment(ProtDF[i],
ProtDF[j],
substitutionMatrix = "BLOSUM62",
gapOpening = 0,
gapExtension = -2,
scoreOnly=FALSE,
type="global")
mat[i,j]<-globalpwa#score
mat[j,i]<-globalpwa#score
}
# if you need to make diagonal zero
diag(mat) <- 0
# plot
plot(hclust(as.dist(mat)))
An alternate method, if you're interested, using the same example as above:
library(DECIPHER)
ProtDF <- c(OQS54945.1 = "MINDLKVRKDVEKSKAHCHVKPFGKGSRACERCASHRGHNRKYGMNLCRRCLHTNAKILGFTSFD",
XP_031008245.1 = "KHTESPVEPARRDNLKTAIMSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKASDIGFVKVCDGHTDSSYDGSEF",
TVY80688.1 = "MSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKAADIGFVKHR",
TVY57447.1 = "LPFLKIRVEPARRDNLKPAIMSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKASDIGFVKVCVDAMGTLEPRASSPEL",
TVY47820.1 = "EPARRDNLKTTIMSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKAADIGFVK",
TVY37154.1 = "AISKLNSRPQRPISTTPRKAKHTKSLVEPARRDNLKTAIMSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKASDIGFVKHR",
TVY29458.1 = "KHTESPVEPARRDNLKTAIMSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKASDIGFVKVCDGHTDSSYDGSEF",
TVY14147.1 = "MSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKASDIGFVKVCDGWIGTLEL",
`sp|Q6CPG3.1|RS29_KLULA` = "MAHENVWYSHPRKFGKGSRQCRISGSHSGLIRKYGLNIDRQSFREKANDIGFYKYR",
`sp|Q8SS73.1|RS29_ENCCU` = "MSFEPSGPHSHRKPFGKGSRSCVSCYTFRGIIRKLMMCRRCFREYAGDIGFAIYD",
`sp|O74329.3|RS29_SCHPO` = "MAHENVWFSHPRKYGKGSRQCAHTGRRLGLIRKYGLNISRQSFREYANDIGFVKYR",
TPX23066.1 = "MTHESVFYSRPRNYGKGSRQCRVCAHKAGLIRKYGLLVCRQCFREKSQDIGFVKYR",
`sp|Q6FWE3.1|RS29_CANGA` = "MAHENVWFSHPRRFGKGSRQCRVCSSHTGLIRKYDLNICRQCFRERASDIGFNKYR",
`sp|Q6BY86.1|RS29_DEBHA` = "MAHESVWFSHPRNFGKGSRQCRVCSSHSGLIRKYDLNICRQCFRERASDIGFNKFR",
XP_028490553.1 = "MSHESVWNSRPRSYGKGSRSCRVCKHSAGLIRKYDLNLCRQCFREKAKDIGFNKFR")
# All pairwise alignments:
# Convert characters to an AA String Set
ProtDF <- AAStringSet(ProtDF)
# Initialize some outputs
AliMat <- matrix(data = list(),
ncol = length(ProtDF),
nrow = length(ProtDF))
DistMat <- matrix(data = 0,
ncol = length(ProtDF),
nrow = length(ProtDF))
# loop through the upper triangle of your matrix
for (m1 in seq_len(length(ProtDF) - 1L)) {
for (m2 in (m1 + 1L):length(ProtDF)) {
# Align each pair
AliMat[[m1, m2]] <- AlignSeqs(myXStringSet = ProtDF[c(m1, m2)],
verbose = FALSE)
# Assign a distance to each alignment, fill both triangles of the matrix
DistMat[m1, m2] <- DistMat[m2, m1] <- DistanceMatrix(myXStringSet = AliMat[[m1, m2]],
type = "dist", # return a single value
includeTerminalGaps = TRUE, # return a global distance
verbose = FALSE)
}
}
dimnames(DistMat) <- list(names(ProtDF),
names(ProtDF))
Dend01 <- IdClusters(myDistMatrix = DistMat,
method = "NJ",
type = "dendrogram",
showPlot = FALSE,
verbose = FALSE)
# A single multiple alignment:
AllAli <- AlignSeqs(myXStringSet = ProtDF,
verbose = FALSE)
AllDist <- DistanceMatrix(myXStringSet = AllAli,
type = "matrix",
verbose = FALSE,
includeTerminalGaps = TRUE)
Dend02 <- IdClusters(myDistMatrix = AllDist,
method = "NJ",
type = "dendrogram",
showPlot = FALSE,
verbose = FALSE)
Dend01, from all the pairwise alignments:
Dend02, from a single multiple alignment:
Finally, DECIPHER has a function for loading up your alignment in your browser just to look at it, which, if your alignments are huge, can be a bit of a mistake, but in this case (and in cases up to a few hundred short sequences) is just fine:
BrowseSeqs(AllAli)
A side note about BrowseSeqs, for some reason it doesn't behave well with Safari, but it plays just fine with Chrome. Sequences are displayed in the order in which they exist in the aligned string set.
EDIT: BrowseSeqs DOES play fine with safari directly, but it does have a weird issue with being incorporated with HTMLs knitted together with RMarkdown. Weird, but not applicable here.
Another big aside: All of the functions i've used have a processors argument, which is set to 1 by default, if you'd like to get greedy with your cores you can set it to NULL and it'll just grab everything available. If you're aligning very large string sets, this can be pretty useful, if you're doing trivially small things like this example, not so much.
combn is a great function and I use it all the time. However for these really simple set ups I like looping through the upper triangle, but that's just a personal preference.
I am working on an estimation module, where we are computing seasonality variations and forecasting. Previously, we were using fixed 5-order sinusoidal functions for estimation. The formula was as follows
doy_seasonality = exp(z[1]*sin(2*pi*doy/365)+z[2]*cos(2*pi*doy/365)
+z[3]*sin(4*pi*doy/365)+ z[4]*cos(4*pi*doy/365)
+z[5]*sin(6*pi*doy/365)+ z[6]*cos(6*pi*doy/365)
+z[7]*sin(8*pi*doy/365)+ z[8]*cos(8*pi*doy/365)
+ z[9]*sin(10*pi*doy/365)+ z[10]*cos(10*pi*doy/365))
Now, we have tried some modifications in our model. Using Fast Fourier Transform, we are able to generate the orders for trigonometric functions automatically.
For example, on my current dataset, I have the following array of orders.
order_FFT = [2, 6, 10, 24], such that
order_FFT[0] = 2
order_FFT[1] = 6
order_FFT[2] = 10
order_FFT[3] = 24
There will be 4 orders here. With some other dataset, there could be more or less no. of orders. Therefore, I need to define a for loop so that the formula gets modified.
With my current dataset and corresponding orders_FFT array, the for loop should execute the following formula:
doy_seasonality = exp(z[1]*sin(2*pi*doy/365)+z[2]*cos(2*pi*doy/365)
+z[3]*sin(6*pi*doy/365)+ z[4]*cos(6*pi*doy/365)
+z[5]*sin(10*pi*doy/365)+ z[6]*cos(10*pi*doy/365)
+z[7]*sin(24*pi*doy/365)+ z[8]*cos(24*pi*doy/365)
which basically means
doy_seasonality = exp(z[1]*sin(order_FFT[0]*pi*doy/365)+z[2]*cos(order_FFT[0]*pi*doy/365)
+z[3]*sin(order_FFT[1]*pi*doy/365)+ z[4]*cos(order_FFT[1]*pi*doy/365)
+z[5]*sin(order_FFT[2]*pi*doy/365)+ z[6]*cos(order_FFT[2]*pi*doy/365)
+z[7]*sin(order_FFT[3]*pi*doy/365)+ z[8]*cos(order_FFT[3]*pi*doy/365)
I am at a loss trying to figure out a for loop code for this. Sorry that I am not able to show my own efforts here.
I would not use a loop. Here is an R approach:
#Some test data
set.seed(42)
z <- rnorm(8)
doy <- 1:365
order_FFT <- c(2, 6, 10, 24)
#separate coefficients for sin and cos in two rows:
z <- matrix(z, nrow = 2)
#calculate the sins and cosins:
sins <- outer(doy, order_FFT, function(x, y) sin(x * pi * y / 365))
cosins <- outer(doy, order_FFT, function(x, y) cos(x * pi * y / 365))
#use matrix products to multiply and sum
doy_seasonality2 <- c(exp(sins %*% z[1,] + cosins %*% z[2,]))
Does it produce the same result?
doy_seasonality = exp(z[1]*sin(2*pi*doy/365)+z[2]*cos(2*pi*doy/365)
+z[3]*sin(6*pi*doy/365)+ z[4]*cos(6*pi*doy/365)
+z[5]*sin(10*pi*doy/365)+ z[6]*cos(10*pi*doy/365)
+z[7]*sin(24*pi*doy/365)+ z[8]*cos(24*pi*doy/365))
all.equal(doy_seasonality, doy_seasonality2)
#[1] TRUE
I'm using the R code for the implementation of SVM-RFE Algorithm from this source http://www.uccor.edu.ar/paginas/seminarios/Software/SVM_RFE_R_implementation.pdf but I made a small modification so that the r code uses the gnum library. The code is the following:
svmrfeFeatureRanking = function(x,y){
n = ncol(x)
survivingFeaturesIndexes = seq(1:n)
featureRankedList = vector(length=n)
rankedFeatureIndex = n
while(length(survivingFeaturesIndexes)>0){
#train the support vector machine
svmModel = SVM(x[, survivingFeaturesIndexes], y, C = 10, cache_size=500,kernel="linear" )
#compute ranking criteria
rankingCriteria = svmModel$w * svmModel$w
#rank the features
ranking = sort(rankingCriteria, index.return = TRUE)$ix
#update feature ranked list
featureRankedList[rankedFeatureIndex] = survivingFeaturesIndexes[ranking[1]]
rankedFeatureIndex = rankedFeatureIndex - 1
#eliminate the feature with smallest ranking criterion
(survivingFeaturesIndexes = survivingFeaturesIndexes[-ranking[1]])
}
return (featureRankedList)
}
That function receive a matrix as an input for x and a factor as an input for y. I use the function for some data , and I receive the following error message in the last iterations:
Error in if (nrow(x) != length(y)) { : argument is of length zero
Debugging the code, I got this:
3 SVM.default(x[, survivingFeaturesIndexes], y, C = 10, cache_size = 500,
kernel = "linear")
2 SVM(x[, survivingFeaturesIndexes], y, C = 10, cache_size = 500,
kernel = "linear")
1 svmrfeFeatureRanking(sdatx, ym)
So, what's the error of the function?
Looks like your matrix gets converted into a list when only one feature remains. Try this:
svmModel = SVM(as.matrix(x[, survivingFeaturesIndexes]), y, C = 10, cache_size=500,kernel="linear" )
Example:
require(data.table)
example = matrix(c(rnorm(15, 5, 1), rep(1:3, each=5)), ncol = 2, nrow = 15)
example = data.table(example)
setnames(example, old=c("V1","V2"), new=c("target", "index"))
example
threshold = 100
accumulating_cost = function(x,y) { x-cumsum(y) }
whats_left = accumulating_cost(threshold, example$target)
whats_left
I want whats_left to consist of the difference between threshold and the cumulative sum of values in example$target for which example$index = 1, and 2, and 3. So I used the following for loop:
rm(whats_left)
whats_left = vector("list")
for(i in 1:max(example$index)) {
whats_left[[i]] = accumulating_cost(threshold, example$target[example$index==i])
}
whats_left = unlist(whats_left)
whats_left
plot(whats_left~c(1:15))
I know for loops aren't the devil in R, but I'm habituating myself to use vectorization when possible (including getting away from apply, being a for loop wrapper). I'm pretty sure it's possible here, but I can't figure out how to do it. Any help would be much appreciated.
All you trying to do is accumulate the cost by index. Thus, you might want to use the by argument as in
example[, accumulating_cost(threshold, target), by = index]
I am trying to write a function that would generate first n terms with the given equation.
For example: f(x_2)= x_1*r+3, and x_n = f(x_(n-1)
Here is my code:
super = function(x,r,n){
x[n] = r*x+3
x1=seq(x,x[n],,n)
return(x1)
}
When I try to run it I keep getting: Error in super(0.6, 2, 100) : could not find function "x".
But if I make a basic code like:
n=88
x=0.6
x1 = seq(x,100,,n)
everything works
Thanks in advance for any inputs
We can slice the input and apply function on the new slice.
getFirstN = function(arr, func, n){
slice = arr[1:n]
return(lapply(slice, func))
}
e.g.:
square = function(x) {
return(x * x)
}
example = c(1, 2, 3, 4)
print(getFirstN(example, square, 2))
1, 4 # the output