Fitting multiple peaks to a dataset and extracting individual peak information in R - r

I have a number of datasets composed of counts of features at various elevations. Currently there is data for each 1m interval from 1-30m. When plotted, many of my datasets exhibit 3-4 peaks, which are indicative of height layers.
Here is a sample dataset:
Height <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30)
Counts <-c(4000,2000,500,300,200,100,0,0,400,700,800,800,500,1000,1500,2000,2500,2200,1700,1100,500,0,0,1000,1500,2000,3000,4000,4000,2000)
I would like to fit some manner of curve function to these datasets in order to determine the total number of ‘peaks’, the peak center location (i.e. height) and peak width.
I could perform this kind of analysis by fitting multiple Gaussian functions manually using the fityk software some time ago, I would however like to know if it is possible to perform such a process automatically through R?
I’ve explored a number of other posts concerning fitting peaks to histograms, such as through the mixtools package, however I do not know if you could extract individual peak information.
Any help you can supply would be greatly appreciated.

"How do I fit a curve to my data" is way too broad of a question, because there are countless ways to do this. It's also probably more suited for https://stats.stackexchange.com/ than here. However, ksmooth from base R is a pretty good starting point for a basic smoother:
plot(Height,Counts)
smoothCounts<-ksmooth(Height,Counts,kernel="normal",bandwidth=2)
dsmooth<-diff(smoothCounts$y)
locmax<-sign(c(0,dsmooth))>0 & sign(c(dsmooth,0))<0
lines(smoothCounts)
points(smoothCounts$x[locmax],smoothCounts$y[locmax],cex=3,c=2)

A simple peak identification could be along the following lines. Looks reasonable?
library(data.table)
dt <- data.table(
Height = c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30),
Counts = c(4000,2000,500,300,200,100,0,0,400,700,800,800,500,1000,1500,2000,2500,2200,1700,1100,500,0,0,1000,1500,2000,3000,4000,4000,2000)
)
# crude dHeights/dCounts
dt[,d1 := c(NA,diff(Counts))]
# previous crude dHeights/dCounts (d2Heights/dCounts2 will be even more crude so comparing change in dHeight/dCounts instead)
dt[,d2 := c(tail(d1,-1),NA)]
# local maxima
dtpeaks <- dt[d1 >=0 & d2 <=0]
I'm not very sure how you would calculate FWHM for the peaks, if you can explain the process then I should be able to help.

Related

Centering/standardizing variables in R [duplicate]

I'm trying to understand the definition of scale that R provides. I have data (mydata) that I want to make a heat map with, and there is a VERY strong positive skew. I've created a heatmap with a dendrogram for both scale(mydata) and log(my data), and the dendrograms are different for both. Why? What does it mean to scale my data, versus log transform my data? And which would be more appropriate if I want to look at the dendrogram illustrating the relationship between the columns of my data?
Thank you for any help! I've read the definitions but they are whooping over my head.
log simply takes the logarithm (base e, by default) of each element of the vector.
scale, with default settings, will calculate the mean and standard deviation of the entire vector, then "scale" each element by those values by subtracting the mean and dividing by the sd. (If you use scale(x, scale=FALSE), it will only subtract the mean but not divide by the std deviation.)
Note that this will give you the same values
set.seed(1)
x <- runif(7)
# Manually scaling
(x - mean(x)) / sd(x)
scale(x)
It provides nothing else but a standardization of the data. The values it creates are known under several different names, one of them being z-scores ("Z" because the normal distribution is also known as the "Z distribution").
More can be found here:
http://en.wikipedia.org/wiki/Standard_score
This is a late addition but I was looking for information on the scale function myself and though it might help somebody else as well.
To modify the response from Ricardo Saporta a little bit.
Scaling is not done using standard deviation, at least not in version 3.6.1 of R, I base this on "Becker, R. (2018). The new S language. CRC Press." and my own experimentation.
X.man.scaled <- X/sqrt(sum(X^2)/(length(X)-1))
X.aut.scaled <- scale(X, center = F)
The result of these rows are exactly the same, I show it without centering because of simplicity.
I would respond in a comment but did not have enough reputation.
I thought I would contribute by providing a concrete example of the practical use of the scale function. Say you have 3 test scores (Math, Science, and English) that you want to compare. Maybe you may even want to generate a composite score based on each of the 3 tests for each observation. Your data could look as as thus:
student_id <- seq(1,10)
math <- c(502,600,412,358,495,512,410,625,573,522)
science <- c(95,99,80,82,75,85,80,95,89,86)
english <- c(25,22,18,15,20,28,15,30,27,18)
df <- data.frame(student_id,math,science,english)
Obviously it would not make sense to compare the means of these 3 scores as the scale of the scores are vastly different. By scaling them however, you have more comparable scoring units:
z <- scale(df[,2:4],center=TRUE,scale=TRUE)
You could then use these scaled results to create a composite score. For instance, average the values and assign a grade based on the percentiles of this average. Hope this helped!
Note: I borrowed this example from the book "R In Action". It's a great book! Would definitely recommend.

Spatstat Point Pattern Analysis for colocalization

I am trying to do some cololcaization analysis, i.e. I want to show if one cell type tends to show up closer to another different celltype significantly in a microsopy image.
I tried to do this with R spatstat package I was able to visualize my dataset:
mypattern is one kind of cell and mypattern2 ist another kind of cell. When u look at the L-plots you can see that there is some kind of clustering as the curve is deviating from poission.
I thought about using nearest neighbor apporoach which is the nncross function in spatstat.
But how can I show now if this distance is random (two random point pattern) or significantly relevant? Does anyone has an idea? I saw a lot about simulations like Monte-Carlo but I have no idea how to begin coding...
I would be glad for any help!
Kind regards,
Hashirama
The L function should not be used here because the data are highly inhomogeneous.
I suggest you combine the two point patterns into a single "marked" point pattern,
X <- superimpose(A=mypattern1, B=mypattern2)
Then estimate the spatially-varying densities of points
D <- density(split(X))
plot(D)
or the spatially varying proportions of each type of cell
R <- relrisk(X)
plot(R)
You can also use segregation.test or a contingency table of nearest neighbours (dixon).
See Chapter 14 of the spatstat book and the help files for relrisk, density.splitppp and segregation.test.

How can I achieve hierarchical clustering with p-values for a large dataset?

I am trying to carry out hierarchical cluster analysis (based on Ward's method) on a large dataset (thousands of records and 13 variables) representing multi-species observations of marine predators, to identify possible significant clusters in species composition.
Each record has date, time etc and presence/absence data (0 / 1) for each species.
I attempted hierarchical clustering with the function pvclust. I transposed the data (pvclust works on transposed tables), then I ran pvclust on the data selecting Jacquard distances (“binary” in R) as a distance measure (suitable for species pres/abs data) and Ward’s method (“ward.D2”). I used “parallel = TRUE” to reduce computation time. However, using a default of nboots= 1000, my computer was not able to finish the computation in hours and finally I got ann error, so I tried with lower nboots (100).
I cannot provide my dataset here, and I do not think it makes sense to provide a small test dataset, as one of the main issues here seems to be the size itself of the dataset. However, I am providing the lines of code I used for the transposition, clustering and plotting:
tdata <- t(data)
cluster <- pvclust(tdata, method.hclust="ward.D2", method.dist="binary",
nboot=100, parallel=TRUE)
plot(cluster, labels=FALSE)
This is the dendrogram I obtained (never mind the confusion at the lower levels due to overlap of branches).
As you can see, the p-values for the higher ramifications of the dendrogram all seem to be 0.
Now, I understand that my data may not be perfect, but I still think there is something wrong with the method I am using, as I would not expect all these values to be zero even with very low significance in the clusters.
So my questions would be
is there anything I got wrong in the pvclust function itself?
may my low nboots (due to “weak” computer) be a reason for the non-significance of my results?
are there other functions in R I could try for hierarchical clustering that also deliver p-values?
Thanks in advance!
.............
I have tried to run the same code on a subset of 500 records with nboots = 1000. This worked in a reasonable computation time, but the output is still not very satisfying - see dendrogram2 .dendrogram obtained for a SUBSET of 500 records and nboots=1000

Cross-correlation of 5 time series (distance) and interpretation

I would appreciate some input in this a lot!
I have data for 5 time series (an example of 1 step in the series is in the plot below), where each step in the series is a vertical profile of species sightings in the ocean which were investigated 6h apart. All 5 steps are spaced vertically by 0.1m (and the 6h in time).
What I want to do is calculate the multivariate cross-correlation between all series in order to find out at which lag the profiles are most correlated and stable over time.
Profile example:
I find the documentation in R on that not so great, so what I did so far is use the package MTS with the ccm function to create cross correlation matrices. However, the interpretation of the figures is rather difficult with sparse documentation. I would appreciate some help with that a lot.
Data example:
http://pastebin.com/embed_iframe.php?i=8gdAeGP4
Save in file cross_correlation_stack.csv or change as you wish.
library(dplyr)
library(MTS)
library(data.table)
d1 <- file.path('cross_correlation_stack.csv')
d2 = read.csv(d1)
# USING package MTS
mod1<-ccm(d2,lag=1000,level=T)
#USING base R
acf(d2,lag.max=1000)
# MQ plot also from MTS package
mq(d2,lag=1000)
Which produces this (the ccm command):
This:
and this:
In parallel, the acf command from above produces this:
My question now is if somebody can give some input in whether I am going in the right direction or are there better suited packages and commands?
Since the default figures don't get any titles etc. What am I looking at, specifically in the ccm figures?
The ACF command was proposed somewhere, but can I use it here? In it's documentation it says ... calculates autocovariance or autocorrelation... I assume this is not what I want. But then again it's the only command that seems to work multivariate. I am confused.
The plot with the significance values shows that after a lag of 150 (15 meters) the p values increase. How would you interpret that regarding my data? 0.1 intervals of species sightings and many lags up to 100-150 are significant? Would that mean something like that peaks in sightings are stable over the 5 time-steps on a scale of 150 lags aka 15 meters?
In either way it would be nice if somebody who worked with this before can explain what I am looking at! Any input is highly appreciated!
You can use the base R function ccf(), which will estimate the cross-correlation function between any two variables x and y. However, it only works on vectors, so you'll have to loop over the columns in d1. Something like:
cc <- vector("list",choose(dim(d1)[2],2))
par(mfrow=c(ceiling(choose(dim(d1)[2],2)/2),2))
cnt <- 1
for(i in 1:(dim(d1)[2]-1)) {
for(j in (i+1):dim(d1)[2]) {
cc[[cnt]] <- ccf(d1[,i],d1[,j],main=paste0("Cross-correlation of ",colnames(d1)[i]," with ",colnames(d1)[j]))
cnt <- cnt + 1
}
}
This will plot each of the estimated CCF's and store the estimates in the list cc. It is important to remember that the lag-k value returned by ccf(x,y) is an estimate of the correlation between x[t+k] and y[t].
All of that said, however, the ccf is only defined for data that are more-or-less normally distributed, but your data are clearly overdispersed with all of those zeroes. Therefore, lacking some adequate transformation, you should really look into other metrics of "association" such as the mutual information as estimated from entropy. I suggest checking out the R packages entropy and infotheo.

Understanding `scale` in R

I'm trying to understand the definition of scale that R provides. I have data (mydata) that I want to make a heat map with, and there is a VERY strong positive skew. I've created a heatmap with a dendrogram for both scale(mydata) and log(my data), and the dendrograms are different for both. Why? What does it mean to scale my data, versus log transform my data? And which would be more appropriate if I want to look at the dendrogram illustrating the relationship between the columns of my data?
Thank you for any help! I've read the definitions but they are whooping over my head.
log simply takes the logarithm (base e, by default) of each element of the vector.
scale, with default settings, will calculate the mean and standard deviation of the entire vector, then "scale" each element by those values by subtracting the mean and dividing by the sd. (If you use scale(x, scale=FALSE), it will only subtract the mean but not divide by the std deviation.)
Note that this will give you the same values
set.seed(1)
x <- runif(7)
# Manually scaling
(x - mean(x)) / sd(x)
scale(x)
It provides nothing else but a standardization of the data. The values it creates are known under several different names, one of them being z-scores ("Z" because the normal distribution is also known as the "Z distribution").
More can be found here:
http://en.wikipedia.org/wiki/Standard_score
This is a late addition but I was looking for information on the scale function myself and though it might help somebody else as well.
To modify the response from Ricardo Saporta a little bit.
Scaling is not done using standard deviation, at least not in version 3.6.1 of R, I base this on "Becker, R. (2018). The new S language. CRC Press." and my own experimentation.
X.man.scaled <- X/sqrt(sum(X^2)/(length(X)-1))
X.aut.scaled <- scale(X, center = F)
The result of these rows are exactly the same, I show it without centering because of simplicity.
I would respond in a comment but did not have enough reputation.
I thought I would contribute by providing a concrete example of the practical use of the scale function. Say you have 3 test scores (Math, Science, and English) that you want to compare. Maybe you may even want to generate a composite score based on each of the 3 tests for each observation. Your data could look as as thus:
student_id <- seq(1,10)
math <- c(502,600,412,358,495,512,410,625,573,522)
science <- c(95,99,80,82,75,85,80,95,89,86)
english <- c(25,22,18,15,20,28,15,30,27,18)
df <- data.frame(student_id,math,science,english)
Obviously it would not make sense to compare the means of these 3 scores as the scale of the scores are vastly different. By scaling them however, you have more comparable scoring units:
z <- scale(df[,2:4],center=TRUE,scale=TRUE)
You could then use these scaled results to create a composite score. For instance, average the values and assign a grade based on the percentiles of this average. Hope this helped!
Note: I borrowed this example from the book "R In Action". It's a great book! Would definitely recommend.

Resources