Peak to Gaussian function estimation - r

I have a continuous flow of input data : more than 100.000. There are two values a time and an intensity. The data contains many peaks. Let se part of the data.
Objectives : Search for peaks -> identify them -> calculate area.
Problem : a huge peak (like the one betwen 8.0 and 8.5) could be contain multiple Gaussian (this is just one "normal", there are other type of estimation functions also) peaks for example.
Question : How can I "deconvolute" this peaks in order to measure the area of them.
Example : I want to do something similar like the following matlab code :
iPeak

Well, if "iPeak" is good enough, just port the code to R. There are a couple packages in R that do thresholded peak finding. Naturally I forgot their names, being away from my main machine. If I can refresh my brain... possible fits for you: PROcess, ppc, seewave, Peaks, and the one I like: pracma.
Note: I found these using the incredibly useful tool sos

Related

Why is k-means clustering ignoring a significant patch of data?

I'm working with a set of co-ordinates, and want to dynamically (I have many sets that need to go through this process) understand how many distinct groups there are within the data. My approach was to apply k-means to investigate whether it would find the centroids and I could go from there.
When plotting some data with 6 distinct clusters (visually) the k-means algorithm continues to ignore two significant clusters while putting many centroids into another.
See image below:
Red are the co-ordinate data points and blue are centroids that k-means has provided. In this specific case I've gone for 15 (arbitrary), but it still doesn't recognise those patches of data on the right hand side, rather putting a mid point between them while putting in 8 in the cluster in the top right.
Admittedly there are slightly more data points in the top right, but not by much.
I'm using the standard k-means algorithm in R and just feeding in x and y co-ordinates. I've tried standardising the data, but this doesn't make any difference.
Any thoughts on why this is, or other potential methodologies that could be applied to try and dynamically understand the number of distinct clusters there are in the data?
You could try with Self-organizing map:
this is a clustering algorithm based on Neural Networks which create a discretized representation of the input space of the training samples, called a map, and is, therefore, a method to do dimensionality reduction (SOM).
This algorithm is very good for clustering also because does not require a priori selection of the number of clusters (in k-mean you need to choose k, here no). In your case, it hopefully finds automatically the optimal number of cluster, and you can actually visualize it.
You can find a very nice python package called somoclu which has got this algorithm implemented and an easy way to visualize the result. Else you can go with R. Here you can find a blog post with a tutorial, and Cran package manual for SOM.
K-means is a randomized algorithm and it will get stuck in local minima.
Because of these problems, it is common to run k-means several times, and keep the result with least squares, I.e., the best of the local minima found.

Multiple regression lines to define a set of data

I am trying to use a regression model to establish a relationship between two parameters, A and B(more specifically, runtime and workload, so that can I recommend what an optimal workload could be maybe, or how strongly one affects the other etc. ) I am using 'rlm'(robust linear model) for this purpose since it saves me the trouble of dealing with outliers before hand.
However, rather than output one single regression model, I would like to determine a band that can confidently explain most of the points. Here is an image I took from the web. Those additional red lines are what I want to determine.
This is what I had in mind :
1. I found the mean of the residuals of all the points lying above the line. Then we probably shift the original regression line by some multiple of mean + k*sigma. The same can be done for the points below the line.
In SVM, in order to find the support vectors, we draw parallel lines(essentially shift the middle line until we find support vectors on either sides). I had something like that in mind. Play around with the intercepts a little and find the the number of points which can be explained by the band. Keep a threshold so you can stop somewhere.
The problem is, I am unable to implement this in R. For that matter, I am not sure if these approaches even work either. I would like to know what you would suggest. Also, is there a classic way to do this using one of the many R packages?
Thanks a lot for helping. Appreciate it.

R - Approach to find outliers/artefacts in blood pressure curve

Do you guys have an idea how to approach the problem of finding artefacts/outliers in a blood pressure curve? My goal is to write a program, that finds out the start and end of each artefact. Here are some examples of different artefacts, the green area is the correct blood pressure curve and the red one is the artefact, that needs to be detected:
And this is an example of a whole blood pressure curve:
My first idea was to calculate the mean from the whole curve and many means in short intervals of the curve and then find out where it differs. But the blood pressure varies so much, that I don't think this could work, because it would find too many non existing "artefacts".
Thanks for your input!
EDIT: Here is some data for two example artefacts:
Artefact1
Artefact2
Without any data there is just the option to point you towards different methods.
First (without knowing your data, which is always a huge drawback), I would point you towards Markov switching models, which can be analysed using the HiddenMarkov-package, or the HMM-package. (Unfortunately the RHmm-package that the first link describes is no longer maintained)
You might find it worthwile to look into Twitter's outlier detection.
Furthermore, there are many blogposts that look into change point detection or regime changes. I find this R-bloggers blog post very helpful for a start. It refers to the CPM-package, which stands for "Sequential and Batch Change Detection Using Parametric and Nonparametric Methods", the BCP-package ("Bayesian Analysis of Change Point Problems"), and the ECP-package ("Non-Parametric Multiple Change-Point Analysis of Multivariate Data"). You probably want to look into the first two as you don't have multivariate data.
Does that help you getting started?
I could provide an graphical answer that does not use any statistical algorithm. From your data I observe that the "abnormal" sequences seem to present constant portions or, inversely, very high variations. Working on the derivative, and setting limits on this derivative could work. Here is a workaround:
require(forecast)
test=c(df2$BP)
test=ma(test, order=50)
test=test[complete.cases(test)]
which <- ma(0+abs(diff(test))>1, order=10)>0.1
abnormal=test; abnormal[!which]<-NA
plot(x=1:NROW(test), y=test, type='l')
lines(x=1:NROW(test), y=abnormal, col='red')
What it does: first "smooths" the data with a moving average to prevent the micro-variations to be detected. Then it applyes the "diff" function (derivative) and tests if it is greater than 1 (this value is to be adjusted manually depending on the soothing amplitude). THen, in order to get a whole "block" of abnormal sequence without tiny gaps, we apply again a smoothing on the boolean and test it superior to 0.1 to grasp better the boundaries of the zone. Eventually, I overplot the spotted portions in red.
This works for one type of abnormality. For the other type, you could make up a low treshold on the derivative, inversely, and play with the tuning parameters of smoothing.

Clustering time series in R

i have a problem with clustering time series in R.
I googled a lot and found nothing that fits my problem.
I have made a STL-Decomposition of Timeseries.
The trend component is in a matrix with 64 columns, one for every series.
Now i want to cluster these series in simular groups, involve the curve shapes and the timely shift. I found some functions that imply one of these aspects but not both.
First i tried to calculte a distance matrix with the dtw-distance so i
found clusters based on the values and inply the time shift but not on the shape of the timeseries. After this i tried some correlation based clustering, but then the timely shift
we're not recognized and the result dont satisfy my claims.
Is there a function that could cover my problem or have i to build up something
on my own. Im thankful for every kind of help, after two days of tutorials and examples i totaly uninspired. I hope i could explain the problem well enough to you.
I attached a picture. Here you can see some example time series.
There you could see the problem. The two series in the middle are set to one cluster,
although the upper and the one on the bottom have the same shape as one of the middle.
Have you tried the R package dtwclust
https://cran.r-project.org/web/packages/dtwclust/index.html
(I'm just starting to explore this package, but it seems like it covers a lot of aspects of time series clustering and it has lots of good references.)
you can use the kml package. It is used specifically to longitudinal data. You can consult its help. It has the next example:
### Generation of some data
cld1 <- generateArtificialLongData(25)
### We suspect 3, 4 or 6 clusters, we want 3 redrawing.
### We want to "see" what happen (so printCal and printTraj are TRUE)
kml(cld1,c(3,4,6),3,toPlot='both')
### 4 seems to be the best. We want 7 more redrawing.
### We don't want to see again, we want to get the result as fast as possible.
kml(cld1,4,10)
Example cluster

Fitting a binormal distribution in R

As from title, I have some data that is roughly binormally distributed and I would like to find its two underlying components.
I am fitting to the data distribution the sum of two normal with means m1 and m2 and standard deviations s1 and s2. The two gaussians are scaled by a weight factor such that w1+w2 = 1
I can succeed to do this using the vglm function of the VGAM package such as:
fitRes <- vglm(mydata ~ 1, mix2normal1(equalsd=FALSE),
iphi=w, imu=m1, imu2=m2, isd1=s1, isd2=s2))
This is painfully slow and it can take several minutes depending on the data, but I can live with that.
Now I would like to see how the distribution of my data changes over time, so essentially I break up my data in a few (30-50) blocks and repeat the fit process for each of those.
So, here are the questions:
1) how do I speed up the fit process? I tried to use nls or mle that look much faster but mostly failed to get good fit (but succeeded in getting all the possible errors these function could throw on me). Also is not clear to me how to impose limits with those functions (w in [0;1] and w1+w2=1)
2) how do I automagically choose some good starting parameters (I know this is a $1 million question but you'll never know, maybe someone has the answer)? Right now I have a little interface that allow me to choose the parameters and visually see what the initial distribution would look like which is very cool, but I would like to do it automatically for this task.
I thought of relying on the x corresponding to the 3rd and 4th quartiles of the y as starting parameters for the two mean? Do you thing that would be a reasonable thing to do?
First things first:
did you try to search for fit mixture model on RSeek.org?
did you look at the Cluster Analysis + Finite Mixture Modeling Task View?
There has been a lot of research into mixture models so you may find something.

Resources