Testing CSR on lpp with R - r

I have recently posted a "very newie to R" question about the correct way of doing this, if you are interested in it you can find it [here].1
I have now managed to develop a simple R script that does the job, but now the results are what troubles me.
Long story short I'm using R to analyze lpp (Linear Point Pattern) with mad.test.That function performs an hypothesis test where the null hypothesis is that the points are randomly distributed. Currently I have 88 lpps to analyze, and according to the p.value 86 of them are randomly distributed and 2 of them are not.
These are the two not randomly distributed lpps.
Looking at them you can see some kind of clusters in the first one, but the second one only has three points, and seems to me that there is no way one can assure only three points are not corresponding to a random distribution. There are other tracks with one, two, three points but they all fall into the "random" lpps category, so I don't know why this one is different.
So here is the question: how many points are too little points for CSR testing?
I have also noticed that these two lpps have a much lower $statistic$rank than the others. I have tried to find what that means but I'm clueless righ now, so here is another newie question: Is the $statistic$rank some kind of quality analysis indicator, and thus can I use it to group my lpp analysis into "significant ones" and "too little points" ones?
My R script and all the shp files can be downloaded from here(850 Kb).
Thank you so much for your help.

It is impossible to give an universal answer to the question about how many points is needed for an analysis. Usually 0, 1 and 2 are too few for a standalone analysis. However, if they are part of repeated measurements of the same thing they might be interesting still. Also, I would normally say that your example with 3 points is too few to say anything interesting. However, an extreme example would be if you have a single long line segment where one point occurs close to one end and two other occur close to each other at the other end. This is not so likely to happen for CSR and you may be inclined to not believe that hypothesis. This appears to be what happened in your case.
Regarding your question about the rank you might want to read a bit more up on the Monte Carlo test you are preforming. Basically, you summarise the point pattern by a single number (maximum absolute deviation of linear K) and then you look at how extreme this number is compared to numbers generated at random from CSR. Assuming you use 99 simulations of CSR you have 100 numbers in total. If your data ranks as the most extreme ($statistic$rank==1) among these it has p-value 1%. If it ranks as the 50th number the p-value is 50%. If you used another number of simulations you have to calculate accordingly. I.e. with 199 simulations rank 1 is 0.5%, rank 2 is 1%, etc.

There is a fundamental problem here with multiple testing. You are applying a hypothesis test 88 times. The test is (by default) designed to give a false positive in 5 percent (1 in 20) of applications, so if the null hypothesis is true, you should expect 88 /20 = 4.4 false positives to have occurred your 88 tests. So getting only 2 positive results ("non-random") is entirely consistent with the null hypothesis that ALL of the patterns are random. My conclusion is that the patterns are random.

Related

LASSO coefficients equal to 0 using opt1D

I have a question about LASSO. I'm getting crazy because it is something that I can not solve only according to my background. I'm a biologist.
Briefly I run LASSO using the R library "penalized". In particular I used the opt1D function with around 500 simulations on a data.frame (numerical) of around 30 columns that are my biomarkers (gene expression). I want to test and 3000 rows that are people of which around 50 are tumours and all the others are normals.
Unfortunately by using L1 regularization, all and really all coefficients of 500 simulations are 0. If I check L2 matrix of coefficients they are close to 0. Now my point is that I cannot think that all my biomarkers are not able to distinguish between Normals and Tumors.
I don't know if what I have done is all I can to check for the discriminatory potential of my molecules. Is there something else I can do to understand why are they all 0 and also is there something else I can do to verify that really they are not able to stratify my cohort?
Did you consider fitting your data without penalization before using regularization? L1 regularization will naturally result in a significant number of zero coefficients.
As a side note I would first run PCA/PCoA and see whether or not your genes separate according to your class variable. This could save you some time and allow you to trim your data set to those genes that show the greatest differences across your class variable. Also if you have relatively little experience with R I would suggest using a linear modeling package such as Limma since it has excellent documentation and many examples that are easy to follow.

Suggested Neural Network for small, highly varying dataset?

I am currently working with a small dataset of training values, no more than 20, and am getting large MSE. The input data vectors themselves consist of 16 parameters, many of which are binary variables. Across all the training values, a majority of the 16 parameters stay the same (but not all). The remaining input variables, across all the exemplars, vary a lot amongst one another. This is to say, two exemplars might appear to be the same except for two parameters in which they differ, one parameter being a binary variable, and another being a continuous variable, where the difference could be greater than a single standard deviation (for that variable's set of values).
My single output variable (as of now) can either be a continuous variable, OR depending on the true difficulty of reducing the error in my situation, I can make this a classification problem instead, with 12 different forms for classification.
I have long been researching different neural networks than my current implementation of a feed-forward MLP, as I have read into Stochastic NNs, Ladder NNs, and many forms of recurrent NNs. I am stuck with which one I should investigate, as I do not have time to try every NN available.
While my description may be vague, could anyone make a suggestion as to which network I should investigate to minimize my cost function (as of now, MSE) the most?
If my current setup must be rendered implacable because of the sheer difficulty involved with predicting correct output for such a small set of highly variant training values, which network would best work, should my dataset be expanded to the order of thousands of exemplars (at the cost of having a significantly more redundant, seemingly homogenous set of input values)?
Any help is most certainly appreciated.
20 samples is very small especially if you have 16 input variables. It will be hard to determine which one of those inputs is responsible for your output value. If you keep your network simple (fewer layers) you may be able to use as many samples as you would need for traditional regression.

R - Approach to find outliers/artefacts in blood pressure curve

Do you guys have an idea how to approach the problem of finding artefacts/outliers in a blood pressure curve? My goal is to write a program, that finds out the start and end of each artefact. Here are some examples of different artefacts, the green area is the correct blood pressure curve and the red one is the artefact, that needs to be detected:
And this is an example of a whole blood pressure curve:
My first idea was to calculate the mean from the whole curve and many means in short intervals of the curve and then find out where it differs. But the blood pressure varies so much, that I don't think this could work, because it would find too many non existing "artefacts".
Thanks for your input!
EDIT: Here is some data for two example artefacts:
Artefact1
Artefact2
Without any data there is just the option to point you towards different methods.
First (without knowing your data, which is always a huge drawback), I would point you towards Markov switching models, which can be analysed using the HiddenMarkov-package, or the HMM-package. (Unfortunately the RHmm-package that the first link describes is no longer maintained)
You might find it worthwile to look into Twitter's outlier detection.
Furthermore, there are many blogposts that look into change point detection or regime changes. I find this R-bloggers blog post very helpful for a start. It refers to the CPM-package, which stands for "Sequential and Batch Change Detection Using Parametric and Nonparametric Methods", the BCP-package ("Bayesian Analysis of Change Point Problems"), and the ECP-package ("Non-Parametric Multiple Change-Point Analysis of Multivariate Data"). You probably want to look into the first two as you don't have multivariate data.
Does that help you getting started?
I could provide an graphical answer that does not use any statistical algorithm. From your data I observe that the "abnormal" sequences seem to present constant portions or, inversely, very high variations. Working on the derivative, and setting limits on this derivative could work. Here is a workaround:
require(forecast)
test=c(df2$BP)
test=ma(test, order=50)
test=test[complete.cases(test)]
which <- ma(0+abs(diff(test))>1, order=10)>0.1
abnormal=test; abnormal[!which]<-NA
plot(x=1:NROW(test), y=test, type='l')
lines(x=1:NROW(test), y=abnormal, col='red')
What it does: first "smooths" the data with a moving average to prevent the micro-variations to be detected. Then it applyes the "diff" function (derivative) and tests if it is greater than 1 (this value is to be adjusted manually depending on the soothing amplitude). THen, in order to get a whole "block" of abnormal sequence without tiny gaps, we apply again a smoothing on the boolean and test it superior to 0.1 to grasp better the boundaries of the zone. Eventually, I overplot the spotted portions in red.
This works for one type of abnormality. For the other type, you could make up a low treshold on the derivative, inversely, and play with the tuning parameters of smoothing.

Obtaining noise in DBSCAN using R

I have a dataset consisting of bets for soccer matches. I am carrying out outlier detection using 3 parameters, the odds that the home team wins, the odds that the match ends in a draw, and the odds that the away team wins.
Each record looks something like this:
Home Draw Away
1.320 5.700 13.500
I have identified the clusters but am having difficulty identifying which one contains the noise, the most plausible seems to be the last cluster (i.e if I have 10 clusters, cluster 10 would be the noise.)
Is this the correct way of obtaining outliers from my dataset using DBSCAN, is there a better way?
Also how can I know how much clusters I have to obtain the last one (the one with the noise) without manually checking?
I am completely new to statistical programming and outlier detection, I apologise if I sound utterly clueless.
Read the documentation, please.
integer vector coding cluster membership with noise observations (singletons) coded as 0
It's there, just search for the word "noise" in the manual of dbscan.

How to normalize benchmark results to obtain distribution of ratios correctly?

To give a bit of the context, I am measuring the performance of virtual machines (VMs), or systems software in general, and usually want to compare different optimizations for performance problem. Performance is measured in absolute runtime for a number of benchmarks, and usually for a number of configurations of a VM variating over used number of CPU cores, different benchmark parameters, etc. To get reliable results, each configuration is measure like 100 times. Thus, I end up with quite a number of measurements for all kind of different parameters where I am usually interested in the speedup for all of them, comparing the VM with and the VM without a certain optimization.
What I currently do is to pick one specific series of measurements. Lets say the measurements for a VM with and without optimization (VM-norm/VM-opt) running benchmark A, on 1 core.
Since I want to compare the results of the different benchmarks and number of cores, I can not use absolute runtime, but need to normalize it somehow. Thus, I pair up the 100 measurements for benchmark A on 1 core for VM-norm with the corresponding 100 measurements of VM-opt to calculate the VM-opt/VM-norm ratios.
When I do that taking the measurements just in the order I got them, I obviously have quite a high variation in my 100 resulting VM-opt/VM-norm ratios. So, I thought, ok, let's assume the variation in my measurements come from non-deterministic effects and the same effects cause variation in the same way for VM-opt and VM-norm. So, naively, it should be ok to sort the measurements before pairing them up. And, as expected, that reduces the variation of course.
However, my half-knowledge tells me that is not the best way and perhaps not even correct.
Since I am eventually interested in the distribution of those ratios, to visualize them with beanplots, a colleague suggested to use the cartesian product instead of pairing sorted measurements. That sounds like it would account better for the random nature of two arbitrary measurements paired up for comparison. But, I am still wondering what a statistician would suggest for such a problem.
In the end, I am really interested to plot the distribution of ratios with R as bean or violin plots. Simple boxplots, or just mean+stddev tell me too few about what is going on. These distributions usually point at artifacts that are produced by the complex interaction on these much to complex computers, and that's what I am interested in.
Any pointers to approaches of how to work with and how to produce such ratios in a correct way a very welcome.
PS: This is a repost, the original was posted at https://stats.stackexchange.com/questions/15947/how-to-normalize-benchmark-results-to-obtain-distribution-of-ratios-correctly
I found it puzzling that you got such a minimal response on "Cross Validated". This does not seem like a specific R question, but rather a request for how to design an analysis. Perhaps the audience there thought you were asking too broad a question, but if that is the case then the [R] forum is even worse, since we generally tackle problems where data is actually provided. We deal with the requests for implementation construction in our language. I agree that violin plots are preferred to boxplots for the examination of distributions (when there is sufficient data and I am not sure that 100 samples per group makes the grade in that instance), but in any case that means the "R answer" is that you just need to refer to the proper R help page:
library(lattice)
?xyplot
?panel.violin
Further comments would require more details and preferably some data examples constructed in R. You may want to refer to the page where "great question design is outlined".
One further graphical method: If you are interested in the ratios of two paired variates but do not want to "commit" to just x/y, then you can examine them by plotting and then plotting iso-ratio lines by repeatedly using abline(a=0, b= ). I think 100 samples is pretty "thin" for doing density estimates, but there are 2d density methods if you can gather more data.

Resources