R lowpass filter using Signal package

R lowpass filter using Signal package - r

I am new to R and having trouble fitting a Lowpass filter to my data. I am measuring Force exerted on a treadmil over a period of 30 seconds with a sample rate of 250/s or 250Hz.
The data contains negative force values as seen in this image
This is due to ripples in the signal or background noise. I need to be able to filter out any force signal <0, and for this I have used the Butter function within the Signal package:
ritLowPass = function(s, frqCutOff, bPlot = F )
{
f = butter( 4, frqCutOff/(smpRate/2), "low" ); # lowpass filter
s.lp = rev( filter( f, rev( filter( f, s ))) );
if( bPlot ) {
idx=(1*smpRate):(4*smpRate);
plot( x=idx/smpRate, y=s[idx], xlab="time/s", ylab="signal", ty="l" );
lines( x=idx/smpRate, y=s.lp[idx], col="red", lwd=2)
}
return(data.frame(s.lp));
}
VT_filter <- ritLowPass(guest$Fz, 250, bPlot)
sample data:
Time Fz
0 3.769
0.004 -32.94
0.008 -117.305
0.012 -142.329
0.016 -55.35
0.02 -27.362
0.024 29.039
0.028 73.718
0.032 76.633
0.036 4.482
0.04 -80.949
0.044 -114.279
0.048 -102.968
0.052 -9.76
0.056 35.405
0.06 152.541
0.064 79.249
0.068 50.147
0.072 22.547
0.076 47.757
0.08 -29.123
0.084 57.384
0.088 88.715
0.092 195.115
0.096 118.752
0.1 183.22
0.104 157.957
0.108 37.992
0.112 -7.893
When I run the code I get the following error:
VT_filter <- ritLowPass(guest$Fz, 250, bPlot)
Error in butter.default(4, frqCutOff/(smpRate/2), "low") :
butter: critical frequencies must be in (0 1)
Called from: butter.default(4, frqCutOff/(smpRate/2), "low")
I wonder if I should be using HighPass instead or is there another option for attenuating any force signal lower than zero?

Preamble
I'm not sure I can see anything in the data that suggests your "culprit" frequency is 250 Hz, or that you should cut frequencies above this value.
If you're trying to remove signal noise at a specific frequency, you'll need to find the noise frequency first. spectrum is your friend.
However, assuming you actually want to filter frequencies above 250 Hz:
Short Answer
If you want to filter frequencies above 250 Hz, your sampling frequency needs to be at least 500 Hz.
Long Answer
Your filter can only filter between frequencies of 0 and the Nyquist frequency, i.e. 0 to (Sampling Frequency)/2. This is a hard limit of information theory, not an implementation issue.
You're asking it to filter something that is twice the Nyquist Frequency.
help(butter) gives the following about the W parameter:
W: critical frequencies of the filter. ... For digital filters, W must be between 0 and 1 where 1 is the Nyquist frequency.
The cutoff value you are trying to assign to the filter is (250)/(250/2) = 2. The function is telling you this is outside its capabilities (or the capabilities of any digital filter).

From the question it looks like you did not bother to read the whole output of ?butter manual. The frequencies for pretty much all filter design functions in the package are only used relative to the Nyquist frequency so whenever a function asks you for a frequency f_1 you are expected to provide it with f_1/(f_sample/2) and the result is expected to be between 0 and 1 because your signal is expected not to have nonrecoverable distortions. They don't tell you the simple equation exactly and they have some mistakes in the manuals (like the formula for the bilinear transform function) but you are of course expected to have some general and basic knowledge about the topic before you attempt to use the package so it is not a big deal.
Also if the only thing that makes you worry for whatever reason is the negative signal values then why even bother trying to filter it? Here I use the definition of the word "filtering" found in DSP-related books of course which is probably not what you mean in the question. You can just do something like guest$Fz[guest$Fz<0]=0. It is generally a better idea than using NAs or removing the samples completely because the missing values and therefore irregular sampling creates global signal artifacts and it is much worse than local high frequency spikes from just replacing a single sample value with another. Then you can use some data smoothing method to make your signal look nicer if you feel a need to do so.
In fact my guess is that this is some purely educational test signal and you probably really need to filter the signal with a simple low pass filter and the single cutoff frequency required is well below the 250Hz Fs and the negative values are not a problem by themselves but rather they are an indication of really bad or nonexistent filtering but who knows...

Related

Handling box constraints in Nelder-Mead optimisation by distorting the parameter space

I have a question on a specific implementation of a Nelder-Mead algorithm (1) that handles box contraints in an unusual way. I cannot find in anything about it in any paper (25 papers), textbook (searched 4 of them) or the internet.
I have a typical optimisation problem: min f(x) with a box constraint -0.25 <= x_i <= 250
The expected approach would be using a penalty function and make sure that all instances of f(x) are "unattractive" when x is out of bounds.
The algorithm works differently: the implementation in question does not touch f(x). Instead it distorts the parameter space using an inverse hyperbolic tangens atanh(f). Now the simplex algorithm can freely operate in a space without bounds and pick just any point. Before it gets f(x) in order to assess the solution at x the algorithm switches back into normal space.
At a first glance I found the idea ingenious. This way we avoid the disadvantages of penalty functions. But now I am having doubts. The distorted space affects termination behaviour. One termination criterion is the size of the simplex. By inflating the parameter space with atanh(x) we also inflate the simplex size.
Experiments with the algorithm also show that it does not work as intended. I do not yet understand how this happens, but I do get results that are out of bounds. I can say that almost half of the returned local minima are out of bounds.
As an example, take a look at nmkb() optimising the rosenbrook function when we gradually change the width of the box constraint:
rosbkext <- function(x) {
# Extended Rosenbrock function
n <- length(x)
sum (100*(x[1:(n-1)]^2 - x[2:n])^2 + (x[1:(n-1)] - 1)^2)
}
np <- 6 #12
for (box in c(2, 4, 12, 24, 32, 64, 128)) {
set.seed(123)
p0 <- rnorm(np)
p0[p0 > +2] <- +2 - 1E-8
p0[p0 < -2] <- -2 + 1E-8
ctrl <- list(maxfeval = 5E4, tol = 1E-8)
o <- nmkb(fn = rosbkext, par = p0, lower = -box, upper = +box, control = ctrl)
print(o$message)
cat("f(", format(o$par, digits = 2), ") =", format(o$value, digits=3), "\n")
}
The output shows that it claims to converge but it does not in three cases. And it does that for bounds of (-2,2) and (-12,12). I might accept that but then it also fails at (-128, 128). I also tried the same with the unconstrained dfoptim::nmk(). No trouble there. It converges perfectly.
[1] "Successful convergence"
f( -0.99 0.98 0.97 0.95 0.90 0.81 ) = 3.97
[1] "Successful convergence"
f( 1 1 1 1 1 1 ) = 4.42e-09
[1] "Successful convergence"
f( -0.99 0.98 0.97 0.95 0.90 0.81 ) = 3.97
[1] "Successful convergence"
f( 1 1 1 1 1 1 ) = 1.3e-08
[1] "Successful convergence"
f( 1 1 1 1 1 1 ) = 4.22e-09
[1] "Successful convergence"
f( 1 1 1 1 1 1 ) = 8.22e-09
[1] "Successful convergence"
f( -0.99 0.98 0.97 0.95 0.90 0.81 ) = 3.97
Why does the constrained algorithm have more trouble converging than the unconstrained one?
Footnote (1): I am referring to the Nelder-Mead implementation used in the optimx package in R. This package calls another package dfoptim with the nmkb-function.

(This question has nothing to do with optimx, which is just a wrapper for R packages providing unconstrained optimization.)
The function in question is nmkb() in the dfoptim package for gradient-free optimization routines. The approach to transform bounded regions into unbounded spaces is a common one and can be applied with many different transformation functions, sometimes depending on the kind of the boundary and/or the type of the objective function. It may also be applied, e.g., to transform unbounded integration domains into bounded ones.
The approach is problematic if the optimum lies at the boundary, because the optimal point will be sent to (nearly) infinity and cannot ultimately be reached. The routine will not converge or the solution be quite inaccurate.
If you think the algorithm is not working correctly, you should write to the authors of that package and -- that is important -- add one or two examples for what you think are bugs or incorrect solutions. Without explicit code examples no one here is able to help you.
(1) Those transformations define bijective maps between bounded and unbounded regions and the theory behind this approach is obvious. You may read about possible transformations in books on multivariate calculus.
(2) The approach with penalties outside the bounds has its own drawbacks, for instance the target function will not be smooth at the boundaries, and the BFGS method may not be appropriate anymore.
(3) You could try the Hooke-Jeeves algorithm through function hjkb() in the same dfoptim package. It will be slower, but uses a different approach for treating the boundaries, no transformations involved.
EDIT (after discussion with Erwin Kalvelagen above)
There appear to be local minima (with some coordinates negative).
If you set the lower bounds to 0, nmkb() will find the global minimum (1,1,1,1,1,1) in any case.
Watch out: starting values have to be feasible, that is all their coordinates greater 0.

silhouette calculation in R for a large data

I want to calculate silhouette for cluster evaluation. There are some packages in R, for example cluster and clValid. Here is my code using cluster package:
# load the data
# a data from the UCI website with 434874 obs. and 3 variables
data <- read.csv("./data/spatial_network.txt",sep="\t",header = F)
# apply kmeans
km_res <- kmeans(data,20,iter.max = 1000,
nstart=20,algorithm="MacQueen")
# calculate silhouette
library(cluster)
sil <- silhouette(km_res$cluster, dist(data))
# plot silhouette
library(factoextra)
fviz_silhouette(sil)
The code works well for smaller data, say data with 50,000 obs, however I get an error like "Error: cannot allocate vector of size 704.5 Gb" when the data size is a bit large. This might be problem for Dunn index and other internal indices for large datasets.
I have 32GB RAM in my computer. The problem comes from calculating dist(data). I am wondering if it is possible to not calculating dist(data) in advance, and calculate corresponding distances when it is required in the silhouette formula.
I appreciate your help regarding this problem and how I can calculate silhouette for large and very large datasets.

You can implement Silhouette yourself.
It only needs every distance twice, so storing an entire distance matrix is not necessary. It may run a bit slower because it computes distances twice, but at the same time the better memory efficiency may well make up for that.
It will still take a LONG time though.
You should consider to only use a subsample (do you really need to consider all points?) as well as alternatives such as Simplified Silhouette, in particular with KMeans... You only gain very little with extra data on such methods. So you may just use a subsample.

Anony-Mousse answer is perfect, particularly subsampling. This is very important for very large datasets due to the increase in computational cost.
Here is another solution for calculating internal measures such as silhouette and Dunn index, using an R package of clusterCrit. clusterCrit is for calculating clustering validation indices, which does not require entire distance matrix in advance. However, it might be slow as Anony-Mousse discussed. Please see the below link for documentation for clusterCrit:
https://www.rdocumentation.org/packages/clusterCrit/versions/1.2.8/topics/intCriteria
clusterCrit also calculates most of Internal measures for cluster validation.
Example:
intCriteria(data,km_res$cluster,c("Silhouette","Calinski_Harabasz","Dunn"))

If it is possible to calculate the Silhouette index, without using the distance matrix, alternatively you can use the clues package, optimizing both the time and the memory used by the cluster package. Here is an example:
library(rbenchmark)
library(cluster)
library(clues)
set.seed(123)
x = c(rnorm(1000,0,0.9), rnorm(1000,4,1), rnorm(1000,-5,1))
y = c(rnorm(1000,0,0.9), rnorm(1000,6,1), rnorm(1000, 5,1))
cluster = rep(as.factor(1:3),each = 1000)
df <- cbind(x,y)
head(df)
x y
[1,] -0.50442808 -0.13527673
[2,] -0.20715974 -0.29498142
[3,] 1.40283748 -1.30334876
[4,] 0.06345755 -0.62755613
[5,] 0.11635896 2.33864121
[6,] 1.54355849 -0.03367351
Runtime comparison between the two functions
benchmark(f1 = silhouette(as.integer(cluster), dist = dist(df)),
f2 = get_Silhouette(y = df, mem = cluster))
test replications elapsed relative user.self sys.self user.child sys.child
1 f1 100 15.16 1.902 13.00 1.64 NA NA
2 f2 100 7.97 1.000 7.76 0.00 NA NA
Comparison in memory usage between the two functions
library(pryr)
object_size(silhouette(as.integer(cluster), dist = dist(df)))
73.9 kB
object_size(get_Silhouette(y = df, mem = cluster))
36.6 kB
As a conclusion clues::get_Silhouette, it reduces the time and memory used to the same.

Multiply Probability Distribution Functions

I'm having a hard time building an efficient procedure that adds and multiplies probability density functions to predict the distribution of time that it will take to complete two process steps.
Let "a" represent the probability distribution function of how long it takes to complete process "A". Zero days = 10%, one day = 40%, two days = 50%. Let "b" represent the probability distribution function of how long it takes to complete process "B". Zero days = 10%, one day = 20%, etc.
Process "B" can't be started until process "A" is complete, so "B" is dependent upon "A".
a <- c(.1, .4, .5)
b <- c(.1,.2,.3,.3,.1)
How can I calculate the probability density function of the time to complete "A" and "B"?
This is what I'd expect as the output for or the following example:
totallength <- 0 # initialize
totallength[1:(length(a) + length(b))] <- 0 # initialize
totallength[1] <- a[1]*b[1]
totallength[2] <- a[1]*b[2] + a[2]*b[1]
totallength[3] <- a[1]*b[3] + a[2]*b[2] + a[3]*b[1]
totallength[4] <- a[1]*b[4] + a[2]*b[3] + a[3]*b[2]
totallength[5] <- a[1]*b[5] + a[2]*b[4] + a[3]*b[3]
totallength[6] <- a[2]*b[5] + a[3]*b[4]
totallength[7] <- a[3]*b[5]
print(totallength)
[1] [1] 0.01 0.06 0.16 0.25 0.28 0.19 0.05
sum(totallength)
[1] 1
I have an approach in visual basic that used three for loops (one for each of the steps, and one for the output) but I hope I don't have to loop in R.
Since this seems to be a pretty standard process flow question, part two of my question is whether any libraries exist to model operations flow so I'm not creating this from scratch.

The efficient way to do this sort of operation is to use a convolution:
convolve(a, rev(b), type="open")
# [1] 0.01 0.06 0.16 0.25 0.28 0.19 0.05
This is efficient both because it's less typing than computing each value individually and also because it's implemented in an efficient way (using the Fast Fourier Transform, or FFT).
You can confirm that each of these values is correct using the formulas you posted:
(expected <- c(a[1]*b[1], a[1]*b[2] + a[2]*b[1], a[1]*b[3] + a[2]*b[2] + a[3]*b[1], a[1]*b[4] + a[2]*b[3] + a[3]*b[2], a[1]*b[5] + a[2]*b[4] + a[3]*b[3], a[2]*b[5] + a[3]*b[4], a[3]*b[5]))
# [1] 0.01 0.06 0.16 0.25 0.28 0.19 0.05

See the package:distr. Choosing the term "multiply" is unfortunate, since the situation described is not one where the contributions to probabilities is independent (where multiplication of probabilities would be the natural term to use). It's rather some sort of sequential addition, and that is exactly what the distr package provides as its interpretation of what "+" should mean when used as a symbolic manipulation of two discrete distributions.
A <- DiscreteDistribution ( setNames(0:2, c('Zero', 'one', 'two') ), a)
B <- DiscreteDistribution(setNames(0:2, c( "Zero2" ,"one2", "two2",
"three2", "four2") ), b )
?'operators-methods' # where operations on 2 DiscreteDistribution are convolution
plot(A+B)
After a bit of nosing around I see that the actual numeric values can be found here:
A.then.B <- A + B
> environment(A.the.nB#d)$dx
[1] 0.01 0.06 0.16 0.25 0.28 0.19 0.05
Seems like there should have been a method for display of the probabilities, and I'm not a regular user of this fascinating package so there well may be one. Do read the vignette and the code-demos ... which I have not yet done. Further noodling around convinces me that the right place to look is in the companion package: distrDoc where the vignette is 100+ pages long. And it shouldn't have required any effort to find it, either, since that advice is in the messages that print when the package is loaded ... except in my defense there were a couple of pages of messages, so it was more tempting to jump into coding and using the help pages.

I'm not familiar with a dedicated package that does exactly what your example describes. but let me sujust a more robust solution for this problem.
You are looking for a method to estimate the distribution of a process that might be combined by an n steps process, in your case 2 that might not be as easy to compute as your example.
The approach Iwould use is a simulation, of 10k observations drown from the underlying distributions, and then calculating the density function of the simulated results.
using your example we can do the following:
x <- runif(10000)
y <- runif(10000)
library(data.table)
z <- as.data.table(cbind(x,y))
z[x>=0 & x<0.1, a_days:=0]
z[x>=0.1 & x<0.5, a_days:=1]
z[x>=0.5 & x<=1, a_days:=2]
z[y>=0 & y <0.1, b_days:=0]
z[x>=0.1 & x<0.3, b_days:=1]
z[x>=0.3 & x<0.5, b_days:=2]
z[x>=0.5 & x<0.8, b_days:=3]
z[x>=0.8 & x<=1, b_days:=4]
z[,total_days:=a_days+b_days]
hist(z[,total_days])
this will result in a very good proxy if the density and the aproach would also work if your second process was drown from an exponential distribution. in which case you'd use rexp function to calculate b_days directly.

Statistical inefficiency (block-averages)

I have a series of data, these are obtained through a molecular dynamics simulation, and therefore are sequential in time and correlated to some extent. I can calculate the mean as the average of the data, I want to estimate the the error associated to mean calculated in this way.
According to this book I need to calculate the "statistical inefficiency", or roughly the correlation time for the data in the series. For this I have to divide the series in blocks of varying length and, for each block length (t_b), the variance of the block averages (v_b). Then, if the variance of the whole series is v_a (that is, v_b when t_b=1), I have to obtain the limit, as t_b tends to infinity, of (t_b*v_b/v_a), and that is the inefficiency s.
Then the error in the mean is sqrt(v_a*s/N), where N is the total number of points. So, this means that only one every s points is uncorrelated.
I assume this can be done with R, and maybe there's some package that does it already, but I'm new to R. Can anyone tell me how to do it? I have already found out how to read the data series and calculate the mean and variance.
A data sample, as requested:
# t(ps) dH/dl(kJ/mol)
0.0000 582.228
0.0100 564.735
0.0200 569.055
0.0300 549.917
0.0400 546.697
0.0500 548.909
0.0600 567.297
0.0700 638.917
0.0800 707.283
0.0900 703.356
0.1000 685.474
0.1100 678.07
0.1200 687.718
0.1300 656.729
0.1400 628.763
0.1500 660.771
0.1600 663.446
0.1700 637.967
0.1800 615.503
0.1900 605.887
0.2000 618.627
0.2100 587.309
0.2200 458.355
0.2300 459.002
0.2400 577.784
0.2500 545.657
0.2600 478.857
0.2700 533.303
0.2800 576.064
0.2900 558.402
0.3000 548.072
... and this goes on until 500 ps. Of course, the data I need to analyze is the second column.

Suppose x is holding the sequence of data (e.g., data from your second column).
v = var(x)
m = mean(x)
n = length(x)
si = c()
for (t in seq(2, 1000)) {
nblocks = floor(n/t)
xg = split(x[1:(nblocks*t)], factor(rep(1:nblocks, rep(t, nblocks))))
v2 = sum((sapply(xg, mean) - m)**2)/nblocks
#v rather than v1
si = c(si, t*v2/v)
}
plot(si)
Below image is what I got from some of my time series data. You have your lower limit of t_b when the curve of si becomes approximately flat (slope = 0). See http://dx.doi.org/10.1063/1.1638996 as well.

There are a couple different ways to calculate the statistical inefficiency, or integrated autocorrelation time. The easiest, in R, is with the CODA package. They have a function, effectiveSize, which gives you the effective sample size, which is the total number of samples divided by the statistical inefficiency. The asymptotic estimator for the standard deviation in the mean is sd(x)/sqrt(effectiveSize(x)).
require('coda')
n_eff = effectiveSize(x)

Well it's never too late to contribute to a question, isn't it?
As I'm doing some molecular simulation myself, I did step uppon this problem but did not see this thread already. I found out that the method actually proposed by Allen & Tildesley seems a bit out dated compared to modern error analysis methods. The rest of the book is good enought to worth the look though.
While Sunhwan Jo's answer is correct concerning block averages method,concerning error analysis you can find other methods like the jacknife and bootstrap methods (closely related to one another) here: http://www.helsinki.fi/~rummukai/lectures/montecarlo_oulu/lectures/mc_notes5.pdf
In short, with the bootstrap method, you can make series of random artificial samples from your data and calculate the value you want on your new sample. I wrote a short piece of Python code to work some data out (don't forget to import numpy or the functions I used):
def Bootstrap(data):
B = 100 # arbitraty number of artificial samplings
es = 0.
means = numpy.zeros(B)
sizeB = data.shape[0]/4 # (assuming you pass a numpy array)
# arbitrary bin-size proportional to the one of your
# sampling.
for n in range(B):
for i in range(sizeB):
# if data is multi-column array you may have to add the one you use
# specifically in randint, else it will give you a one dimension array.
# Check the doc.
means[n] = means[n] + data[numpy.random.randint(0,high=data.shape[0])] # Assuming your desired value is the mean of the values
# Any calculation is ok.
means[n] = means[n]/sizeB
es = numpy.std(means,ddof = 1)
return es
I know it can be upgraded but it's a first shot. With your data, I get the following:
Mean = 594.84368
Std = 66.48475
Statistical error = 9.99105
I hope this helps anyone stumbling across this problem in statistical analysis of data. If I'm wrong or anything else (first post and I'm no mathematician), any correction is welcomed.

Speeding up lots of GLMs in R

First off, sorry about the long post. Figured it's better to give context to get good answers (I hope!). Some time ago I wrote an R function that will get all pairwise interactions of variables in a data frame. This worked fine at the time, but now a colleague would like me to do this with a much larger dataset. They don't know how many variables they are going to have in the end but they are guessing approximately 2,500 - 3,000. My function below is way too slow for this (4 minutes for 100 variables). At the bottom of this post I have included some timings for various numbers of variables and total numbers of interactions. I have the results of calling Rprof() on the 100 variables run of my function, so If anyone wants to take a look at it let me know. I don't want to make a super long any longer than it needs to be.
What I'd like to know is if there is anything I can do to speed this function up. I tried looking going directly to glm.fit, but as far as I understood, for that to be useful the computation of the design matrices and all of that other stuff that I frankly don't understand, needs to be the same for each model, which is not the case for my analysis, although perhaps I am wrong about this.
Any ideas on how to make this run faster would be greatly appreciated. I am planning on using parallelization to run the analysis in the end but I don't know how many CPU's I am going to have access to but I'd say it won't be more than 8.
Thanks in advance,
Cheers
Davy.
getInteractions2 = function(data, fSNPcol, ccCol)
{
#fSNPcol is the number of the column that contains the first SNP
#ccCol is the number of the column that contains the outcome variable
require(lmtest)
a = data.frame()
snps = names(data)[-1:-(fSNPcol-1)]
names(data)[ccCol] = "PHENOTYPE"
terms = as.data.frame(t(combn(snps,2)))
attach(data)
fit1 = c()
fit2 = c()
pval = c()
for(i in 1:length(terms$V1))
{
fit1 = glm(PHENOTYPE~get(as.character(terms$V1[i]))+get(as.character(terms$V2[i])),family="binomial")
fit2 = glm(PHENOTYPE~get(as.character(terms$V1[i]))+get(as.character(terms$V2[i]))+I(get(as.character(terms$V1[i]))*get(as.character(terms$V2[i]))),family="binomial")
a = lrtest(fit1, fit2)
pval = c(pval, a[2,"Pr(>Chisq)"])
}
detach(data)
results = cbind(terms,pval)
return(results)
}
In the table below is the system.time results for increasing numbers of variables being passed through the function. n is the number, and Ints, is the number of pair-wise interactions given by that number of variables.
n Ints user.self sys.self elapsed
time 10 45 1.20 0.00 1.30
time 15 105 3.40 0.00 3.43
time 20 190 6.62 0.00 6.85
...
time 90 4005 178.04 0.07 195.84
time 95 4465 199.97 0.13 218.30
time 100 4950 221.15 0.08 242.18
Some code to reproduce a data frame in case you want to look at timings or the Rprof() results. Please don't run this unless your machine is super fast, or your prepared to wait for about 15-20 minutes.
df = data.frame(paste("sid",1:2000,sep=""),rbinom(2000,1,.5))
gtypes = matrix(nrow=2000, ncol=3000)
gtypes = apply(gtypes,2,function(x){x=sample(0:2, 2000, replace=T);x})
snps = paste("rs", 1000:3999,sep="")
df = cbind(df,gtypes)
names(df) = c("sid", "status", snps)
times = c()
for(i in seq(10,100, by=5)){
if(i==100){Rprof()}
time = system.time((pvals = getInteractions2(df[,1:i], 3, 2)))
print(time)
times = rbind(times, time)
if(i==100){Rprof(NULL)}
}
numI = function(n){return(((n^2)-n)/2)}
timings = cbind(seq(10,100,by=5), sapply(seq(10,100,by=5), numI),times)

So I have sort of solved this (with help from the R mailing lists) and am posting it up in-case it's useful to anyone.
Basically, where the SNPs or variables are independent (i.e. Not in LD, not correlated) you can centre each SNP/Variable at it's mean like so:
rs1cent <- rs1-mean(rs1)
rs2cent <- rs2 -mean(rs2)
you can then test for correlation between phenotype and interaction as a screening step:
rs12interaction <- rs1cent*rs2cent
cor(PHENOTYPE, rs12interaction)
and then fully investigate using the full glm any that seem to be correlated. cut-off choice is, as ever, arbitrary.
Other suggestions were to use a RAO score test, which involves only fitting the null hypothesis model this halving the computation time for this step, but I don't really understand how this works (yet! more reading required.)
Anyway there you go. Maybe be of use to someone someday.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R lowpass filter using Signal package - r

Related

Handling box constraints in Nelder-Mead optimisation by distorting the parameter space

silhouette calculation in R for a large data

Multiply Probability Distribution Functions

Statistical inefficiency (block-averages)

Speeding up lots of GLMs in R

Categories

Resources