fast perl t-test function - r

I'm using perl+R to analyze a large dataset of samples. For each two samples, I calculate the t-test p-value. Currently, I'm using the statistics::R module to export values from perl to R, and then use the t.test function. However, this process is extremely slow. I was wondering if someone knows a perl function that will do the same procedure, in a more efficient manner.
Thanks!

The volume of data, the number of dataset pairs, and perhaps even the code you have written would probably help us identify why your code is slow. For instance, sending many small datasets to R would be slow, but can probably be sped up simply by sending all the data at once.
For a pure Perl solution, you first need to compute the test statistic (that is easy, and already done in
Statistics::TTest,
for instance), and then to convert it to a p-value (you need something like R's qt function, but I am not sure it is readily available in Perl -- you could send the T-values to R, in one block, at the end, to convert them to p-values).

You can also try PDL, in particular PDL::Stats.

The Statistics::TTest module gives you a p-value.
use Statistics::TTest;
my #r1 = map { rand(10) } 1..32;
my #r2 = map { rand(10)-2 } 1..32;
my $ttest = new Statistics::TTest;
$ttest->load_data(\#r1,\#r2);
say "p-value = prob > |T| = ", $ttest->{t_prob};
Playing around a bit, I find that the p-values that this gives you are slightly lower than what you get from R. R is apparently doing something that reduces the degrees of freedom, but my knowledge of statistics is insufficient to explain what it's doing or why. (In the above example, the difference is about 1%. If you use samples of 320 floats instead of 32, then the difference is 50% or even more, but it's a difference between 1e-12 and 1.5e-12.) If you need precise p-values, you will want to take care.

Related

How to remove the models that failed convergence from a set of random questions?

I want to include some random replications of model estimations (e.g., GARCH model) in the question. The code uses a different data series randomly. In this process, some GARCH estimations for some random data series may not achieve numerical convergence. Therefore, I need to code the question/problem in such a way that it has to remove the models that failed convergence from the set of questions. How can I code this when I use R-exams?
Basic idea
In general when using random data in the generation of exercises, there is a chance that sometimes something goes wrong, e.g., the solution does not fall into a desired range (i.e., becomes too large or too small), or the solution does not even exist due to mathematical intractability or numerical problems (as you point out) etc.
Of course, it is best to avoid such problems in the data-generating process so that they do not occur at all. However, it is not always possible to do so or not worth the effort because problems occur very rarely. In such situations I typically use a while() loop to re-generate the random data if necessary. As this might run potentially for several iterations it is important, though, to make the probably sufficiently small that it is needed.
Worked example
A worked example can be found in the fourfold exercise that ships with the package. It randomly generates a fourfold table with probabilities that should subsequently be reconstructed from partial information in the actual exercise. In order for the exercise to be well-defined all entries of the table must be (strictly) between 0 and 1 and they must sum up to 1. The simulation code actually tries to assure that but edge cases might occur. Rather than writing more code to avoid these edge cases, a simple while() loop tries to catch them and sample a new table if needed:
ok <- FALSE
while(!ok) {
[...generate probabilities...]
tab <- cbind(c(prob1, prob3), c(prob2, prob4))
[...compute solutions...]
ok <- sum(tab) == 1 & all(tab > 0) & all(tab < 1)
}
Application to catching errors
The same type of strategy could also be used for other problems such as the ones you describe. You can wrap the model estimation into a code like
fit <- try(mymodel(...), silent = TRUE)
and then use something like
ok <- !inherits(fit, "try-error")
In addition to not producing an error you might require, say that all coefficients are positive (or something like that). Then you would do:
ok <- !inherits(fit, "try-error") && all(coef(fit) > 0)
Analogously, you could check the convergence of the model etc.

R cluster analysis and dendrogram with correlation matrix

I have to perform a cluster analysis on a big amount of data. Since I have a lot of missing values I made a correlation matrix.
corloads = cor(df1[,2:185], use = "pairwise.complete.obs")
Now I have problems how to go on. I read a lot of articles and examples, but nothing really works for me. How can I find out how many clusters are good for me?
I already tried this:
dissimilarity = 1 - corloads
distance = as.dist(dissimilarity)
plot(hclust(distance), main="Dissimilarity = 1 - Correlation", xlab="")
I got a plot, but its very messy and I dont know how to read it and how to go on. It looks like this:
Any idea how to improve it? And what can I actually get out of it?
I also wanted to create a Screeplot. I read that there will be a curve where you can see how many clusters are correct.
I also performed a cluster analysis and choose 2-20 Clusters, but the results are so long, I have no idea how to handle it and what things are important to look on.
To determine the "optimal number of clusters" several methods are available, despite it is a controversy theme.
The kgs is helpful to get the optimal number of clusters.
Following your code one would do:
clus <- hclust(distance)
op_k <- kgs(clus, distance, maxclus = 20)
plot (names (op_k), op_k, xlab="# clusters", ylab="penalty")
So the optimal number of clusters according to the kgs function is the minimum value of op_k, as you can see in the plot.
You can get it with
min(op_k)
Note that I set the maximum number of clusters allowed to 20. You can set this argument to NULL.
Check this page for more methods.
Hope it helps you.
Edit
To find which is the optimal number of clusters, you can do
op_k[which(op_k == min(op_k))]
Plus
Also see this post to find the perfect graphy answer from #Ben
Edit
op_k[which(op_k == min(op_k))]
still gives penalty. To find the optimal number of clusters, use
as.integer(names(op_k[which(op_k == min(op_k))]))
I'm happy to learn about the kgs function. Another option is using the find_k function from the dendextend package (it uses the average silhouette width). But given the kgs function, I might just add it as another option to the package.
Also note the dendextend::color_branches function, to color your dendrogram with the number of clusters you end up choosing (you can see more about this here: https://cran.r-project.org/web/packages/dendextend/vignettes/introduction.html#setting-a-dendrograms-branches )

Apply a function to an array of data and save result in another array

I tried this on two different occasions. I wanted to demonstrate overfitting and tinker around with power analysis.
I generated some data with only five points and wanted to demonstrate how you can get a perfect fit but a bad model:
a<-4
b<-1
n<-5
x<-c(1:n)
y=a * x+b+rnorm(n,0,100)
model1=lm(y~x)
model2=lm(y~poly(x,2))
model3=lm(y~poly(x,3))
model4=lm(y~poly(x,4))
however, I bet there is a way to do this more elegantly, all that I could figure out was to use sapply() in some way:
modeldat=c(1:4)
model[i]=sapply(modeldat,function(i) model2=lm(y~poly(x,i)))
So after tinkering around, I got no results and just stuck to the "solution" shown above.
Now, today I played around with the "pwr" library to calculate the sample size needed for some hypothetical data. This is the ugly solution I came up with and I bet there is some way to do it with sapply() as well. Doing it like this, however:
test=(seq(0,1,by=0.1))
power<-function(x){
pwr.t.test(,d=coensD,sig.level=0.05,power=x,type="two.sample")
}
sapply(test,power)
only produces an error:
Error in uniroot(function(n) eval(p.body) - power, c(2 + 1e-10,
1e+09)) : f() values at end points not of opposite sign
so after googling around for about two hours I am none the wiser and did this...
plist=0
powerList=pwr.t.test(,d=1,sig.level=0.05,power=0.8,type="two.sample")
plist[1]=powerList$n
powerList=pwr.t.test(,d=1,sig.level=0.05,power=0.7,type="two.sample")
plist[2]=powerList$n
powerList=pwr.t.test(,d=1,sig.level=0.05,power=0.6,type="two.sample")
plist[3]=powerList$n
powerList=pwr.t.test(,d=1,sig.level=0.05,power=0.5,type="two.sample")
plist[4]=powerList$n
powerList=pwr.t.test(,d=1,sig.level=0.05,power=0.4,type="two.sample")
plist[5]=powerList$n
In both cases I want to apply a function to an array of values and store them in another array. Is there an obvious and easy way to do this?
df<-data.frame(model = c(1:4))
apply(df[,1,drop=F],1,function(i) lm(y~poly(x,i)))

how to avoid R fisher.test workspace errors

I am preforming a fisher's exact test on a large number of contingency tables and saving the p-val for a bioinformatics problem. Some of these contingency tables are large so I've increased the workspace as much as I can; but when I run the following code I get an error:
result <- fisher.test(data,workspace=2e9)
LDSTP is too small for this problem. Try increasing the size of the workspace.
if I increase the size of the workspace I get another error:
result <- fisher.test(data,workspace=2e10)
cannot allocate memory block of size 134217728Tb
Now I could just simulate pvals:
result <- fisher.test(data, simulate.p.value = TRUE, B = 1e5)
but Im afraid Ill need a huge number of simulations to get accurate results since my pvals may be extremely small in some cases.
Thus my question whether there is some way to preemptively check if a contingency table is too complex to calculate exactly? In those cases alone I could switch to using a large number of simulations with B=1e10 or something. Or at least just skip those tables with a value of "NA" so that my job actually finishes?
Maybe you colud use tryCatch to get desired behaviour when fisher.test fails? Something like this maybe:
tryCatchFisher<-function(...){
tryCatch(fisher.test(...)$p.value,
error = function(e) {'too big'})
}

What software package can you suggest for a programmer who rarely works with statistics?

Being a programmer I occasionally find the need to analyze large amounts of data such as performance logs or memory usage data, and I am always frustrated by how much time it takes me to do something that I expect to be easier.
As an example to put the question in context, let me quickly show you an example from a CSV file I received today (heavily filtered for brevity):
date,time,PS Eden Space used,PS Old Gen Used, PS Perm Gen Used
2011-06-28,00:00:03,45004472,184177208,94048296
2011-06-28,00:00:18,45292232,184177208,94048296
I have about 100,000 data points like this with different variables that I want to plot in a scatter plot in order to look for correlations. Usually the data needs to be processed in some way for presentation purposes (such as converting nanoseconds to milliseconds and rounding fractional values), some columns may need to be added or inverted, or combined (like the date/time columns).
The usual recommendation for this kind of work is R and I have recently made a serious effort to use it, but after a few days of work my experience has been that most tasks that I expect to be simple seem to require many steps and have special cases; solutions are often non-generic (for example, adding a data set to an existing plot). It just seems to be one of those languages that people love because of all the powerful libraries that have accumulated over the years rather than the quality and usefulness of the core language.
Don't get me wrong, I understand the value of R to people who are using it, it's just that given how rarely I spend time on this kind of thing I think that I will never become an expert on it, and to a non-expert every single task just becomes too cumbersome.
Microsoft Excel is great in terms of usability but it just isn't powerful enough to handle large data sets. Also, both R and Excel tend to freeze completely (!) with no way out other than waiting or killing the process if you accidentally make the wrong kind of plot over too much data.
So, stack overflow, can you recommend something that is better suited for me? I'd hate to have to give up and develop my own tool, I have enough projects already. I'd love something interactive that could use hardware acceleration for the plot and/or culling to avoid spending too much time on rendering.
#flodin It would have been useful for you to provide an example of the code you use to read in such a file to R. I regularly work with data sets of the size you mention and do not have the problems you mention. One thing that might be biting you if you don't use R often is that if you don't tell R what the column-types R, it has to do some snooping on the file first and that all takes time. Look at argument colClasses in ?read.table.
For your example file, I would do:
dat <- read.csv("foo.csv", colClasses = c(rep("character",2), rep("integer", 3)))
then post process the date and time variables into an R date-time object class such as POSIXct, with something like:
dat <- transform(dat, dateTime = as.POSIXct(paste(date, time)))
As an example, let's read in your example data set, replicate it 50,000 times and write it out, then time different ways of reading it in, with foo containing your data:
> foo <- read.csv("log.csv")
> foo
date time PS.Eden.Space.used PS.Old.Gen.Used
1 2011-06-28 00:00:03 45004472 184177208
2 2011-06-28 00:00:18 45292232 184177208
PS.Perm.Gen.Used
1 94048296
2 94048296
Replicate that, 50000 times:
out <- data.frame(matrix(nrow = nrow(foo) * 50000, ncol = ncol(foo)))
out[, 1] <- rep(foo[,1], times = 50000)
out[, 2] <- rep(foo[,2], times = 50000)
out[, 3] <- rep(foo[,3], times = 50000)
out[, 4] <- rep(foo[,4], times = 50000)
out[, 5] <- rep(foo[,5], times = 50000)
names(out) <- names(foo)
Write it out
write.csv(out, file = "bigLog.csv", row.names = FALSE)
Time loading the naive way and the proper way:
system.time(in1 <- read.csv("bigLog.csv"))
system.time(in2 <- read.csv("bigLog.csv",
colClasses = c(rep("character",2),
rep("integer", 3))))
Which is very quick on my modest laptop:
> system.time(in1 <- read.csv("bigLog.csv"))
user system elapsed
0.355 0.008 0.366
> system.time(in2 <- read.csv("bigLog.csv",
colClasses = c(rep("character",2),
rep("integer", 3))))
user system elapsed
0.282 0.003 0.287
For both ways of reading in.
As for plotting, the graphics can be a bit slow, but depending on your OS this can be sped up a bit by altering the device you plot - on Linux for example, don't use the default X11() device, which uses Cairo, instead try the old X window without anti-aliasing. Also, what are you hoping to see with a data set as large as 100,000 observations on a graphics device with not many pixels? Perhaps try to rethink your strategy for data analysis --- no stats software will be able to save you from doing something ill-advised.
It sounds as if you are developing code/analysis as you go along, on the full data set. It would be far more sensible to just work with a small subset of the data when developing new code or new ways of looking at your data, say with a random sample of 1000 rows, and work with that object instead of the whole data object. That way you guard against accidentally doing something that is slow:
working <- out[sample(nrow(out), 1000), ]
for example. Then use working instead of out. Alternatively, whilst testing and writing a script, set argument nrows to say 1000 in the call to load the data into R (see ?read.csv). That way whilst testing you only read in a subset of the data, but one simple change will allow you to run your script against the full data set.
For data sets of the size you are talking about, I see no problem whatsoever in using R. Your point, about not becoming expert enough to use R, will more than likely apply to other scripting languages that might be suggested, such as python. There is a barrier to entry, but that is to be expected if you want the power of a language such as python or R. If you write scripts that are well commented (instead of just plugging away at the command line), and focus on a few key data import/manipulations, a bit of plotting and some simple analysis, it shouldn't take long to masters that small subset of the language.
R is a great tool, but I never had to resort to use it. Instead I find python to be more than adequate for my needs when I need to pull data out of huge logs. Python really comes with "batteries included" with built-in support for working with csv-files
The simplest example of reading a CSV file:
import csv
with open('some.csv', 'rb') as f:
reader = csv.reader(f)
for row in reader:
print row
To use another separator, e.g. tab and extract n-th column, use
spamReader = csv.reader(open('spam.csv', 'rb'), delimiter='\t')
for row in spamReader:
print row[n]
To operate on columns use the built-in list data-type, it's extremely versatile!
To create beautiful plots I use matplotlib
code
The python tutorial is a great way to get started! If you get stuck, there is always stackoverflow ;-)
There seem to be several questions mixed together:
Can you draw plots quicker and more easily?
Can you do things in R with less learning effort?
Are there other tools which require less learning effort than R?
I'll answer these in turn.
There are three plotting systems in R, namely base, lattice and ggplot2 graphics. Base graphics will render quickest, but making them look pretty can involve pathological coding. ggplot2 is the opposite, and lattice is somewhere in between.
Reading in CSV data, cleaning it and drawing a scatterplot sounds like a pretty straightforward task, and the tools are definitely there in R for solving such problems. Try asking a question here about specific bits of code that feel clunky, and we'll see if we can fix it for you. If your datasets all look similar, then you can probably reuse most of your code over and over. You could also give the ggplot2 web app a try.
The two obvious alternative languages for data processing are MATLAB (and its derivatives: Octave, Scilab, AcslX) and Python. Either of these will be suitable for your needs, and MATLAB in particular has a pretty shallow learning curve. Finally, you could pick a graph-specific tool like gnuplot or Prism.
SAS can handle larger data sets than R or Excel, however many (if not most) people--myself included--find it a lot harder to learn. Depending on exactly what you need to do, it might be worthwhile to load the CSV into an RDBMS and do some of the computations (eg correlations, rounding) there, and then export only what you need to R to generate graphics.
ETA: There's also SPSS, and Revolution; the former might not be able to handle the size of data that you've got, and the latter is, from what I've heard, a distributed version of R (that, unlike R, is not free).

Resources