R package for calculating the moles of a oxygen in air - r

Is there an r package that include a function for calculating moles of oxygen in air, given temperature, pressure, etc.
I'm looking for something like marelac, but for air, not water

Solve PV = nRT for n:
n = PV/RT
Make a function:
moles_n <- function( press, # in SI unit pascals
volume, # in SI unit cubic metres
R_const,
temp # degrees Kelvin
){
R_const = 8.3144598(48) # J⋅mol−1⋅K−1
press*vol/(R_const*temp) }
Units and constant looked up at: https://en.wikipedia.org/wiki/Gas_constant. (I can remember some physical constants but this one went back to high-school chemistry class, more than half a century ago in my case.) I suppose if one were entirely scrupulous, one would put in the correction factors for different gases. See: https://en.wikipedia.org/wiki/Van_der_Waals_equation and http://chemed.chem.purdue.edu/genchem/topicreview/bp/ch4/deviation5.html
I thought maybe there could be a package as you requested (although you should realize that package-request questions generally get closed.) I found two, the CHNOSZ-package and the seacarb-package, that do indicate an effort to instantiate "thermodynamic" functions", but my perusal of the function summaries makes me think these are also primarily for aqueous solutions:
http://finzi.psych.upenn.edu/R/library/CHNOSZ/html/00Index.html
Eventually I found package- IAPWS95, which you should examine carefully:
http://finzi.psych.upenn.edu/R/library/IAPWS95/html/00Index.html
The R-way for searching that I find most efficient is to use sos::findFn:
findFn("pressure temperature gas")
found 89 matches; retrieving 5 pages
2 3 4 5
Downloaded 87 links in 21 packages.

Related

Methods in R for large complex survey data sets?

I am not a survey methodologist or demographer, but am an avid fan of Thomas Lumley's R survey package. I've been working with a relatively large complex survey data set, the Healthcare Cost and Utilization Project (HCUP) National Emergency Department Sample (NEDS). As described by the Agency for Healthcare Research and Quality, it is "Discharge data for ED visits from 947 hospitals located in 30 States, approximating a 20-percent stratified sample of U.S. hospital-based EDs"
The full dataset from 2006 to 2012 consists of 198,102,435 observations. I've subsetted the data to 40,073,358 traumatic injury-related discharges with 66 variables. Running even simple survey procedures on these data takes inordinately long amounts of time. I've tried throwing RAM at it (late 2013 Mac Pro, 3.7GHz Quad Core, 128GB (!) memory), using multicore when available, subsetting , working with an out-of-memory DBMS like MonetDB. Design-based survey procedures still take hours. Sometimes many hours. Some modestly complex analyses take upwards of 15 hours. I am guessing that most of the computational effort is tied to what must be a humongous covariance matrix?
As one might expect, working with the raw data is orders of magnitude faster. More interestingly, depending on the procedure, with a data set this large the unadjusted estimates can be quite close to the survey results. (See examples below) The design-based results are clearly more precise and preferred, but several hours of computing time vs seconds is a not inconsiderable cost for that added precision. It begins to look like a very long walk around the block.
Is there anyone who's had experience with this? Are there ways to optimize R survey procedures for large data sets? Perhaps make better use of parallel processing? Are Bayesian approaches using INLA or Hamiltonian methods like Stan a possible solution? Or, are some unadjusted estimates, especially for relative measures, acceptable when the survey is large and representative enough?
Here are a couple of examples of unadjusted estimates approximating survey results.
In this first example, svymean in memory took a bit less than an hour, out of memory required well over 3 hours. The direct calculation took under a second. More importantly, the point estimates (34.75 for svymean and 34.77 unadjusted) as well as the standard errors (0.0039 and 0.0037) are quite close.
# 1a. svymean in memory
svydes<- svydesign(
id = ~KEY_ED ,
strata = ~interaction(NEDS_STRATUM , YEAR), note YEAR interaction
weights = ~DISCWT ,
nest = TRUE,
data = inj
)
system.time(meanAGE<-svymean(~age, svydes, na.rm=T))
user system elapsed
3082.131 143.628 3208.822
> meanAGE
mean SE
age 34.746 0.0039
# 1b. svymean out of memory
db_design <-
svydesign(
weight = ~discwt , weight variable column
nest = TRUE , whether or not psus are nested within strata
strata = ~interaction(neds_stratum , yr) , stratification variable column
id = ~key_ed ,
data = "nedsinj0612" , table name within the monet database
dbtype = "MonetDBLite" ,
dbname = "~/HCUP/HCUP NEDS/monet" folder location
)
system.time(meanAGE<-svymean(~age, db_design, na.rm=T))
user system elapsed
11749.302 549.609 12224.233
Warning message:
'isIdCurrent' is deprecated.
Use 'dbIsValid' instead.
See help("Deprecated")
mean SE
age 34.746 0.0039
# 1.c unadjusted mean and s.e.
system.time(print(mean(inj$AGE, na.rm=T)))
[1] 34.77108
user system elapsed
0.407 0.249 0.653
sterr <- function(x) sd(x, na.rm=T)/sqrt(length(x)) # write little function for s.e.
system.time(print(sterr(inj$AGE)))
[1] 0.003706483
user system elapsed
0.257 0.139 0.394
There is a similar correspondence between the results of svymean vs mean applied to subsets of data using svyby (nearly 2 hours) vs tapply (4 seconds or so):
# 2.a svyby .. svymean
system.time(AGEbyYear<-svyby(~age, ~yr, db_design, svymean, na.rm=T, vartype = c( 'ci' , 'se' )))
user system elapsed
4600.050 376.661 6594.196
yr age se ci_l ci_u
2006 2006 33.83112 0.009939669 33.81163 33.85060
2007 2007 34.07261 0.010055909 34.05290 34.09232
2008 2008 34.57061 0.009968646 34.55107 34.59014
2009 2009 34.87537 0.010577461 34.85464 34.89610
2010 2010 35.31072 0.010465413 35.29021 35.33124
2011 2011 35.33135 0.010312395 35.31114 35.35157
2012 2012 35.30092 0.010313871 35.28071 35.32114
# 2.b tapply ... mean
system.time(print(tapply(inj$AGE, inj$YEAR, mean, na.rm=T)))
2006 2007 2008 2009 2010 2011 2012
33.86900 34.08656 34.60711 34.81538 35.27819 35.36932 35.38931
user system elapsed
3.388 1.166 4.529
system.time(print(tapply(inj$AGE, inj$YEAR, sterr)))
2006 2007 2008 2009 2010 2011 2012
0.009577755 0.009620235 0.009565588 0.009936695 0.009906659 0.010148218 0.009880995
user system elapsed
3.237 0.990 4.186
The correspondence between survey and unadjusted results starts to break down with absolute counts, which requires writing a small function that appeals to the the survey object and uses a small bit some of Dr. Lumley's code to weight the counts:
# 3.a svytotal
system.time(print(svytotal(~adj_cost, svydes, na.rm=T)))
total SE
adj_cost 9.975e+10 26685092
user system elapsed
10005.837 610.701 10577.755
# 3.b "direct" calculation
SurvTot<-function(x){
N <- sum(1/svydes$prob)
m <- mean(x, na.rm = T)
total <- m * N
return(total)
}
> system.time(print(SurvTot(inj$adj_cost)))
[1] 1.18511e+11
user system elapsed
0.735 0.311 0.989
The results are much less acceptable. Though still within the margin of error established by the survey procedure. But again, 3 hours vs. 1 second is an appreciable cost for the more precise results.
Update: 10 Feb 2016
Thanks Severin and Anthony for allowing me to borrow your synapses. Sorry for the delay in following up, has taken little time to try out both your suggestions.
Severin , you are right in your observations that Revolution Analytics/MOR build is faster for some operations. Looks like it has to do with the BLAS ("Basic Linear Algebra Subprograms") library shipped with CRAN R. It is more precise, but slower. So, I optimized the BLAS on my maching with the proprietary (but free with macs) Apple Accelerate vecLib that allows multithreading (see http://blog.quadrivio.com/2015/06/improved-r-performance-with-openblas.html). This seemed to shave some time off the operations, e.g. from 3 hours for a svyby/svymean to a bit over 2 hours.
Anthony, had less luck with the replicate weight design approach. type="bootstrap" with replicates=20 ran for about 39 hours before I quit out; type="BRR" returned error "Can't split with odd numbers of PSUs in a stratum", when I set the options to small="merge", large="merge", it ran for several hours before the OS heaved a huge sigh and ran out of application memory; type="JKn" returned he error "cannot allocate vector of size 11964693.8 Gb"
Again, many thanks for your suggestions. I will for now, resign myself to running these analyses piecemeal and over long periods of time. If I do eventually come up with a better approach, I'll post on SO
for huge data sets, linearized designs (svydesign) are much slower than replication designs (svrepdesign). review the weighting functions within survey::as.svrepdesign and use one of them to directly make a replication design. you cannot use linearization for this task. and you are likely better off not even using as.svrepdesign but instead using the functions within it.
for one example using cluster=, strata=, and fpc= directly into a replicate-weighted design, see
https://github.com/ajdamico/asdfree/blob/master/Censo%20Demografico/download%20and%20import.R#L405-L429
note you can also view minute-by-minute speed tests (with timestamps for each event) here http://monetdb.cwi.nl/testweb/web/eanthony/
also note that the replicates= argument is nearly 100% responsible for the speed that the design will run. so perhaps make two designs, one for coefficients (with just a couple of replicates) and another for SEs (with as many as you can tolerate). run your coefficients interactively and refine which numbers you need during the day, then leave the bigger processes that require SE calculations running overnight
Been a while, but closing the loop on this. As Dr. Lumley refers to in the recent comment above, Charco Hui resurrected the experimental sqlsurvey package as "svydb", which I've found to be a great tool for working with very large survey data sets in R. See a related post here: How to get svydb R package for large survey data sets to return standard errors

Cross-correlation of 5 time series (distance) and interpretation

I would appreciate some input in this a lot!
I have data for 5 time series (an example of 1 step in the series is in the plot below), where each step in the series is a vertical profile of species sightings in the ocean which were investigated 6h apart. All 5 steps are spaced vertically by 0.1m (and the 6h in time).
What I want to do is calculate the multivariate cross-correlation between all series in order to find out at which lag the profiles are most correlated and stable over time.
Profile example:
I find the documentation in R on that not so great, so what I did so far is use the package MTS with the ccm function to create cross correlation matrices. However, the interpretation of the figures is rather difficult with sparse documentation. I would appreciate some help with that a lot.
Data example:
http://pastebin.com/embed_iframe.php?i=8gdAeGP4
Save in file cross_correlation_stack.csv or change as you wish.
library(dplyr)
library(MTS)
library(data.table)
d1 <- file.path('cross_correlation_stack.csv')
d2 = read.csv(d1)
# USING package MTS
mod1<-ccm(d2,lag=1000,level=T)
#USING base R
acf(d2,lag.max=1000)
# MQ plot also from MTS package
mq(d2,lag=1000)
Which produces this (the ccm command):
This:
and this:
In parallel, the acf command from above produces this:
My question now is if somebody can give some input in whether I am going in the right direction or are there better suited packages and commands?
Since the default figures don't get any titles etc. What am I looking at, specifically in the ccm figures?
The ACF command was proposed somewhere, but can I use it here? In it's documentation it says ... calculates autocovariance or autocorrelation... I assume this is not what I want. But then again it's the only command that seems to work multivariate. I am confused.
The plot with the significance values shows that after a lag of 150 (15 meters) the p values increase. How would you interpret that regarding my data? 0.1 intervals of species sightings and many lags up to 100-150 are significant? Would that mean something like that peaks in sightings are stable over the 5 time-steps on a scale of 150 lags aka 15 meters?
In either way it would be nice if somebody who worked with this before can explain what I am looking at! Any input is highly appreciated!
You can use the base R function ccf(), which will estimate the cross-correlation function between any two variables x and y. However, it only works on vectors, so you'll have to loop over the columns in d1. Something like:
cc <- vector("list",choose(dim(d1)[2],2))
par(mfrow=c(ceiling(choose(dim(d1)[2],2)/2),2))
cnt <- 1
for(i in 1:(dim(d1)[2]-1)) {
for(j in (i+1):dim(d1)[2]) {
cc[[cnt]] <- ccf(d1[,i],d1[,j],main=paste0("Cross-correlation of ",colnames(d1)[i]," with ",colnames(d1)[j]))
cnt <- cnt + 1
}
}
This will plot each of the estimated CCF's and store the estimates in the list cc. It is important to remember that the lag-k value returned by ccf(x,y) is an estimate of the correlation between x[t+k] and y[t].
All of that said, however, the ccf is only defined for data that are more-or-less normally distributed, but your data are clearly overdispersed with all of those zeroes. Therefore, lacking some adequate transformation, you should really look into other metrics of "association" such as the mutual information as estimated from entropy. I suggest checking out the R packages entropy and infotheo.

Interpreting the phom R package - persistent homology - topological analysis of data - Cluster analysis

I am learning to analyze the topology of data with the pHom package of R.
I would like to understand (characterize) a set of data (A Matrix(3500 rows,10 colums). In order to achieve such aim the R-package phom runs a persistent homology test that describes the data.
(Reference: The following video describes what we are seeking to do with homology in topology - reference video 4 min: http://www.youtube.com/embed/XfWibrh6stw?rel=0&autoplay=1).
Using the R-package "phom" (link: http://cran.r-project.org/web/packages/phom/phom.pdf) the following example can be run.
I need help in order to properly understand how the phom function works and how to interpret the data (plot).
Using the Example # 1 of the reference manual of the phom package in r, running it on R
Load Packages
library(phom)
library(Rccp)
Example 1
x <- runif(100)
y <- runif(100)
points <- t(as.matrix(rbind(x, y)))
max_dim <- 2
max_f <- 0.2
intervals <- pHom(points, max_dim, max_f, metric="manhattan")
plotPersistenceDiagram(intervals, max_dim, max_f,
title="Random Points in Cube with l_1 Norm")
I would kindly appreciate if someone would be able to help me with:
Question:
a.) what does the value max_f means and where does it come from? from my data? I set them?
b.) the plot : plotPersistenceDiagram (if you run the example in R you will see the plot), how do I interpret it?
Thank you.
Note: in order to run the "phom" package you need the "Rccp" package and you need the latest version of R 3.03.
The previous example was done in R after loading the "phom" and the "Rccp" packages respectively.
This is totally the wrong venue for this question, but just in case you're still struggling with it a year later I happen to know the answer.
Computing persistent homology has two steps:
Turn the point cloud into a filtration of simplicial complexes
Compute the homology of the simplicial complex
The "filtration" part of step 1 means you have to compute a simplicial complex for a whole range of parameters. The parameter in this case is epsilon, the distance threshold within which points are connected. The max_f variable caps the range of epsilon sweep from zero to max_f.
plotPersistenceDiagram displays the homological "persistence barcodes" as points instead of lines. The x-coordinate of the point is the birth time of that topological feature (the value of epsilon for which it first appears), and the y-coordinate is the death time (the value of epsilon for which it disappears).

Error probability function

I have DNA amplicons with base mismatches which can arise during the PCR amplification process. My interest is, what is the probability that a sequence contains errors, given the error rate per base, number of mismatches and the number of bases in the amplicon.
I came across an article [Cummings, S. M. et al (2010). Solutions for PCR, cloning and sequencing errors in population genetic analysis. Conservation Genetics, 11(3), 1095–1097. doi:10.1007/s10592-009-9864-6]
that proposes this formula to calculate the probability mass function in such cases.
I implemented the formula with R as shown here
pcr.prob <- function(k,N,eps){
v = numeric(k)
for(i in 1:k) {
v[i] = choose(N,k-i) * (eps^(k-i)) * (1 - eps)^(N-(k-i))
}
1 - sum(v)
}
From the article, suggest we analysed an 800 bp amplicon using a PCR of 30 cycles with 1.85e10-5 misincorporations per base per cycle, and found 10 unique sequences that are each 3 bp different from their most similar sequence. The probability that a novel sequences was generated by three independent PCR errors equals P = 0.0011.
However when I use my implementation of the formula I get a different value.
pcr.prob(3,800,0.0000185)
[1] 5.323567e-07
What could I be doing wrong in my implementation? Am I misinterpreting something?
Thanks
I think they've got the right number (0.00113), but badly explained in their paper.
The calculation you want to be doing is:
pbinom(3, 800, 1-(1-1.85e-5)^30, lower=FALSE)
I.e. what's the probability of seeing less than three modifications in 800 independent bases, given 30 amplifications that each have a 1.85e-5 chance of going wrong. I.e. you're calculating the probability it doesn't stay correct 30 times.
Somewhat statsy, may be worth a move…
Thinking about this more, you will start to see floating-point inaccuracies when working with very small probabilities here. I.e. a 1-x where x is a small number will start to go wrong when the absolute value of x is less than about 1e-10. Working with log-probabilities is a good idea at this point, specifically the log1p function is a great help. Using:
pbinom(3, 800, 1-exp(log1p(-1.85e-5)*30), lower=FALSE)
will continue to work even when the error incorporation rate is very low.

Easiest way to determine time complexity from run times

Lets suppose I am trying to analyze an algorithm and all I can do is run it with different inputs. I can construct a set of points (x,y) as (sample size, run time).
I would like to dynamically categorize the algorithm into a complexity class (linear, quadratic, exponential, logarithmic, etc..)
Ideally I could give an equation that more or less approximates the behavior.
I am just not sure what the best way to do this is.
For any degree polynomial I can create regression curves and come up with some measure of fitness, but I don't really have a clue how I would do that for any nonpolynomial function. It is harder since I don't have any previous knowledge of what shape I should try to fit.
This may be more of a math question than a programming question, but it is very interesting to me. I'm not a mathematician, so there may be a simpler established method to get a reasonable function from a set of points that I just don't know about. Does anyone have any ideas for solving a problem like this? Is there a numerical library for C# that could help me crunch the numbers?
Well there are not that many complexity classes you really care about, so let's say: linear, quadratic, polynomial (degree > 2), exponential, and logarithmic.
For each of these you could use the largest (x,y) pair to solve for the unknown variable. Let y = f(x) denote the runtime of your algorithm as a function of the sample size. Let's assume that f(1) = 0, and if it doesn't we can always subtract of that value y(1) from each of the y's, this just eliminates the constants in f(x). Let y(end) denote the last (and largest) value of y in your (x,y) data set.
At this point we can solve for the unknown in each canonical form:
f(x) = c*x
f(x) = c*x^2
f(x) = x^c
f(x) = c^x
f(x) = log(x)/log(c)
Since there is only a single unknown in each equation we can you any point to solve for it. Consider the following data generated from a polynomial of random degree > 2:
x = [ 1 2 3 4 5 6 7 8 9 10 ];
y = [ 0 6 19 44 81 135 206 297 411 550 ];
If we use the last point to solve for c for each possibility (assuming this would be the least noise estimate)
550 = c*10 -> c = 55
550 = c*10^2 -> c = 5.5
550 = 10^c -> c = log(550)/log(10) ~= 2.74
550 = c^10 -> c = 550^(1/10) ~= 1.88
550 = log(x)/log(c) -> c = 10^(1/550) ~= 1.0042
We can now compare how well each of these functions fit the remaining data, here is a plot:
I'm new and I can't post images so look at the plot here: http://i.stack.imgur.com/UH6T8.png
The true data is shown in the red asterisk, linear with green line, quadratic in blue, polynomial in black, exponential in pink, and the log plot in green with O's. It should be pretty clear from the residuals what function fits your data the best.
Curve fitting used to be an art, but is now somehow decadent :) (That's a joke for the physicists around)
A lot of progress has been made, that allows simple mortals to guess (some) non trivial functional dependencies.
I'll not enter into a description of the methods and limitations, but instead I'll refer you to eureqa, which is a very nice piece of software developed at Cornell.
Eureqa (pronounced "eureka") is a software tool for detecting equations and hidden mathematical relationships in your data. Its goal is to identify the simplest mathematical formulas which could describe the underlying mechanisms that produced the data. Eureqa is free to download and use. Look for the program download, video tutorial, user forum, and other and reference materials.
I tried eureqa several times with very good results if the models are not too complicated. I think it is good enough for distinguishing between polynomials, logs and exponentials.
HTH!
Post Scriptum:
Regrettably the software isn't free anymore :(

Resources