I’m looking for R resources, and I started looking at “An Introduction to R” here at r-project.org. I did and got stumped immediately.
I think I've figured out what’s going on, and my question is basically
Are there resources to help me figure out something like this more
easily?
The preface of the Introduction to R suggests starting with the introductory session in Appendix A, and right at the start is this code and remark.
x <- rnorm(50)
y <- rnorm(x)
Generate two pseudo-random normal vectors of x- and y-coordinates.
The documentation says the (first and only non-optional) parameter to rnorm is the length of the result vector. So x <- rnorm(50) produces a vector of 50 random values from a normal distribution with mean 0 and standard deviation 1.
So far so good. But why does rnorm(x) seem to do what y <- rnorm(50) or y <- rnorm(length(x)) would have done? Either of these alternatives seem clearer to me.
My guess as to what happens is this:
The wrapper for rnorm didn’t care what kind of thing x is and just passed to the underlying C function a pointer to the C struct for x as an R object.
R objects represented in C are structs followed by “data”; the data of the C representation of an R vector of reals starts with two integers, the first of which is the vector's length. (The vector elements follow those integers.) I found this out by reading up on R internals here.
If a C function were written to find the value of an R integer from a passed pointer-to-R-object, and it were called with a pointer to an R vector of reals, it would find the vector’s length in the place it would look for the single integer.
In addition to my main question of “How can I figure out something like this more easily?”, I wouldn’t mind knowing whether what I think is going on is correct and whether the fact that rnorm(x) is idiomatic R in this context or more of a sloppy choice. Given that it does something useful, can it be relied upon or is it just lucky behavior for an expression that isn’t well-defined in R?
I’m used to strongly-typed languages like C or SQL, which have easier-to-follow (for me) semantics and which also have more comprehensive references available, so any references for R that have a programming-language-theory focus or are aimed at people used to strong typing would be good, too.
It is documented behavior. From ?rnorm:
Usage: [...]
rnorm(n, mean = 0, sd = 1)
Arguments:
[...]
n: number of observations. If ‘length(n) > 1’, the length is
taken to be the number required.
I wish to find an optimisation tool in R that lets me determine the value of an input parameter (say, a specific value between 0.001 and 0.1) that results in my function producing a desired output value.
My function takes an input parameter and computes a value. I want this output value to exactly match a predetermined number, so the function outputs the absolute of the difference between these two values; when they are identical, the output of the function is zero.
I've tried optimize(), but it seems to be set up to minimise the input parameter, not the output value. I also tried uniroot(), but it produces the error f() values at end points not of opposite sign, suggesting that it doesn't like the fact that increasing/decreasing the input parameter reduces the output value up to a point, but going beyond that point then increases it again.
Apologies if I'm missing something obvious here—I'm completely new to optimising functions.
Indeed you are missing something obvious:-) It's very obvious how you should/could formulate your problem.
Assuming the function that must equal a desired output value is f.
Define a function g satisfying
g <- function(x) f(x) - output_value
Now you can use uniroot to find a zero of g. But you must provide endpoints that satisfy the requirements of uniroot. I.e. the value of g for one endpoint must be positive and the value of g for the other endpoint must be negative (or the other way around).
Example:
f <- function(x) x - 10
g <- function(x) f(x) - 8
then
uniroot(g,c(0,20))
will do what you want but
uniroot(g,c(0,2))
will issue the error message values at end points not of opposite sign.
You could also use an optimization function but then you want to minimize the function g. To set you straight: optimize does not minimize the input paramater. Read the help thoroughly.
I'm doing some statistical analysis with R software (bootstrapped Kolmogorov-Smirnov tests) of very large data sets, meaning that my p values are all incredibly small. I've Bonferroni corrected for the large number of tests that I've performed meaning that my alpha value is also very small in order to reject the null hypothesis.
The problem is, R presents me with p values of 0 in some cases where the p value is presumably so small that it cannot be presented (these are usually for the very large sample sizes). While I can happily reject the null hypothesis for these tests, the data is for publication, so I'll need to write p < ..... but I don't know what the lowest reportable values in R are?
I'm using the ks.boot function in case that matters.
Any help would be much appreciated!
.Machine$double.xmin gives you the smallest non-zero normalized floating-point number. On most systems that's 2.225074e-308. However, I don't believe this is a sensible limit.
Instead I suggest that in Matching::ks.boot you change the line
ks.boot.pval <- bbcount/nboots to
ks.boot.pval <- log(bbcount)-log(nboots) and work on the log-scale.
Edit:
You can use trace to modify the function.
Step 1: Look at the function body, to find out where to add additional code.
as.list(body(ks.boot))
You'll see that element 17 is ks.boot.pval <- bbcount/nboots, so we need to add the modified code directly after that.
Step 2: trace the function.
trace (ks.boot, quote(ks.boot.pval <- log(bbcount)-log(nboots)), at=18)
Step 3: Now you can use ks.boot and it will return the logarithm of the bootstrap p-value as ks.boot.pvalue. Note that you cannot use summary.ks.boot since it calls format.pval, which will not show you negative values.
Step 4: Use untrace(ks.boot) to remove the modifications.
I don't know whether ks.boot has methods in the packages Rmpfr or gmp but if it does, or you feel like rolling your own code, you can work with arbitrary precision and arbitrary size numbers.
What is the best R idiom to compute sums of elements within a sliding window?
Conceptually I want the following:
for (i in 1:(length(input) - lag + 1))
output[i] <- sum(input[i:(i + lag - 1)])
In other words, every output element should be the sum of a fixed number of input elements (called lag here), resulting in an appropriately shorter result vector. I know that I can theoretically write this as
output = diff(cumsum(c(0, input)), lag = lag)
but I am worried about the precision here. I have a setup in mind where all the values would have the same sign, and the vectors would be pretty large. Summing up the values up front might lead to prety large numbers, so there won't be many significant digits left for the individual differences. This feels bad.
I would imagine that it should be possible to do better than that, at least when using a single function instead of two. An implementation could maintain the current sum, aleays adding one element and subtracting another for each iteration. Since that would still accumulate rounding errors along the way, one could perform the computations separately from both ends, and if the results at the center were too far off, compute a fresh result from the center and thus increase precision in a divide-and-conquer approach.
Do you know of any implementation which does anything like this?
Or is there a reason why this won't work as I think it should?
Or perhaps a reason why the diff(cumsum(…)) approach isn't as bad as it seems?
Edit: I had some off-by-one mistakes in my above formulations, making them inconsistent. Now they seem to agree on test data. lag should be the number of elements summed, and I'd expect a shorter vector as a result. I'm not dealing with time series objects, so absolute time alignment is not that relevant to me.
I had seen some noisy-looking things in my real data, which I had assumed to be due to such numeric problems. Since several different approaches to compute these values, using different suggestions from answers and comments, still led to similar results, it might be that the strangeness of my data is not in fact due to numeric issues.
So in order to evaluate answers, I used the following setup:
library(Rmpfr)
library(caTools)
len <- 1024*1024*8
lag <- 3
precBits <- 128
taillen <- 6
set.seed(42) # reproducible
input <- runif(len)
input <- input + runif(len, min=-1e-9, max=1e-9) # use >32 bits
options(digits = 22)
# Reference: sum everything separately using high precision.
output <- mpfr(rep(0, taillen), precBits = precBits)
for (i in 1:taillen)
output[i] <- sum(mpfr(input[(len-taillen+i-lag+1):(len-taillen+i)],
precBits=precBits))
output
addResult <- function(data, name) {
n <- c(rownames(resmat), name)
r <- rbind(resmat, as.numeric(tail(data, taillen)))
rownames(r) <- n
assign("resmat", r, parent.frame())
}
# reference solution, rounded to nearest double, assumed to be correct
resmat <- matrix(as.numeric(output), nrow=1)
rownames(resmat) <- "Reference"
# my original solution
addResult(diff(cumsum(c(0, input)), lag=lag), "diff+cumsum")
# filter as suggested by Matthew Plourde
addResult(filter(input, rep(1, lag), sides=1)[lag:length(input)], "filter")
# caTools as suggested by Joshua Ulrich
addResult(lag*runmean(input, lag, alg="exact", endrule="trim"), "caTools")
The result for this looks as follows:
[,1] [,2]
Reference 2.380384891521345469556 2.036472557725210297264
diff+cumsum 2.380384892225265502930 2.036472558043897151947
filter 2.380384891521345469556 2.036472557725210741353
caTools 2.380384891521345469556 2.036472557725210741353
[,3] [,4]
Reference 1.999147923481302324689 1.998499369297661143463
diff+cumsum 1.999147923663258552551 1.998499369248747825623
filter 1.999147923481302324689 1.998499369297661143463
caTools 1.999147923481302324689 1.998499369297661143463
[,5] [,6]
Reference 2.363071143676507723796 1.939272651346203080180
diff+cumsum 2.363071143627166748047 1.939272651448845863342
filter 2.363071143676507723796 1.939272651346203080180
caTools 2.363071143676507723796 1.939272651346203080180
The result indicates that diff+cumsum is still surprisingly accurate. (It appeared even more accurate before I thought of adding that second runif vector.) filter and caTools both are almost indistinguishable from the perfect result. As for performance, I haven't tested that (yet). I only know that the Rmpfr cumsum with 128 bits was slow enough that I didn't feel like waiting on its completion. Feel free to edit this question if you have a performance benchmark, or new suggestions to add to the comparison.
I can't speak to whether this is such an implementation, but there is
filter(input, sides=2, filter=rep(1, lag+1))
Looking at the body to filter, it looks like the hard work gets passed off to a C routine, C_rfilter, so perhaps you could examine this to see if it satisfies your precision requirements. Otherwise, #JoshuaUlrich's suggestion sounds promising.
This answer is based on the comment from Joshua Ulrich.
The package caTools provides a function runmean which computes my partial sum, divided by the window size (or rather the number of not-NA elements in the window in question). Quoting from its documentation:
In case of runmean(..., alg="exact") function a special algorithm is used (see references section) to ensure that round-off errors do not accumulate. As a result runmean is more accurate than filter(x, rep(1/k,k)) and runmean(..., alg="C") functions.
Note:
Function runmean(..., alg="exact") is based by code by Vadim Ogranovich, which is based on Python code (see last reference), pointed out by Gabor Grothendieck.
References:
About round-off error correction used in runmean: Shewchuk, Jonathan
Adaptive Precision Floating-Point Arithmetic and Fast Robust Geometric Predicates
More on round-off error correction can be found at:
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/393090
The code stores the sum of the current window using a sequence of double precision floating point values, where smaller values represent the round-off error incurred by larger elements. Therefore there shouldn't be any accumulation of rounding errors even if the input data is processed in a single pass, adding one element and removing another at each step. The final result should be as exact as double precision arithmetic can represent it.
Algorithms other than exact seem to yield somewhat different results, though, so I probably wouldn't suggest these.
It seems a bit unfortunate that the source code contains a runsum_exact function, but it is commented out. The division to obtain the mean, combined with the multiplication to get back to the sum, will introduce rounding errors which could have been avoided. To this the CHANGES file says:
11) caTools-1.11 (Dec 2010)
Fully retired runsum.exact, which was not working for a while, use runmean with "exact" option instead.
At the moment (caTools version 1.14 from 2012-05-22) the package appears to be orphaned.
In recent conversations with fellow students, I have been advocating for avoiding globals except to store constants. This is a sort of typical applied statistics-type program where everyone writes their own code and project sizes are on the small side, so it can be hard for people to see the trouble caused by sloppy habits.
In talking about avoidance of globals, I'm focusing on the following reasons why globals might cause trouble, but I'd like to have some examples in R and/or Stata to go with the principles (and any other principles you might find important), and I'm having a hard time coming up with believable ones.
Non-locality: Globals make debugging harder because they make understanding the flow of code harder
Implicit coupling: Globals break the simplicity of functional programming by allowing complex interactions between distant segments of code
Namespace collisions: Common names (x, i, and so forth) get re-used, causing namespace collisions
A useful answer to this question would be a reproducible and self-contained code snippet in which globals cause a specific type of trouble, ideally with another code snippet in which the problem is corrected. I can generate the corrected solutions if necessary, so the example of the problem is more important.
Relevant links:
Global Variables are Bad
Are global variables bad?
I also have the pleasure of teaching R to undergraduate students who have no experience with programming. The problem I found was that most examples of when globals are bad, are rather simplistic and don't really get the point across.
Instead, I try to illustrate the principle of least astonishment. I use examples where it is tricky to figure out what was going on. Here are some examples:
I ask the class to write down what they think the final value of i will be:
i = 10
for(i in 1:5)
i = i + 1
i
Some of the class guess correctly. Then I ask should you ever write code like this?
In some sense i is a global variable that is being changed.
What does the following piece of code return:
x = 5:10
x[x=1]
The problem is what exactly do we mean by x
Does the following function return a global or local variable:
z = 0
f = function() {
if(runif(1) < 0.5)
z = 1
return(z)
}
Answer: both. Again discuss why this is bad.
Oh, the wonderful smell of globals...
All of the answers in this post gave R examples, and the OP wanted some Stata examples, as well. So let me chime in with these.
Unlike R, Stata does take care of locality of its local macros (the ones that you create with local command), so the issue of "Is this this a global z or a local z that is being returned?" never comes up. (Gosh... how can you R guys write any code at all if locality is not enforced???) Stata has a different quirk, though, namely that a non-existent local or global macro is evaluated as an empty string, which may or may not be desirable.
I have seen globals used for several main reasons:
Globals are often used as shortcuts for variable lists, as in
sysuse auto, clear
regress price $myvars
I suspect that the main usage of such construct is for someone who switches between interactive typing and storing the code in a do-file as they try multiple specifications. Say they try regression with homoskedastic standard errors, heteroskedastic standard errors, and median regression:
regress price mpg foreign
regress price mpg foreign, robust
qreg price mpg foreign
And then they run these regressions with another set of variables, then with yet another one, and finally they give up and set this up as a do-file myreg.do with
regress price $myvars
regress price $myvars, robust
qreg price $myvars
exit
to be accompanied with an appropriate setting of the global macro. So far so good; the snippet
global myvars mpg foreign
do myreg
produces the desirable results. Now let's say they email their famous do-file that claims to produce very good regression results to collaborators, and instruct them to type
do myreg
What will their collaborators see? In the best case, the mean and the median of mpg if they started a new instance of Stata (failed coupling: myreg.do did not really know you meant to run this with a non-empty variable list). But if the collaborators had something in the works, and too had a global myvars defined (name collision)... man, would that be a disaster.
Globals are used for directory or file names, as in:
use $mydir\data1, clear
God only knows what will be loaded. In large projects, though, it does come handy. You would want to define global mydir somewhere in your master do-file, may be even as
global mydir `c(pwd)'
Globals can be used to store an unpredictable crap, like a whole command:
capture $RunThis
God only knows what will be executed; let's just hope it is not ! format c:\. This is the worst case of implicit strong coupling, but since I am not even sure that RunThis will contain anything meaningful, I put a capture in front of it, and will be prepared to treat the non-zero return code _rc. (See, however, my example below.)
Stata's own use of globals is for God settings, like the type I error probability/confidence level: the global $S_level is always defined (and you must be a total idiot to redefine this global, although of course it is technically doable). This is, however, mostly a legacy issue with code of version 5 and below (roughly), as the same information can be obtained from less fragile system constant:
set level 90
display $S_level
display c(level)
Thankfully, globals are quite explicit in Stata, and hence are easy to debug and remove. In some of the above situations, and certainly in the first one, you'd want to pass parameters to do-files which are seen as the local `0' inside the do-file. Instead of using globals in the myreg.do file, I would probably code it as
unab varlist : `0'
regress price `varlist'
regress price `varlist', robust
qreg price `varlist'
exit
The unab thing will serve as an element of protection: if the input is not a legal varlist, the program will stop with an error message.
In the worst cases I've seen, the global was used only once after having been defined.
There are occasions when you do want to use globals, because otherwise you'd have to pass the bloody thing to every other do-file or a program. One example where I found the globals pretty much unavoidable was coding a maximum likelihood estimator where I did not know in advance how many equations and parameters I would have. Stata insists that the (user-supplied) likelihood evaluator will have specific equations. So I had to accumulate my equations in the globals, and then call my evaluator with the globals in the descriptions of the syntax that Stata would need to parse:
args lf $parameters
where lf was the objective function (the log-likelihood). I encountered this at least twice, in the normal mixture package (denormix) and confirmatory factor analysis package (confa); you can findit both of them, of course.
One R example of a global variable that divides opinion is the stringsAsFactors issue on reading data into R or creating a data frame.
set.seed(1)
str(data.frame(A = sample(LETTERS, 100, replace = TRUE),
DATES = as.character(seq(Sys.Date(), length = 100, by = "days"))))
options("stringsAsFactors" = FALSE)
set.seed(1)
str(data.frame(A = sample(LETTERS, 100, replace = TRUE),
DATES = as.character(seq(Sys.Date(), length = 100, by = "days"))))
options("stringsAsFactors" = TRUE) ## reset
This can't really be corrected because of the way options are implemented in R - anything could change them without you knowing it and thus the same chunk of code is not guaranteed to return exactly the same object. John Chambers bemoans this feature in his recent book.
A pathological example in R is the use of one of the globals available in R, pi, to compute the area of a circle.
> r <- 3
> pi * r^2
[1] 28.27433
>
> pi <- 2
> pi * r^2
[1] 18
>
> foo <- function(r) {
+ pi * r^2
+ }
> foo(r)
[1] 18
>
> rm(pi)
> foo(r)
[1] 28.27433
> pi * r^2
[1] 28.27433
Of course, one can write the function foo() defensively by forcing the use of base::pi but such recourse may not be available in normal user code unless packaged up and using a NAMESPACE:
> foo <- function(r) {
+ base::pi * r^2
+ }
> foo(r = 3)
[1] 28.27433
> pi <- 2
> foo(r = 3)
[1] 28.27433
> rm(pi)
This highlights the mess you can get into by relying on anything that is not solely in the scope of your function or passed in explicitly as an argument.
Here's an interesting pathological example involving replacement functions, the global assign, and x defined both globally and locally...
x <- c(1,NA,NA,NA,1,NA,1,NA)
local({
#some other code involving some other x begin
x <- c(NA,2,3,4)
#some other code involving some other x end
#now you want to replace NAs in the the global/parent frame x with 0s
x[is.na(x)] <<- 0
})
x
[1] 0 NA NA NA 0 NA 1 NA
Instead of returning [1] 1 0 0 0 1 0 1 0, the replacement function uses the index returned by the local value of is.na(x), even though you're assigning to the global value of x. This behavior is documented in the R Language Definition.
One quick but convincing example in R is to run the line like:
.Random.seed <- 'normal'
I chose 'normal' as something someone might choose, but you could use anything there.
Now run any code that uses generated random numbers, for example:
rnorm(10)
Then you can point out that the same thing could happen for any global variable.
I also use the example of:
x <- 27
z <- somefunctionthatusesglobals(5)
Then ask the students what the value of x is; the answer is that we don't know.
Through trial and error I've learned that I need to be very explicit in naming my function arguments (and ensure enough checks at the start and along the function) to make everything as robust as possible. This is especially true if you have variables stored in global environment, but then you try to debug a function with a custom valuables - and something doesn't add up! This is a simple example that combines bad checks and calling a global variable.
glob.arg <- "snake"
customFunction <- function(arg1) {
if (is.numeric(arg1)) {
glob.arg <- "elephant"
}
return(strsplit(glob.arg, "n"))
}
customFunction(arg1 = 1) #argument correct, expected results
customFunction(arg1 = "rubble") #works, but may have unexpected results
An example sketch that came up while trying to teach this today. Specifically, this focuses on trying to give intuition as to why globals can cause problems, so it abstracts away as much as possible in an attempt to state what can and cannot be concluded just from the code (leaving the function as a black box).
The set up
Here is some code. Decide whether it will return an error or not based on only the criteria given.
The code
stopifnot( all( x!=0 ) )
y <- f(x)
5/x
The criteria
Case 1: f() is a properly-behaved function, which uses only local variables.
Case 2: f() is not necessarily a properly-behaved function, which could potentially use global assignment.
The answer
Case 1: The code will not return an error, since line one checks that there are no x's equal to zero and line three divides by x.
Case 2: The code could potentially return an error, since f() could e.g. subtract 1 from x and assign it back to the x in the parent environment, where any x element equal to 1 could then be set to zero and the third line would return a division by zero error.
Here's one attempt at an answer that would make sense to statisticsy types.
Namespace collisions: Common names (x, i, and so forth) get re-used, causing namespace collisions
First we define a log likelihood function,
logLik <- function(x) {
y <<- x^2+2
return(sum(sqrt(y+7)))
}
Now we write an unrelated function to return the sum of squares of an input. Because we're lazy we'll do this passing it y as a global variable,
sumSq <- function() {
return(sum(y^2))
}
y <<- seq(5)
sumSq()
[1] 55
Our log likelihood function seems to behave exactly as we'd expect, taking an argument and returning a value,
> logLik(seq(12))
[1] 88.40761
But what's up with our other function?
> sumSq()
[1] 633538
Of course, this is a trivial example, as will be any example that doesn't exist in a complex program. But hopefully it'll spark a discussion about how much harder it is to keep track of globals than locals.
In R you may also try to show them that there is often no need to use globals as you may access the variables defined in the function scope from within the function itself by only changing the enviroment. For example the code below
zz="aaa"
x = function(y) {
zz="bbb"
cat("value of zz from within the function: \n")
cat(zz , "\n")
cat("value of zz from the function scope: \n")
with(environment(x),cat(zz,"\n"))
}