modelling an infinite series in R - r

I'm trying to write a code to approximate the following infinite Taylor series from the Theis hydrogeological equation in R.
I'm pretty new to functional programming, so this was a challenge! This is my attempt:
Wu <- function(u, repeats = 100) {
result <- numeric(repeats)
for (i in seq_along(result)){
result[i] <- -((-u)^i)/(i * factorial(i))
}
return(sum(result) - log(u)-0.5772)
}
I've compared the results with values from a data table available here: https://pubs.usgs.gov/wsp/wsp1536-E/pdf/wsp_1536-E_b.pdf - see below (excuse verbose code - should have made a csv, with hindsight):
Wu_QC <- data.frame(u = c(1.0*10^-15, 4.1*10^-14,9.9*10^-13, 7.0*10^-12, 3.7*10^-11,
2.3*10^-10, 6.8*10^-9, 5.7*10^-8, 8.4*10^-7, 6.3*10^-6,
3.1*10^-5, 7.4*10^-4, 5.1*10^-3, 2.9*10^-2,8.7*10^-1,
4.6,9.90),
Wu_table = c(33.9616, 30.2480, 27.0639, 25.1079, 23.4429,
21.6157, 18.2291, 16.1030, 13.4126, 11.3978,
9.8043,6.6324, 4.7064,2.9920,0.2742,
0.001841,0.000004637))
Wu_QC$rep_100 <- Wu(Wu_QC$u,100)
The good news is the formula gives identical results for repeats = 50, 100, 150 and 170 (so I've just given you the 100 version above). The bad news is that, while the function performs well for u < ~10^-3, it goes off the rails and gives negative outputs for numbers within an order of magnitude or so of 1. This doesn't happen when I just call the function on an individual number. i.e:
> Wu(4.6)
[1] 0.001856671
Which is the correct answer to 2sf.
Can anyone spot what I've done wrong and/or suggest a better way to code this equation? I think the problem is something to do with my for loop and/or an issue with the factorials generating infinite numbers as u gets larger, but I'm not at all certain.
Thanks!

As it says on page 93 of your reference, W is also known as the exponential integral. See also here.
Then, e.g., the package expint provides a function to compute W(u):
library(expint)
expint(10^(-8))
# [1] 17.84347
expint(4.6)
# [1] 0.001841006
where the results are exactly as in your referred table.

You can write a function that takes in a value together with the repetition times and outputs the required value:
w=function(u,l){
a=2:l
-0.5772-log(u)+u+sum(u^(a)*rep(c(-1,1),length=l-1)/(a)/factorial(a))
}
transform(Wu_QC,new=Vectorize(w)(u,170))
u Wu_table new
1 1.0e-15 3.39616e+01 3.396158e+01
2 4.1e-14 3.02480e+01 3.024800e+01
3 9.9e-13 2.70639e+01 2.706387e+01
4 7.0e-12 2.51079e+01 2.510791e+01
5 3.7e-11 2.34429e+01 2.344290e+01
6 2.3e-10 2.16157e+01 2.161574e+01
7 6.8e-09 1.82291e+01 1.822914e+01
8 5.7e-08 1.61030e+01 1.610301e+01
9 8.4e-07 1.34126e+01 1.341266e+01
10 6.3e-06 1.13978e+01 1.139777e+01
11 3.1e-05 9.80430e+00 9.804354e+00
12 7.4e-04 6.63240e+00 6.632400e+00
13 5.1e-03 4.70640e+00 4.706408e+00
14 2.9e-02 2.99200e+00 2.992051e+00
15 8.7e-01 2.74200e-01 2.741930e-01
16 4.6e+00 1.84100e-03 1.856671e-03
17 9.9e+00 4.63700e-06 2.030179e-05
As the numbers become large the estimation is not quite good, so we should have to go further than 170! but R cannot do that. Maybe you can try other platforms. ie Python

I think I may have solved this myself (though borrowing heavily from Onyambo's answer!) Here's my code:
well_func2 <- function (u, l = 100) {
result <- numeric(length(u))
a <- 2:l
for(i in seq_along(u)){
result[i] <- -0.5772-log(u[i])+u[i]+sum(u[i]^(a)*rep(c(-1,1),length=l-1)/(a)/factorial(a))
}
return(result)
}
As far as I can tell so far, this matches the tabulated results well for u <5 (as did Onyambo's code), and it also gives the same result for vector vs single-value inputs.
Still needs a bit more testing, and there's probably a tidier way to code it using map() or similar instead of the for loop, but I'm happy enough for now. Thought I'd share in case anyone else has the same problem.

Related

Similar function to R "rep" in jags to create array?

Is there a similar function in jags as the R function rep? I want to create an array using similar code as the following:
n ~ dmulti(pi, N) # pi is a 3 dimensional probability vector, N is fixed
# the dimension of n is hard coded in this line:
a <- c(rep(0, n[1]), rep(1, n[2]), rep(2, n[3]))
I read through the manual and wasn't able to find a way to achieve this. I understand that Stan would probably allow this but I couldn't use Stan because I need to do inference on discrete parameters. I really appreciate your help!
This question is also posted on the JAGS help forum.
I have added a rep function to the development version (future JAGS 4.0.0) as Matt and John have alluded to, this requires the second argument to be fixed so that the length of the resulting vector can be determined at compile time.
The short answer is no, I'm afraid not. One of the stipulations of the JAGS/BUGS language is that variables must have fixed dimensions (with every element defined exactly once) - in your example a will change dimension size depending on the vector n. There may be other ways to get the result you are looking for, but not using this approach.
Incidentally, you use n twice in that bit of code (LHS and RHS of the multinominal distribution) which is not allowed - although that may just be a typo :)
Matt
You could populate your vector with some loops:
library(R2jags)
M <- function() {
for (i in 1:n[1]) {
a[i] <- 0
}
for (i in 1:n[2]) {
a[i + n[1]] <- 1
}
for (i in 1:n[3]) {
a[i + sum(n[1:2])] <- 2
}
}
j <- jags(list(n=3:5), NULL, 'a', M, DIC=FALSE)
j$BUGSoutput$mean$a
## [1] 0 0 0 1 1 1 1 2 2 2 2 2
However, as #MattDenwood alluded to, if the sum of the elements of n is variable this will throw an error - a must be of constant length throughout the simulation.

Dynamic variables in base R

How to create please dependent variables in R ?
For example
a <- 1
b <- a*2
a <- 2
b
# [1] 2
But I expect the result 4. How can R maintained relations automatically ?
Thank you very much
Explanation - I'm trying to create something as excel spreeadsheet with the relationships (formula or functions) between cells. Input for R is for examle csv (same values, some function or formula) and output only values
It sounds like you're looking for makeActiveBinding
a <- 1
makeActiveBinding('b', function() a * 2, .GlobalEnv)
b
# [1] 2
a <- 2
b
# [1] 4
The syntax is simpler if you want to use Hadley's nifty pryr package:
library(pryr)
b %<a-% (a * 2)
Most people don't expect variables to behave like this, however. So if you're writing code that others will be reading, I don't recommend using this feature of R. Explicitly update b when a changes or make b a function of a.
Warning: This isn't a good idea and task callbacks really should only be used if you know what you're doing.
You can do something like this but it's tedious and there are better ways to achieve your goal. You can make a function that will be called after every top level evaluation that basically does the reassignment for you.
modified <- function(expr, value, ok, visible){
if(exists("a")){
assign("b", a*2, env = .GlobalEnv)
}
return(TRUE)
}
addTaskCallback(modified)
After running that you should be able to get this...
> a
Error: object 'a' not found
> b
Error: object 'b' not found
> a <- 2
> a
[1] 2
> b
[1] 4
> a <- 3
> a
[1] 3
> b
[1] 6
Note that if you want to emulate a spreadsheet it would probably just be better to define a function to take your input and do all the necessary calculations to get your desired output. R isn't Excel and it would be best if you don't treat it like Excel.
R doesn't work like that. Variables only change when assigned new values. This is a good thing, because it means things don't change magically. Suppose in 20 lines time you want to know the value of b? When did it change? What does it depend on?
R is not a spreadsheet.
Just to spell it out a bit more.
sales = 100
costs = 90
profit = sales - costs
now profit has the value 10.
sales = 120
Only sales has changed.
profit = sales - costs
That changes profits to 30.
If you have a complex calculation you would normally write a function:
computeProfit = function(sales, costs){return(sales - costs)}
and then do:
profit = computeProfit(sales, costs)
whenever you want to compute the profits from the sales and the costs.
Although what you want to do is not completely possible in R, with a simple modification of b into a function and thanks to lexical scoping, you actually can have a "dependent variable" (sort of).
Define a:
a <- 1
Define b like this:
b <- function() {
a*2
}
Then, instead of using b to get the value of b, use b()
b() ##gives 2
a <- 4
b() ##gives 8

R's signal package's filter not matching with Matlab's filter function

In Matlab, there is a 1-D filter function http://www.mathworks.com/help/matlab/ref/filter.html .
In R's signal package, the description of its filter function states: Generic filtering function. The default is to filter with an ARMA filter of given coefficients. The default filtering operation follows Matlab/Octave conventions.
However, the answers don't match if I give the same specification.
In MATLAB (correct answer):
x=[4 3 5 2 7 3]
filter(2/3,[1 -1/3],x,x(1)*1/3)
ans =
4.0000 3.3333 4.4444 2.8148 5.6049 3.8683
In R, if I follow Matlab/Octave's convention (incorrect answer):
library(signal)
x<-c(4,3,5,2,7,3)
filter(2/3,c(1,-1/3),x,x[1]*1/3)
Time Series:
Start = 1
End = 6
Frequency = 1
[1] 3.111111 3.037037 4.345679 2.781893 5.593964 3.864655
I tried a lot of other examples too. R's signal package's filter function doesn't appear to follow the Matlab/Octave conventions even though the document states it so. Perhaps, I'm using the filter function incorrectly in R. Can someone help me?
I believe the answer is in the documentation (shock!!!!)
matlab:
The filter is a "Direct Form II Transposed"
implementation of the standard difference equation:
a(1)*y(n) = b(1)*x(n) + b(2)*x(n-1) + ... + b(nb+1)*x(n-nb)
- a(2)*y(n-1) - ... - a(na+1)*y(n-na)
If a(1) is not equal to 1, filter normalizes the filter coefficients by a(1).
[emphasis mine]
R:
a[1]*y[n] + a[2]*y[n-1] + … + a[n]*y[1] = b[1]*x[n] + b[2]*x[m-1] + … + b[m]*x[1]
Thanks for lifting this issue a couple of years back... I bumped into it as well and think I got an answer. Essentially I think the optimization algos are different for R and Matlab.
If no guess is provided (that is, set the initial values to default which is zero for both R and Matlab), the results are very similar.
R
library(signal)
x<-c(4,3,5,2,7,3)
filter(2/3,cbind(1,-1/3),x, 0.00)
2.666667 2.888889 4.296296 2.765432 5.588477 3.862826
Matlab
x=[4 3 5 2 7 3]
filter(2/3,[1 -1/3],x,0.00)
2.6667 2.8889 4.2963 2.7654 5.5885 3.8628
Now, if we start tweaking the initial guess of the parameters, then the results will diverge.
R
library(signal)
x<-c(4,3,5,2,7,3)
filter(2/3,cbind(1,-1/3),x, 0.05)
2.683333 2.894444 4.298148 2.766049 5.588683 3.862894
Matlab
x=[4 3 5 2 7 3]
filter(2/3,[1 -1/3],x,0.05)
2.7167 2.9056 4.3019 2.7673 5.5891 3.8630
Hope it helps!

R: Sample into bins of predefined sizes (partition sample vector)

I'm working on a dataset that consists of ~10^6 values which clustered into a variable number of bins. In the course of my analysis, I am trying to randomize my clustering, but keeping bin size constant. As a toy example (in pseudocode), this would look something like this:
data <- list(c(1,5,6,3), c(2,4,7,8), c(9), c(10,11,15), c(12,13,14));
sizes <- lapply(data, length);
for (rand in 1:no.of.randomizations) {
rand.data <- partition.sample(seq(1,15), partitions=sizes, replace=F)
}
So, I am looking for a function like "partition.sample" that will take a vector (like seq(1,15)) and randomly sample from it, returning a list with the data partitioned into the right bin sizes given already by "sizes".
I've been trying to write one such function myself, since the task seems to be not so hard. However, the partitioning of a vector into given bin sizes looks like it would be a lot faster and more efficient if done "under the hood", meaning probably not in native R. So I wonder whether I have just missed the name of the appropriate function, or whether someone could please point me to a smart solution that is around :-)
Your help & time are very much appreciated! :-)
Best,
Lymond
UPDATE:
By "no.of.randomizations" I mean the actual number of times I run through the whole "randomization loop". This will, later on, obviously include more steps than just the actual sampling.
Moreover, I would in addition be interested in a trick to do the above feat for sampling without replacement.
Thanks in advance, your help is very much appreciated!
Revised: This should be fairly efficient. It's complexity should be primarily in the permutation step:
# A single step:
x <- sample( unlist(data))
list( one=x[1:4], two=x[5:8], three=x[9], four=x[10:12], five=x[13:16])
As mentioned above the "no.of.randomizations" may have been the number of repeated applications of this proces, in which case you may want to wrap replicate around that:
replic <- replicate(n=4, { x <- sample(unlist(data))
list( x[1:4], x[5:8], x[9], x[10:12], x[13:15]) } )
After some more thinking and googling, I have come up with a feasible solution. However, I am still not convinced that this is the fastest and most efficient way to go.
In principle, I can generate one long vector of a uniqe permutation of "data" and then split it into a list of vectors of lengths "sizes" by going via a factor argument supplied to split. For this, I need an additional ID scheme for my different groups of "data", which I happen to have in my case.
It becomes clearer when viewed as code:
data <- list(c(1,5,6,3), c(2,4,7,8), c(9), c(10,11,15), c(12,13,14));
sizes <- lapply(data, length);
So far, everything as above
names <- c("set1", "set2", "set3", "set4", "set5");
In my case, I am lucky enough to have "names" already provided from the data. Otherwise, I would have to obtain them as (e.g.)
names <- seq(1, length(data));
This "names" vector can then be expanded by "sizes" using rep:
cut.by <- rep(names, times = sizes);
[1] 1 1 1 1 2 2 2 2 3 4 4 4 5
[14] 5 5
This new vector "cut.by" can then by provided as argument to split()
rand.data <- split(sample(1:15, 15), cut.by)
$`1`
[1] 8 9 14 4
$`2`
[1] 10 2 15 13
$`3`
[1] 12
$`4`
[1] 11 3 5
$`5`
[1] 7 6 1
This does the job I was looking for alright. It samples from the background "1:15" and splits the result into vectors of lengths "sizes" through the vector "cut.by".
However, I am still not happy to have to go via an additional (possibly) long vector to indicate the split positions, such as "cut.by" in the code above. This definitely works, but for very long data vectors, it could become quite slow, I guess.
Thank you anyway for the answers and pointers provided! Your help is very much appreciated :-)

Testing if rows of a matrix or data frame are sorted in R

What is an efficient way to test if rows in a matrix are sorted? [Update: see Aaron's Rcpp answer - straightforward & very fast.]
I am porting some code that uses issorted(,'rows') from Matlab. As it seems that is.unsorted does not extend beyond vectors, I'm writing or looking for something else. The naive method is to check that the sorted version of the matrix (or data frame) is the same as the original, but that's obviously inefficient.
NB: For sorting, a la sortrows() in Matlab, my code essentially uses SortedDF <- DF[do.call(order, DF),] (it's wrapped in a larger function that converts matrices to data frames, passes parameters to order, etc.). I wouldn't be surprised if there are faster implementations (data table comes to mind).
Update 1: To clarify: I'm not testing for sorting intra-row or intra-columns. (Such sorting generally results in an algebraically different matrix.)
As an example for creating an unsorted matrix:
set.seed(0)
x <- as.data.frame(matrix(sample(3, 60, replace = TRUE), ncol = 6, byrow = TRUE))
Its sorted version is:
y <- x[do.call(order, x),]
A proper test, say testSorted, would return FALSE for testSorted(x) and TRUE for testSorted(y).
Update 2:
The answers below are all good - they are concise and do the test. Regarding efficiency, it looks like these are sorting the data after all.
I've tried these with rather large matrices, such as 1M x 10, (just changing the creation of x above) and all have about the same time and memory cost. What's peculiar is that they all consume more time for unsorted objects (about 5.5 seconds for 1Mx10) than for sorted ones (about 0.5 seconds for y). This suggests they're sorting before testing.
I tested by creating a z matrix:
z <- y
z[,2] <- y[,1]
z[,1] <- y[,2]
In this case, all of the methods take about 0.85 seconds to complete. Anyway, finishing in 5.5 seconds isn't terrible (in fact, that seems to be right about the time necessary to sort the object), but knowing that a sorted matrix is 11X faster suggests that a test that doesn't sort could be even faster. In the case of the 1M row matrix, the first three rows of x are:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 3 1 2 2 3 1 3 3 2 2
2 1 1 1 3 2 3 2 3 3 2
3 3 3 1 2 1 1 2 1 2 3
There's no need to look beyond row 2, though vectorization isn't a bad idea.
(I've also added the byrow argument for the creation of x, so that row values don't depend on the size of x.)
Update 3:
Another comparison for this testing can be found with the sort -c command in Linux. If the file is already written (using write.table()), with 1M rows, then time sort -c myfile.txt takes 0.003 seconds for the unsorted data and 0.101 seconds for the sorted data. I don't intend to write out to a file, but it's a useful comparison.
Update 4:
Aaron's Rcpp method bested all other methods offered here and that I've tried (including the sort -c comparison above: in-memory is expected to beat on-disk). As for the ratio relative to other methods, it's hard to tell: the denominator is too small to give an accurate measurement, and I've not extensively explored microbenchmark. The speedups can be very large (4-5 orders of magnitude) for some matrices (e.g. one made with rnorm), but this is misleading - checking can terminate after only a couple of rows. I've had speedups with the example matrices of about 25-60 for the unsorted and about 1.1X for the sorted, as the competing methods were already very fast if the data is sorted.
Since this does the right thing (i.e. no sorting, just testing), and does it very quickly, it's the accepted answer.
If y is sorted then do.call(order,y) returns 1:nrow(y).
testSorted = function(y){all(do.call(order,y)==1:nrow(y))}
note this doesn't compare the matrices, but it doesn't dip out as soon as it finds a non-match.
Well, why don't you use:
all(do.call(order, y)==seq(nrow(y)))
That avoids creating the ordered matrix, and ensures it checks your style of ordering.
Newer: I decided I could use the Rcpp practice...
library(Rcpp)
library(inline)
isRowSorted <- cxxfunction(signature(A="numeric"), body='
Rcpp::NumericMatrix Am(A);
for(int i = 1; i < Am.nrow(); i++) {
for(int j = 0; j < Am.ncol(); j++) {
if( Am(i-1,j) < Am(i,j) ) { break; }
if( Am(i-1,j) > Am(i,j) ) { return(wrap(false)); }
}
}
return(wrap(true));
', plugin="Rcpp")
rownames(y) <- NULL # because as.matrix is faster without rownames
isRowSorted(as.matrix(y))
New: This R-only hack is the same speed for all matrices; it's definitely faster for sorted matrices; for unsorted ones it depends on the nature of the unsortedness.
iss3 <- function(x) {
x2 <- sign(do.call(cbind, lapply(x, diff)))
x3 <- t(x2)*(2^((ncol(x)-1):0))
all(colSums(x3)>=0)
}
Original: This is faster for some unsorted matrices. How much faster will depends on where the unsorted elements are; this looks at the matrix column by column so unsortedness on the left side will be noticed much faster than unsorted on the right, while top/bottomness doesn't matter nearly as much.
iss2 <- function(y) {
b <- c(0,nrow(y))
for(i in 1:ncol(y)) {
z <- rle(y[,i])
b2 <- cumsum(z$lengths)
sp <- split(z$values, cut(b2, breaks=b))
for(spi in sp) {
if(is.unsorted(spi)) return(FALSE)
}
b <- c(0, b2)
}
return(TRUE)
}
Well, the brute-force approach is to loop and compare, aborting as soon as a violation is found.
That approach can be implemented and tested easily in R, and then be carried over to a simple C++ function we can connect to R via inline and Rcpp (or plain C if you must) as looping is something that really benefits from an implementation in a compiled language.
Otherwise, can you not use something like diff() and check if all increments are non-negative?
You can use your do.call statement with is.unsorted:
issorted.matrix <- function(A) {!is.unsorted(do.call("order",data.frame(A)))}

Resources