create multi-dimensional array with named dimensions and elements compactly - r

A while ago I wanted a simpler way of creating multidimensional arrays with named dimensions.
I ended up writing a function that has worked really well for me but I worry it may be somewhat of a hack and that there may be a better way of doing. So before passing this on to a colleague I'm seeking advice here.
What I want to do is to be able to create multi-dimensional arrays where the name of a dimension is specified by the name of a vector and the names of the elements are specified by the contents of that vector. e.g.
sex <- c("F","M")
name2 <- c("a","b","c")
This can be written out like so.
dimnames1 <- list( sex=sex, name2=name2 )
dim1 <- sapply(dimnames1, function(x) length(x))
a <- array(0,dim=dim1, dimnames=dimnames1)
a
name2
sex a b c
F 0 0 0
M 0 0 0
But I wanted to be able to keep this more compact :
I wrote this function that enables that.
array_named <- function( ...)
{
listArgs <- as.list(match.call()[-1])
#works only if args are specified by actual ranges not by a varname
#dimnames1 <- lapply(listArgs,eval)
#works, I'm not sure why n=3
dimnames1 <- lapply(listArgs,function(x){eval.parent(x, n=3)})
#setting dimensions of array from dimnames1
dim1 <- sapply(dimnames1, function(x) length(x))
#creating array and filling with fill value
a <- array(0, dim=dim1, dimnames=dimnames1)
return(a)
}
This allows passing vectors by name :
array_named( sex=sex, name2=name2 )
name2
sex a b c
F 0 0 0
M 0 0 0
and directly e.g.
array_named( a=c(1,2), b=c('x','y') )
b
a x y
1 0 0
2 0 0
Are there problems with this is there a more sensible way of doing ?

Related

How to create matrix of all 2^n binary sequences of length n using recursion in R?

I know I can use expand.grid for this, but I am trying to learn actual programming. My goal is to take what I have below and use a recursion to get all 2^n binary sequences of length n.
I can do this for n = 1, but I don't understand how I would use the same function in a recursive way to get the answer for higher dimensions.
Here is for n = 1:
binseq <- function(n){
binmat <- matrix(nrow = 2^n, ncol = n)
r <- 0 #row counter
for (i in 0:1) {
r <- r + 1
binmat[r,] <- i
}
return(binmat)
}
I know I have to use probably a cbind in the return statement. My intuition says the return statement should be something like cbind(binseq(n-1), binseq(n)). But, honestly, I'm completely lost at this point.
The desired output should produce something like what expand.grid gives:
n = 5
expand.grid(replicate(n, 0:1, simplify = FALSE))
It should just be a matrix as binmat is being filled recursively.
As requested in a comment (below), here is a limited implementation for binary sequences only:
eg.binary <- function(n, digits=0:1) {
if (n <= 0) return(matrix(0,0,0))
if (n == 1) return(matrix(digits, 2))
x <- eg.binary(n-1)
rbind(cbind(digits[1], x), cbind(digits[2], x))
}
After taking care of an initial case that R cannot handle correctly, it treats the "base case" of n=1 and then recursively obtains all n-1-digit binary strings and prepends each digit to each of them. The digits are prepended so that the binary strings end up in their usual lexicographic order (the same as expand.grid).
Example:
eg.binary(3)
[,1] [,2] [,3]
[1,] 0 0 0
[2,] 0 0 1
[3,] 0 1 0
[4,] 0 1 1
[5,] 1 0 0
[6,] 1 0 1
[7,] 1 1 0
[8,] 1 1 1
A general explanation (with a more flexible solution) follows.
Distill the problem down to the basic operation of tacking the values of an array y onto the rows of a dataframe X, associating a whole copy of X with each value (via cbind) and appending the whole lot (via rbind):
cross <- function(X, y) {
do.call("rbind", lapply(y, function(z) cbind(X, z)))
}
For example,
cross(data.frame(A=1:2, b=letters[1:2]), c("X","Y"))
A b z
1 1 a X
2 2 b X
3 1 a Y
4 2 b Y
(Let's worry about the column names later.)
The recursive solution for a list of such arrays y assumes you have already carried out these operations for all but the last element of the list. It has to start somewhere, which evidently consists of converting an array into a one-column data frame. Thus:
eg_ <- function(y) {
n <- length(y)
if (n <= 1) {
as.data.frame(y)
} else {
cross(eg_(y[-n]), y[[n]])
}
}
Why the funny name? Because we might want to do some post-processing, such as giving the result nice names. Here's a fuller implementation:
eg <- function(y) {
# (Define `eg_` here to keep it local to `eg` if you like)
X <- eg_(y)
names.default <- paste0("Var", seq.int(length(y)))
if (is.null(names(y))) {
colnames(X) <- names.default
} else {
colnames(X) <- ifelse(names(y)=="", names.default, names(y))
}
X
}
For example:
eg(replicate(3, 0:1, simplify=FALSE))
Var1 Var2 Var3
1 0 0 0
2 1 0 0
3 0 1 0
4 1 1 0
5 0 0 1
6 1 0 1
7 0 1 1
8 1 1 1
eg(list(0:1, B=2:3))
Var1 B
1 0 2
2 1 2
3 0 3
4 1 3
Apparently this was the desired recursive code:
binseq <- function(n){
if(n == 1){
binmat <- matrix(c(0,1), nrow = 2, ncol = 1)
}else if(n > 1){
A <- binseq(n-1)
B <- cbind(rep(0, nrow(A)), A)
C <- cbind(rep(1, nrow(A)), A)
binmat <- rbind(B,C)
}
return(binmat)
}
Basically for n = 1 we create a [0, 1] matrix. For every n there after we add a column of 0's to the original matrix, and, separately, a column of 1's. Then we rbind the two matrices to get the final product. So I get what the algorithm is doing, but I don't really understand what the recursion is doing. For example, I don't understand the step from n = 2 to n = 3 based on the algorithm.

Create a vector in a loop for every pair of samples

I do a pairwise calculation between my samples and I want every pairwise calculation to be stored in a separate vector. For 3 comparisons, I have:
sample_12 <- vector(mode="numeric", length = 10)
sample_13 <- vector(mode="numeric", length = 10)
sample_23 <- vector(mode="numeric", length = 10)
Is there a possibility to create these vectors with the corresponding names in a loop so it can work for any given number of samples?
I tried the following code but I can't access the vectors outside the for-loop, how could I solve this issue?
pop = 3
sample = vector(mode="numeric", length = 10)
for (i in 1:(pop - 1)) {
for (j in (i + 1):pop) {
name <- paste("sample",i,j, sep = "")
name <- vector(mode="numeric", length = 10)
}
}
You can use the "assign" function:
pop = 3
sample = vector(mode="numeric", length = 10)
pop_combos <- combn(pop, 2)
for (i in 1:ncol(pop_combos)) {
name <- paste("sample_",
pop_combos[,i][1],
pop_combos[,i][2],
sep="")
assign(name, sample)
}
Outside the loop you can now access the vectors:
> sample_12
[1] 0 0 0 0 0 0 0 0 0 0
Use a list:
pop = 3
combinations = apply(combn(pop, m = 2), 2, paste, collapse = "_")
sample = replicate(n = length(combinations), numeric(10), simplify = FALSE)
names(sample) = combinations
sample
# $`1_2`
# [1] 0 0 0 0 0 0 0 0 0 0
#
# $`1_3`
# [1] 0 0 0 0 0 0 0 0 0 0
#
# $`2_3`
# [1] 0 0 0 0 0 0 0 0 0 0
You can then access each element of the list, e.g., sample[["1_3"]]. This scales up very easily and doesn't require pasting together names and using assign and get, which is just asking for hard-to-find bugs. You can use lapply or for loops to iterate over each item in the list trivially. Depending on your use case, it might make more sense to use the default simplify = TRUE inside replicate and keep it as a matrix or data frame. The only reason to use a list would be if some of the vectors needed to be different lengths.
Is something like this you are searching for?
Please suppose that you save all the vectors as rows/columns in a data.frame
list.values <- list()
col <- ncol(df)
row <- nrow(df)
for( i in 1:(col*row)) {list[[i]] = df - df[i/row,i%%col]}
Now you have access to all the data frames in the list[[i * j]], that are the difference between all the elements and the element[i,j].
E.g: You want to access the values that are made between all the dataframe and the
element [2, 3]. Then, you do this View(list[[2*3]])

R: Remove the number of occurrences of values in one vector from another vector, but not all

Apologies for the confusing title, but I don't know how to express my problem otherwise. In R, I have the following problem which I want to solve:
x <- seq(1,1, length.out=10)
y <- seq(0,0, length.out=10)
z <- c(x, y)
p <- c(1,0,1,1,0,0)
How can I remove vector p from vector z so that vector a new vector i now has three occurrences of 1 and three occurrences 0 less, so what do I have to do to arrive at the following result? In the solution, the order of 1's and 0's in z should not matter, they just might have been in a random order, plus there can be other numbers involved as well.
i
> 1 1 1 1 1 1 1 0 0 0 0 0 0 0
Thanks in advance!
Similar to #VincentGuillemot's answer, but in functional programming style. Uses purrr package:
i <- z
map(p, function(x) { i <<- i[-min(which(i == x))]})
i
> i
[1] 1 1 1 1 1 1 1 0 0 0 0 0 0 0
There might be numerous better ways to do it:
i <- z
for (val in p) {
if (val %in% i) {
i <- i[ - which(i==val)[1] ]
}
}
Another solution that I like better because it does not require a test (and thanks fo #Franck's suggestion):
for (val in p)
i <- i[ - match(val, i, nomatch = integer(0) ) ]

R: Pass data.frame by reference to a function

I pass a data.frame as parameter to a function that want to alter the data inside:
x <- data.frame(value=c(1,2,3,4))
f <- function(d){
for(i in 1:nrow(d)) {
if(d$value[i] %% 2 == 0){
d$value[i] <-0
}
}
print(d)
}
When I execute f(x) I can see how the data.frame inside gets modified:
> f(x)
value
1 1
2 0
3 3
4 0
However, the original data.frame I passed is unmodified:
> x
value
1 1
2 2
3 3
4 4
Usually I have overcame this by returning the modified one:
f <- function(d){
for(i in 1:nrow(d)) {
if(d$value[i] %% 2 == 0){
d$value[i] <-0
}
}
d
}
And then call the method reassigning the content:
> x <- f(x)
> x
value
1 1
2 0
3 3
4 0
However, I wonder what is the effect of this behaviour in a very large data.frame, is a new one grown for the method execution? Which is the R-ish way of doing this?
Is there a way to modify the original one without creating another one in memory?
Actually in R (almost) each modification is performed on a copy of the previous data (copy-on-writing behavior).
So for example inside your function, when you do d$value[i] <-0 actually some copies are created. You usually won't notice that since it's well optimized, but you can trace it by using tracemem function.
That being said, if your data.frame is not really big you can stick with your function returning the modified object, since it's just one more copy afterall.
But, if your dataset is really big and doing a copy everytime can be really expensive, you can use data.table, that allows in-place modifications, e.g. :
library(data.table)
d <- data.table(value=c(1,2,3,4))
f <- function(d){
for(i in 1:nrow(d)) {
if(d$value[i] %% 2 == 0){
set(d,i,1L,0) # special function of data.table (see also ?`:=` )
}
}
print(d)
}
f(d)
print(d)
# results :
> f(d)
value
1: 1
2: 0
3: 3
4: 0
>
> print(d)
value
1: 1
2: 0
3: 3
4: 0
N.B.
In this specific case, the loop can be replaced with a "vectorized" and more efficient version e.g. :
d[d$value %% 2 == 0,'value'] <- 0
but maybe your real loop code is much more convoluted and cannot be vectorized easily.

Is this R code of Rao score test for the Bernoulli data model correct?

I am a complete statistical noob and new to R, hence the question. I've tried to find an implementation of the Rao score for the particular case when one's data is binary and each observation has bernoulli distribution. I stumbled upon anova in the R language but failed to understand how to use that. Therefore, I tried implementing Rao score for this particular case myself:
rao.score.bern <- function(data, p0) {
# assume `data` is a list of 0s and 1s
y <- sum(data)
n <- length(data)
phat <- y / n
z <- (phat - p0) / sqrt(p0 * (1 - p0) / n)
p.value <- 2 * (1 - pnorm(abs(z)))
}
I am pretty sure that there is a bug in my code because it produces only two distinct p-values in the following scenario:
p0 <- 1 / 4
p <- seq(from=0.01, to=0.5, by=0.01)
n <- seq(from=5, to=70, by=1)
g <- expand.grid(n, p)
data <- apply(g, 1, function(x) rbinom(x[1], 1, x[2]))
p.values <- sapply(data, function(x) rao.score.bern(x[[1]], p0))
Could someone please show me where the problem is? Could you perhaps point me to a built-in solution in R?
First test, then debug.
Test
Does rao.score.bern work at all?
rao.score.bern(c(0,0,0,1,1,1), 1/6))
This returns...nothing! Fix it by replacing the ultimate line by
2 * (1 - pnorm(abs(z)))
This eliminates the unnecessary assignment.
rao.score.bern(c(0,0,0,1,1,1), 1/6))
[1] 0.02845974
OK, now we're getting somewhere.
Debug
Unfortunately, the code still doesn't work. Let's debug by yanking the call to rao.score.bern and replacing it by something that shows us the input. Don't apply it to the large input you created! Use a small piece of it:
sapply(data[1:5], function(x) x[[1]])
[1] 0 0 0 0 0
That's not what you expected, is it? It's returning just one zero for each element of data. What about this?
sapply(data[1:5], function(x) x)
[[1]]
[1] 0 0 0 0 0
[[2]]
[1] 0 0 0 0 0 0
...
[[5]]
[1] 0 0 0 0 0 0 0 0 0
Much better! The variable x in the call to sapply refers to the entire vector, which is what you want to pass to your routine. Whence
p.values <- sapply(data, function(x) rao.score.bern(x, p0)); hist(p.values)

Resources