swapping byrow for vector without converting to matrix - r

I have a matrix that I'm storing as a vector for speed & memory considerations. I want to essentially swap from 'byrow=FALSE' to 'byrow=TRUE' without actually converting it to a matrix (again, for speed and memory considerations, the data could potentially be very large).
It's trivial to do going through a call to matrix, e.g. if I have a 2x3 matrix,
> a <- 1:6
> a
[1] 1 2 3 4 5 6
> as.vector(matrix(a, nrow=2, ncol=3, byrow=TRUE))
[1] 1 4 2 5 3 6
I think I could come up with a manual solution involving pulling out every ith entry and reordering, etc, etc, but was hoping there might be a more straightforward solution.
Any ideas?
Thanks.

Related

R: Compare vectors of differing lengths

I'm actually having trouble phrasing my question, so if anyone has feedback on that, I'd love to hear it.
I'm working in R and have a vector and a data frame, of different lengths:
xp.data <- c(400,500,600,700)
XPTable <- data.frame("Level"=1:10,"XP"=c(10,50,100,200,400,600,700,800,900,1000))
What I'm hoping to obtain is a new vector:
> lv.data
[1] 5 5 6 7
The goal is to do so without using a loop, as the xp.data vector can be any length, and the XPTable data frame can also be of varying lengths.
If I was doing this without a vector for xp.data, I'd just use:
max(XPTable$Level[XPTable$XP < XP.data])
However, this only works if XP.data has a length of 1.
lv.data <- findInterval(xp.data, XPTable$XP)
print(lv.data)
# [1] 5 5 6 7

R Matching function between two matrices

Sorry if this has been posted before. I looked for the answer both on Google and Stackoverflow and couldn't find a solution.
Right now I have two matrices of data in R. I am trying to loop through each row in the matrix, and find the row in the other matrix that is most similar by some distance metric (for now least squared). I figured out one method but it is O(n^2) which is prohibitive for my data.
I think this might be similar to some dictionary learning techniques but I couldn't find anything.
Thanks!
Both matrices are just 30 by n matrices with a number at each entry.
distance.fun=function(mat1,mat2){
match=c()
for (i in 1:nrow(mat1)){
if (all(is.na(mat1[i,]))==FALSE){
dist=c()
for (j in 1:nrow(mat2)){
dist[j]=sum((mat1[i,]-mat2[j,])^2)
match[i]=which(min(dist) %in% dist)
}
}
}
return(match)
}
A better strategy would be to compute the distance matrix all at once first, then extract the mins. Here's an example using simualted data
set.seed(15)
mat1<-matrix(runif(2*25), ncol=2)
mat2<-matrix(runif(2*25), ncol=2)
and here's a helper function that can calculate the distances between values in one matrix to another. It uses the built in dist function but it does do unnecessary within-group comparisons that we eventually have to filter out, still it may be better performing overall.
distab<-function(m1, m2) {
stopifnot(ncol(m1)==ncol(m2))
m<-as.matrix(dist(rbind(m1, m2)))[1:nrow(m1), -(1:nrow(m1))]
rownames(m)<-rownames(m1)
colnames(m)<-rownames(m2)
m
}
mydist<-distab(mat1, mat2)
now that we have the between-group distances, we just need to minimize the matches.
best <- apply(mydist, 2, which.min)
rr <- cbind(m1.row=seq.int(nrow(mat1)), best.m2.row = best)
head(rr) #just print a few
# m1.row best.m2.row
# [1,] 1 1
# [2,] 2 14
# [3,] 3 7
# [4,] 4 3
# [5,] 5 23
# [6,] 6 15
note that with a strategy like this (we well as with your original implementation) it is possible for multiple rows from mat1 to match to the same row in mat2 and some rows in mat2 to be unmatched to mat1.

Data Structures for Creating BIG Data in R

I'm writing a gene level analysis script in R and I'll have to handle large amounts of data.
My initial idea was to create a super list structure, a set of lists within lists. Essentially the structure is
#12.8 mins
list[[1:8]][[1:1000]][[1:6]][[1:1000]]
This is huge and takes in excess of 12 mins purely to set up the data structure. Stream lining this process, I can get it down to about 1.6 mins when setting up one value of the 1:8 list, so essentially...
#1.6 mins
list[[1:1]][[1:1000]][[1:6]][[1:1000]]
Normally, I'd create the structure as and when it's needed, on the fly, however, I'm distributing the 1:1000 steps which means, I don't know which order they'll come back in.
Are there any other packages for handling the creation of this level of data?
Could I use any more efficient data structures in my approach?
I apologise if this seems like the wrong approach entirely, but this is my first time handling big data in R.
Note that lists are vectors, and like any other vector, they can have a dim attribute.
l <- vector("list", 8 * 1000 * 6 * 1000)
dim(l) <- c(8, 1000, 6, 1000)
This is effectively instantaneous. You access individual elements with [[, eg l[[1, 2, 3, 4]].
A different strategy is to create a vector and a partitioning, e.g., to represent
list(1:4, 5:7)
as
l = list(data=1:7, partition=c(4, 7))
then one can do vectorized calculations, e.g.,
logl = list(data=log(l$data), partition = l$partition)
and other clever things. This avoids creating complicated lists and the iterations that implies. This approach is formalized in the Bioconductor IRanges package *List classes.
> library(IRanges)
> l <- NumericList(1:4, 5:7)
> l
NumericList of length 2
[[1]] 1 2 3 4
[[2]] 5 6 7
> log(l)
NumericList of length 2
[[1]] 0 0.693147180559945 1.09861228866811 1.38629436111989
[[2]] 1.6094379124341 1.79175946922805 1.94591014905531
One idiom for working with this data is to unlist, transform, then relist; both unlist and relist are inexpensive, so the long-hand version of the above is relist(log(unlist(l)), l)
Depending on your data structure, the DataFrame class may be appropriate, e.g., the following can be manipulated like a data.frame (subset, etc) but contains *List elements.
> DataFrame(Sample=c("A", "B"), VariableA=l, LogA=log(l))
DataFrame with 2 rows and 3 columns
Sample VariableA LogA
<character> <NumericList> <NumericList>
1 A 1,2,3,... 0,0.693147180559945,1.09861228866811,...
2 B 5,6,7 1.6094379124341,1.79175946922805,1.94591014905531
For genomic data where the coordinates of genes (or other features) on chromosomes is of fundamental importance, the GenomicRanges package and GRanges / GRangesList classes are appropriate.

Subsetting R array: dimension lost when its length is 1

When subsetting arrays, R behaves differently depending on whether one of the dimensions is of length 1 or not. If a dimension has length 1, that dimension is lost during subsetting:
ax <- array(1:24, c(2,3,4))
ay <- array(1:12, c(1,3,4))
dim(ax)
#[1] 2 3 4
dim(ay)
#[1] 1 3 4
dim(ax[,1:2,])
#[1] 2 2 4
dim(ay[,1:2,])
#[1] 2 4
From my point of view, ax and ay are the same, and performing the same subset operation on them should return an array with the same dimensions. I can see that the way that R is handling the two cases might be useful, but it's undesirable in the code that I'm writing. It means that when I pass a subsetted array to another function, the function will get an array that's missing a dimension, if I happened to reduce a dimension to length 1 at an earlier stage. (So in this case R's flexibility is making my code less flexible!)
How can I prevent R from losing a dimension of length 1 during subsetting? Is there another way of indexing? Some flag to set?
As you've found out by default R drops unnecessary dimensions. Adding drop=FALSE while indexing can prevent this:
> dim(ay[,1:2,])
[1] 2 4
> dim(ax[,1:2,])
[1] 2 2 4
> dim(ay[,1:2,,drop = F])
[1] 1 2 4

R: a for statement wanted that allows for the use of values from each row

I'm pretty new to R..
I'm reading in a file that looks like this:
1 2 1
1 4 2
1 6 4
and storing it in a matrix:
matrix <- read.delim("filename",...)
Does anyone know how to make a for statement that adds up the first and last numbers of one row per iteration ?
So the output would be:
2
3
5
Many thanks!
Edit: My bad, I should have made this more clear...
I'm actually more interested in an actual for-loop where I can use multiple values from any column on that specific row in each iteration. The adding up numbers was just an example. I'm actually planning on doing much more with those values (for more than 2 columns), and there are many rows.
So something in the lines of:
for (i in matrix_i) #where i means each row
{
#do something with column j and column x from row i, for example add them up
}
If you want to get a vector out of this, it is simpler (and marginally computationally faster) to use apply rather than a for statement. In this case,
sums = apply(m, 1, function(x) x[1] + x[3])
Also, you shouldn't call your variables "matrix" since that is the name of a built in function.
ETA: There is an even easier and computationally faster way. R lets you pull out columns and add them together (since they are vectors, they will get added elementwise):
sums = m[, 1] + m[, 3]
m[, 1] means the first column of the data.
Something along these lines should work rather efficiently (i.e. this is a vectorised approach):
m <- matrix(c(1,1,1,2,4,6,1,2,4), 3, 3)
# [,1] [,2] [,3]
# [1,] 1 2 1
# [2,] 1 4 2
# [3,] 1 6 4
v <- m[,1] + m[,3]
# [1] 2 3 5
You probably can use an apply function or a vectorized approach --- and if you can you really should, but you ask for how to do it in a for loop, so here's how to do that. (Let's call your matrix m.)
results <- numeric(nrow(m))
for (row in nrow(m)) {
results[row] <- m[row, 1] + m[row, 3]
}
This is probably one of those 100 ways to skin a cat questions. You are perhaps looking for the rowSums function, although you might also find many answers using the apply function.

Resources