Why an empty matrix is 208 bytes? [duplicate] - r

This question already has an answer here:
How to compute the size of the allocated memory for a general type
(1 answer)
Closed 9 years ago.
I was interested in the memory usage of matrices in R when I observed something strange. In a loop, I made grow up the number of columns of a matrix and computed, for each step, the object size like this:
x <- 10
size <- matrix(1:x, x, 2)
for (i in 1:x){
m <- matrix(1, 2, i)
size[i,2] <- object.size(m)
}
Which gives
plot(size[,1], size[,2], xlab="n columns", ylab="memory")
It seems that matrices with 2 rows and 5, 6, 7 or 8 columns use the exact same memory. How can we explain that?

To understand what's going on here, you need to know a little bit about the memory overhead associated with objects in R. Every object, even an object with no data, has 40 bytes of data associated with it:
x0 <- numeric()
object.size(x0)
# 40 bytes
This memory is used to store the type of the object (as returned by typeof()), and other metadata needed for memory management.
After ignoring this overhead, you might expect that the memory usage of a vector is proportional to the length of the vector. Let's check that out with a couple of plots:
sizes <- sapply(0:50, function(n) object.size(seq_len(n)))
plot(c(0, 50), c(0, max(sizes)), xlab = "Length", ylab = "Bytes",
type = "n")
abline(h = 40, col = "grey80")
abline(h = 40 + 128, col = "grey80")
abline(a = 40, b = 4, col = "grey90", lwd = 4)
lines(sizes, type = "s")
It looks like memory usage is roughly proportional to the length of the vector, but there is a big discontinuity at 168 bytes and small discontinuities every few steps. The big discontinuity is because R has two storage pools for vectors: small vectors, managed by R, and big vectors, managed by the OS (This is a performance optimisation because allocating lots of small amounts of memory is expensive). Small vectors can only be 8, 16, 32, 48, 64 or 128 bytes long, which once we remove the 40 byte overhead, is exactly what we see:
sizes - 40
# [1] 0 8 8 16 16 32 32 32 32 48 48 48 48 64 64 64 64 128 128 128 128
# [22] 128 128 128 128 128 128 128 128 128 128 128 128 136 136 144 144 152 152 160 160 168
# [43] 168 176 176 184 184 192 192 200 200
The step from 64 to 128 causes the big step, then once we've crossed into the big vector pool, vectors are allocated in chunks of 8 bytes (memory comes in units of a certain size, and R can't ask for half a unit):
# diff(sizes)
# [1] 8 0 8 0 16 0 0 0 16 0 0 0 16 0 0 0 64 0 0 0 0 0 0 0 0 0 0 0
# [29] 0 0 0 0 8 0 8 0 8 0 8 0 8 0 8 0 8 0 8 0 8 0
So how does this behaviour correspond to what you see with matrices? Well, first we need to look at the overhead associated with a matrix:
xv <- numeric()
xm <- matrix(xv)
object.size(xm)
# 200 bytes
object.size(xm) - object.size(xv)
# 160 bytes
So a matrix needs an extra 160 bytes of storage compared to a vector. Why 160 bytes? It's because a matrix has a dim attribute containing two integers, and attributes are stored in a pairlist (an older version of list()):
object.size(pairlist(dims = c(1L, 1L)))
# 160 bytes
If we re-draw the previous plot using matrices instead of vectors, and increase all constants on the y-axis by 160, you can see the discontinuity corresponds exactly to the jump from the small vector pool to the big vector pool:
msizes <- sapply(0:50, function(n) object.size(as.matrix(seq_len(n))))
plot(c(0, 50), c(160, max(msizes)), xlab = "Length", ylab = "Bytes",
type = "n")
abline(h = 40 + 160, col = "grey80")
abline(h = 40 + 160 + 128, col = "grey80")
abline(a = 40 + 160, b = 4, col = "grey90", lwd = 4)
lines(msizes, type = "s")

This seems to only happen for a very specific range of columns on the small end. Looking at matrices with 1-100 columns I see the following:
I do not see any other plateaus, even if I increase the number of columns to say, 10000:
Intrigued, I've looked at bit further, putting your code in a function:
sizes <- function(nrow, ncol) {
size=matrix(1:ncol,ncol,2)
for (i in c(1:ncol)){
m = matrix(1,nrow, i)
size[i,2]=object.size(m)
}
plot(size[,1], size[,2])
size
}
Interestingly, we still see this plateau and straight line in low numbers if we increase the number of rows, with the plateau shrinking and moving backwards, before finally adjusting to a straight line by the time we hit nrow=8:
Indicating that this happens for a very specific range for the number of cells in a matrix; 9-16.
Memory Allocation
As #Hadley pointed out in his comment, there is a similar thread on memory allocation of vectors. Which comes up with the formula: 40 + 8 * floor(n / 2) for numeric vectors of size n.
For matrices the overhead is slightly different, and the stepping relationship doesn't hold (as seen in my plots). Instead I have come up with the formula 208 + 8 * n bytes where n is the number of cells in the matrix (nrow * ncol), except where n is between 9 and 16:
Matrix size - 208 bytes for "double" matrices, 1 row, 1-20 columns:
> sapply(1:20, function(x) { object.size(matrix(1, 1, x)) })-208
[1] 0 8 24 24 40 40 56 56 120 120 120 120 120 120 120 120 128 136 144
[20] 152
HOWEVER. If we change the type of the matrix to Integer or Logical, we do see the stepwise behaviour in memory allocation described in the thread above:
Matrix size - 208 bytes for "integer" matrices 1 row, 1-20 columns:
> sapply(1:20, function(x) { object.size(matrix(1L, 1, x)) })-208
[1] 0 0 8 8 24 24 24 24 40 40 40 40 56 56 56 56 120 120 120
[20] 120
Similarly for "logical" matrices:
> sapply(1:20, function(x) { object.size(matrix(1L, 1, x)) })-208
[1] 0 0 8 8 24 24 24 24 40 40 40 40 56 56 56 56 120 120 120
[20] 120
It is surprising that we do not see the same behaviour with a matrix of type double, as it is just a "numeric" vector with a dim attribute attached (R lang specification).
The big step we see in memory allocation comes from R having two memory pools, one for small vectors and one for Large vectors, and that happens to be where the jump is made. Hadley Wickham explains this in detail in his response.

Look at the numeric vector with size from 1 to 20, I got this figure.
x=20
size=matrix(1:x,x,2)
for (i in c(1:x)){
m = rep(1, i)
size[i,2]=object.size(m)
}
plot(size[,1],size[,2])

Related

filling a vector with a numberline from negative to positive in R

I want to create a range of values with a +/- interval given a mid point. The mid point and intervals are variables and can change.
For example, if my mid point is 0 and my interval is 5, I want my vector to be comprised of
[-5,-4,-3,-2,-1,0,1,2,3,4,5]
If my mid point is 140 and interval is 5, the vector would be
[135,136,137,138,139,140,141,142,143,144,145]
I initially thought this would be easy to do in a single for loop. But I am totally stumped on how to do this in an elegant fashion in R
The only way I can think off is to calculate the negative and positive values separately and then join the elements to form a vector.
You could use this which uses the seq() function:
mid_int <- function(mid_point, interval) {
seq(mid_point - interval, mid_point + interval, by = 1)
}
mid_int(0, 5)
[1] -5 -4 -3 -2 -1 0 1 2 3 4 5
mid_int(140, 5)
[1] 135 136 137 138 139 140 141 142 143 144 145

Expanding a list of numbers into a matrix (list with n values to multiply to a n x n matrix)

I have a set of numbers, which I want to expand into a matrix.
There are 4 values in the list which I want to expand into a 4x4 matrix.
Here is some example data
freq <- c(627,449,813,111)
I want to expand this into a matrix of so that it's like this.
Apologies I have just copied and pasted data, thus it's not an R output, but hope it helps to get the idea across.
1 2 3 4 Total
1 197 141 255 35 627
2 141 101 183 25 449
3 255 183 330 45 813
4 35 25 45 6 111
627 449 813 111 2000
The cells are multiplication of the (row total)x(column total)/(table total). The value in 1,1 = (627 x 627)/2000 = 197. The value in 2,1 = (627 x 449)/2000 = 141, and so on.
Is there a function that will create this matrix? I will try to do it via a loop but was hoping there is a function or matrix calculation trick that can do this more efficiently? Apologies if I didn't articulate the above too well, any help is greatly appreciated. Thanks
freq <- c(627,449,813,111)
round(outer(freq, freq)/sum(freq))
#> [,1] [,2] [,3] [,4]
#> [1,] 197 141 255 35
#> [2,] 141 101 183 25
#> [3,] 255 183 330 45
#> [4,] 35 25 45 6
It doesn't really matter here, but it is good practice to avoid constructions like outer(x, x) / sum(x) in favour of ones like tcrossprod(x / sqrt(sum(x))):
round(tcrossprod(freq / sqrt(sum(freq))))
## [,1] [,2] [,3] [,4]
## [1,] 197 141 255 35
## [2,] 141 101 183 25
## [3,] 255 183 330 45
## [4,] 35 25 45 6
There are a few issues with the outer approach:
outer(x, x) evaluates tcrossprod(as.vector(x), as.vector(x)) internally. The as.vector calls and everything else that happens inside of outer are completely redundant if x is already a vector. The as.vector calls are actually worse than redundant: if x has any attributes, then as.vector(x) requires a deep copy of x.
Naively doing A <- outer(x, x); A / sum(x) requires R to allocate memory for two n-by-n matrices. For large enough n, that can be quite wasteful, if not impossible. R is clever enough to avoid the second allocation if you compute outer(x, x) / sum(x) directly. However, such optimizations are low level, come with a number of gotchas, and are not even documented in ?Arithmetic, so it can be unsafe to rely on them.
outer(x, x) can result in underflow or overflow if the elements of x are very (very) small or large.
tcrossprod(x / sqrt(sum(x))) avoids all of these issues by scaling x before computing an outer product and cutting out all of the redundancies of outer.

Alternative to for loop R

I have written a function that will compare the similarity of IP addresses, and will let the user select the level of detail in the octet. for example, in the address 255.255.255.0 and 255.255.255.1, a user could specify that they only want to compare the first, first and second, first second third etc. octets.
the function is below:
did.change.ip=function(vec, detail){
counter=2
result.vec=FALSE
r.list=strsplit(vec, '.', fixed=TRUE)
for(i in vec){
if(counter>length(vec)){
break
}
first=as.numeric(r.list[[counter-1]][1:detail])
second=as.numeric(r.list[[counter]][1:detail])
if(sum(first==second)==detail){
result.vec=append(result.vec,FALSE)
}
else{
result.vec=append(result.vec,TRUE)
}
counter=counter+1
}
return(result.vec)
}
and it's really slow once the data starts getting larger. for a dataset of 500,000 rows, the system.time() results are:
user system elapsed
208.36 0.59 209.84
are there any R power users who have insight on how to write this more efficiently? I know lapply() is the preferred method for looping over vectors/dataframes, but I'm stumped as to how to access the previous element in a vector for this purpose. I've tried to sketch something out quickly, but It returns a syntax error:
test=function(vec, detail){
rlist=strsplit(vec, '.', fixed=TRUE)
r.value=vapply(rlist, function(x,detail) ifelse(x[1:detail]==x[1:detail] TRUE, FALSE))
}
I've created some sample data for testing purposes below:
stack.data=structure(list(V1 = c("247.116.209.66", "195.121.47.105", "182.136.49.12",
"237.123.100.50", "120.30.174.18", "29.85.72.70", "18.186.76.177",
"33.248.142.26", "109.97.92.50", "217.138.155.145", "20.203.156.2",
"71.1.51.190", "31.225.208.60", "55.25.129.73", "211.204.249.244",
"198.137.15.53", "234.106.102.196", "244.3.87.9", "205.242.10.22",
"243.61.212.19", "32.165.79.86", "190.207.159.147", "157.153.136.100",
"36.151.152.15", "2.254.210.246", "3.42.1.208", "30.11.229.18",
"72.187.36.103", "98.114.189.34", "67.93.180.224")), .Names = "V1", class = "data.frame", row.names = c(NA,
-30L))
Here's another solution just using base R.
did.change.ip <- function(vec, detail=4){
ipv <- scan(text=paste(vec, collapse="\n"),
what=c(replicate(detail, integer()), replicate(4-detail,NULL)),
sep=".", quiet=TRUE)
c(FALSE, rowSums(vapply(ipv[!sapply(ipv, is.null)],
diff, integer(length(vec)-1))!=0)>0)
}
Here we use scan() to break up the ip address into numbers. Then we we look down each octet for differences using diff. It seems this is faster than the original proposal, but slightly slower than #josilber's stringr solution (using microbenchmark with 3,000 ip addresses)
Unit: milliseconds
expr min lq median uq max neval
orig 35.251886 35.716921 36.019354 36.700550 90.159992 100
scan 2.062189 2.116391 2.170110 2.236658 3.563771 100
strngr 2.027232 2.075018 2.136114 2.200096 3.535227 100
The simplest way I can think of to do this is to build a transformed vector that only includes the parts of the IP you want. Then it's a one-liner to check if each element is equal to the one before it:
library(stringr)
did.change.josilber <- function(vec, detail) {
s <- str_extract(vec, paste0("^(\\d+\\.){", detail, "}"))
return(s != c(s[1], s[1:(length(s)-1)]))
}
This seems reasonably efficient for 500,000 rows:
set.seed(144)
big.vec <- sample(stack.data[,1], 500000, replace=T)
system.time(did.change.josilber(big.vec, 3))
# user system elapsed
# 0.527 0.030 0.554
The biggest issue with your code is that you call append each iteration, which requires reallocation of your vector 500,000 times. You can read more about this in the second circle of the R inferno.
Not sure if all you want is counts, but this is potentially a solution:
library(dplyr)
library(tidyr)
# split ip addresses into "octets"
octets <- stack.data %>%
separate(V1,c("first","second","third","fourth"))
# how many shared both their first and second octets?
octets %>%
group_by(first,second) %>%
summarize(n = n())
first second n
1 109 97 1
2 120 30 1
3 157 153 1
4 18 186 1
5 182 136 1
6 190 207 1
7 195 121 1
8 198 137 1
9 2 254 1
10 20 203 1
11 205 242 1
12 211 204 1
13 217 138 1
14 234 106 1
15 237 123 1
16 243 61 1
17 244 3 1
18 247 116 1
19 29 85 1
20 3 42 1
21 30 11 1
22 31 225 1
23 32 165 1
24 33 248 1
25 36 151 1
26 55 25 1
27 67 93 1
28 71 1 1
29 72 187 1
30 98 114 1

R : how to Detect Pattern in Matrix By Row

I have a big matrix with 4 columns, containing normalized values (by column, mean ~ 0 and standard deviation = 1)
I would like to see if there is a pattern in the matrix, and if yes I would like to cluster rows by pattern, by pattern I mean values in a given row example
for row N
if value in column 1 < column 2 < column 3 < column 4 then it is let's say a pattern 1
Basically there is 4^4 = 256 possible patterns (in theory)
Is there a way in R to do this ?
Thanks in advance
Rad
Yes. (Although the number of distinct permutations is only 24 = 4*3*2. After one value is chosen, there are only three possible second values, and after the second is specified there are only two more orderings left.) The order function applied to each row should give the desired 1,2,3, 4 permutations:
mtx <- matrix(rnorm(10000), ncol=4)
res <- apply(mtx, 1, function(x) paste( order(x), collapse=".") )
> table(res)[1:10]
> table(res)
res
1.2.3.4 1.2.4.3 1.3.2.4 1.3.4.2 1.4.2.3 1.4.3.2
98 112 95 120 114 118
2.1.3.4 2.1.4.3 2.3.1.4 2.3.4.1 2.4.1.3 2.4.3.1
101 114 105 102 104 122
3.1.2.4 3.1.4.2 3.2.1.4 3.2.4.1 3.4.1.2 3.4.2.1
105 82 107 90 97 86
4.1.2.3 4.1.3.2 4.2.1.3 4.2.3.1 4.3.1.2 4.3.2.1
99 93 100 108 118 110

Modified Bootstrapping

I'm interested in developing a modified bootstrap that samples some vector of length x, with replacement, but must meet a number of number of criteria before stopping the sampling. I'm attempting to calculate confidence intervals for lambda of a populations growth rate, 10000 iterations, but in some groupings of individuals, say vector 13, there are very few individuals growing out of the group. Typical bootstrapping would lead to a fair number instances where growth in this vector does not occur and hence the model falls apart. Each vector consists of a certain number of 1's, 2's, and 3's where 1 represents staying within a group, 2 growing out of a group, and 3 death. Here is what I have so far without the modification, it is likely not the best approach time wise, but I am new to R.
st13 <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,3,3)
#runs
n <- 10000
stage <- st13
stagestay <- vector()
stagemoved <- vector()
stagedead <- vector()
for(i in 1:n){
index <- sample(stage, replace=T)
stay <- ((length(index[index==1]))/(length(index)))
moved <- ((length(index[index==2]))/(length(index)))
stagestay <- rbind(stagestay,stay)
stagemoved <- rbind(stagemoved,moved)
}
Currently, this samples
My question is then: In what way can I modify the sample function to continue sampling these numbers until the length of "index" is at least the same as st13 AND until at least 1 instance of a 2 is present in "index"?
Thanks very much,
Kristopher Hennig
Masters Student
University of Mississippi
Oxford, MS, 38677
Update:
The answer from #lselzer reminded me that the requirement was for the length of the sample to be at least as long as st13. My code above just keeps sampling until it finds a bootstrap sample that contains a 2. The code of #lselzer grows the sample, 1 new index at a time, until the sample contains a 2. This is quite inefficient as you might have to call sample() many times till you get 2. My code might repeat a long time before a 2 is returned in the sample. So can we do any better?
One way would be to sample a large sample with replacement using a single call to sample(). Check which are 2s and see if there is a 2 within the first length(st13) entries. If there is, return those entries, if not, find the first 2 in the large sample and return all entries up to an including that one. If there are no 2s, add on another large sample and repeat. Here is some code:
#runs
n <- 100 #00
stage <- st13
stagedead <- stagemoved <- stagestay <- Size <- vector()
sampSize <- 100 * (len <- length(stage)) ## sample size to try
for(i in seq_len(n)){
## take a large sample
samp <- sample(stage, size = sampSize, replace = TRUE)
## check if there are any `2`s and which they are
## and if no 2s expand the sample
while(length((twos <- which(samp == 2))) < 1) {
samp <- c(samp, sample(stage, size = sampSize, replace = TRUE))
}
## now we have a sample containing at least one 2
## so set index to the required set of elements
if((min.two <- min(twos)) <= len) {
index <- samp[seq_len(len)]
} else {
index <- samp[seq_len(min.two)]
}
stay <- length(index[index==1]) / length(index)
moved <- length(index[index==2]) / length(index)
stagestay[i] <- stay
stagemoved[i] <- moved
Size[i] <- length(index)
}
Here is a really degenerate vector with only a single 2 in 46 entries:
R> st14 <- sample(c(rep(1, 45), 2))
R> st14
[1] 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[39] 1 1 1 1 1 1 1 1
If I use the above loop on it rather than st13, I get the following for the minimum sample size required to get a 2 on each of the 100 runs:
R> Size
[1] 65 46 46 46 75 46 46 57 46 106 46 46 46 66 46 46 46 46
[19] 46 46 46 46 46 279 52 46 63 70 46 46 90 107 46 46 46 87
[37] 130 46 46 46 46 46 46 60 46 167 46 46 46 71 77 46 46 84
[55] 58 90 112 52 46 53 85 46 59 302 108 46 46 46 46 46 174 46
[73] 165 103 46 110 46 80 46 166 46 46 46 65 46 46 46 286 71 46
[91] 131 61 46 46 141 46 46 53 47 83
So it would suggest that the sampSize I chose (100 * length(stage)) is a bit of overkill here but as all the operators we are using are vectorised we probably don't incur to much penalty for the overly long initial sample size, and we certainly don't incur any extra sample() calls.
Original:
If I understand you correctly, the problem is that sample() might not return any 2 indicies at all. If so, we can continue sampling until it does using the repeat control flow construct.
I've altered your code accordingly, and optimised it a bit because you never grow objects in a loop like you were doing. There are other ways this could be improved, but I'll stick with the loop for now. Explanation comes below.
st13 <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,3,3)
#runs
n <- 10000
stage <- st13
stagedead <- stagemoved <- stagestay <- vector()
for(i in seq_len(n)){
repeat {
index <- sample(stage, replace = TRUE)
if(any(index == 2)) {
break
}
}
stay <- length(index[index==1]) / length(index)
moved <- length(index[index==2]) / length(index)
stagestay[i] <- stay
stagemoved[i] <- moved
}
This is the main change related to your Q:
repeat {
index <- sample(stage, replace = TRUE)
if(any(index == 2)) {
break
}
}
what this does is repeat the code contained in the braces until a break is triggered to jump us out of the repeat loop. So what happens is we take a bootstrap sample, then check if any of the sample contains the index 2. If there are any 2s then we break out and carry on with the rest of the current for loop iteration. If the sample doesn't contain any 2s, the break is not triggered and we go round again taking another sample. This will happen until we do get a sample with a 2 in it.
For starters, sample has a size argument which you could use to match the length of st13. The second part of your question could be solved using a while loop.
st13 <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,3,3)
#runs
n <- 10000
stage <- st13
stagestay <- vector()
stagemoved <- vector()
stagedead <- vector()
for(i in 1:n){
index <- sample(stage, length(stage), replace=T)
while(!any(index == 2)) {
index <- c(index, sample(stage, 1, replace = T))
}
stay <- ((length(index[index==1]))/(length(index)))
moved <- ((length(index[index==2]))/(length(index)))
stagestay[i] <- stay
stagemoved[i] <- moved
}
While I was writing this Gavin posted his answer which is similar to mine, but I added the size argument to be sure index has at least the lenght of st13

Resources