I've been trying to keep track of various objects in memory using data.table::address or .Internal(address()), but have noticed that some objects return the same address every time, while others are almost always different. What is going on here?
I've noticed that addresses of objects like lists (data.tables, data.frames, etc) remain constant (as reported by these functions), whereas if I try to report the address by [ into a list, ie address(lst[1]) I get different results nearly everytime. On the other hand lst[[1]] returns the same value, and the addresses of constants like address(pi) remain constant whereas address(1) is volatile. Why is this happening?
## Create some data.tables of different sizes and plot the addresses
library(data.table)
par(mfrow = c(2,2))
for (i in 2:5) {
dat <- data.table(a=1:10^i)
## Constants
addr1 <- address(dat)
addr2 <- address(dat[[1]])
addr3 <- address(dat$a) # same as addr2
## Vary
addrs <- replicate(5000, address(dat[1]))
plot(density(as.integer(as.hexmode(addrs))), main=sprintf("N: %g", nrow(dat)))
abline(v=as.integer(as.hexmode(c(addr1, addr2, addr3))), col=1:3, lwd=2, lty=1:3)
legend("topleft", c("dat", "dat[[1]]", "dat$a"), col=1:3, lwd=2, lty=1:3)
}
Here are some examples of what I'm talking about with different sized data.tables. They are just densities of the results from address(dat[1]) (converted to an integer), and the lines correspond to the constant addresses of the data.table.
First off, I can replicate your results, so I did a bit of an investigation and dived through some code.
When you access the first member of dat using dat[1] you are actually creating a slice made from the list in data[[1]] or dat$a. To take a slice, R first copies the list and then returns the slice you want.
So - basically - you see what you see because the [] syntax for indexing returns a slice containing the first element of dat which is a copy of dat$a, which will be at an arbitrary memory location.
The [[]] syntax returns a reference to the actual list that is the column in your data.table or data.frame and hence its address is invariant (or at least it is until you change a member of that list).
This could be confusing, because of course doing dat[1] = 6 or similar will alter the value(s) of the list in your data structure. However, if you look at address(dat[[1]]) before and after making such a change, you will notice that in fact the reference is now to a different list (the copy) e.g.
> dat <- data.table(a=1:10000)
> dat
a
1: 1
2: 2
3: 3
4: 4
5: 5
---
9996: 9996
9997: 9997
9998: 9998
9999: 9999
10000: 10000
> address(dat[[1]])
[1] "000000000CF389D8"
> address(dat[[1]])
[1] "000000000CF389D8"
> dat[1] = 100
> address(dat[[1]])
[1] "000000000D035B38"
> dat
a
1: 100
2: 2
3: 3
4: 4
5: 5
---
9996: 9996
9997: 9997
9998: 9998
9999: 9999
10000: 10000
>
Looking at the source code for data.frame (rather than data.table), the code that does the slice indexing ([]) is here, whereas the direct indexing ([[]]) is here. You can see that the latter is simpler and to cut a long story short, the former returns a copy. If you change a slice directly (e.g. dat[1] = 5), there is some logic here that handles ensuring that the data frame now references the updated copy.
Related
This is probably simple, but Im new to R and it doesn't work like GrADs so I;ve been searching high and low for examples but to no avail..
I have two sets of data. Data A (1997) and Data B (2000)
Data A has 35 headings (apples, orange, grape etc). 200 observations.
Data B has 35 headings (apples, orange, grape, etc). 200 observations.
The only difference between the two datasets is the year.
So i would like to correlate the two dataset i.e. 200 data under Apples (1997) vs 200 data under Apples (2000). So 1 heading should give me only 1 value.
I've converted all the header names to V1,V2,V3...
So now I need to do this:
x<-1
while(x<35) {
new(x)=cor(1997$V(x),2000$V(x))
print(new(x))
}
and then i get this error:
Error in pptn26$V(x) : attempt to apply non-function.
Any advise is highly appreciated!
Your error comes directly from using parentheses where R isn't expecting them. You'll get the same type of error if you do 1(x). 1 is not a function, so if you put it right next to parentheses with no white space between, you're attempting to apply a non function.
I'm also a bit surprised at how you are managing to get all the way to that error, before running into several others, but I suppose that has something to do with when R evaluates what...
Here's how to get the behavior you're looking for:
mapply(cor, A, B)
# provided A is the name of your 1997 data frame and B the 2000
Here's an example with simulated data:
set.seed(123)
A <- data.frame(x = 1:10, y = sample(10), z = rnorm(10))
B <- data.frame(x = 4:13, y = sample(10), z = rnorm(10))
mapply(cor, A, B)
# x y z
# 1.0000000 0.1393939 -0.2402058
In its typical usage, mapply takes an n-ary function and n objects that provide the n arguments for that function. Here the n-ary function is cor, and the objects are A, and B, each a data frame. A data frame is structured as a list of vectors, the columns of the data frame. So mapply will loop along your columns for you, making 35 calls to cor, each time with the next column of both A and B.
If you have managed to figure out how to name your data frames 1997 and 2000, kudos. It's not easy to do that. It's also going to cause you headaches. You'll want to have a syntactically valid name for your data frame(s). That means they should start with a letter (or a dot, but really a letter). See the R FAQ for the details.
I read that using seq_along() allows to handle the empty case much better, but this concept is not so clear in my mind.
For example, I have this data frame:
df
a b c d
1 1.2767671 0.133558438 1.5582137 0.6049921
2 -1.2133819 -0.595845408 -0.9492494 -0.9633872
3 0.4512179 0.425949910 0.1529301 -0.3012190
4 1.4945791 0.211932487 -1.2051334 0.1218442
5 2.0102918 0.135363711 0.2808456 1.1293810
6 1.0827021 0.290615747 2.5339719 -0.3265962
7 -0.1107592 -2.762735937 -0.2428827 -0.3340126
8 0.3439831 0.323193841 0.9623515 -0.1099747
9 0.3794022 -1.306189542 0.6185657 0.5889456
10 1.2966537 -0.004927108 -1.3796625 -1.1577800
Considering these three different code snippets:
# Case 1
for (i in 1:ncol(df)) {
print(median(df[[i]]))
}
# Case 2
for (i in seq_along(df)) {
print(median(df[[i]]))
}
# Case 3
for(i in df) print(median(i))
What is the difference between these different procedures when a full data.frame exists or in the presence of an empty data.frame?
Under the condition that df <- data.frame(), we have:
Case 1 falling victim to...
Error in .subset2(x, i, exact = exact) : subscript out of bounds
while Case 2 and 3 are not triggered.
In essence, the error in Case 1 is due to ncol(df) being 0. This leads the sequence 1:ncol(df) to be 1:0, which creates the vector c(1,0). In this case, the for loop tries to access the first element of the vector 1, which tries to access column 1 does not exist. Hence, the subset is found to be out of bounds.
Meanwhile, in Case 2 and 3 the for loop is never executed since there are no elements to process within their respective collections since the vectors are empty. Principally, this means that they have length of 0.
As this question specifically relates to what the heck is happening to seq_along(), let's take a traditional seq_along example by constructing a full vector a and seeing the results:
set.seed(111)
a <- runif(5)
seq_along(a)
#[1] 1 2 3 4 5
In essence, for each element of the vector a, there is a corresponding index that was created by seq_along to be accessed.
If we apply seq_along now to the empty df in the above case, we get:
seq_along(df)
# integer(0)
Thus, what was created was a zero length vector. Its mighty hard to move along a zero length vector.
Ergo, the Case 1 poorly protects the against the empty case.
Now, under the traditional assumption, that is there is some data within the data.frame, which is a very bad assumption for any kind of developer to make...
set.seed(1234)
df <- data.frame(matrix(rnorm(40), 4))
All three cases would be operating as expected. That is, you would receive a median per column of the data.frame.
[1] -0.5555419
[1] -0.4941011
[1] -0.4656169
[1] -0.605349
I am new to R and am trying create a new dataframe of bootstrapped resamples of groups of different sizes. My dataframe has 6 variables and a group designation, and there are 128 groups of different Ns. Here is an example of my data:
head(PhenoM2)
ID Name PhenoNames Group HML RML FML TML FHD BIB
1 378607 PaleoAleut PaleoAleut 1 323.5 248.75 434.50 355.75 46.84 NA
2 378664 PaleoAleut PaleoAleut 1 NA 238.50 441.50 353.00 45.83 277.0
3 378377 PaleoAleut PaleoAleut 1 309.5 227.75 419.00 332.25 46.39 284.0
4 378463 PaleoAleut PaleoAleut 1 283.5 228.75 397.75 331.00 44.37 255.5
5 378602 PaleoAleut PaleoAleut 1 279.5 230.00 393.00 329.50 45.93 265.0
6 378610 PaleoAleut PaleoAleut 1 307.5 234.25 419.50 338.50 43.98 271.5
Pulling from this question - bootstrap resampling for hierarchical/multilevel data - and taking some advice from others (thanks!) I wrote the code:
resample.M <- NULL
for(i in 1000){
groups <- unique(PhenoM2$"Group")
for(ii in 1:128)
data.i.ii <- PhenoM2[PhenoM2$"Group"==groups[ii],]
resample.M[i] <- data.i.ii[sample(1:nrow(data.i.ii),replace=T),]
}
Unfortunately, this gives me the warning:
In resample.M[i] <- data.i.ii[sample(1:nrow(data.i.ii), replace = T),:
number of items to replace is not a multiple of replacement length
Which I understand, since each of the 128 groups has a different N and none of it is a multiple of 1000. I put in resample.M[i] to try and accumulate all of the 1000x resamples of the 128 groups into a single database, and I'm pretty sure the problem is here.
Nearly all of the examples of for loops I've read create a vector database - numeric(1000) - then plug in the information, but since I'm wanting all of the data (which include factors, integers, and numerics) this doesn't work. I tried making a matrix to put the info in (there are 2187 unique individuals in the dataframe):
resample.M <- matrix(ncol=2187000,nrow=10)
But it's giving me the same warning.
So, since I'm sure I'm missing something basic here, I have three questions:
How can I get this code to resample all of the groups (with replacement and based on their individual Ns)?
How can I get this code to repeat this resampling 1000x?
How can I get the resamples of every group into the same database?
Thank you so much for your insight and expertise!
I think you may have wanted to use double square bracket, to store the results in a list, i.e. resample.M[[i]] <- .... Apart from that it makes more sense to write PhenoM2$Group than PhenoM2$"Group" and also groups <- unique(PhenoM2$Group) can go outside of your for loop since you only need to compute it once. Also replace 1:128 by 1:length(groups) or seq_along(groups), so that you don't need to hard code the length of the vector.
Because you will often need to operate on data frames grouped by some variable, I suggest you familiarise yourself with a package designed to do that, rather than using for loops, which can be very slow. The best one for a beginner in R may be plyr, which has an easy syntax (although there are many possibilities, including the slightly more "advanced" packages like dplyr and data.table).
So for a subset d <- subset(PhenoM2, Group == 1), you already have the function you need to perform on it: function(d) d[sample(1:nrow(d), replace = TRUE),].
Now to go over all such subsets, perform this operation and then arrange the results in a new data frame named samples you do
samples <- ddply(PhenoM2, .(Group),
function(d) d[sample(1:nrow(d), replace = TRUE),])
So what remains is to iterate this 1000 or however many times you want. You can use a for loop for this, storing the results in a list. Note that you need to use double square bracket [[ to set elements of the list.
n <- 1000 # number of iterations
samples <- vector("list", n) # list of length n to store results
for (i in seq_along(samples))
samples[[i]] <- ddply(PhenoM2, .(Group),
function(d) d[sample(1:nrow(d), replace = TRUE),])
An alternative way would be to use the function replicate, that performs the same task many times.
Once you have done this, all resamples will be stored in a list. I am not sure what you mean by "How can I get the resamples of every group into the same database". If you want to group them in a single data frame, you do all.samples <- do.call(rbind, samples). In general, you can format your list of samples using do.call and lapply together with a function.
How can I make R check whether an object is too large to print in the console? "Too large" here means larger than a user-defined value.
Example: You have a list f_data with two elements f_data$data (a 100MB data.frame) and f_data$info (for instance, a vector). Assume you want to inspect the first few entries of the f_data$data data.frame but you make a mistake and type head(f_data) instead of head(f_data$data). R will try to print the whole content of f_data to the console (which would take forever).
Is there somewhere an option that I can set in order to suppress the output of objects that are larger than let's say 1MB?
Edit: Thank you guys for your help. After implementing the max.rows option I realized that this gives indeed the desired output. BUT the problem that the output takes very long to show up still persists. I will give you a proper example below.
df_nrow=100000
df_ncol=100
#create list with first element being a large data.frame
#second element is a short vector
test_list=list(df=data.frame(matrix(rnorm(df_nrow*df_ncol),nrow=df_nrow,ncol=df_ncol)),
vec=1:110)
#only print the first 100 elements of an object
options(max.print=100)
#head correctly displays the first row of the data.frame
#BUT for some reason the output takes really long to show up in the console (~30sec)
head(test_list)
#let's try to see how long exactly
system.time(head(test_list))
# user system elapsed
# 0 0 0
#well, obviously system.time is not the proper tool to measure this
#the same problem if I just print the object to the console without using head
test_list$df
I assume that R performs some sort of analysis on the object being printed and this is what takes so long.
Edit 2:
As per my comment below, I checked whether the problem persists if I use a matrix instead of a data.frame.
#create list with first element being a large MATRIX
test_list=list(mat=matrix(rnorm(df_nrow*df_ncol),nrow=df_nrow,ncol=df_ncol),vec=1:110)
#no problem
head(test_list)
#no problem
test_list$mat
Could it be that the output to the console is not really efficiently implemented for data.frame objects?
I think there is no such option, but you can check the size of an object with object.size and print it if is lower than a threshold (measure in bytes), for example:
print.small.objects <- function(x, threshold = 1e06, ...)
{
if (object.size(x) < threshold) {
print(x, ...)
} else {
cat(paste("too big object\n"))
print(object.size(x))
}
}
Here's an example that you could adjust up to 100MB. It basically only prints the first 6 rows and 5 columns if the object's size is above 8e5 bytes. You could also turn this into a function and place it in your .Rprofile
> lst <- list(data.frame(replicate(100, rnorm(1000))), 1:10)
> sapply(lst, object.size)
# [1] 810968 88
> lapply(lst, function(x){
if(object.size(x) > 8e5) head(x)[1:5] else x
})
#[[1]]
# X1 X2 X3 X4 X5
#1 0.3398235 -1.7290077 -0.35367971 0.09874918 -0.8562069
#2 0.2318548 -0.3415523 -0.38346083 -0.08333569 -1.1091982
#3 0.0714407 -1.4561768 0.50131914 -0.54899188 0.1652095
#4 -0.5170228 1.7343073 -0.05602883 0.87855313 0.4025590
#5 0.6962212 -0.3179930 0.28016057 1.05414456 -0.5172885
#6 0.9471200 1.4424843 -1.46323827 -0.78004192 -1.3611820
#
#[[2]]
# [1] 1 2 3 4 5 6 7 8 9 10
I'm writing a gene level analysis script in R and I'll have to handle large amounts of data.
My initial idea was to create a super list structure, a set of lists within lists. Essentially the structure is
#12.8 mins
list[[1:8]][[1:1000]][[1:6]][[1:1000]]
This is huge and takes in excess of 12 mins purely to set up the data structure. Stream lining this process, I can get it down to about 1.6 mins when setting up one value of the 1:8 list, so essentially...
#1.6 mins
list[[1:1]][[1:1000]][[1:6]][[1:1000]]
Normally, I'd create the structure as and when it's needed, on the fly, however, I'm distributing the 1:1000 steps which means, I don't know which order they'll come back in.
Are there any other packages for handling the creation of this level of data?
Could I use any more efficient data structures in my approach?
I apologise if this seems like the wrong approach entirely, but this is my first time handling big data in R.
Note that lists are vectors, and like any other vector, they can have a dim attribute.
l <- vector("list", 8 * 1000 * 6 * 1000)
dim(l) <- c(8, 1000, 6, 1000)
This is effectively instantaneous. You access individual elements with [[, eg l[[1, 2, 3, 4]].
A different strategy is to create a vector and a partitioning, e.g., to represent
list(1:4, 5:7)
as
l = list(data=1:7, partition=c(4, 7))
then one can do vectorized calculations, e.g.,
logl = list(data=log(l$data), partition = l$partition)
and other clever things. This avoids creating complicated lists and the iterations that implies. This approach is formalized in the Bioconductor IRanges package *List classes.
> library(IRanges)
> l <- NumericList(1:4, 5:7)
> l
NumericList of length 2
[[1]] 1 2 3 4
[[2]] 5 6 7
> log(l)
NumericList of length 2
[[1]] 0 0.693147180559945 1.09861228866811 1.38629436111989
[[2]] 1.6094379124341 1.79175946922805 1.94591014905531
One idiom for working with this data is to unlist, transform, then relist; both unlist and relist are inexpensive, so the long-hand version of the above is relist(log(unlist(l)), l)
Depending on your data structure, the DataFrame class may be appropriate, e.g., the following can be manipulated like a data.frame (subset, etc) but contains *List elements.
> DataFrame(Sample=c("A", "B"), VariableA=l, LogA=log(l))
DataFrame with 2 rows and 3 columns
Sample VariableA LogA
<character> <NumericList> <NumericList>
1 A 1,2,3,... 0,0.693147180559945,1.09861228866811,...
2 B 5,6,7 1.6094379124341,1.79175946922805,1.94591014905531
For genomic data where the coordinates of genes (or other features) on chromosomes is of fundamental importance, the GenomicRanges package and GRanges / GRangesList classes are appropriate.