How can I make R check whether an object is too large to print in the console? "Too large" here means larger than a user-defined value.
Example: You have a list f_data with two elements f_data$data (a 100MB data.frame) and f_data$info (for instance, a vector). Assume you want to inspect the first few entries of the f_data$data data.frame but you make a mistake and type head(f_data) instead of head(f_data$data). R will try to print the whole content of f_data to the console (which would take forever).
Is there somewhere an option that I can set in order to suppress the output of objects that are larger than let's say 1MB?
Edit: Thank you guys for your help. After implementing the max.rows option I realized that this gives indeed the desired output. BUT the problem that the output takes very long to show up still persists. I will give you a proper example below.
df_nrow=100000
df_ncol=100
#create list with first element being a large data.frame
#second element is a short vector
test_list=list(df=data.frame(matrix(rnorm(df_nrow*df_ncol),nrow=df_nrow,ncol=df_ncol)),
vec=1:110)
#only print the first 100 elements of an object
options(max.print=100)
#head correctly displays the first row of the data.frame
#BUT for some reason the output takes really long to show up in the console (~30sec)
head(test_list)
#let's try to see how long exactly
system.time(head(test_list))
# user system elapsed
# 0 0 0
#well, obviously system.time is not the proper tool to measure this
#the same problem if I just print the object to the console without using head
test_list$df
I assume that R performs some sort of analysis on the object being printed and this is what takes so long.
Edit 2:
As per my comment below, I checked whether the problem persists if I use a matrix instead of a data.frame.
#create list with first element being a large MATRIX
test_list=list(mat=matrix(rnorm(df_nrow*df_ncol),nrow=df_nrow,ncol=df_ncol),vec=1:110)
#no problem
head(test_list)
#no problem
test_list$mat
Could it be that the output to the console is not really efficiently implemented for data.frame objects?
I think there is no such option, but you can check the size of an object with object.size and print it if is lower than a threshold (measure in bytes), for example:
print.small.objects <- function(x, threshold = 1e06, ...)
{
if (object.size(x) < threshold) {
print(x, ...)
} else {
cat(paste("too big object\n"))
print(object.size(x))
}
}
Here's an example that you could adjust up to 100MB. It basically only prints the first 6 rows and 5 columns if the object's size is above 8e5 bytes. You could also turn this into a function and place it in your .Rprofile
> lst <- list(data.frame(replicate(100, rnorm(1000))), 1:10)
> sapply(lst, object.size)
# [1] 810968 88
> lapply(lst, function(x){
if(object.size(x) > 8e5) head(x)[1:5] else x
})
#[[1]]
# X1 X2 X3 X4 X5
#1 0.3398235 -1.7290077 -0.35367971 0.09874918 -0.8562069
#2 0.2318548 -0.3415523 -0.38346083 -0.08333569 -1.1091982
#3 0.0714407 -1.4561768 0.50131914 -0.54899188 0.1652095
#4 -0.5170228 1.7343073 -0.05602883 0.87855313 0.4025590
#5 0.6962212 -0.3179930 0.28016057 1.05414456 -0.5172885
#6 0.9471200 1.4424843 -1.46323827 -0.78004192 -1.3611820
#
#[[2]]
# [1] 1 2 3 4 5 6 7 8 9 10
Related
I read that using seq_along() allows to handle the empty case much better, but this concept is not so clear in my mind.
For example, I have this data frame:
df
a b c d
1 1.2767671 0.133558438 1.5582137 0.6049921
2 -1.2133819 -0.595845408 -0.9492494 -0.9633872
3 0.4512179 0.425949910 0.1529301 -0.3012190
4 1.4945791 0.211932487 -1.2051334 0.1218442
5 2.0102918 0.135363711 0.2808456 1.1293810
6 1.0827021 0.290615747 2.5339719 -0.3265962
7 -0.1107592 -2.762735937 -0.2428827 -0.3340126
8 0.3439831 0.323193841 0.9623515 -0.1099747
9 0.3794022 -1.306189542 0.6185657 0.5889456
10 1.2966537 -0.004927108 -1.3796625 -1.1577800
Considering these three different code snippets:
# Case 1
for (i in 1:ncol(df)) {
print(median(df[[i]]))
}
# Case 2
for (i in seq_along(df)) {
print(median(df[[i]]))
}
# Case 3
for(i in df) print(median(i))
What is the difference between these different procedures when a full data.frame exists or in the presence of an empty data.frame?
Under the condition that df <- data.frame(), we have:
Case 1 falling victim to...
Error in .subset2(x, i, exact = exact) : subscript out of bounds
while Case 2 and 3 are not triggered.
In essence, the error in Case 1 is due to ncol(df) being 0. This leads the sequence 1:ncol(df) to be 1:0, which creates the vector c(1,0). In this case, the for loop tries to access the first element of the vector 1, which tries to access column 1 does not exist. Hence, the subset is found to be out of bounds.
Meanwhile, in Case 2 and 3 the for loop is never executed since there are no elements to process within their respective collections since the vectors are empty. Principally, this means that they have length of 0.
As this question specifically relates to what the heck is happening to seq_along(), let's take a traditional seq_along example by constructing a full vector a and seeing the results:
set.seed(111)
a <- runif(5)
seq_along(a)
#[1] 1 2 3 4 5
In essence, for each element of the vector a, there is a corresponding index that was created by seq_along to be accessed.
If we apply seq_along now to the empty df in the above case, we get:
seq_along(df)
# integer(0)
Thus, what was created was a zero length vector. Its mighty hard to move along a zero length vector.
Ergo, the Case 1 poorly protects the against the empty case.
Now, under the traditional assumption, that is there is some data within the data.frame, which is a very bad assumption for any kind of developer to make...
set.seed(1234)
df <- data.frame(matrix(rnorm(40), 4))
All three cases would be operating as expected. That is, you would receive a median per column of the data.frame.
[1] -0.5555419
[1] -0.4941011
[1] -0.4656169
[1] -0.605349
I've been trying to keep track of various objects in memory using data.table::address or .Internal(address()), but have noticed that some objects return the same address every time, while others are almost always different. What is going on here?
I've noticed that addresses of objects like lists (data.tables, data.frames, etc) remain constant (as reported by these functions), whereas if I try to report the address by [ into a list, ie address(lst[1]) I get different results nearly everytime. On the other hand lst[[1]] returns the same value, and the addresses of constants like address(pi) remain constant whereas address(1) is volatile. Why is this happening?
## Create some data.tables of different sizes and plot the addresses
library(data.table)
par(mfrow = c(2,2))
for (i in 2:5) {
dat <- data.table(a=1:10^i)
## Constants
addr1 <- address(dat)
addr2 <- address(dat[[1]])
addr3 <- address(dat$a) # same as addr2
## Vary
addrs <- replicate(5000, address(dat[1]))
plot(density(as.integer(as.hexmode(addrs))), main=sprintf("N: %g", nrow(dat)))
abline(v=as.integer(as.hexmode(c(addr1, addr2, addr3))), col=1:3, lwd=2, lty=1:3)
legend("topleft", c("dat", "dat[[1]]", "dat$a"), col=1:3, lwd=2, lty=1:3)
}
Here are some examples of what I'm talking about with different sized data.tables. They are just densities of the results from address(dat[1]) (converted to an integer), and the lines correspond to the constant addresses of the data.table.
First off, I can replicate your results, so I did a bit of an investigation and dived through some code.
When you access the first member of dat using dat[1] you are actually creating a slice made from the list in data[[1]] or dat$a. To take a slice, R first copies the list and then returns the slice you want.
So - basically - you see what you see because the [] syntax for indexing returns a slice containing the first element of dat which is a copy of dat$a, which will be at an arbitrary memory location.
The [[]] syntax returns a reference to the actual list that is the column in your data.table or data.frame and hence its address is invariant (or at least it is until you change a member of that list).
This could be confusing, because of course doing dat[1] = 6 or similar will alter the value(s) of the list in your data structure. However, if you look at address(dat[[1]]) before and after making such a change, you will notice that in fact the reference is now to a different list (the copy) e.g.
> dat <- data.table(a=1:10000)
> dat
a
1: 1
2: 2
3: 3
4: 4
5: 5
---
9996: 9996
9997: 9997
9998: 9998
9999: 9999
10000: 10000
> address(dat[[1]])
[1] "000000000CF389D8"
> address(dat[[1]])
[1] "000000000CF389D8"
> dat[1] = 100
> address(dat[[1]])
[1] "000000000D035B38"
> dat
a
1: 100
2: 2
3: 3
4: 4
5: 5
---
9996: 9996
9997: 9997
9998: 9998
9999: 9999
10000: 10000
>
Looking at the source code for data.frame (rather than data.table), the code that does the slice indexing ([]) is here, whereas the direct indexing ([[]]) is here. You can see that the latter is simpler and to cut a long story short, the former returns a copy. If you change a slice directly (e.g. dat[1] = 5), there is some logic here that handles ensuring that the data frame now references the updated copy.
.Hi
I would like to make a comparison operation on my vector: I got one with numerical values that I want to transform in 2^. However if one value is greater than 65000 after it has be transformed I would like there's no transformation for the entire vector.
Currently I'm trying this:
final<-ifelse((2^vec>65000)vec,2^vec)
It works great but for each value. So if one is greater than 65000 after transformation it this code returns me the initial value but if does'nt exceed 65000 it returns me the transformed value and I have a mixed vector with transformed and non transformed values.
here an example:
> vec
32.82 576.47 36.45 78.93 8.77 63.28 176.86 1.88 291.97 35.59
And the result after my code
> final
32.820000 576.470000 36.450000 78.930000 436.549065 63.280000 176.860000 3.680751 291.970000 35.590000
here, you can see that some values have been transformed en some not. In this kind of situation finally I would like fina=vec. I tried with a "break" instead of vec for the "yes" condition in the ifelse but it does'nt work. Probably something like that could work but I don't what.
If someone has an idea ^^
Thanks
How's this?
log_if_bigger = function(vec, thresh){
if(any(vec>thresh)){
return(log2(vec))
}else{
return(vec)
}
}
Usage:
# if any values, bigger than 0 then log - here there are:
> log_if_bigger(c(1,2,3,4),0)
[1] 0.000000 1.000000 1.584963 2.000000
# if any values bigger than 9 then log - here there arent:
> log_if_bigger(c(1,2,3,4),9)
[1] 1 2 3 4
Then you just want something like:
final = log_if_bigger(vec, 65000)
or possibly:
final = log_if_bigger(vec, log2(65000))
based on your condition where you test 2^vec>65000
I am trying to append a new row to a matrix for each time I run a function. I reckon, the first time the function is run a matrix is created and the succeeding times, a new row with values is appended.
Here is some dummy data. Lets say x and y are sides of rectangle and z some sort of ID. In reality, these are not known in advance, but outputted by the function. The real function takes a species directory as argument, reads shapefiles, merges polygons and does a bunch of other things, but outputs the surface area. For each species (i.e. run of function) I would like to store each outputted area in a matrix or a data.frame for further analysis instead of outputting it to individual variables.
myfunc <- function(x, y, z){
area <- x*y
id <- z
tmp <- cbind(area,id)
assign(as.matrix('mtrx'), rbind(tmp), envir=.GlobalEnv)
}
The above obviously only creates the matrix and overwrites it each time the function is run.
Any pointers would be very much appreciated!
If, as in your example, you know the values for x, y and z in advance, it makes sense to say something like:
> f1 <- function(x, y, z) c(x*y, z)
mapply(f1, x=seq(4), y=seq(4), z=seq(4))
> [,1] [,2] [,3] [,4]
[1,] 1 4 9 16
[2,] 1 2 3 4
If the values for these variables are returned by another function, then perhaps best to store them until you're ready to run all the values through the final function (e.g. f1 above).
You say
a new row with values is appended
but in RAM a new matrix is created (assigned) with the new row added each time you append. (You're in Circle 2).
For small sized data this is not likely to be a problem in practice.
Also, using assign can make scoping awkward when calling a function within an environment (e.g. another function), so generally best to avoid if possible. There's usually a better alternative.
Here's the basic idea.
myfunc <- function(ID) {
# do a bunch of stuff based on ID
# calculate area
area <- 2*ID + rnorm(1,0,10) # fake the area...
return(c(ID=ID,area=area))
}
ID.list <- rep(1:100) # list of ID's
result <- do.call(rbind,lapply(ID.list,myfunc))
# head(result)
# ID area
# [1,] 1 -14.794850
# [2,] 2 13.777036
# [3,] 3 17.807578
# [4,] 4 21.070712
# [5,] 5 11.904047
# [6,] 6 3.735771
Return ID and area as a named vector with c(ID=ID, area=area). Do this for all ID's with the call to lapply(...). Then bind them all together using do.call(rbind,...).
I highly recommend against this method, but you need to use get in that last line
assign('mtrx', rbind(get('mtrx', envir=parent.frame()), tmp)), envir=.GlobalEnv)
So this question has been bugging me for a while since I've been looking for an efficient way of doing it. Basically, I have a dataframe, with a data sample from an experiment in each row. I guess this should be looked at more as a log file from an experiment than the final version of the data for analyses.
The problem that I have is that, from time to time, certain events get logged in a column of the data. To make the analyses tractable, what I'd like to do is "fill in the gaps" for the empty cells between events so that each row in the data can be tied to the most recent event that has occurred. This is a bit difficult to explain but here's an example:
Now, I'd like to take that and turn it into this:
Doing so will enable me to split the data up by the current event. In any other language I would jump into using a for loop to do this, but I know that R isn't great with loops of that type, and, in this case, I have hundreds of thousands of rows of data to sort through, so am wondering if anyone can offer suggestions for a speedy way of doing this?
Many thanks.
This question has been asked in various forms on this site many times. The standard answer is to use zoo::na.locf. Search [r] for na.locf to find examples how to use it.
Here is an alternative way in base R using rle:
d <- data.frame(LOG_MESSAGE=c('FIRST_EVENT', '', 'SECOND_EVENT', '', ''))
within(d, {
# ensure character data
LOG_MESSAGE <- as.character(LOG_MESSAGE)
CURRENT_EVENT <- with(rle(LOG_MESSAGE), # list with 'values' and 'lengths'
rep(replace(values,
nchar(values)==0,
values[nchar(values) != 0]),
lengths))
})
# LOG_MESSAGE CURRENT_EVENT
# 1 FIRST_EVENT FIRST_EVENT
# 2 FIRST_EVENT
# 3 SECOND_EVENT SECOND_EVENT
# 4 SECOND_EVENT
# 5 SECOND_EVENT
The na.locf() function in package zoo is useful here, e.g.
require(zoo)
dat <- data.frame(ID = 1:5, sample_value = c(34,56,78,98,234),
log_message = c("FIRST_EVENT", NA, "SECOND_EVENT", NA, NA))
dat <-
transform(dat,
Current_Event = sapply(strsplit(as.character(na.locf(log_message)),
"_"),
`[`, 1))
Gives
> dat
ID sample_value log_message Current_Event
1 1 34 FIRST_EVENT FIRST
2 2 56 <NA> FIRST
3 3 78 SECOND_EVENT SECOND
4 4 98 <NA> SECOND
5 5 234 <NA> SECOND
To explain the code,
na.locf(log_message) returns a factor (that was how the data were created in dat) with the NAs replaced by the previous non-NA value (the last one carried forward part).
The result of 1. is then converted to a character string
strplit() is run on this character vector, breaking it apart on the underscore. strsplit() returns a list with as many elements as there were elements in the character vector. In this case each component is a vector of length two. We want the first elements of these vectors,
So I use sapply() to run the subsetting function '['() and extract the 1st element from each list component.
The whole thing is wrapped in transform() so i) I don;t need to refer to dat$ and so I can add the result as a new variable directly into the data dat.