data.table fread function - r

I am using the new data.table:::fread function (fastest read function I've used in R so far) and I got the following (self explanatory) exception:
R) fread(path)
Erreur dans fread(path) : Coercing integer64 to real needs to be implemented
My file (which is a csv separated by tabs) indeed holds big integers like 902160000671352000. My question is then, can I tell fread to #NOT# read the second columns (where those monsters int are)

Good question. Not yet, but yes you will be able to. Agree with all comments.
The TO DO list is at the top of the readfile.c source. If there's anything missing please let me know. That list covers allowing type overrides, implementing the unimplemented coercions and allowing columns to be skipped. Hopefully will all be done for first release in 1.9.0.
fread is currently in v1.8.7 which is in development on R-Forge. When finished it'll be released as 1.9.0 to CRAN. The .0 indicates that new features might possibly change argument names and behaviour; i.e., don't be surprised if backwards incompatabile changes are made to fread in 1.9.1. Given its nature it's hard to imagine anything major will change, though. But that's why I publicised its availability on R-Forge, to get it into the wild early and get things like this right.

Related

How to make an R object immutable? [duplicate]

I'm working in R, and I'd like to define some variables that I (or one of my collaborators) cannot change. In C++ I'd do this:
const std::string path( "/projects/current" );
How do I do this in the R programming language?
Edit for clarity: I know that I can define strings like this in R:
path = "/projects/current"
What I really want is a language construct that guarantees that nobody can ever change the value associated with the variable named "path."
Edit to respond to comments:
It's technically true that const is a compile-time guarantee, but it would be valid in my mind that the R interpreter would throw stop execution with an error message. For example, look what happens when you try to assign values to a numeric constant:
> 7 = 3
Error in 7 = 3 : invalid (do_set) left-hand side to assignment
So what I really want is a language feature that allows you to assign values once and only once, and there should be some kind of error when you try to assign a new value to a variabled declared as const. I don't care if the error occurs at run-time, especially if there's no compilation phase. This might not technically be const by the Wikipedia definition, but it's very close. It also looks like this is not possible in the R programming language.
See lockBinding:
a <- 1
lockBinding("a", globalenv())
a <- 2
Error: cannot change value of locked binding for 'a'
Since you are planning to distribute your code to others, you could (should?) consider to create a package. Create within that package a NAMESPACE. There you can define variables that will have a constant value. At least to the functions that your package uses. Have a look at Tierney (2003) Name Space Management for R
I'm pretty sure that this isn't possible in R. If you're worried about accidentally re-writing the value then the easiest thing to do would be to put all of your constants into a list structure then you know when you're using those values. Something like:
my.consts<-list(pi=3.14159,e=2.718,c=3e8)
Then when you need to access them you have an aide memoir to know what not to do and also it pushes them out of your normal namespace.
Another place to ask would be R development mailing list. Hope this helps.
(Edited for new idea:) The bindenv functions provide an
experimental interface for adjustments to environments and bindings within environments. They allow for locking environments as well as individual bindings, and for linking a variable to a function.
This seems like the sort of thing that could give a false sense of security (like a const pointer to a non-const variable) but it might help.
(Edited for focus:) const is a compile-time guarantee, not a lock-down on bits in memory. Since R doesn't have a compile phase where it looks at all the code at once (it is built for interactive use), there's no way to check that future instructions won't violate any guarantee. If there's a right way to do this, the folks at the R-help list will know. My suggested workaround: fake your own compilation. Write a script to preprocess your R code that will manually substitute the corresponding literal for each appearance of your "constant" variables.
(Original:) What benefit are you hoping to get from having a variable that acts like a C "const"?
Since R has exclusively call-by-value semantics (unless you do some munging with environments), there isn't any reason to worry about clobbering your variables by calling functions on them. Adopting some sort of naming conventions or using some OOP structure is probably the right solution if you're worried about you and your collaborators accidentally using variables with the same names.
The feature you're looking for may exist, but I doubt it given the origin of R as a interactive environment where you'd want to be able to undo your actions.
R doesn't have a language constant feature. The list idea above is good; I personally use a naming convention like ALL_CAPS.
I took the answer below from this website
The simplest sort of R expression is just a constant value, typically a numeric value (a number) or a character value (a piece of text). For example, if we need to specify a number of seconds corresponding to 10 minutes, we specify a number.
> 600
[1] 600
If we need to specify the name of a file that we want to read data from, we specify the name as a character value. Character values must be surrounded by either double-quotes or single-quotes.
> "http://www.census.gov/ipc/www/popclockworld.html"
[1] "http://www.census.gov/ipc/www/popclockworld.html"

fread issue with integer64

I am trying to read in some larger datafiles in R with fread using integer64="numeric", but for some reason the conversion does not work anywhere (it used to work in the past). Some of my outcome data is in integer, some is in integer64 and some is in numeric. That is probably not intended. The problem seems to be known: https://github.com/Rdatatable/data.table/issues/2607
My question is: What is the best current workaround to deal with this? If someone has an idea how to post sample data to illustrate the issue more clearly, please feel free to contribute to this post.
I guess this affects a lot of people who are using numbers >= |2^31|. Also see the documentation of fread in this regard: "integer64" (default) reads columns detected as containing integers larger than 2^31 as type bit64::integer64. Alternatively, "double"|"numeric" reads as base::read.csv does; i.e., possibly with loss of precision and if so silently. Or, "character".

attr(*, "internal.selfref")=<externalptr> appearing in data.table Rstudio

I am a new user of the R data.table package, and I have noticed something unusual in my data.tables that I have not found explained in the documentation or elsewhere on this site.
When using data.table package within Rstudio, and viewing a specific data.table within the 'Environment' panel, I see the following string appearing at the end of the data.table
attr(*,"internal.selref")=<externalptr>
If I print the same data.table within the Console, this string does not appear.
Is this a bug, or just an inherent feature of data.table (or Rstudio)? Should I be concerned about whether this is affecting how these data are handled by downstream processes?
The versions I am running are as follows:
data.table Version 1.9.6
Rstudio Version 0.99.447
OSX 10.10.5
Apologies in advance if this is just me being an ignorant newbie.
I actually asked Matt Dowle, the primary author of the data.table package, this very question a little while ago.
Is this a bug, or just an inherent feature of data.table (or Rstudio)?
Apparently this attribute is used internally by data.table, it isn't a bug in RStudio, in fact RStudio is doing its job of showing the attributes of the object.
Should I be concerned about whether this is affecting how these data are handled by downstream processes?
No, this isn't going to affect anything.
For those who are curious about why this attribute is created, I believe it's explained in the data.table manual under the section for setkey():
In v1.7.8, the key<- syntax was deprecated. The <- method copies the whole table and we know of
no way to avoid that copy without a change in R itself. Please use the set* functions instead, which
make no copy at all. setkey accepts unquoted column names for convenience, whilst setkeyv
accepts one vector of column names.
The problem (for data.table) with the copy by key<- (other than being slower) is that R doesn’t
maintain the over allocated truelength, but it looks as though it has. Adding a column by reference
using := after a key<- was therefore a memory overwrite and eventually a segfault; the
over allocated memory wasn’t really there after key<-’s copy. data.tables now have an attribute
.internal.selfref to catch and warn about such copies. This attribute has been implemented in
a way that is friendly with identical() and object.size().
For the same reason, please use the other set* functions which modify objects by reference, rather
than using the <- operator which results in copying the entire object.

Easiest way to save an S4 class

Probably the most basic question on S4 classes imaginable here.
What is the simplest way to save an S4 class you have defined so that you can reuse it elsewhere. I have a project where I'm taking a number of very large datasets and compiling summary information from them into small S4 objects. Since I'll therefore be switching R sessions to create the summary object for each dataset, it'd be good to be able to load in the definition of the class from a saved object (or have it load automatically) rather than having to include the long definition of the object at the top of each script (which I assume is bad practice anyway because the code defining the object might become inconsistent).
So what's the syntax along the lines of saveclass("myClass"), loadclass("myclass") or am I just thinking about this in the wrong way?
setClass("track", representation(x="numeric", y="numeric"))
x <- new("track", x=1:4, y=5:8)
save as binary
fn <- tempfile()
save(x, ascii=FALSE, file=fn)
rm(x)
load(fn)
x
save as ASCII
save(x, ascii=TRUE, file=fn)
ASCII text representation from which to regenerate the data
dput(x, file=fn)
y <- dget(fn)
The original source can be found here.
From the question, I think you really do want to include the class definition at the top of each script (although not literally; see below), rather than saving a binary representation of the class definition and load that. The reason is the general one that binary representations are more fragile (subject to changes in software implementation) compared to simple text representations (for instance, in the not too distant past S4 objects were based on simple lists with a class attribute; more recently they have been built around an S4 'bit' set on the underlying C-level data representation).
Instead of copying and pasting the definition into each script, really the best practice is to included the class definition (and related methods) in an R package, and to load the package at the top of the script. It is not actually hard to write packages; an easy way to get started is to use Rstudio to create a 'New Project' as an 'R package'. Use a version number in the package to keep track of the specific version of the class definition / methods you're using, and version control (svn or git, for instance) to make it easy to track the changes / explorations you make as your class matures. Share with your colleagues and eventually the larger R community to let others benefit from your hard work and insight!

Out of memory when modifying a big R data.frame

I have a big data frame taking about 900MB ram. Then I tried to modify it like this:
dataframe[[17]][37544]=0
It seems that makes R using more than 3G ram and R complains "Error: cannot allocate vector of size 3.0 Mb", ( I am on a 32bit machine.)
I found this way is better:
dataframe[37544, 17]=0
but R's footprint still doubled and the command takes quite some time to run.
From a C/C++ background, I am really confused about this behavior. I thought something like dataframe[37544, 17]=0 should be completed in a blink without costing any extra memory (only one cell should be modified). What is R doing for those commands I posted? What is the right way to modify some elements in a data frame then without doubling the memory footprint?
Thanks so much for your help!
Tao
Following up on Joran suggesting data.table, here are some links. Your object, at 900MB, is manageable in RAM even in 32bit R, with no copies at all.
When should I use the := operator in data.table?
Why has data.table defined := rather than overloading <-?
Also, data.table v1.8.0 (not yet on CRAN but stable on R-Forge) has a set() function which provides even faster assignment to elements, as fast as assignment to matrix (appropriate for use inside loops for example). See latest NEWS for more details and example. Also see ?":=" which is linked from ?data.table.
And, here are 12 questions on Stack Overflow with the data.table tag containing the word "reference".
For completeness :
require(data.table)
DT = as.data.table(dataframe)
# say column name 17 is 'Q' (i.e. LETTERS[17])
# then any of the following :
DT[37544, Q:=0] # using column name (often preferred)
DT[37544, 17:=0, with=FALSE] # using column number
col = "Q"
DT[37544, col:=0, with=FALSE] # variable holding name
col = 17
DT[37544, col:=0, with=FALSE] # variable holding number
set(DT,37544L,17L,0) # using set(i,j,value) in v1.8.0
set(DT,37544L,"Q",0)
But, please do see linked questions and the package's documentation to see how := is more general than this simple example; e.g., combining := with binary search in an i join.
Look up 'copy-on-write' in the context of R discussions related to memory. As soon as one part of a (potentially really large) data structure changes, a copy is made.
A useful rule of thumb is that if your largest object is N mb/gb/... large, you need around 3*N of RAM. Such is life with an interpreted system.
Years ago when I had to handle large amounts of data on machines with (relative to the data volume) relatively low-ram 32-bit machines, I got good use out of early versions of the bigmemory package. It uses the 'external pointer' interface to keep large gobs of memory outside of R. That save you not only the '3x' factor, but possibly more as you may get away with non-contiguous memory (which is the other thing R likes).
Data frames are the worst structure you can choose to make modification to. Due to quite the complex handling of all features (such as keeping row names in synch, partial matching, etc.) which is done in pure R code (unlike most other objects that can go straight to C) they tend to force additional copies as you can't edit them in place. Check R-devel on the detailed discussions on this - it has been discussed in length several times.
The practical rule is to never use data frames for large data, unless you treat them read-only. You will be orders of magnitude more efficient if you either work on vectors or matrices.
There is type of object called a ffdf in the ff package which is basically a data.frame stored on disk. In addition to the other tips above you can try that.
You can also try the RSQLite package.

Resources