Related
I'm trying to create a function that extracts a column from a big.matrix object in Rcpp (so that it can be analyzed in cpp before bringing the results to R), but I can't figure out how to get it to recognise NA's (they are now presented as -2147483648 - as shown in my minimal example below). It would be even better if I could access the function GetMatrixCols (src/bigmemory.cpp) straight from Rcpp, but I've yet to discover a way to do that.
#include <Rcpp.h>
// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::depends(BH, bigmemory)]]
#include <bigmemory/MatrixAccessor.hpp>
#include <bigmemory/isna.hpp>
using namespace Rcpp;
//Logic for extracting column from a Big Matrix object
template <typename T>
NumericVector GetColumn_logic(XPtr<BigMatrix> pMat, MatrixAccessor<T> mat, int cn) {
NumericVector nv(pMat->nrow());
for(int i = 0; i < pMat->nrow(); i++) {
if(isna(mat[cn][i])) {
nv[i] = NA_INTEGER;
} else {
nv[i] = mat[cn][i];
}
}
return nv;
}
//' Extract Column from a Big Matrix.
//'
//' #param pBigMat A bigmemory object address.
//' #param colNum Column Number to extract. Indexing starts from zero.
//' #export
// [[Rcpp::export]]
NumericVector GetColumn(SEXP pBigMat, int colNum) {
XPtr<BigMatrix> xpMat(pBigMat);
switch(xpMat->matrix_type()) {
case 1: return GetColumn_logic(xpMat, MatrixAccessor<char>(*xpMat), colNum);
case 2: return GetColumn_logic(xpMat, MatrixAccessor<short>(*xpMat), colNum);
case 4: return GetColumn_logic(xpMat, MatrixAccessor<int>(*xpMat), colNum);
case 6: return GetColumn_logic(xpMat, MatrixAccessor<float>(*xpMat), colNum);
case 8: return GetColumn_logic(xpMat, MatrixAccessor<double>(*xpMat), colNum);
default: throw Rcpp::exception("Unknown type detected for big.matrix object!");
}
}
/*** R
bm <- bigmemory::as.big.matrix(as.matrix(reshape2::melt(matrix(c(1:4,NA,6:20),4,5))))
bigmemory:::CGetType(bm#address)
bigmemory:::GetCols.bm(bm, 3)
GetColumn(bm#address, 2)
*/
That's a great one! Stay with me for a moment:
tl;dr: It works once fixed:
R> sourceCpp("/tmp/bigmemEx.cpp")
R> bm <- bigmemory::as.big.matrix(as.matrix(reshape2::melt(matrix(c(1:4,NA,6:20),4,5))))
R> bigmemory:::CGetType(bm#address)
[1] 4
R> bigmemory:::GetCols.bm(bm, 3)
[1] 1 2 3 4 NA 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
R> GetColumn(bm#address, 2)
[1] 1 2 3 4 NA 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
R>
The trouble starts at the inside. When you create your matrix as
matrix(c(1:4,NA,6:20),4,5)
what do you get? Integer!
R> matrix(c(1:4,NA,6:20),4,5)
[,1] [,2] [,3] [,4] [,5]
[1,] 1 NA 9 13 17
[2,] 2 6 10 14 18
[3,] 3 7 11 15 19
[4,] 4 8 12 16 20
R> class(matrix(c(1:4,NA,6:20),4,5))
[1] "matrix"
R> typeof(matrix(c(1:4,NA,6:20),4,5))
[1] "integer"
R>
Not a problem per se, but a problem once you remember that the IEEE 754standard has NaN defined for floating point only (correct if I'm wrong).
The other issue is that you reflexively used NumericVector in your, but operate on integers. Now R has NaN, and even NA, for floating point and integer, but 'normal libraries' outside of R do not. And a bigmemory by design represents things outside of R, you're stuck.
The fix is simple enough: use IntegerVector (or equivalently convert your integer data on input). Below is my altered version of your code.
// -*- mode: C++; c-indent-level: 4; c-basic-offset: 4; indent-tabs-mode: nil; -*-
#include <Rcpp.h>
// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::depends(BH, bigmemory)]]
#include <bigmemory/MatrixAccessor.hpp>
#include <bigmemory/isna.hpp>
using namespace Rcpp;
//Logic for extracting column from a Big Matrix object
template <typename T>
IntegerVector GetColumn_logic(XPtr<BigMatrix> pMat, MatrixAccessor<T> mat, int cn) {
IntegerVector nv(pMat->nrow());
for(int i = 0; i < pMat->nrow(); i++) {
if(isna(mat[cn][i])) {
nv[i] = NA_INTEGER;
} else {
nv[i] = mat[cn][i];
}
}
return nv;
}
//' Extract Column from a Big Matrix.
//'
//' #param pBigMat A bigmemory object address.
//' #param colNum Column Number to extract. Indexing starts from zero.
//' #export
// [[Rcpp::export]]
IntegerVector GetColumn(SEXP pBigMat, int colNum) {
XPtr<BigMatrix> xpMat(pBigMat);
switch(xpMat->matrix_type()) {
case 1: return GetColumn_logic(xpMat, MatrixAccessor<char>(*xpMat), colNum);
case 2: return GetColumn_logic(xpMat, MatrixAccessor<short>(*xpMat), colNum);
case 4: return GetColumn_logic(xpMat, MatrixAccessor<int>(*xpMat), colNum);
case 6: return GetColumn_logic(xpMat, MatrixAccessor<float>(*xpMat), colNum);
case 8: return GetColumn_logic(xpMat, MatrixAccessor<double>(*xpMat), colNum);
default: throw Rcpp::exception("Unknown type detected for big.matrix object!");
}
}
/*** R
bm <- bigmemory::as.big.matrix(as.matrix(reshape2::melt(matrix(c(1:4,NA,6:20),4,5))))
bigmemory:::CGetType(bm#address)
bigmemory:::GetCols.bm(bm, 3)
GetColumn(bm#address, 2)
*/
Accessing a column of a big.matrix in Rcpp is not difficult,
you can for example get an std vector, an Armadillo vector or an Eigen vector
with the following code (there may exist cleaner code):
// [[Rcpp::depends(RcppEigen, RcppArmadillo, bigmemory, BH)]]
#include <RcppArmadillo.h>
#include <RcppEigen.h>
#include <bigmemory/BigMatrix.h>
#include <bigmemory/MatrixAccessor.hpp>
using namespace Rcpp;
using namespace arma;
using namespace Eigen;
using namespace std;
// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::export]]
ListOf<IntegerVector> AccessVector(SEXP pBigMat, int j) {
XPtr<BigMatrix> xpMat(pBigMat);
MatrixAccessor<int> macc(*xpMat);
int n = xpMat->nrow();
// Bigmemory
cout << "Bigmemory:";
for (int i = 0; i < n; i++) {
cout << macc[j][i] << ' ';
}
cout << endl;
// STD VECTOR
vector<int> stdvec(macc[j], macc[j] + n);
// ARMA VECTOR
Row<int> armavec(macc[j], n); // Replace Row by Col if you want
// EIGEN VECTOR
VectorXi eigenvec(n);
memcpy(&(eigenvec(0)), macc[j], n * sizeof(int));
return(List::create(_["Std vector"] = stdvec,
_["Arma vector"] = armavec,
_["Eigen vector"] = eigenvec));
}
AccessVector(bm#address, 2) gets you:
Bigmemory:1 2 3 4 -2147483648 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
$`Std vector`
[1] 1 2 3 4 NA 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
$`Arma vector`
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15]
[1,] 1 2 3 4 NA 6 7 8 9 10 11 12 13 14 15
[,16] [,17] [,18] [,19] [,20]
[1,] 16 17 18 19 20
$`Eigen vector`
[1] 1 2 3 4 NA 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
You can see that C doesn't know about NAs but when returning to R, you keep them.
So, it depends on what operations you want to do in Rcpp on the columns. I think if you use directly Eigen or Armadillo operations, it should be OK, but you will certainly get lots of NAs in your result.
Maybe it would be clearer if you say what are these operations you want to do.
Im having some troubles using factors in functions, or just to make use of them in basic calculations. I have a data-frame something like this (but with as many as 6000 different factors).
df<- data.frame( p <- runif(20)*100,
q = sample(1:100,20, replace = T),
tt = c("e","e","f","f","f","i","h","e","i","i","f","f","j","j","h","h","h","e","j","i"),
ta = c("a","a","a","b","b","b","a","a","c","c","a","b","a","a","c","c","b","a","c","b"))
colnames(df)<-c("p","q","ta","tt")
Now price = p and quantity = q are my variables, and tt and ta are different factors.
Now, I would first like to find the average price per unit of q by each different factor in tt
(p*q ) / sum(q) by tt
This would in this case give me a list of 3 different sums, by a, b and c (I have 6000 different factors so I need to do it smart :) ).
I have tried using split to make lists, and in this case i can get each individual tt factor to contain the prices and another for the quantity, but I cant seem to get them to for example make an average. I've also tried to use tapply, but again I can't see how I can incorporate factors into this?
EDIT: I can see I need to clearify:
I need to find 3 sums, the average price pr. q given each factor, so in this simplified case it would be:
a: Sum of p*q for (Row (1,2,3, 7, 11, 13,14,18) / sum (q for row Row (1,2,3, 7, 11, 13,14,18)
So the result should be the average price for a, b and c, which is just 3 values.
I'd use plyr to do this:
library(plyr)
ddply(df, .(tt), mutate, new_col = (p*q) / sum(q))
p q ta tt new_col
1 73.92499 70 e a 11.29857879
2 58.49011 60 e a 7.66245932
3 17.23246 27 f a 1.01588711
4 64.74637 42 h a 5.93743967
5 55.89372 45 e a 5.49174103
6 25.87318 83 f a 4.68880732
7 12.35469 23 j a 0.62043207
8 1.19060 83 j a 0.21576367
9 84.18467 25 e a 4.59523322
10 73.59459 66 f b 10.07726727
11 26.12099 99 f b 5.36509998
12 25.63809 80 i b 4.25528535
13 54.74334 90 f b 10.22178577
14 69.45430 50 h b 7.20480246
15 52.71006 97 i b 10.60762667
16 17.78591 54 i c 5.16365066
17 0.15036 41 i c 0.03314388
18 85.57796 30 h c 13.80289670
19 54.38938 44 h c 12.86630433
20 44.50439 17 j c 4.06760541
plyr does have a reputation for being slow, data.table provides similar functionality, but much higher performance.
If I understood corectly you'r problem this should be the answer. Give it a try and responde, that I can adjust it if it's needed.
myRes <- function(tt) {
out <- NULL;
qsum <- sum(as.numeric(df[,"q"]))
psum <- sum(as.numeric(df[,"p"]))
for (var in tt) {
index <- which(df["tt"] == var)
out <- c(out, ((qsum *psum) / sum(df[index,"q"])))
}
return (out)
}
threeValue <- myRes(levels(df[, "tt"]));
I have a data frame which I then split into three (or any number) of dataframes.
What I’m trying to do is to automatically process each column in each dataframe and add lagged versions of existing variables.
For example if there were three variables in each data.frame (V1, V2, V3) I would like to automatically (without hardcoding) add V1.lag, V2.lag and V3.lag.
Here is what I have so far, but I’m stuck now.
Any help would be highly apprecaited.
dd<-data.frame(matrix(rnorm(216),72,3),c(rep("A",24),rep("B",24),rep("C",24)),c(rep("J",36),rep("K",36)));
colnames(dd) <- c("v1", "v2", "v3", "dim1", "dim2");
dd;
dds <- split(dd, dd$dim1);
dds;
# Missing step 1: Automatically create v1.lag, v2.lag, v3.lag, etc (if required)
Finally I would like to merge the three data frames into one big dataframe which will include newly created variables.
# Missing step 2: Merge data frames into single data frame
Any help would be highly appreciated.
EDIT:
In comments section I asked about moving averages instead of lags. here is the solution:
ma <- function(x, f=c(1,1,1)){as.numeric(filter(x, f, sides=1)/length(f));}
foo <- function(df, f = c(1,1,1)) {
nums <- sapply(df, is.numeric); ## which are numeric vars
nams <- paste(names(df)[nums], "ma", length(f), sep = "."); ## generate new names foo.ma
df[, nams] <- lapply(which(nums), function(id, df, f) ma(df[[id]], f = f), df = df, f = f); ## apply ma to each numeric variable
df; ## return
}
Here is one option:
## reuse #Andrie's clag() function as lag() is silly
clag <- function(x, n = 1) c(rep(NA, n), head(x, -n))
## wrapper function to do the addition of lag variables for single DF
foo <- function(df, n = 1) {
nums <- sapply(df, is.numeric) ## which are numeric vars
nams <- paste(names(df)[nums], "lag", sep = ".") ## generate new names foo.lag
df[, nams] <- lapply(which(nums), function(id, df, n) clag(df[[id]], n = n),
df = df, n = n) ## apply clag to each numeric variable
df ## return
}
lapply(dds, foo)
Which gives:
> lapply(dds, foo)
$A
v1 v2 v3 dim1 dim2 v1.lag v2.lag v3.lag
1 -1.15107343 1.47671548 -0.146501739 A J NA NA NA
2 -1.61068272 -0.85397093 -1.240187604 A J -1.15107343 1.47671548 -0.146501739
3 -1.23470282 -0.26194027 1.938344030 A J -1.61068272 -0.85397093 -1.240187604
4 -0.57874043 -0.44600138 0.326069423 A J -1.23470282 -0.26194027 1.938344030
5 0.16139066 -1.95804742 -0.744678169 A J -0.57874043 -0.44600138 0.326069423
6 -1.01497027 0.36850034 1.532640065 A J 0.16139066 -1.95804742 -0.744678169
7 0.72288058 -0.40115543 -0.686450596 A J -1.01497027 0.36850034 1.532640065
8 -0.51300447 0.19686310 0.441649595 A J 0.72288058 -0.40115543 -0.686450596
9 0.95439966 -2.03513002 -0.897784897 A J -0.51300447 0.19686310 0.441649595
10 -1.36736081 -0.41040962 -0.459403176 A J 0.95439966 -2.03513002 -0.897784897
11 0.59503846 0.28925760 -0.003095389 A J -1.36736081 -0.41040962 -0.459403176
12 -0.37951869 0.49551357 0.269412108 A J 0.59503846 0.28925760 -0.003095389
13 -0.52953401 -0.28433351 1.125505917 A J -0.37951869 0.49551357 0.269412108
14 -1.73466020 0.25442637 -1.094139749 A J -0.52953401 -0.28433351 1.125505917
15 0.08479137 -0.11688894 -1.034378216 A J -1.73466020 0.25442637 -1.094139749
16 -2.45854464 0.15806266 -2.275995527 A J 0.08479137 -0.11688894 -1.034378216
17 1.10663502 1.28587230 0.070334868 A J -2.45854464 0.15806266 -2.275995527
18 -0.01945585 1.63659116 -0.137040232 A J 1.10663502 1.28587230 0.070334868
19 0.59026606 -1.95724134 -0.480014930 A J -0.01945585 1.63659116 -0.137040232
20 -0.32245933 1.35372005 1.348717525 A J 0.59026606 -1.95724134 -0.480014930
21 -0.42560327 -1.30145328 2.020609480 A J -0.32245933 1.35372005 1.348717525
22 1.19550777 0.18417336 0.099232994 A J -0.42560327 -1.30145328 2.020609480
23 1.20198621 0.05926023 -0.171505810 A J 1.19550777 0.18417336 0.099232994
24 -1.00667141 1.32441782 0.056696824 A J 1.20198621 0.05926023 -0.171505810
$B
v1 v2 v3 dim1 dim2 v1.lag v2.lag v3.lag
25 0.7878614 0.10354576 -0.69308980 B J NA NA NA
26 0.5824551 0.42319616 0.42734938 B J 0.7878614 0.10354576 -0.69308980
27 -0.2769730 1.51559382 -0.64106570 B J 0.5824551 0.42319616 0.42734938
28 -0.5736416 -1.58745816 -1.13274631 B J -0.2769730 1.51559382 -0.64106570
29 -1.9082145 -0.26148604 -0.04699411 B J -0.5736416 -1.58745816 -1.13274631
30 -1.6254549 0.39390814 -1.79993619 B J -1.9082145 -0.26148604 -0.04699411
31 0.3963274 1.79667985 0.92873142 B J -1.6254549 0.39390814 -1.79993619
32 -0.5889415 -0.04690351 1.43394978 B J 0.3963274 1.79667985 0.92873142
33 0.4683819 -1.34023029 0.18749782 B J -0.5889415 -0.04690351 1.43394978
34 0.7373052 -0.93470320 -1.14528378 B J 0.4683819 -1.34023029 0.18749782
35 -0.7751348 -1.26533917 0.11246728 B J 0.7373052 -0.93470320 -1.14528378
36 1.7786627 -0.19757164 0.14150980 B J -0.7751348 -1.26533917 0.11246728
37 1.8570412 -2.15174901 1.07751105 B K 1.7786627 -0.19757164 0.14150980
38 0.5128697 0.40112948 -0.94826274 B K 1.8570412 -2.15174901 1.07751105
39 0.8710264 -0.59978467 0.54462858 B K 0.5128697 0.40112948 -0.94826274
40 -0.3711512 -0.15632337 0.15832543 B K 0.8710264 -0.59978467 0.54462858
41 1.4505624 0.20915835 2.59369653 B K -0.3711512 -0.15632337 0.15832543
42 0.0871329 0.25440471 0.30096063 B K 1.4505624 0.20915835 2.59369653
43 -0.7398342 -1.72678544 0.45534941 B K 0.0871329 0.25440471 0.30096063
44 0.1953264 -0.60560630 -0.36884626 B K -0.7398342 -1.72678544 0.45534941
45 -0.2702493 0.50747209 -0.50699830 B K 0.1953264 -0.60560630 -0.36884626
46 0.2987449 0.46347722 1.20725190 B K -0.2702493 0.50747209 -0.50699830
47 -0.5682779 -0.71470625 -0.07865078 B K 0.2987449 0.46347722 1.20725190
48 -1.5291983 1.80092050 -1.73317395 B K -0.5682779 -0.71470625 -0.07865078
$C
v1 v2 v3 dim1 dim2 v1.lag v2.lag v3.lag
49 0.06095825 -0.518263220 0.510999371 C K NA NA NA
50 0.40077713 0.477989115 0.855752036 C K 0.06095825 -0.518263220 0.510999371
51 0.06763037 0.802110426 -0.102536186 C K 0.40077713 0.477989115 0.855752036
52 -0.90530986 -0.005452101 -0.089703589 C K 0.06763037 0.802110426 -0.102536186
53 -0.79360209 0.299844218 -0.765164525 C K -0.90530986 -0.005452101 -0.089703589
54 1.34050298 -1.093705314 -0.955952912 C K -0.79360209 0.299844218 -0.765164525
55 0.45377712 0.054978470 0.382874895 C K 1.34050298 -1.093705314 -0.955952912
56 0.95283101 -0.564193352 1.458002944 C K 0.45377712 0.054978470 0.382874895
57 1.09157807 -1.351894599 -1.366084414 C K 0.95283101 -0.564193352 1.458002944
58 2.71993062 -1.126272793 1.374046159 C K 1.09157807 -1.351894599 -1.366084414
59 -0.04685281 0.423085481 -0.455903151 C K 2.71993062 -1.126272793 1.374046159
60 -0.31055449 0.818291875 0.400386018 C K -0.04685281 0.423085481 -0.455903151
61 -0.54904545 1.542272313 0.648135340 C K -0.31055449 0.818291875 0.400386018
62 -0.72914142 1.495482707 -0.212135011 C K -0.54904545 1.542272313 0.648135340
63 -0.27374611 -1.309254707 -0.005125047 C K -0.72914142 1.495482707 -0.212135011
64 0.87439910 -2.666588138 1.043778597 C K -0.27374611 -1.309254707 -0.005125047
65 1.07142042 0.446233778 -0.286784683 C K 0.87439910 -2.666588138 1.043778597
66 -0.10431808 0.510820156 0.405309569 C K 1.07142042 0.446233778 -0.286784683
67 -1.04006019 -0.041327622 1.202855549 C K -0.10431808 0.510820156 0.405309569
68 0.41084794 -0.376796559 -1.147032471 C K -1.04006019 -0.041327622 1.202855549
69 0.88329788 -0.344611311 1.862998306 C K 0.41084794 -0.376796559 -1.147032471
70 -0.67916248 1.396061431 0.697517685 C K 0.88329788 -0.344611311 1.862998306
71 3.55359528 -0.207825480 -0.949834845 C K -0.67916248 1.396061431 0.697517685
72 0.11329113 0.294747300 -0.955891419 C K 3.55359528 -0.207825480 -0.949834845
For the last bit, the combine step, save the above:
dds <- lapply(dds, foo)
then use do.call() to rbind() the individual data frames together, as in:
df2 <- do.call(rbind, dds)
which gives:
> head(df2)
v1 v2 v3 dim1 dim2 v1.lag v2.lag v3.lag
A.1 -1.1510734 1.4767155 -0.1465017 A J NA NA NA
A.2 -1.6106827 -0.8539709 -1.2401876 A J -1.1510734 1.4767155 -0.1465017
A.3 -1.2347028 -0.2619403 1.9383440 A J -1.6106827 -0.8539709 -1.2401876
A.4 -0.5787404 -0.4460014 0.3260694 A J -1.2347028 -0.2619403 1.9383440
A.5 0.1613907 -1.9580474 -0.7446782 A J -0.5787404 -0.4460014 0.3260694
A.6 -1.0149703 0.3685003 1.5326401 A J 0.1613907 -1.9580474 -0.7446782
Use the plyr package to do all of this one step:
library(plyr)
clag <- function(x, n=1)c(rep(NA, n), head(x, -n))
x <- ddply(dd, .(dim1), transform,
v1.lag=clag(v1), v2.lag=clag(v2), v3.lag=clag(v3))
head(x)
v1 v2 v3 dim1 dim2 v1.lag v2.lag v3.lag
1 0.4465910 -0.2564334 -0.9122640 A J NA NA NA
2 -0.3748563 -0.9461061 0.1641274 A J 0.4465910 -0.2564334 -0.9122640
3 -0.5010834 -0.4413026 -0.7509968 A J -0.3748563 -0.9461061 0.1641274
4 -0.5278584 -0.6377017 0.5528831 A J -0.5010834 -0.4413026 -0.7509968
5 -0.4290586 0.4687849 0.6885102 A J -0.5278584 -0.6377017 0.5528831
6 0.1179935 -0.2742456 -0.1945482 A J -0.4290586 0.4687849 0.6885102
I have a few questions/suggestions concerning data.table.
R) X = data.table(x=c("q","q","q","w","w","e"),y=1:6,z=10:15)
R) X[,list(sum(y)),by=list(x)]
x V1
1: q 6
2: w 9
3: e 6
I think it is too bad that one has to write
R) X[,list(y=sum(y)),by=list(x)]
x y
1: q 6
2: w 9
3: e 6
It should default to keeping the same column name (ie: y) where the function calls only one column, this would be a massive gain in most of the cases, typically in finance as we usually look as weighted sums or last time or...
=> Is there any variable I can set to default to this behaviour ?
When doing a selectI might want to do a calculus on few columns and apply another operation for all other columns.
I mean too bad that when I want this:
R) X = data.table(x=c("q","q","q","w","w","e"),y=1:6,z=10:15,t=20:25,u=30:35)
R) X
x y z t u
1: q 1 10 20 30
2: q 2 11 21 31
3: q 3 12 22 32
4: w 4 13 23 33
5: w 5 14 24 34
6: e 6 15 25 35
R) X[,list(y=sum(y),z=last(z),t=last(t),u=last(u)),by=list(x)] #LOOOOOOOOOOONGGGG
#EXPR
x y z t u
1: q 6 12 22 32
2: w 9 14 24 34
3: e 6 15 25 35
I cannot write it like...
R) X[,list(sum(y)),by=list(x),defaultFn=last] #defaultFn would be
applied to all remaniing columns
=> Can I do this somehow (may be setting an option)?
Thanks
On part 1, that's not a bad idea. We already do that for expressions in by, and something close is already on the list for j :
FR#2286 Inferred naming could apply to j=colname[...]
Find max per group and return another column
But if we did do that it would probably need to be turned on via an option, to maintain backwards compatibility. I've added a link in that FR back to this question.
On the 2nd part how about :
X[,c(y=sum(y),lapply(.SD,last)[-1]),by=x]
x y z t u
1: q 6 12 22 32
2: w 9 14 24 34
3: e 6 15 25 35
Please ask multiple questions separately, though. Each question on S.O. is supposed to be a single question.
I'm dealing with a categorical variable retrieved from a database and am wanting to use factors to maintain the "fullness" of the data.
For instance, I have a table which stores colors and their associated numerical ID
ID | Color
------+-------
1 | Black
1805 | Red
3704 | White
So I'd like to use a factor to store this information in a data frame such as:
Car Model | Color
----------+-------
Civic | Black
Accord | White
Sentra | Red
where the color column is a factor and the underlying data stored, rather than being a string, is actually c(1, 3704, 1805) -- to IDs associated with each color.
So I can create a custom factor by modifying the levels attribute of an object of the factor class to achieve this effect.
Unfortunately, as you can see in the example, my IDs are not incremented. In my application, I have ~30 levels and the maximum ID for one level is ~9,000. Because the levels are stored in an array for a factor, that means I'm storing an integer vector of length 9,000 with only 30 elements in it.
Is there any way to use a hash or list to accomplish this effect more efficiently? i.e. if I were to use a hash in the levels attribute of a factor, I could store all 30 elements with whatever indices I please without having to create an array of size max(ID).
Thanks in advance!
Well, I'm pretty sure you can't change how factors work. A factor always has level ids that are integer numbers 1..n where n is the number of levels.
...but you can easily have a translation vector to get to your color ids:
# The translation vector...
colorIds <- c(Black=1,Red=1805,White=3704)
# Create a factor with the correct levels
# (but with level ids that are 1,2,3...)
f <- factor(c('Red','Black','Red','White'), levels=names(colorIds))
as.integer(f) # 2 1 2 3
# Translate level ids to your color ids
colorIds[f] # 1805 1 1805 3704
Technically, colorIds does not need to define the names of the colors, but it makes it easier to have in one place since the names are used when creating the levels for the factor. You want to specify the levels explicitly so that the numbering of them matches even if the levels are not in alphabetical order (as yours happen to be).
EDIT It is however possible to create a class deriving from factor that has the codes as an attribute. Lets call this new glorious class foo:
foo <- function(x = character(), levels, codes) {
f <- factor(x, levels)
attr(f, 'codes') <- codes
class(f) <- c('foo', class(f))
f
}
`[.foo` <- function(x, ...) {
y <- NextMethod('[')
attr(y, 'codes') <- attr(x, 'codes')
y
}
as.integer.foo <- function(x, ...) attr(x,'codes')[unclass(x)]
# Try it out
set.seed(42)
f <- foo(sample(LETTERS[1:5], 10, replace=TRUE), levels=LETTERS[1:5], codes=101:105)
d <- data.frame(i=11:15, f=f)
# Try subsetting it...
d2 <- d[2:5,]
# Gets the codes, not the level ids...
as.integer(d2$f) # 105 102 105 104
You could then also fix print.foo etc...
In thinking about it, the only feature that a "level" needs to implement in order to have a valid factor is the [ accessor. So any object implementing the [ accessor could be viewed as a vector from the standpoint of any interfacing function.
I looked into the hash class, but saw that it uses the normal R behavior (as is seen in lists) of returning a slice of the original hash when only using a single bracket (while extracting the actual value when using the double bracket). However, it I were to override this using setMethod(), I was actually able to get the desired behavior.
library(hash)
setMethod(
'[' ,
signature( x="hash", i="ANY", j="missing", drop = "missing") ,
function(
x,i,j, ... ,
drop
) {
if (class(i) == "factor"){
#presumably trying to lookup the values associated with the ordered keys in this hash
toReturn <- NULL
for (k in make.keys(as.integer(i))){
toReturn <- c(toReturn, get(k, envir=x#.xData))
}
return(toReturn)
}
#default, just make keys and get from the environment
toReturn <- NULL
for (k in make.keys(i)){
toReturn <- c(toReturn, get(k, envir=x#.xData))
}
return(toReturn)
}
)
as.character.hash <- function(h){
as.character(values(h))
}
print.hash <- function(h){
print(as.character(h))
}
h <- hash(1:26, letters)
df <- data.frame(ID=1:26, letter=26:1, stringsAsFactors=FALSE)
attributes(df$letter)$class <- "factor"
attributes(df$letter)$levels <- h
> df
ID letter
1 1 z
2 2 y
3 3 x
4 4 w
5 5 v
6 6 u
7 7 t
8 8 s
9 9 r
10 10 q
11 11 p
12 12 o
13 13 n
14 14 m
15 15 l
16 16 k
17 17 j
18 18 i
19 19 h
20 20 g
21 21 f
22 22 e
23 23 d
24 24 c
25 25 b
26 26 a
> attributes(df$letter)$levels
<hash> containing 26 key-value pair(s).
1 : a
10 : j
11 : k
12 : l
13 : m
14 : n
15 : o
16 : p
17 : q
18 : r
19 : s
2 : b
20 : t
21 : u
22 : v
23 : w
24 : x
25 : y
26 : z
3 : c
4 : d
5 : e
6 : f
7 : g
8 : h
9 : i
>
> df[1,2]
[1] z
Levels: a j k l m n o p q r s b t u v w x y z c d e f g h i
> as.integer(df$letter)
[1] 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2
[26] 1
Any feedback on this? As best I can tell, everything's working. It looks like it works properly as far as printing, and the underlying data stored in the actual data.frame is untouched, so I don't feel like I'm jeopardizing anything there. I may even be able to get away with adding a new class into my package which just implements this accessor to avoid having to add a dependency on the hash class.
Any feedback or points on what I'm overlooking would be much appreciated.