If I have the following simple datatable:
DT <- data.table(VAL = sample(c(1, 2, 3), 10, replace = TRUE),Group = c(rep("A",5),rep("B",5)))
I can calc the mean via:
DT[,lapply(.SD,function(x){mean(x)}),by=Group]
I could also use:
DT[,lapply(.SD,function(x){sum(x)/.N}),by=Group]
But my question is, why does the following NOT work:
DT[,lapply(.SD,function(x){sum(x)/nrow(x)}),by=Group]
From my understanding, .SD is a sub datatable of the full datatable, so via function(x) I should be able to refer to the number of rows of x - or in other words, why can I calculate sum(x), but not nrow(x) in .SD? Did not find anything in the documentation in this regard.
.SD is a data.table, but when you lapply over it, each x value is a column vector, for which nrow does not work. If you were to do length instead, it returns the number of rows.
DT[,lapply(.SD,function(x){sum(x)/length(x)}),by=Group]
# Group VAL
# 1: A 2.0
# 2: B 1.6
Related
In R's data.table, one can chain multiple operations by putting together squared braces, each of which will be able to use non-standard evaluation of e.g. column names for whatever is the current transformation in a chain - for example:
dt[,
.(agg1=mean(var1), agg2=mean(var2)),
by=.(col1, col2)
][,
.(agg2=ceiling(agg2), agg3=agg2^2)
]
Suppose I want to do an operation which involves computing some function that would take as input a data.frame, which I want to put in a data.table chain taking the current object in the chain, while using these results in e.g. a by clause. In magrittr/dplyr chains, I can use a . to refer to the object that is passed to a pipe and do arbitrary operations with it, but in data.table the chains work quite differently.
For example, suppose I have this table and I want to compute a percentage of total across groups:
library(data.table)
dt = data.table(col1 = c(1,1,1,2,2,3), col2=10:15)
dt[, .(agg1 = sum(col2)/sum(dt$col2)), by=.(col1)]
col1 agg1
1: 1 0.44
2: 2 0.36
3: 3 0.20
In this snippet, I referred to the full object name inside [ to compute the total of the column as sum(dt$col2), which bypasses the by part. But if it were in a chain and these columns were calculated through other operations, I wouldn't have been able to simply use dt like that as that's an external variable (that is, it's evaluated in an environment level outside the [ scope).
How can I refer to the current object/table inside a chain? To be clear, I am not asking about how to do this specific grouping operation, but about a general way of accessing the current object so as to use it in arbitrary functions.
We could use .SDcols or directly .SD and select the column as a vector with [[
dt[, .(agg1 = sum(col2)/sum(dt$col2)), by=.(col1)][,
.(agg1 = ceiling(.SD[["agg1"]]))]
Interesting question!
If I understand correctly, the OP wants to create complex chains of data.table expressions. One sample use case is to compute a percentage of total across groups in a chain like in
dt[, .(agg1 = sum(col2) / sum(dt$col2)), by = col1]
col1 agg1
1: 1 0.44
2: 2 0.36
3: 3 0.20
Here col2 appears in two places. First, to compute the group totals, and second to compute the grand total.
data.table has a special symbol .SD which stands for Subset, Selfsame, or Self-reference of the Data (please, see the vignette Using .SD for Data Analysis for details and examples).
When used in grouping (by = ), .SD refers to the subset of the data within each group and thus cannot be used to refer to the whole (ungrouped) dataset. So,
data.table(col1 = c(1, 1, 1, 2, 2, 3), col2 = 10:15)[
, .(agg1 = sum(col2) / sum(.SD$col2)), by = col1]
returns the trivial result 1 for each group.
Chaining within a chain
However, I believe I have found a way to accomplish OP's goal by chaining within a chain:
data.table(col1 = c(1, 1, 1, 2, 2, 3), col2 = 10:15)[
, .SD[, .(agg1 = sum(col2)), by = col1][, .(col1, agg1 = agg1 / sum(col2))]]
col1 agg1
1: 1 0.44
2: 2 0.36
3: 3 0.20
Here,
data.table(col1 = c(1, 1, 1, 2, 2, 3), col2 = 10:15)[, j]
is a data.table chain with a data.table created on-the-fly, i.e., without assigning to a variable name, and with the j expression being another data.table chain
.SD[, .(agg1 = sum(col2)), by = col1][, .(col1, agg1 = agg1 / sum(col2))]
The inner chain consists of three parts:
.SD refers to the whole dataset.
The next expression computes the aggregates by group, returning a data.table with two columns, col1 and agg1.
The last expression divides agg1 by the grand total sum(col2). As col2 is not included in the result of step 2, it is taken from the whole dataset.
However, if you do not want to rely on scoping rules, we can achieve the same result by explicitely computing the grand total beforehand:
data.table(col1 = c(1, 1, 1, 2, 2, 3), col2 = 10:15)[
, {tmp <- sum(col2); .SD[, .(agg1 = sum(col2) / tmp), by = col1]}]
Here, we use the fact that in data.table j can be an arbitrary expression (enclosed in curly braces).
I created the following data.table as an example:
dt <- data.table(x = c(1, 12, 200, 1600))
dt[, y := " "]
My goal is to fill the y column with the x values extended by leading zeros such that each entry in y consists of four digits (i.e. 0001, 0012, 0200, 1600).
My idea is as follows:
dt[, y := x] # fill column with original values
dt[nchar(as.integer(x)) < 4, y := paste(paste(rep(0, 4-nchar(as.integer(x))), collapse=""), x, sep="")]
This command is supposed to check whether x consists of less than 4 digits, and, if so, generate the required number of zeros and paste them at the beginning of the string. Executing the statement however yields the message "Error in rep(0, 4 - nchar(as.integer(x))) : invalid 'times' argument".
I know that my basic idea is correct since the following command works properly:
dt[nchar(as.integer(x)) < 4, y := paste(paste(rep(0, 4), collapse=""), x, sep="")]
Here, I simply replaced the second argument in rep() by a random number (4 in this case).
Therefore, rep() obviously has some problems understanding the column reference made by x. Other functions (e.g. as.numeric() and many many more) don't have problems with this.
Thanks for any help!
Juse use formatC:
library(data.table)
dt <- data.table(x = c(1, 12, 200, 1600))
dt[, y := formatC(x, width = 6, format = "d", flag = "0")] #
dt
x y
1: 1 000001
2: 12 000012
3: 200 000200
4: 1600 001600
I think the problem is that you're feeding rep() a vector of length>1.
There are probably formats you can use? Below is a workaround step-by-step.
dt <- data.frame(x = c(1, 12, 200, 1600))
dt$times_to_rep<-4-nchar(dt$x)
dt$power_of_ten<-10^dt$times_to_rep
dt$zeros<-substring(dt$power_of_ten,2,nchar(dt$power_of_ten))
dt$y<-paste0(dt$zeros, dt$x, sep = '')
Imagine I have a data.table DT that has columns a, b, c. I want to filter rows based on a (say, select only those with value "A"), compute the sum of b by c. I can do this efficiently, using binary search for filtering, by
setkey(DT, a)
DT[.("A"), .(sum.B = sum(B)), by = .(C)]
What if then I want to filter rows based on the value of the newly obtained sum.b? If I want to keep rows where sum.b equals one of c(3, 4, 5), I can do that by saying
DT[.("A"), .(sum.B = sum(B)), by = .(C)][sum.b %in% c(3, 4, 5)]
but the latter operation uses vector scan which is slow. Is there a way to set keys "on the fly" while chaining? Ideally I would have
DT[.("A"), .(sum.B = sum(B)), by = .(C)][??set sum.b as key??][.(c(3, 4, 5))]
where I don't know the middle step.
The middle step you are asking in the question would be the following:
# unnamed args
DT[,.SD,,sum.b]
# named args
DT[j = .SD, keyby = sum.b]
# semi named
DT[, .SD, keyby = sum.b]
Yet you should benchmark it on your data as it may be slower than vector scan as you need to setkey.
It looks like eddi already provide that solution in comment. The FR mentioned by him is data.table#1105.
While reading a data set using fread, I've noticed that sometimes I'm getting duplicated column names, for example (fread doesn't have check.names argument)
> data.table( x = 1, x = 2)
x x
1: 1 2
The question is: is there any way to remove 1 of 2 columns if they have the same name?
How about
dt[, .SD, .SDcols = unique(names(dt))]
This selects the first occurrence of each name (I'm not sure how you want to handle this).
As #DavidArenburg suggests in comments above, you could use check.names=TRUE in data.table() or fread()
.SDcols approaches would return a copy of the columns you're selecting. Instead just remove those duplicated columns using :=, by reference.
dt[, which(duplicated(names(dt))) := NULL]
# x
# 1: 1
Different approaches:
Indexing
my.data.table <- my.data.table[ ,-2]
Subsetting
my.data.table <- subset(my.data.table, select = -2)
Making unique names if 1. and 2. are not ideal (when having hundreds of columns, for instance)
setnames(my.data.table, make.names(names = names(my.data.table), unique=TRUE))
Optionnaly systematize deletion of variables which names meet some criterion (here, we'll get rid of all variables having a name ending with ".X" (X being a number, starting at 2 when using make.names)
my.data.table <- subset(my.data.table,
select = !grepl(pattern = "\\.\\d$", x = names(my.data.table)))
This thread has discussed about doing it for data frame. I want to do a little more complicated than that:
dt <- data.table(A = c(rep("a", 3), rep("b", 4), rep("c", 5)) , B = rnorm(12, 5, 2))
dt2 <- dt[order(dt$A, dt$B)] # Sorting
# Always shows the factor from A
do.call(rbind, by(
dt2, dt2$A,
function(x) data.table(A = x[,A][1], B = x[,B][4])
)
)
#This is to reply to Vlo's comment below. If I do this, it will return both row as 'NA'
do.call(rbind,
by(dt2, dt2$A, function(x) x[4])
)
# Take the max value of B according to each factor A
do.call(rbind, by(dt2, dt2$A,
function(x) tail(x,1))
)
)
What are more efficient way(s) to do this with data.table native functions?
In data.table, you can refer to columns as if they are variables within the scope of dt. So, you don't need the $. That is,
dt2 = dt[order(A, B)] # no need for dt$
is sufficient. And if you want the 4th element of B for every group in A:
dt2[, list(B=B[4L]), by=A]
# A B
# 1: a NA
# 2: b 6.579446
# 3: c 6.378689
Refer to #Vlo's answer for your second question.
From the way you're using data.tables, it seems like you've not gone through any vignettes or talks. It'd be helpful for you to check out the Introduction and the FAQ vignettes or tutorials from the homepage; especially, Matt's #user2014 tutorial amidst others.
First statement makes no sense to me, here is the second
# Take the max value of B according to each factor A
dt2[, list(B=max(B)), by=A]