Create column names based on "by" argument the data.table way

Create column names based on "by" argument the data.table way - r

Say I have the following data.table
dt <- data.table(var = c("a", "b"), val = c(1, 2))
Now I want to add two new columns to dt, named a, and b with the respective values (1, 2). I can do this with a loop, but I want to do it the data.table way.
The result would be a data.table like this:
dt.res <- data.table(var = c("a", "b"), val = c(1, 2), #old vars
a = c(1, NA), b = c(NA, 2)) # newly created vars
So far I came up with something like this
dt[, c(xx) := val, by = var]
where xx would be a data.table-command similar to .N which addresses the value of the by-group.
Thanks for the help!
Appendix: The for-loop way
The non-data.table-way with a for-loop instead of a by-argument would look something like this:
for (varname in dt$var){
dt[var == varname, c(varname) := val]
}

Based on the example showed, we can use dcast from the data.table to convert the long format to wide, and join with the original dataset on the 'val' column.
library(data.table)#v1.9.6+
dt[dcast(dt, val~var, value.var='val'), on='val']
# var val a b
#1: a 1 1 NA
#2: b 2 NA 2
Or as #CathG mentioned in the comments, for previous versions either merge or set the key column and then join.
merge(dt, dcast.data.table(dt, val~var, value.var='val'))

Related

Applying a function to the whole data table by groups

Let's suppose the following data table:
a = runif(40)
b = c(rep(NA,5), runif(5), rep(NA,3),runif(3),NA,runif(3), c(rep(NA,3), runif(7), rep(NA,4), runif(3), NA,NA, runif(1)))
c = rep(1:4,each=10)
DT = data.table(a,b,c)
I want to eliminate the rows with the first NA values in b for every unique value in c (first NAs when c==1, when c==2...), but not the rows with the NAs that come after.
I can do it by using a loop:
for(i in unique(DT$c))
{
first_NA = which(DT$c==i)[1]
last_NA = which(!is.na(DT[,b]) & DT$c==i)[1] - 1
DT = DT[-c(first_NA:last_NA)]
}
But I wonder if there is any simpler way of doing this by using a function for the whole data table using groups (by in data table or groupby in dplyr), without just applying it to columns.
Thank you!

You can filter out the first NA values in b through
DT[, .SD[cumsum( !is.na(b) ) != 0], by = .(c)]

You have to mark these lines then keep those not marked.
# mark values
DT <- DT[, by=c,
flag := is.na(b[1]) # first value of b is NA
& (seq_len(.N)==1) # only for first value
]
# discard marked
DT <- DT[(!flag)]
# remove flag
DT[, flag:=NULL]
or in a row
DT[, by=c, flag:=is.na(b[1]) & (seq_len(.N)==1)][(!flag)][, flag:=NULL]

How to construct an empty data.table with the colum names of an existing data.table?

I would like to create an empty data.table in R with colum names from another existing data.table.
Somehow I could not find a solution for that.
I would like to do something like that:
require(data.table)
dt1 <- data.table(fn = c("A","B","C"), x = c(1,2,3), y = c(2,3,4), a = 1, b = 2, c = 3)
dt2 <- data.table(names=colnames(dt1)) # Gives 6 rows instead of 6 cols
How can this be achieved?
Thanks!

You can also take your old dt1, clear it and keep as dt2
dt2 <- dt1[0,]
dt2
Empty data.table (0 rows and 6 cols): fn,x,y,a,b,c
It isn't precisely what did you want, but it always some solution.

One option could be:
dt2 <- setnames(data.table(matrix(nrow = 0, ncol = length(dt1))), names(dt1))
Empty data.table (0 rows and 6 cols): fn,x,y,a,b,c

How to explicitly name the count column generated by the .N function?

I want to group-by a data table by an id column and then count how many times each id occurs. This can be done as follows:
dt <- data.table(id = c(1, 1, 2))
dt_by_id <- dt[, .N, by = id]
dt_by_id
id N
1: 1 2
2: 2 1
That's pretty fine, but I want the N-column to have a different name (e. g. count). In the help it says:
.N is an integer, length 1, containing the number of rows in the group. This may be useful when the column names are not known in
advance and for convenience generally. When grouping by i, .N is the
number of rows in x matched to, for each row of i, regardless of
whether nomatch is NA or 0. It is renamed to N (no dot) in the result
(otherwise a column called ".N" could conflict with the .N variable,
see FAQ 4.6 for more details and example), unless it is explicitly
named; ... .
How to "explicitly name" the N-column when creating the dt_by_id data table? (I know how to rename it afterwards.) I tried
dt_by_id <- dt[, count = .N, by = id]
but this led to
Error in `[.data.table`(dt, , count = .N, by = id) :
unused argument (count = .N)

You have to list the output of your calculation if you want to give your own name:
dt[, .(count=.N), by = id]
This is identical to dt[, list(count=.N), by = id], if you prefer; . is an alias for list here.

If we have already named it, then use setnames
setnames(dt_by_id, "N", 'count')
or using rename
library(dplyr)
dt_by_id %>%
rename(count = N)
# id count
#1: 1 2
#2: 2 1

Using dplyr::count (x, name= "new column" ) will replace the default column name n with a new name.
dt <- data.frame(id = c(1, 1, 2))
dt %>%
dplyr:: count(id, name = 'ID')

data.table in r : subset using column index

DT - data.table with column "A"(column index==1), "B"(column index 2), "C" and etc
for example next code makes subset DT1 which consists rows where A==2:
DT1 <- DT[A==2, ]
BUT How can I make subsets like DT1 using only column index??
for example, code like next not works :
DT1 <- DT[.SD==2, .SDcols = 1]

It is not recommended to use column index instead of column names as it makes your code difficult to understand and agile for any changes that could happen to your data. (See, for example, the first paragraph of the first question in the package FAQ.) However, you can subset with column index as follows:
DT = data.table(A = 1:5, B = 2:6, C = 3:7)
DT[DT[[1]] == 2]
# A B C
#1: 2 3 4

We can get the row index with .I and use that to subset the DT
DT[DT[, .I[.SD==2], .SDcols = 1]]
# A B C
#1: 2 3 4
data
DT <- data.table(A = 1:5, B = 2:6, C = 3:7)

What is a concise and clear idiom for mapping values into a data.table

A relatively common task is to need to assign ("map") a manual value to each row based on a lookup into a small map.
In data.table the most obvious ways of doing this create very convoluted code, and so I wonder if I'm missing an idiom that produces this result with clearer coding.
Consider this example, where we start with a large data.table that has a Name column containing either a, b, c, or d.
library(data.table)
DT = data.table(ID = 1:4000, Name = rep(letters[1:4],1000), X = rnorm(4000))
setkey(DT, ID)
We now want to assign the scores (a = 1, b = 4, c = 6, d = 3) in an extra column. Perhaps, joining to a small table like this:
Weights = data.table(Name = c("a", "b", "c", "d"), W = c(1,4,6,3))
setkey(DT, Name)
setkey(Weights, Name)
DT = Weights[DT]
However note that we have had to much up the index for DT to do this and the columns are reordered. So the job is not complete until we do:
setkey(DT, ID)
setcolorder(DT, c("ID", "Name", "X", "W"))
And even though this has got there in the end the weight setting is problematic because the values and names are not joined together, which is asking for a typo. Something like this would be better:
WeightList = list(a = 1, b = 3, c = 6, d = 3)
But how can we then look up from this list into DT?
At first glance it looks like we can do
DT[, W := WeightList[Name]]
> DT
ID Name X W
1: 1 a -0.05006513 1
2: 2 b 0.01637769 3
3: 3 c 2.18922366 6
4: 4 d 0.18327623 3
5: 5 a -1.44108171 1
---
3996: 3996 d 0.70507702 3
3997: 3997 a 0.42989246 1
3998: 3998 b 1.31611236 3
3999: 3999 c -1.43431163 6
4000: 4000 d 0.32244477 3
But that W column is not well formed, and simple operations on it don't work
> DT[, W + 1]
Error in W + 1 : non-numeric argument to binary operator

Using on argument along with := was designed with these cases in mind, i.e., no need to reorder (setting key for join) and copy the entire data.table (when you don't use :=) just to add column(s).
require(data.table)
DT = data.table(ID = 1:4000, Name = rep(letters[1:4],1000), X = rnorm(4000))
setkey(DT, ID)
Weights = data.table(Name = c("a", "b", "c", "d"), W = c(1,4,6,3))
DT[Weights, W := W, on="Name"]
key(DT) # [1] "ID"
DT is updated by reference, and key is retained.

You are assigning a list column to DT and hence can't add integers to it (without using unlist first at least).
You could change your list vector to a usual integer/numeric named vector and your code will work just fine. For instance
WeightList <- c(a = 1, b = 3, c = 6, d = 3)
Or a bit more robust method creating this vector could be
WeightList <- setNames(c(1, 3, 6, 3), letters[1:4])
Then, your code as before
DT[, W := WeightList[Name]]