data.table - group by all except one column - r

Can I group by all columns except one using data.table? I have a lot of columns, so I'd rather avoid writing out all the colnames.
The reason being I'd like to collapse duplicates in a table, where I know one column has no relevance.
library(data.table)
DT <- structure(list(N = c(1, 2, 2), val = c(50, 60, 60), collapse = c("A",
"B", "C")), .Names = c("N", "val", "collapse"), row.names = c(NA,
-3L), class = c("data.table", "data.frame"))
> DT
N val collapse
1: 1 50 A
2: 2 60 B
3: 2 60 C
That is, given DT, is there something like like DT[, print(.SD), by = !collapse] which gives:
> DT[, print(.SD), .(N, val)]
collapse
1: A
collapse
1: B
2: C
without actually having to specify .(N, val)? I realise I can do this by copy and pasting the column names, but I thought there might be some elegant way to do this too.

To group by all columns except one, you can use:
by = setdiff(names(DT), "collapse")
Explanation: setdiff takes the general form of setdiff(x, y) which returns all values of x that are not in y. In this case it means that all columnnames are returned except the collapse-column.
Two alternatives:
# with '%in%'
names(dt1)[!names(dt1) %in% 'colB']
# with 'is.element'
names(dt1)[!is.element(names(dt1), 'colB')]

Related

Adding something to a list of dates in a column

Supppose a data.table is:
dt <- structure(list(type = c("A", "B", "C"), dates = c("21-07-2011",
"22-11-2011,01-12-2011", "07-08-2012,14-08-2012,18-08-2012,11-10-2012"
)), class = c("data.table", "data.frame"), row.names = c(NA, -3L))
Check it:
type dates
1: A 21-07-2011
2: B 22-11-2011,01-12-2011
3: C 07-08-2012,14-08-2012,18-08-2012,11-10-2012
I need to add, say, 5 to each of the dates in second column, ie, I want the result to be as under:
type dates
1: A 26-07-2011
2: B 27-11-2011,06-12-2011
3: C 12-08-2012,19-08-2012,23-08-2012,16-10-2012
Any help would be appreciated.
Using only basic R you can do:
dt$dates = sapply(dt$dates, function(x){
dates = as.Date(strsplit(x,",")[[1]], format = "%d-%m-%Y")
paste(format(dates+5, '%d-%m-%Y'), collapse = ",")
})
Result:
> dt
type dates
1: A 26-07-2011
2: B 27-11-2011,06-12-2011
3: C 12-08-2012,19-08-2012,23-08-2012,16-10-2012
This procedure is practically the same as the one given by akrun, but without the extra libraries.
Grouped by 'type', we split the 'dates' by the ,, (with strsplit), convert to a Date class object with dmy (from lubridate), add 5, format it to the original format of the data, paste it to single string and assign (:=) to update the 'dates' column in the dataset
library(lubridate)
library(data.table)
dt[, dates := paste(format(dmy(unlist(strsplit(dates, ","))) + 5,
'%d-%m-%Y'), collapse=','), by = type]
dt
# type dates
#1: A 26-07-2011
#2: B 27-11-2011,06-12-2011
#3: C 12-08-2012,19-08-2012,23-08-2012,16-10-2012
Another option without splitting, converting to Date, reformatting is regex method with gsubfn
library(gsubfn)
dt[, dates := gsubfn("^(\\d+)", ~ as.numeric(x) + 5,
gsubfn(",(\\d+)", ~sprintf(",%02d", as.numeric(x) + 5), dates))]
dt
# type dates
#1: A 26-07-2011
#2: B 27-11-2011,06-12-2011
#3: C 12-08-2012,19-08-2012,23-08-2012,16-10-2012
NOTE: Would assume the second method to be faster as we are not splitting, converting, pasteing etc.

r rowdies iteration on a data table, again

There are at least a couple of Q/As that are similar to this but I can't seem to get the hang of it. Here's a reproducible example. DT holds the data. I want food(n) = food(n-1) * xRatio.food(n)
DT <- fread("year c_Crust xRatio.c_Crust
X2005 0.01504110 NA
X2010 NA 0.9883415
X2015 NA 1.0685221
X2020 NA 1.0664189
X2025 NA 1.0348418
X2030 NA 1.0370386
X2035 NA 1.0333771
X2040 NA 1.0165511
X2045 NA 1.0010563
X2050 NA 1.0056368")
The code that gets closest to the formula is
DT[,res := food[1] * cumprod(xRatio.food[-1])]
but the res value is shifted up, and the first value is recycled to the last row with a warning. I want the first value of xRatio.food to be NA
I'd rename/reshape...
myDT = melt(DT, id = "year", meas=list(2,3),
variable.name = "food",
value.name = c("value", "xRatio"))[, food := "c_Crust"][]
# or for this example with only one food...
myDT = DT[, .(year, food = "c_Crust", xRatio = xRatio.c_Crust, value = c_Crust)]
... then do the computation per food group with the data in long form:
myDT[, v := replace(first(value)*cumprod(replace(xRatio, 1, 1)), 1, NA), by=food]
# or more readably, to me anyways
library(magrittr)
myDT[, v := first(value)*cumprod(xRatio %>% replace(1, 1)) %>% replace(1, NA), by=food]
Alternately, there's myDT[, v := c(NA, first(value)*cumprod(xRatio[-1])), by=food], extending the OP's code, though I prefer just operating on full-length vectors with replace rather than trying to build vectors with c, since the latter can run into weird edge cases (like if there is only one row, will it do the right thing?).

sample data.table rows with different conditions

I have a data.table with multiple columns. One of these columns currently works as a 'key' (keyb for the example). Another column (let's say A), may or may not have data in it. I would like to supply a vector that randomly sample two rows per key, -if this key appears in the vector, where 1 row contains data in A, while the other does not.
MRE:
#data.table
trys <- structure(list(keyb = c("x", "x", "x", "x", "x", "y", "y", "y",
"y", "y"), A = c("1", "", "1", "", "", "1", "", "", "1", "")), .Names = c("keyb",
"A"), row.names = c(NA, -10L), class = c("data.table", "data.frame"
))
setkey(trys,keyb)
#list with keys
list_try <- structure(list(a = "x", b = c("r", "y","x")), .Names = c("a", "b"))
I could, for instance subset the data.table based on the elements that appear in list_try:
trys[keyb %in% list_try[[2]]]
My original (and probably inefficient idea), was to try to chain a sample of two rows per key, where the A column has data or no data, and then merge. But it does not work:
#here I was trying to sample rows based on whether A has data or not
#here for rows where A has no data
trys[keyb %in% list_try[[2]]][nchar(A)==0][sample(.N, 2), ,by = keyb]
#here for rows where A has data
trys[keyb %in% list_try[[2]]][nchar(A)==1][sample(.N, 2), ,by = keyb]
In this case, my expected output would be two data.tables (one for a and one for b in list_try), of two rows per appearing element: So the data.table from a would have two rows (one with and without data in A), and the one from b, four rows (two with and two without data in A).
Please let me know if I can make this post any clearer
You could add A to the by statement too, while converting it to a binary vector by modifying to A != "", combine with a binary join (while adding nomatch = 0L in order to remove non-matches) you could then sample from the row index .I by those two aggregators and then subset from the original data set
For a single subset case
trys[trys[list_try[[2]], nomatch = 0L, sample(.I, 1L), by = .(keyb, A != "")]$V1]
# keyb A
# 1: y 1
# 2: y
# 3: x 1
# 4: x
For a more general case, when you want to create separate data sets according to a list of keys, you could easily embed this into lapply
lapply(list_try,
function(x) trys[trys[x, nomatch = 0L, sample(.I, 1L), by = .(keyb, A != "")]$V1])
# $a
# keyb A
# 1: x 1
# 2: x
#
# $b
# keyb A
# 1: y 1
# 2: y
# 3: x 1
# 4: x

What is a concise and clear idiom for mapping values into a data.table

A relatively common task is to need to assign ("map") a manual value to each row based on a lookup into a small map.
In data.table the most obvious ways of doing this create very convoluted code, and so I wonder if I'm missing an idiom that produces this result with clearer coding.
Consider this example, where we start with a large data.table that has a Name column containing either a, b, c, or d.
library(data.table)
DT = data.table(ID = 1:4000, Name = rep(letters[1:4],1000), X = rnorm(4000))
setkey(DT, ID)
We now want to assign the scores (a = 1, b = 4, c = 6, d = 3) in an extra column. Perhaps, joining to a small table like this:
Weights = data.table(Name = c("a", "b", "c", "d"), W = c(1,4,6,3))
setkey(DT, Name)
setkey(Weights, Name)
DT = Weights[DT]
However note that we have had to much up the index for DT to do this and the columns are reordered. So the job is not complete until we do:
setkey(DT, ID)
setcolorder(DT, c("ID", "Name", "X", "W"))
And even though this has got there in the end the weight setting is problematic because the values and names are not joined together, which is asking for a typo. Something like this would be better:
WeightList = list(a = 1, b = 3, c = 6, d = 3)
But how can we then look up from this list into DT?
At first glance it looks like we can do
DT[, W := WeightList[Name]]
> DT
ID Name X W
1: 1 a -0.05006513 1
2: 2 b 0.01637769 3
3: 3 c 2.18922366 6
4: 4 d 0.18327623 3
5: 5 a -1.44108171 1
---
3996: 3996 d 0.70507702 3
3997: 3997 a 0.42989246 1
3998: 3998 b 1.31611236 3
3999: 3999 c -1.43431163 6
4000: 4000 d 0.32244477 3
But that W column is not well formed, and simple operations on it don't work
> DT[, W + 1]
Error in W + 1 : non-numeric argument to binary operator
Using on argument along with := was designed with these cases in mind, i.e., no need to reorder (setting key for join) and copy the entire data.table (when you don't use :=) just to add column(s).
require(data.table)
DT = data.table(ID = 1:4000, Name = rep(letters[1:4],1000), X = rnorm(4000))
setkey(DT, ID)
Weights = data.table(Name = c("a", "b", "c", "d"), W = c(1,4,6,3))
DT[Weights, W := W, on="Name"]
key(DT) # [1] "ID"
DT is updated by reference, and key is retained.
You are assigning a list column to DT and hence can't add integers to it (without using unlist first at least).
You could change your list vector to a usual integer/numeric named vector and your code will work just fine. For instance
WeightList <- c(a = 1, b = 3, c = 6, d = 3)
Or a bit more robust method creating this vector could be
WeightList <- setNames(c(1, 3, 6, 3), letters[1:4])
Then, your code as before
DT[, W := WeightList[Name]]

Create column names based on "by" argument the data.table way

Say I have the following data.table
dt <- data.table(var = c("a", "b"), val = c(1, 2))
Now I want to add two new columns to dt, named a, and b with the respective values (1, 2). I can do this with a loop, but I want to do it the data.table way.
The result would be a data.table like this:
dt.res <- data.table(var = c("a", "b"), val = c(1, 2), #old vars
a = c(1, NA), b = c(NA, 2)) # newly created vars
So far I came up with something like this
dt[, c(xx) := val, by = var]
where xx would be a data.table-command similar to .N which addresses the value of the by-group.
Thanks for the help!
Appendix: The for-loop way
The non-data.table-way with a for-loop instead of a by-argument would look something like this:
for (varname in dt$var){
dt[var == varname, c(varname) := val]
}
Based on the example showed, we can use dcast from the data.table to convert the long format to wide, and join with the original dataset on the 'val' column.
library(data.table)#v1.9.6+
dt[dcast(dt, val~var, value.var='val'), on='val']
# var val a b
#1: a 1 1 NA
#2: b 2 NA 2
Or as #CathG mentioned in the comments, for previous versions either merge or set the key column and then join.
merge(dt, dcast.data.table(dt, val~var, value.var='val'))

Resources