I am trying to understand why can't I order by a new variable that I create in the same line.
Currently I need to write two lines, one for creating the new variable and then for ordering it.
Can this be done in the same line in data.table:
DF <- data.table(ID = c(1,2,1,2,1,1,1,1,2), Value = c(1,1,1,1,1,1,1,1,1))
newDF <- DF[order(-Count), .(Count = .N), by = ID]
# Gives error: Error in eval(v, x, parent.frame()) : object 'Count' not found
# Works Correctly
newDF <- DF[, .(Count = .N), by = ID]
newDF <- newDF[order(-Count)]
> newDF
ID Count
1: 1 6
2: 2 3
You can simply chain both of the operations in a single line,
DF[, .(Count = .N), by = ID][order(-Count)]
Related
So I'm new to data.table and don't understand now I can modify by reference at the same time that I perform an operation on chosen columns using the .SD symbol? I have two examples.
Example 1
> DT <- data.table("group1:1" = 1, "group1:2" = 1, "group2:1" = 1)
> DT
group1:1 group1:2 group2:1
1: 1 1 1
Let's say for example I simply to choose only columns which contain "group1:" in the name. I know it's pretty straightforward to just reassign the result of operation to the same object like so:
cols1 <- names(DT)[grep("group1:", names(DT))]
DT <- DT[, .SD, .SDcols = cols1]
From reading the data.table vignette on reference-semantics my understanding is that the above does not modify by reference, whereas a similar operation that would use the := would do so. Is this accurate? If that's correct Is there a better way to do this operation that does modify by reference? In trying to figure this out I got stuck on how to combine the .SD symbol and the := operator. I tried
DT[, c(cols1) := .SD, .SDcols = cols1]
DT[, c(cols1) := lapply(.SD,function(x)x), .SDcols = cols1]
neither of which gave the result I wanted.
Example 2
Say I want to perform a different operation dcast that uses .SD as input. Example data table:
> DT <- data.table(x = c(1,2,1,2), y = c("A","A","B","B"), z = 5:8)
> DT
x y z
1: 1 A 5
2: 2 A 6
3: 1 B 7
4: 2 B 8
Again, I know I can just reassign like so:
> DT <- dcast(DT, x ~ y, value.var = "z")
> DT
x A B
1: 1 5 7
2: 2 6 8
But don't understand why the following does not work (or whether it would be preferable in some circumstances):
> DT <- data.table(x = c(1,2,1,2), y = c("A","A","B","B"), z = 5:8)
> cols <- c("x", unique(DT$y))
> DT[, cols := dcast(.SD, x ~ y, value.var = "z")]
In your example,
cols1 <- names(DT)[grep("group1:", names(DT))]
DT[, c(cols1) := .SD, .SDcols = cols1] # not this
DT[, (cols1) := .SD, .SDcols = cols1] # this will work
Below is other example to set 0 values on numeric columns .SDcols by reference.
The trick is to assign column names vector before :=.
colnames = DT[, names(.SD), .SDcols = is.numeric] # column name vector
DT[, (colnames) := lapply(.SD, nafill, fill = 0), .SDcols= is.numeric]
This may have been asked before and I have looked through Reference semantics but I can't seem to find the answer. SO also suggested revising my title, so I will be fine if someone posts a link to the answer!
I have a MWE below. I am trying to group by column val by the day of each month. From my understanding, in SCENARIO 1 below in the code, since I am not assigning the values of lapply to any new column through :=, the data.table is printed.
However, in SCENARIO 2, when I assign new column variables by reference using := the new columns are created (with the correct values) but the value is repeated for every hour of the day, when I want just the daily values.
SCENARIO 3 also gives the desired result, but requires the creation of a new data.table.
I also wouldn't think of set because value iterates by row, and I need to group certain columns.
Thanks for any help,
library(data.table)
library(magrittr)
set.seed(123)
# create data.table to group by
dt <- data.table(year = rep(2018, times = 24 * 31),
month = rep(1, times = 24 * 31),
day = rep(1:31, each = 24),
hour = rep(0:23, times = 31)) %>%
.[, val := sample(100, size = nrow(dt), replace = TRUE)]
# SCENARIO 1
# creates desired dataframe but only prints it, doesn't modify dt by reference (because it is missing `:=`)
dt[, lapply(.SD,
sum),
.SDcols = "val",
by = .(year,
month,
day)]
# Scenario 2
# creates desired val column, but creates duplicate val values for all rows of original grouping by data.table
dt[, val := lapply(.SD,
sum),
.SDcols = "val",
by = .(year,
month,
day)]
# SCENARIO 3
# this also works, but requires creating a new data.table
new_dt <- dt[, lapply(.SD,
sum),
.SDcols = "val",
by = .(year,
month,
day)]
I don't see any problem in the creation of the new data.table object, you can do it with the same name to rewrite.
dt <- dt[, lapply(.SD,
sum),
.SDcols = "val",
by = .(year,
month,
day)]
Now you cannot change the number of rows in the data.table without rewriting like dt<-unique(dt) according to discussion in this feature request: https://github.com/Rdatatable/data.table/issues/635.
I want to group-by a data table by an id column and then count how many times each id occurs. This can be done as follows:
dt <- data.table(id = c(1, 1, 2))
dt_by_id <- dt[, .N, by = id]
dt_by_id
id N
1: 1 2
2: 2 1
That's pretty fine, but I want the N-column to have a different name (e. g. count). In the help it says:
.N is an integer, length 1, containing the number of rows in the group. This may be useful when the column names are not known in
advance and for convenience generally. When grouping by i, .N is the
number of rows in x matched to, for each row of i, regardless of
whether nomatch is NA or 0. It is renamed to N (no dot) in the result
(otherwise a column called ".N" could conflict with the .N variable,
see FAQ 4.6 for more details and example), unless it is explicitly
named; ... .
How to "explicitly name" the N-column when creating the dt_by_id data table? (I know how to rename it afterwards.) I tried
dt_by_id <- dt[, count = .N, by = id]
but this led to
Error in `[.data.table`(dt, , count = .N, by = id) :
unused argument (count = .N)
You have to list the output of your calculation if you want to give your own name:
dt[, .(count=.N), by = id]
This is identical to dt[, list(count=.N), by = id], if you prefer; . is an alias for list here.
If we have already named it, then use setnames
setnames(dt_by_id, "N", 'count')
or using rename
library(dplyr)
dt_by_id %>%
rename(count = N)
# id count
#1: 1 2
#2: 2 1
Using dplyr::count (x, name= "new column" ) will replace the default column name n with a new name.
dt <- data.frame(id = c(1, 1, 2))
dt %>%
dplyr:: count(id, name = 'ID')
Suppose I have the following data.table:
player_id prestige_score_0 prestige_score_1 prestige_score_2 prestige_score_3 prestige_score_4
1: 100284 0.0001774623 2.519792e-03 5.870781e-03 7.430179e-03 7.937716e-03
2: 103819 0.0001774623 1.426482e-03 3.904329e-03 5.526974e-03 6.373850e-03
3: 100656 0.0001774623 2.142518e-03 4.221423e-03 5.822705e-03 6.533448e-03
4: 104745 0.0001774623 1.084913e-03 3.061197e-03 4.383649e-03 5.091851e-03
5: 104925 0.0001774623 1.488457e-03 2.926728e-03 4.360301e-03 5.068171e-03
And I want to find the difference between values in each column starting from column prestige_score_0
In one step it should look like this: df[,prestige_score_0] - df[,prestige_score_1]
How can I do it in data.table(and save this differences as data.table and keep player_id as well)?
This is how you can do this in a tidy way:
# make it tidy
df2 <- melt(df,
id = "player_id",
variable.name = "column_name",
value.name = "prestige_score")
# extract numbers from column names
df2[, score_number := as.numeric(gsub("prestige_score_", "", column_name))]
# compute differences by player
df2[, diff := prestige_score - shift(prestige_score, n = 1L, type = "lead"),
by = player_id]
# if necessary, reshape back to original format
dcast(df2, player_id ~ score_number, value.var = c("prestige_score", "diff"))
you can subtract a whole dt with a shifted version of itself
dt = data.table(id=c("A","B"),matrix(rexp(10, rate=.1), ncol=5))
dt_shift = data.table(id=dt[,id], dt[, 2:(ncol(dt)-1)] - dt[,3:ncol(dt)])
You could use a for loop -
for(i in c(1:(ncol(df)-1)){
df[, paste0("diff_", i-1, "_", i)] = df[, paste0("prestige_score_", i-1)] -
df[, paste0("prestige_score_", i)]
}
This might not be the most efficient if you have a lot of columns though.
I am trying to do a min/max aggregate on a dynamically chosen column in a data.table. It works perfectly for numeric columns but I cannot get it to work on Date columns unless I create a temporary data.table.
It works when I use the name:
dt <- data.table(Index=1:31, Date = seq(as.Date('2015-01-01'), as.Date('2015-01-31'), by='days'))
dt[, .(minValue = min(Date), maxValue = max(Date))]
# minValue maxValue
# 1: 2015-01-01 2015-01-31
It does not work when I use with=FALSE:
colName = 'Date'
dt[, .(minValue = min(colName), maxValue = max(colName)), with=F]
# Error in `[.data.table`(dt, , .(minValue = min(colName), maxValue = max(colName)), :
# could not find function "."
I can use .SDcols on a numeric column:
colName = 'Index'
dt[, .(minValue = min(.SD), maxValue = max(.SD)), .SDcols=colName]
# minValue maxValue
# 1: 1 31
But I get an error when I do the same thing for a Date column:
colName = 'Date'
dt[, .(minValue = min(.SD), maxValue = max(.SD)), .SDcols=colName]
# Error in FUN(X[[i]], ...) :
# only defined on a data frame with all numeric variables
If I use lapply(.SD, min) or sapply() then the dates are changed to numbers.
The following works and does not seem to waste memory and is fast. Is there anything better?
a <- dt[, colName, with=F]
setnames(a, 'a')
a[, .(minValue = min(a), maxValue = max(a))]
On your first attempt:
dt[, .(minValue = min(colName), maxValue = max(colName)), with=F]
# Error in `[.data.table`(dt, , .(minValue = min(colName), maxValue = max(colName)), :
# could not find function "."
You should simply read the Introduction to data.table vignette to understand what with= means. It's easier if you're aware of with() function from base R.
On the second one:
dt[, .(minValue = min(.SD), maxValue = max(.SD)), .SDcols=colName]
# Error in FUN(X[[i]], ...) :
# only defined on a data frame with all numeric variables
This seems like an issue with min() and max() on a data.frame/data.table with column with attributes. Here's a MRE.
df = data.frame(x=as.Date("2015-01-01"))
min(df)
# Error in FUN(X[[i]], ...) :
# only defined on a data frame with all numeric variables
To answer your question, you can use get():
dt[, .(min = min(get(colName)), max = max(get(colName)))]
Or as #Frank suggested, [[ operator to subset the column:
dt[, .(min = min(.SD[[colName]]), max = max(.SD[[colName]]))]
There's not yet a nicer way of applying .SD to multiple functions (because base R doesn't seem to have one AFAICT, and data.table tries to use base R functions as much as possible). There's a FR #1063 to address this issue. If/when that gets implemented, then one could do, for example:
# NOTE: not yet implemented, FR #1063
dt[, colwise(.SD, min, max), .SDcols = colName]