I am trying to reshape data using dcast.data.table, but, when I use predefined function list, dcast.data.table throws error.
require(data.table)
require(Hmisc)
n <- 2
contributors <- 1:2
dates <- 2
DT <- data.table(ID = rep(rep(1:n, contributors), each = dates))
DT[, contributor := c(1,1,2,2,2,3)]
DT[, date := c(1,2,1,1,2,2)]
DT[, amount := rnorm(.N)]
DT[, rate := c(1,1,1,3,3,4)]
DT
# ID contributor date amount rate
# 1: 1 1 1 -1.3888607 1
# 2: 1 1 2 -0.2787888 1
# 3: 2 2 1 -0.1333213 1
# 4: 2 2 1 0.6359504 3
# 5: 2 2 2 -0.2842529 3
# 6: 2 3 2 -2.6564554 4
var.list <- as.list(Cs(amount, rate))
collapse <- function(x) paste(x, collapse = ',')
fun.list <- list(sum, collapse)
dcast.data.table(data = DT, ID + contributor ~ date,
fun.aggregate = fun.list,
value.var = var.list, fill = NA)
# Error in aggregate_funs(fun.call, lvals, sep, ...) :
# When 'fun.aggregate' and 'value.var' are both lists, 'value.var' must be either of length =1 or =length(fun.aggregate).
But the lengths are equal:
length(var.list) == length(fun.list)
# [1] TRUE
When fun.aggregate is defined directly in dcast, then there isn't any problems:
dcast.data.table(data = DT, ID + contributor ~ date,
fun.aggregate = list(sum, collapse),
value.var = var.list, fill = NA)
# ID contributor amount_sum_1 amount_sum_2 rate_collapse_1 rate_collapse_2
# 1: 1 1 -1.3888607 -0.2787888 1 1
# 2: 2 2 0.5026291 -0.2842529 1,3 3
# 3: 2 3 NA -2.6564554 NA 4
I would like to know why this is happening and how could I bypass this error, to use predefined function list in dcast.data.table.
For what it's worth, you can build the call to dcast by hand, using substitute() to pass the user provided list literal to dcast, like so:
z = as.data.table(expand.grid(a=LETTERS[1:3],b=1:3,c=5:6,d=3:4,stringsAsFactors =FALSE))[sample(36,9)]
myfun = function(DT,fmla,funs,vars)
do.call("dcast",list(zz,a~.,fun=substitute(funs),value.var = list('c','d')))
myfun(z,a~.,list(sum,mean),list('c','d'))
> a c_sum d_mean
> 1: A 24 3.500000
> 2: B 10 3.500000
> 3: C 18 3.333333
However, your user (i.e. whoever calls myfun() in this example) will have to provide a list literal, as this doesn't get around the internals of dcast which walk the AST of the argument passed to fun.aggregate which expects a list literal.
Related
I need to find the unique two minima in data column at a data table by two ids id1 and id2:
n <- 12
set.seed(1234)
id1 <- rep(1:2, each = 6)
id2 <- rep(1:6, each = 2)
data <- 100+100*rnorm(n)
dt <- data.table(id1=id1, id2=id2, data=data)
Find below the function that, given the second id id2, calculates the two unique minima at the same time and export them as a vector:
detect_two_lower <- function(ids, values){
dt <- data.table(ids, values)
dt <- dt[, .(V1=min(values, na.rm = T))
, by = ids
][order(V1)]
min_1 <- dt$V1[1]
min_2 <- dt$V1[2]
nn <- c(min_1 = min_1, min_2 = min_2)
}
detect_two_lower <- memoise(detect_two_lower)
Then apply the function on the data.table, grouping by = id1:
dt[, `:=` ( min_1 = detect_two_lower(id2, data)[1]
,min_2 = detect_two_lower(id2, data)[2])
, by = id1
]
The calculation runs as expected (see below). Note, however, that the code calls detect_two_lower twice with the same parameters. As a workaround I tried to minimize the reworking with memoise, but I would like to avoid this patch.
Is there a better way to accomplish the same result?
dt
id1 id2 data min_1 min_2
1: 1 1 -20.7065749 -134.5697703 -20.70657
2: 1 1 127.7429242 -134.5697703 -20.70657
3: 1 2 208.4441177 -134.5697703 -20.70657
4: 1 2 -134.5697703 -134.5697703 -20.70657
5: 1 3 142.9124689 -134.5697703 -20.70657
6: 1 3 150.6055892 -134.5697703 -20.70657
7: 2 4 42.5260040 0.1613555 10.99622
8: 2 4 45.3368144 0.1613555 10.99622
9: 2 5 43.5548001 0.1613555 10.99622
10: 2 5 10.9962171 0.1613555 10.99622
11: 2 6 52.2807300 0.1613555 10.99622
12: 2 6 0.1613555 0.1613555 10.99622
Return a list from the function
library(data.table)
detect_two_lower <- function(ids, values){
dt <- data.table(ids, values)
dt <- dt[, .(V1=min(values, na.rm = T)), by = ids][order(V1)]
as.list(dt$V1)
}
So you can assign them directly :
dt[, c('min_1', 'min_2') := detect_two_lower(id2, data), id1]
I have got single column in data table
library(data.table)
DT <- data.table(con=c(1:5))
My result is a data table with new column x calculated as follows: first value should be first value of con(here:1), next(second) value should be calculated by muliplication second value of con times first value of x. Third value of x is a result of multiplcation third value of con times second value of x and so on. Result:
DT <- data.table(con=c(1:5), x = c(1,2,6,24,120))
I tried use shifts but it did non helped, below some lines of my code:
DT <- data.table(con=c(1:5))
DT[, x := shift(con,1, type = "lead")]
DT[, x := shift(x, 1)]
DT[, x := con * x]
You are looking for cumprod
DT[,x:=cumprod(con)]
DT
con x
1: 1 1
2: 2 2
3: 3 6
4: 4 24
5: 5 120
We can use the accumulate function from the purrr package.
library(data.table)
library(purrr)
DT <- data.table(con=c(1:5))
DT[, x := accumulate(con, `*`)][]
# con x
# 1: 1 1
# 2: 2 2
# 3: 3 6
# 4: 4 24
# 5: 5 120
Or the Reduce function from the base R.
DT <- data.table(con=c(1:5))
DT[, x:= Reduce(`*`, con, accumulate = TRUE)][]
# con x
# 1: 1 1
# 2: 2 2
# 3: 3 6
# 4: 4 24
# 5: 5 120
For a data.table DT grouped by site, sorted by time t, I need to change the last value of a variable in each group. I assume it should be possible to do this by reference using :=, but I haven't found a way that works yet.
Sample data:
require(data.table) # using 1.8.11
DT <- data.table(site=c(rep("A",5), rep("B",4)),t=c(1:5,1:4),a=as.double(c(11:15,21:24)))
setkey(DT, site, t)
DT
# site t a
# 1: A 1 11
# 2: A 2 12
# 3: A 3 13
# 4: A 4 14
# 5: A 5 15
# 6: B 1 21
# 7: B 2 22
# 8: B 3 23
# 9: B 4 24
The desired result is to change the last value of a in each group, for example to 999, so the result looks like:
# site t a
# 1: A 1 11
# 2: A 2 12
# 3: A 3 13
# 4: A 4 14
# 5: A 5 999
# 6: B 1 21
# 7: B 2 22
# 8: B 3 23
# 9: B 4 999
It seems like .I and/or .N should be used, but I haven't found a form that works. The use of := in the same statement as .I[.N] gives an error. The following gives me the row numbers where the assignment is to be made:
DT[, .I[.N], by=site]
# site V1
# 1: A 5
# 2: B 9
but I don't seem to be able to use this with a := assignment. The following give errors:
DT[.N, a:=999, by=site]
# Null data.table (0 rows and 0 cols)
DT[, .I[.N, a:=999], by=site]
# Error in `:=`(a, 999) :
# := and `:=`(...) are defined for use in j, once only and in particular ways.
# See help(":="). Check is.data.table(DT) is TRUE.
DT[.I[.N], a:=999, by=site]
# Null data.table (0 rows and 0 cols)
Is there a way to do this by reference in data.table? Or is this better done another way in R?
Currently you can use:
DT[DT[, .I[.N], by = site][['V1']], a := 999]
# or, avoiding the overhead of a second call to `[.data.table`
set(DT, i = DT[,.I[.N],by='site'][['V1']], j = 'a', value = 999L)
alternative approaches:
use replace...
DT[, a := replace(a, .N, 999), by = site]
or shift the replacement to the RHS, wrapped by {} and return the full vector
DT[, a := {a[.N] <- 999L; a}, by = site]
or use mult='last' and take advantage of by-without-by. This requires the data.table to be keyed by the groups of interest.
DT[unique(site), a := 999, mult = 'last']
There is a feature request #2793 that would allow
DT[, a[.N] := 999]
but this is yet to be implemented
i have a data.table and want to apply a function to on each subset of a row.
Normaly one would do as follows: DT[, lapply(.SD, function), by = y]
But in my case the function does not return a atomic vector but simply a vector.
Is there a chance to do something like this?
library(data.table)
set.seed(9)
DT <- data.table(x1=letters[sample(x=2L,size=6,replace=TRUE)],
x2=letters[sample(x=2L,size=6,replace=TRUE)],
y=rep(1:2,3), key="y")
DT
# x1 x2 y
#1: a a 1
#2: a b 1
#3: a a 1
#4: a a 2
#5: a b 2
#6: a a 2
DT[, lapply(.SD, table), by = y]
# Desired Result, something like this:
# x1_a x2_a x2_b
# 3 2 1
# 3 2 1
Thanks in advance, and also: I would not mind if the result of the function must have a fixed length.
You simply need to unlist the table and then coerce back to a list:
> DTCounts <- DT[, as.list(unlist(lapply(.SD, table))), by=y]
> DTCounts
y x1.a x2.a x2.b
1: 1 3 2 1
2: 2 3 2 1
.
if you do not like the dots in the names, you can sub them out:
> setnames(DTCounts, sub("\\.", "_", names(DTCounts)))
> DTCounts
y x1_a x2_a x2_b
1: 1 3 2 1
2: 2 3 2 1
Note that if not all values in a column are present for each group
(ie, if x2=c("a", "b") when y=1, but x2=c("b", "b") when y=2)
then the above breaks.
The solution is to make the columns factors before counting.
DT[, lapply(.SD, is.factor)]
## OR
columnsToConvert <- c("x1", "x2") # or .. <- setdiff(names(DT), "y")
DT <- cbind(DT[, lapply(.SD, factor), .SDcols=columnsToConvert], y=DT[, y])
I am trying to get many lm models work in a function and I need to automatically drop constant columns from my data.table. Thus, I want to keep only columns with two or more unique values, excluding NA from the count.
I tried several methods found on SO, but I am still not able to drop columns that have two values: a constant and NAs.
My reproducible code:
library(data.table)
df <- data.table(x=c(1,2,3,NA,5), y=c(1,1,NA,NA,NA),z=c(NA,NA,NA,NA,NA),
d=c(2,2,2,2,2))
> df
x y z d
1: 1 1 NA 2
2: 2 1 NA 2
3: 3 NA NA 2
4: NA NA NA 2
5: 5 NA NA 2
My intention is to drop columns y, z, and d since they are constant, including y that only have one unique value when NAs are omitted.
I tried this:
same <- sapply(df, function(.col){ all(is.na(.col)) || all(.col[1L] == .col)})
df1 <- df[ , !same, with = FALSE]
> df1
x y
1: 1 1
2: 2 1
3: 3 NA
4: NA NA
5: 5 NA
As seen, 'y' is still there ...
Any help?
Because you have a data.table, you may use uniqueN and its na.rm argument:
df[ , lapply(.SD, function(v) if(uniqueN(v, na.rm = TRUE) > 1) v)]
# x
# 1: 1
# 2: 2
# 3: 3
# 4: NA
# 5: 5
A base alternative could be Filter(function(x) length(unique(x[!is.na(x)])) > 1, df)
There is simple solution with function Filter in base r. It will help.
library(data.table)
df <- data.table(x=c(1,2,3,NA,5), y=c(1,1,NA,NA,NA),z=c(NA,NA,NA,NA,NA),
d=c(2,2,2,2,2))
# Select only columns for which SD is not 0
> Filter(function(x) sd(x, na.rm = TRUE) != 0, df)
x
1: 1
2: 2
3: 3
4: NA
5: 5
Note: Don't forget to use na.rm = TRUE.
Check if the variance is zero:
df[, sapply(df, var, na.rm = TRUE) != 0, with = FALSE]
# x
# 1: 1
# 2: 2
# 3: 3
# 4: NA
# 5: 5
Here is an option:
df[,which(df[,
unlist(
sapply(.SD,function(x) length(unique(x[!is.na(x)])) >1))]),
with=FALSE]
x
1: 1
2: 2
3: 3
4: NA
5: 5
For each column of the data.table we count the number of unique values different of NA. We keep only column that have more than one value.
If you really mean DROPing those columns, here is a solution:
library(data.table)
dt <- data.table(x=c(1,2,3,NA,5),
y=c(1,1,NA,NA,NA),
z=c(NA,NA,NA,NA,NA),
d=c(2,2,2,2,2))
for (col in names(copy(dt))){
v = var(dt[[col]], na.rm = TRUE)
if (v == 0 | is.na(v)) dt[, (col) := NULL]
}
Just change
all(is.na(.col)) || all(.col[1L] == .col)
to
all(is.na(.col) | .col[1L] == .col)
Final code:
same <- sapply( df, function(.col){ all( is.na(.col) | .col[1L] == .col ) } )
df1 <- df[,!same, with=F]
Result:
x
1: 1
2: 2
3: 3
4: NA
5: 5
For removing constant columns,
Numeric Columns:-
constant_col = [const for const in df.columns if df[const].std() == 0]
print (len(constant_col))
print (constant_col)
Categorical Columns:-
constant_col = [const for const in df.columns if len(df[const].unique()) == 1]
print (len(constant_col))
print (constant_col)
Then you drop the columns using the drop method
library(janitor)
df %>%
remove_constant(na.rm = TRUE)
x
1: 1
2: 2
3: 3
4: NA
5: 5