Value as column names in data.table - r

I have the following data.table:
dat<-data.table(Y=as.factor(c("a","b","a")),"a"=c(1,2,3),"b"=c(3,2,1))
It looks like:
Y a b
1: a 1 3
2: b 2 2
3: a 3 1
What I want is to subtract the value of the column indicated by the value of Y by 1. E.g. the Y value of the first row is "a", so the value of the column "a" in the first row should be reduced by one.
The result should be:
Y a b
1: a 0 3
2: b 2 1
3: a 2 1
Is this possible? If yes, how? Thank you!

Using self-joins and get:
for (yval in dat[ , unique(Y)]){
dat[yval, (yval) := get(yval) - 1L, on = "Y"]
}
dat[]
# Y a b
# 1: a 0 3
# 2: b 2 1
# 3: a 2 1

We can use melt/dcast to do this. melt the dataset after creating a row sequence ('N') to 'long' format, subtract 1 from the 'value' column where 'Y' and 'variable' elements are equal, assign (:= the output to 'value', then dcast the 'long' format to 'wide'.
dcast(melt(dat[, N := 1:.N], id.var = c("Y", "N"))[Y==variable,
value := value -1], N + Y ~variable, value.var = "value")[, N := NULL][]
# Y a b
#1: a 0 3
#2: b 2 1
#3: a 2 1

First an apply function to make the actual transformation. We need to apply by row and then use the first element to name the second element to access and over write. For some reason the values I was accessing in a and b were strings, so I used as.numeric to transform them to numbers. I don't know if this is normal in data.tables or a result of using the apply statement on one since I don't use data.tables normally.
tformDat <- apply(dat, 1, function(x) {x[x[1]] <- as.numeric(x[x[1]]) - 1;x})
Then you need to reformat back to the original data.table format
data.table(t(tformDat))
The whole thing can be done in one line.
data.table(t(apply(dat, 1, function(x) {x[x[1]] <- as.numeric(x[x[1]]) - 1;x})))

Related

R perform summary operation and subset result by data.table column

I want to use a list external to my data.table to inform what a new column of data should be, in that data.table. In this case, the length of the list element corresponding to a data.table attribute;
# dummy list. I am interested in extracting the vector length of each list element
l <- list(a=c(3,5,6,32,4), b=c(34,5,6,34,2,4,6,7), c = c(3,4,5))
# dummy dt, the underscore number in Attri2 is the element of the list i want the length of
dt <- data.table(Attri1 = c("t","y","h","g","d","e","d"),
Attri2 = c("fghd_1","sdafsf_3","ser_1","fggx_2","sada_2","sfesf_3","asdas_2"))
# extract that number to a new attribute, just for clarity
dt[, list_gp := tstrsplit(Attri2, "_", fixed=TRUE, keep=2)]
# then calculate the lengths of the vectors in the list, and attempt to subset by the index taken above
dt[,list_len := '[['(lapply(1, length),list_gp)]
Error in lapply(l, length)[[list_gp]] : no such index at level 1
I envisaged the list_len column to be 5,3,5,8,8,3,8
A couple of things.
tstrsplit gives you a string. convert to number.
not quite sure about the [[ construct there, see proposed solution:
dt[, list_gp := as.numeric( tstrsplit(Attri2, "_", fixed=TRUE, keep=2)[[1]] )]
dt[, list_len := sapply( l[ list_gp ], length ) ]
Output:
> dt
Attri1 Attri2 list_gp list_len
1: t fghd_1 1 5
2: y sdafsf_3 3 3
3: h ser_1 1 5
4: g fggx_2 2 8
5: d sada_2 2 8
6: e sfesf_3 3 3
7: d asdas_2 2 8

data.table "out of range", how to add value to new row

While working with data.frame it is simple to insert new value by using row number;
df1 <- data.frame(c(1:3))
df1[4,1] <- 1
> df1
c.1.3.
1 1
2 2
3 3
4 1
It is not working with data.table;
df1 <- data.table(c(1:3))
df1[4,1] <- 1
Error in `[<-.data.table`(`*tmp*`, 4, 1, value = 1) : i[1] is 4 which is out of range [1,nrow=3].
How can I do it?
Data Tables were designed to work much faster with some common operations like subset, join, group, sort etc. and as a result have some differences with data.frames.
Some operations like the one you pointed out will not work on data.tables. You need to use data.table - specific operations.
dt1 <- data.table(c(1:3))
rbindlist(list(dt1, list(1)), use.names=FALSE)
dt1
# V1
# 1: 1
# 2: 2
# 3: 3
# 4: 1

Naming Aggregate variable(s) in data.table by reference in R

I would like to know if it's possible to name an aggregate variable by a dynamic reference at the time of aggregation in data.table.
Please note that I know I can rename the variable after aggregation by reference and that is not what I'm asking here!
Let's say I've got a data.table DT with three variables v1, v2, and v3.
> DT
var1 var2 var3
1: 1 1 A
2: 3 0 A
3: 2 2 B
4: 1 0 A
5: 0 2 C
I would like to dynamically name the aggregate variable, based on the names stored in a vector OR a string variable
var_string <- c('agg_var1', 'agg_var2')
# the following doesn't work
DT_agg <- DT[, .( (var_string[1]) = sum(v1 + v2)), by = .( (var_string[2]) = var3)]
#this is the output I want
> DT_agg
agg_var2 agg_var1
1: A 6
2: B 4
3: C 2
The code above doesn't work. it gives me error of the sort:
Error: unexpected '=' in "DT_agg <- DT[, .( (var_string[1]) = sum(v1 + v2)), by = .( (var_string[2]) = var3)="
I'm only interested to know if it's possible to do this at the same time as aggregation, rather than renaming the columns afterwards, which i know how to do already.

Returning values based on matching strings and then expanding to other rows in same group

I have a dataset with 11,000 rows in the following format:
Case Type
A x
A y
A z
B a
B b
B z
...where Case and Type are both multiletter character strings. I want to add a new column of dummies for rows containing Type==x or y, which I can easily do using the following line of code:
df$quality <- ifelse(grepl("x|y", df$type), 1, 0)
This produces the following:
Case Type Quality
A x 1
A y 1
A z 0
B a 0
B b 0
B z 0
There are quite a few threads on how to do this. However, I couldn't find any that explain how do expand returned values across groups. Specifically, I'd like Quality==1 if any observations in a given Case contain x or y. The results should then look like:
Case Type Quality
A x 1
A y 1
A z 1
B a 0
B b 0
B z 0
...such that row 3 is also coded Quality==1 even though it doesn't contain Type x or y because another row in Case A does. The answer must be simple, but I'd be grateful for some help!
Similar to the idea of #Psidom, we can use base R method ave
df$Quality <- as.numeric(as.logical(ave(df$Type, df$Case, FUN = function(i)
any(grepl("x|y", i)))))
# Case Type Quality
#1 A x 1
#2 A y 1
#3 A z 1
#4 B a 0
#5 B b 0
#6 B z 0
We can reduce this further, as commented by #thelatemail in comments,
df$Quality <- as.numeric(ave(grepl("[xy]", df$Type), df$Case, FUN=any))
We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'Case', we get the logical vector with %in% by checking whether the elements 'x', 'y' are %in% the 'Type' column. If both of them needs to be there, use all (or else replace all with any), convert the logical vector to binary with as.integer and assign (:=) it to new column 'Quality'
library(data.table)
setDT(df1)[, Quality := as.integer(all(c('x', 'y') %in% Type)), by = Case]
df1
# Case Type Quality
#1: A x 1
#2: A y 1
#3: A z 1
#4: B a 0
#5: B b 0
#6: B z 0
Or using the `OP's method
setDT(df1)[, Quality := as.integer(any(grepl("[xy]", Type))), by = Case]
Or with dplyr, we use the same methodology as in data.table
library(dplyr)
df1 %>%
group_by(Case) %>%
mutate(Quality = as.integer(all(c('x', 'y') %in% Type)))
#mutate(Quality = as.integer(any(c('x', 'y') %in% Type)))
Or another base R option with table
tbl <- with(df1, table(Case, grepl("[x|y]", Type)))[,2]
transform(df1, Quality = +(Case %in% names(tbl[tbl!=0])))

Supply arguments to data.table as (1) vector of strings AND (2) variablenames

Imagine you want to apply a function row-wise on a data.table. The function's arguments correspond to fixed data.table columns as well as dynamically generated column names.
Is there a way to supply fixed and dynamic column names as argument to a function while using data.tables?
The problems are:
Both, variablenames and dynamically generated strings as argument to a function over a datatable
The dynamic column name strings are stored in a vector with > 1 entries (get() won't work)
The dynamic column's values need to be supplied as a vector to the function
This illustrates it:
library('data.table')
# Sample dataframe
D <- data.table(id=1:3, fix=1:3, dyn1=1:3, dyn2=1:3) #fixed and dynamic column names
setkey(D, id)
# Sample function
foo <-function(fix, dynvector){ rep(fix,length(dynvector)) %*% dynvector}
# It does not matter what this function does.
# The result when passing column names not dynamically
D[, "new" := foo(fix,c(dyn1,dyn2)), by=id]
# id fix dyn1 dyn2 new
# 1: 1 1 1 1 2
# 2: 2 2 2 2 8
# 3: 3 3 3 3 18
I want to get rid of the c(dyn1,dyn2). I need to get the column names dyn1, dyn2 from another vector which holds them as string.
This is how far I got:
# Now we try it dynamically
cn <-paste("dyn",1:2,sep="") #vector holding column names "dyn1", "dyn2"
# Approaches that don't work
D[, "new" := foo(fix,c(cn)), by=id] #wrong as using a mere string
D[, "new" := foo(fix,c(cn)), by=id, with=F] #does not work
D[, "new" := foo(fix,c(get(cn))), by=id] #uses only the first element "dyn1"
D[, "new" := foo(fix,c(mget(cn, .GlobalEnv, inherits=T))), by=id] #does not work
D[, "new" := foo(fix,c(.SD)), by=id, .SDcols=cn] #does not work
I suppose mget() is the solution, but I know too less about scoping to figure it out.
Thanks! JBJ
Update: Solution
based on the answer by BondedDust
D[, "new" := foo(fix,sapply(cn, function(x) {get(x)})), by=id]
I wasn't able to figure out what you were trying to do with the matrix-multiplication, but this shows how to create new variables with varying and fixed inputs to a function:
D <- data.table(id=1:3, fix=1:3, dyn1=1:3, dyn2=1:3)
setkey(id)
foo <-function(fix, dynvector){ fix* dynvector}
D[, paste("new",1:2,sep="_") := lapply( c(dyn1,dyn2), foo, fix=fix), by=id]
#----------
> D
id fix dyn1 dyn2 new_1 new_2
1: 1 1 1 1 1 1
2: 2 2 2 2 4 4
3: 3 3 3 3 9 9
So you need to use a vector of character values to get columns. This is a bit of an extension to this question: Why do I need to wrap `get` in a dummy function within a J `lapply` call?
> D <- data.table(id=1:3, fix=1:3, dyn1=1:3, dyn2=1:3)
> setkey(D, id)
> id1 <- parse(text=cn)
> foo <-function( fix, dynvector){ fix*dynvector}
> D[, paste("new",1:2,sep="_") := lapply( sapply( cn, function(x) {get(x)}) , foo, fix=fix) ]
Warning message:
In `[.data.table`(D, , `:=`(paste("new", 1:2, sep = "_"), lapply(sapply(cn, :
Supplied 2 columns to be assigned a list (length 6) of values (4 unused)
> D
id fix dyn1 dyn2 new_1 new_2
1: 1 1 1 1 1 2
2: 2 2 2 2 2 4
3: 3 3 3 3 3 6
You could probably use the methods in create an expression from a function for data.table to eval as well.

Resources