Complex data.table subset and manipulation

Complex data.table subset and manipulation - r

I am trying to combine a lot of data.table manipulations into a some faster code. I am creating an example with a smaller data.table and I hopeful someone has a better solution than the clunky (embarrassing) code I developed.
For each group, I want to:
1) Verify there is both a TRUE and FALSE in column w, and if there is:
2) Subtract the value of x corresponding to the highest value of v from each
value of x in the same group and put that that number in a new column
So in group 3, if the highest v value is 10, and in the same row x is 0.212,
I would subtract 0.212 from every x value corresponding to group 3 and put that number in a new column
3) Remove all rows corresponding to groups without both a TRUE and a FALSE in column w.
set.seed(1)
test <- data.table(v=1:12, w=runif(12)<0.5, x=runif(12),
y=sample(2,12,replace=TRUE), z=sample(letters[1:3],12,replace=TRUE) )
setkey(test,y,z)
test[,group:=.GRP,by=key(test)]

A chained version can look like this without needing to set a table key:
result <- test[
# First, identify groups to remove and store in 'rowselect'
, rowselect := (0 < sum(w) & sum(w) < .N)
, by = .(y,z)][
# Select only the rows that we need
rowselect == TRUE][
# get rid of the temp column
, rowselect := NULL][
# create a new column 'u' to store the values
, u := x - x[max(v) == v]
, by = .(y,z)]
The result looks like this:
> result
v w x y z u
1: 1 TRUE 0.6870228 1 c 0.4748803
2: 3 FALSE 0.7698414 1 c 0.5576989
3: 7 FALSE 0.3800352 1 c 0.1678927
4: 10 TRUE 0.2121425 1 c 0.0000000
5: 8 FALSE 0.7774452 2 b 0.6518901
6: 12 TRUE 0.1255551 2 b 0.0000000

Related

R data.table - update by summing over subsets coded by columns

I have the following problem. I have a list of sets encoded in a data.table sets where id.s encodes the id of the set and id.e encodes its element.
For each set s there is its value m(s). Values of the function m() are in another data.table m where each row contains an id of the set id.s and its value.
sets <- data.table(
id.s = c(1,2,2,3,3,3,4,4,4,4),
id.e = c(3,3,4,2,3,4,1,2,3,4))
v <- data.table(id.s = 1:4, value = c(1/10,2/10,3/10,4/10))
I need to calculate new function v'() such that
where |s| denoted the cardinality of the set s (the number of elements) and b \ a denotes sets subtraction (a way of modifying a set b by removing the joint elements with set a)
Right now, I do it using a for-loop where I update row by row. Nevertheless, it takes too much time for large data.tables with thousands of sets with thousands of elements.
Do you have any idea how to make it easier?
My current code:
# convert data.table to wide format
dc <- dcast(sets, id.s ~ id.e, drop = FALSE, value.var = "id.e" , fill = 0)
# take columns corresponding to elements id.e
cols <- names(dc)[-1]
# convert columns cols to 0-1 coding
dc[, (cols) := lapply(.SD, function(x) ifelse(x > 0,1,0)), .SDcols = cols]
# join dc with v
dc <- dc[v, on = "id.s"]
# calculate the cardinality of each set
dc[, cardinality := sum(.SD > 0), .SDcols = cols, by = id.s]
# prepare column for new value
dc[, value2 := 0]
# id.s 1 2 3 4 value cardinality value2
#1: 1 0 0 1 0 0.1 1 0
#2: 2 0 0 1 1 0.2 2 0
#3: 3 0 1 1 1 0.3 3 0
#4: 4 1 1 1 1 0.4 4 0
# for each set (row of dc)
for(i in 1:nrow(dc)) {
row <- dc[i,]
set <- as.numeric(row[,cols, with = F])
row.cardinality <- as.numeric(row$cardinality)
# find its supersets
dc[,is.superset := ifelse(rowSums(mapply("*",dc[,cols,with=FALSE],set))==row.cardinality,1,0)][]
# use the formula to update the value
res <- dc[is.superset==1,][, sum := sum((-1)^(cardinality - row.cardinality)*value)]$sum[1]
dc[i,value2 := res]
}
dc[,.(id.s, value2), with = TRUE]
# id.s value2
#1: 1 -0.2
#2: 2 0.3
#3: 3 -0.1
#4: 4 0.4

This might work for you:
Make a little function to get the superset for each set
get_superset <- function(el, setvalue) {
c(setvalue, sets[id.s!=setvalue, setequal(intersect(el, id.e), el), by=id.s][V1==TRUE, id.s])
}
Get cardinality of each set in the sets object, but also save separately for later use (see step 4)
sets[, cardinality:=.N, by=.(id.s)]
cardinality = unique(sets[, .(id.s, cardinality)])
Add supersets, by set, using above function
sets <- unique(sets[,!c("id.e")][sets[, .("supersets"=get_superset(id.e, .GRP)),by=id.s], on=.(id.s)])
(Note: As an alternative, step 2 could be broken into three sub-steps, like this)
# 2a. Get the supersets
supersets = sets[, .("supersets"=get_superset(id.e, .GRP)),by=id.s]
# 2b. Merge the supersets on the original sets
sets = sets[supersets, on=.(id.s)]
# 2c. Retain only necessary columns, and make unique
sets = unique(sets[, .(id.s, cardinality,supersets)])
add value
sets <- sets[v,on=.(supersets=id.s)][order(id.s)]
grab cardinality of each superset
sets <- sets[cardinality, on=.(supersets=id.s)]
get the result (i.e. estimate your v' function)
result = sets[, .(value2 = sum((-1)^(i.cardinality-cardinality)*value)), by=.(id.s)]
Output:
id.s value2
1: 1 -0.2
2: 2 0.3
3: 3 -0.1
4: 4 0.4

How to transpose a long data frame every n rows

I have a data frame like this:
x=data.frame(type = c('a','b','c','a','b','a','b','c'),
value=c(5,2,3,2,10,6,7,8))
every item has attributes a, b, c while some records may be missing records, i.e. only have a and b
The desired output is
y=data.frame(item=c(1,2,3), a=c(5,2,6), b=c(2,10,7), c=c(3,NA,8))
How can I transform x to y? Thanks

We can use dcast
library(data.table)
out <- dcast(setDT(x), rowid(type) ~ type, value.var = 'value')
setnames(out, 'type', 'item')
out
# item a b c
#1: 1 5 2 3
#2: 2 2 10 8
#3: 3 6 7 NA

Create a grouping vector g assuming each occurrence of a starts a new group, use tapply to create a table tab and coerce that to a data frame. No packages are used.
g <- cumsum(x$type == "a")
tab <- with(x, tapply(value, list(g, type), c))
as.data.frame(tab)
giving:
a b c
1 5 2 3
2 2 10 NA
3 6 7 8
An alternate definition of the grouping vector which is slightly more complex but would be needed if some groups have a missing is the following. It assumes that x lists the type values in order of their levels within group so that if a level is less than the prior level it must be the start of a new group.
g <- cumsum(c(-1, diff(as.numeric(x$type))) < 0)
Note that ultimately there must be some restriction on missingness; otherwise, the problem is ambiguous. For example if one group can have b and c missing and then next group can have a missing then whether b and c in the second group actually form a second group or are part of the first group is not determinable.

R - Data frame manipulation without a for loop

I want to read a dataframe read if the first column is T or F and depending on this I will add a new entry to a new column in the matrix using data from the second column.
If z[,1] == true set z[,4] to 2*z[,2]
else set z[,4] to z[,2]
Set if the row in column 1 is true, set the new entry to 2 times the second column, other wise just set it to the value of the second column at that index
Lets create z:
set.seed(4)
z <- data.frame(first=c(T, F, F, T, F), second=sample(-2:2),
third=letters[5:1], stringsAsFactors=FALSE)
z
here is my for loop:
for(i in 1:nrow(z)){
if(z$first == TRUE){
z$newVar2 <- 2*z$second
}
else{
z$newVar2 <- z$second
}
}
Here is without a for loop:
z$newVar<-ifelse(z$first==TRUE, 2*z$second, z$second)
Is there a way to do this with apply? Is there a more efficient way to accomplish this task?

Not what you asked exactly but if working with a matrix data structure, you might as well explore data.table way of going about it:
#Make data.table
setDT(z)
setkey(z)
#Write function to do all the stuff
myfun <- function(first, second){ifelse(first, 2*second, second)}
#Do stuff
z[, newvar2:=myfun(first, second)]
#Printing z
first second third newvar2
1: FALSE -2 d -2
2: FALSE -1 a -1
3: FALSE 1 c 1
4: TRUE 0 e 0
5: TRUE 2 b 4

We can use data.table in a more efficient way still without defining a function, by making use of the fact that TRUE == 1
## use set.seed because we are sampling
set.seed(123)
z <- data.frame(first=c(T, F, F, T, F),
second=sample(-2:2),
third=letters[5:1], stringsAsFactors=FALSE)
library(data.table)
setDT(z)[, newvar2 := (first + 1) * second]
z
# first second third newvar2
# 1: TRUE -1 e -2
# 2: FALSE 1 d 1
# 3: FALSE 2 c 2
# 4: TRUE 0 b 0
# 5: FALSE -2 a -2

Query data.table by key in R

I have followed the data.table introduction. A key is set on the x column of the data.table and then queried. I have tried to set the key on the v column and it does not work has expected. Any ideas of what I am doing wrong?
> set.seed(34)
> DT = data.table(x=c("b","b","b","a","a"),v=rnorm(5))
> DT
x v
1: b -0.1388900
2: b 1.1998129
3: b -0.7477224
4: a -0.5752482
5: a -0.2635815
> setkey(DT,v)
> DT[1.1998129,]
x v
1: b -0.7477224
EXPECTED:
x v
1: b 1.1998129

When the first argument of [.data.table is a number, it will not do a join, but a simple row number lookup. Since after the setkey your data.table looks like so:
DT
# x v
#1: b -0.7477224
#2: a -0.5752482
#3: a -0.2635815
#4: b -0.1388900
#5: b 1.1998129
And since as.integer(1.1998129) is equal to 1 you get the first row.
Now if you intended to do a join instead, you have to use the syntax DT[J(...)] or DT[.(...)], and that will work as expected, provided you use the correct number (as a convenience, you're not required to use the J when dealing with e.g. character columns, because there is no default meaning for what DT["a"] would mean):
DT[J(v[5])]
# x v
#1: b 1.199813
Note that DT[J(1.1998129)] will not work, because:
DT$v[5] == 1.1998129
#[1] FALSE
You could print out a lot of digits, and that would work:
options(digits = 22)
DT$v[5]
#[1] 1.199812896606383683107
DT$v[5] == 1.199812896606383683107
#[1] TRUE
DT[J(1.199812896606383683107)]
# x v
#1: b 1.199812896606383683107
but there is an additional subtlety here, worth noting, in that R and data.table have different precisions for when floating point numbers are equal:
DT$v[5] == 1.19981289660638
#[1] FALSE
DT[J(1.19981289660638)]
# x v
#1: b 1.199812896606379908349
Long story short - be careful when joining floating point numbers.

Filter data.table using inequalities and variable column names

I have a data.table that i want to filter based on some inequality criteria:
dt <- data.table(A=letters[1:3], B=2:4)
dt
# A B
# 1: a 2
# 2: b 3
# 3: c 4
dt[B>2]
# A B
# 1: b 3
# 2: c 4
The above works well as a vector scan solution. But I can't work out how to combine this with variable names for the columns:
mycol <- "B"
dt[mycol > 2]
# A B // Nothing has changed
# 1: a 2
# 2: b 3
# 3: c 4
How do I work around this? I know I can use binary search by setting keys using setkeyv(dt, mycol) but I can't see a way of doing a binary search based on some inequality criteria.

OK, then,
Use get(mycol) because you want the argument to dt[ to be the contents of the object "mycol" . I believe dt[mycol ...] looks for a "mycol" thingie in the data.table object itself, of which of course there is no such animal.

There is an accesor function provided for this. j is evaluated in the frame of X, i.e. your data.table, unless you specify with = FALSE. This would be the canonical way of doing this.
dt[ , mycol , with = FALSE ]
B
1: 2
2: 3
3: 4
Return column, logical comparison, subset rows...
dt[ c( dt[ , mycol , with = FALSE ] > 2 ) ]

Another alternative is to use ]] to retrieve B as a vector, and subset using this:
dt[dt[[mycol]] > 2]

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Complex data.table subset and manipulation - r

Related

R data.table - update by summing over subsets coded by columns

How to transpose a long data frame every n rows

R - Data frame manipulation without a for loop

Query data.table by key in R

Filter data.table using inequalities and variable column names

Categories

Resources