This question might be too general, but I feel it comes up again and again in my work and thus is probably of interest to others.
Suppose I want to create a data table (or matrix) that is based off of expanding a grid.
library(data.table)
Vmat1 = data.table(expand.grid(c(list(d = 1:5, w = 1:(3)))))
Suppose however, that if I were to do this, this would result in a memory error for the true power set. However, there are constraints that I want to impose, for example:
If w>1, then it must be that d<3
This then gives smaller final set that would not result in a memory error:
Vmat1[w>1 & d<3 | w==1]
d w
1: 1 1
2: 2 1
3: 3 1
4: 4 1
5: 5 1
6: 1 2
7: 2 2
8: 1 3
9: 2 3
My question is, is it possible to ex ante impose the restriction when creating the grid? It is too costly to build the full power set and then condition and reduce.
If your concern is memory you can split into sets as #chinsoon suggests.
# Filter Method
V = CJ(d = 1:5, w = 1:3) # same as data.table(expand.grid(c(list(d = 1:5, w = 1:(3))))) except ordering
a <- V[w>1 & d<3 | w==1]
# Sets Method
d <- 1:5
w <- 1:3
b <- rbindlist(list(
CJ(d = d[d < 3], w = w[w > 1])
, CJ(d = d, w = w[w == 1])))
all(a == b[order(d, w)])
# [1] TRUE
I have a data.table with multiple categorical variables for which I would like to create contrast (or "dummy") variables along with many more numerical variables which I would like to simply pass by reference.
Example dataset:
library('data.table')
d <- data.table(1:3, # there are lots of numerics, so I want to avoid copying
letters[1:3], # convert these to factor then dummy variable
10:12,
LETTERS[24:26])
# >d
# V1 V2 V3 V4
# 1: 1 a 10 X
# 2: 2 b 11 Y
# 3: 3 c 12 Z
The desired result looks like:
>dummyDT(d)
V1 V3 V2.b V2.c V4.Y V4.Z
1: 1 10 0 0 0 0
2: 2 11 1 0 1 0
3: 3 12 0 1 0 1
which can be produced with:
# this does what I want but is slow and inelegant and not idiomatic data.table
categorToMatrix <- function(x, name_prefix='Var'){
# set levels in order of appearance to avoid default re-sort by alpha
m <- contrasts(factor(x, levels=unique(x)))
dimnames(m) <- list(NULL, paste(name_prefix, colnames(m), sep='.') )
m
}
dummyDT <- function(d){
toDummy <- which(sapply(d, function(x) is.factor(x) | is.character(x)))
if(length(toDummy)>0){
dummyComponent <-
data.table(
do.call(cbind, lapply(toDummy, function(j) {
categorToMatrix(d[[j]], name_prefix = names(d)[j])
} )
)
)
asIs <- (1:ncol(d))[-toDummy]
if(length(asIs)>0) {
allCols <- cbind(d[,asIs,with=FALSE], dummyComponent)
} else allCols <- dummyComponent
} else allCols <- d
return(allCols)
}
(I do not care about maintaining original column ordering.)
I have tried in addition to the above, the approach of splitting each matrix into a list of columns, as in:
# split a matrix into list of columns and keep track of column names
# expanded from #Tommy's answer at: https://stackoverflow.com/a/6821395/2573061
splitMatrix <- function(m){
setNames( lapply(seq_len(ncol(m)), function(j) m[,j]), colnames(m) )
}
# Example:
splitMatrix(categoricalToMatrix(d$V2, name_prefix='V2'))
# $V2.b
# [1] 0 1 0
#
# $V2.c
# [1] 0 0 1
which works for an individual column, but then when I try to lapply to multiple columns, these lists get somehow coerced into string-rows and recycled, which is baffling me:
dummyDT2 <- function(d){
stopifnot(inherits(d,'data.table'))
toDummy <- which(sapply(d, function(x) is.factor(x) | is.character(x)))
if(length(toDummy)>0){
dummyComponent <- d[, lapply(.SD, function(x) splitMatrix( categorToMatrix(x) ) ) ,
.SDcols=isChar]
asIs <- (1:ncol(d))[-toDummy]
if(length(asIs)>0) {
allCols <- cbind(d[,asIs,with=FALSE], dummyComponent)
} else allCols <- dummyComponent
} else allCols <- d
return(allCols)
}
dummyDT2(d)
# V1 V3 V2
# 1: 1 10 0,1,0
# 2: 2 11 0,0,1
# 3: 3 12 0,1,0
# Warning message:
# In data.table::data.table(...) :
# Item 2 is of size 2 but maximum size is 3 (recycled leaving remainder of 1 items)
I then tried wrapping splitMatrix with data.table() and got an amusingly laconic error message.
I know that functions like caret::dummyVars exist for data.frame. I am trying to create a data.table optimized version.
Closely related question: How to one-hot-encode factor variables with data.table?
But there are two differences: I do not want full-rank dummy variables (because I'm using this for regression) but rather contrast variables (n-1 of these for n levels) and I have multiple numeric variables that I do not want to OHE.
I am trying to combine a lot of data.table manipulations into a some faster code. I am creating an example with a smaller data.table and I hopeful someone has a better solution than the clunky (embarrassing) code I developed.
For each group, I want to:
1) Verify there is both a TRUE and FALSE in column w, and if there is:
2) Subtract the value of x corresponding to the highest value of v from each
value of x in the same group and put that that number in a new column
So in group 3, if the highest v value is 10, and in the same row x is 0.212,
I would subtract 0.212 from every x value corresponding to group 3 and put that number in a new column
3) Remove all rows corresponding to groups without both a TRUE and a FALSE in column w.
set.seed(1)
test <- data.table(v=1:12, w=runif(12)<0.5, x=runif(12),
y=sample(2,12,replace=TRUE), z=sample(letters[1:3],12,replace=TRUE) )
setkey(test,y,z)
test[,group:=.GRP,by=key(test)]
A chained version can look like this without needing to set a table key:
result <- test[
# First, identify groups to remove and store in 'rowselect'
, rowselect := (0 < sum(w) & sum(w) < .N)
, by = .(y,z)][
# Select only the rows that we need
rowselect == TRUE][
# get rid of the temp column
, rowselect := NULL][
# create a new column 'u' to store the values
, u := x - x[max(v) == v]
, by = .(y,z)]
The result looks like this:
> result
v w x y z u
1: 1 TRUE 0.6870228 1 c 0.4748803
2: 3 FALSE 0.7698414 1 c 0.5576989
3: 7 FALSE 0.3800352 1 c 0.1678927
4: 10 TRUE 0.2121425 1 c 0.0000000
5: 8 FALSE 0.7774452 2 b 0.6518901
6: 12 TRUE 0.1255551 2 b 0.0000000
I would like to remove the columns that have all zeros. But, some of the columns appear to have non numeric values. How can I remove the non numeric columns, and the columns with all zeros. It would be helpful if the non numeric column name was printed, or the column number, so I can determine if it was ok to remove the column.
Here's what I'm trying, but it doesn't work when the data table has non numeric values.
removeColsAllZeros = function(ddt) {
m <- as.matrix(ddt)
# isNumericColList <- lapply(1:ncol(m), function(ii,mm){is.numeric(mm[,ii])}, mm=m)
# indexNonNumericCols <- which(!unlist(isNumericColList))
mnz <- m[, colSums(abs(m),na.rm = TRUE) != 0]
return(mnz)
}
Here's a simple function that can be applied to all columns in your data frame, returning just the ones that are numeric and not all zero:
# Fake data
dat = data.frame(x=rnorm(5),
y=rep(0,5),
z=sample(c(1,0),5,replace=TRUE),
w=sample(LETTERS[1:3],5,replace=TRUE),
stringsAsFactors=FALSE)
dat
x y z w
1 0.5450570 0 0 B
2 0.5292899 0 0 B
3 -0.2142306 0 1 C
4 -0.7246841 0 0 C
5 -0.7567683 0 1 A
# Remove columns with all zeros or that are not numeric
dat[, !sapply(names(dat), function(col) {all(dat[,col]==0) |
!is.numeric(dat[,col])})]
x z
1 0.5450570 0
2 0.5292899 0
3 -0.2142306 1
4 -0.7246841 0
5 -0.7567683 1
To unpack this, the function checks, for a single column of dat, whether it has all zeros or is not numeric. sapply then "applies" this function to every column in the data frame, returning a logical vector with TRUE for columns of dat with all zeros or that are non-numeric, and FALSE for columns that are numeric and not all zeros. The ! ("NOT") before sapply just reverses the FALSE and TRUE values:
!sapply(names(dat), function(col) {
all(dat[, col]==0) | !is.numeric(dat[, col])
})
x y z w
TRUE FALSE TRUE FALSE
Then we use this logical vector to return only those columns of dat that are TRUE.
dat[ , c(TRUE, FALSE, TRUE, FALSE)]
x z
1 0.5450570 0
2 0.5292899 0
3 -0.2142306 1
4 -0.7246841 0
5 -0.7567683 1
Finally, to check the non-numeric columns that were removed, do the following, which will return all non-numeric columns:
dat[, sapply(names(dat), function(col) {!is.numeric(dat[,col])})]
This is not compact but works on data table after modifying #eipi10's code.
# toy data
set.seed(1)
dat = data.frame(x=rnorm(5),
y=rep(0,5),
z=sample(c(1,0),5,replace=TRUE),
w=sample(LETTERS[1:3],5,replace=TRUE),
stringsAsFactors=FALSE)
# code for a data table
library(data.table)
setDT(dat)
idx = sapply(dat, function(x){ !(all(x==0) | !is.numeric(x)) })
dat[, .SD, .SDcols = idx]
# x z
# 1: -0.6264538 1
# 2: 0.1836433 1
# 3: -0.8356286 0
# 4: 1.5952808 1
# 5: 0.3295078 0
Both of the other answers were helpful, but they didn't totally answer the question. Here's a function with to identify and remove the non-numeric and all zero columns from a data table. This was helpful and provided additional insight into the data set.
removeColsAllZeros = function(ddt) {
# Identify and remove nonnumeric cols and cols with all zeros
idx_all_zeros = ddt[, lapply(.SD, function(x){ (is.numeric(x) & all(x==0)) })]
idx_not_numeric = ddt[, lapply(.SD, function(x){ (!is.numeric(x)) })]
idx_all_zeros = which(unlist(idx_all_zeros))
idx_not_numeric = which(unlist(idx_not_numeric))
# Print bad column names
if (length(idx_all_zeros)>0) {
cat('Numeric columns with all zeros are\n',paste(names(ddt)[idx_all_zeros],collapse='\n'),'\n')
flush.console()
}
if (length(idx_not_numeric)>0) {
cat('Nonnumeric columns are\n',paste(names(ddt)[idx_not_numeric],collapse='\n'),'\n')
flush.console()
}
# Determine the numeric columns that have nonzero values
idx_bad = union(idx_all_zeros, idx_not_numeric)
idx_good = setdiff(seq(1,ncol(ddt)), idx_bad)
# Return nonzero numeric data
ddt[, .SD, .SDcols = idx_good]
}
according to my last question i have an new belonging question. After Editing my post and ask there and wait abot a week i want to try it here again.
This time with a better example:
Equip<- c(1,1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,6,6,6,6,6,6)
Notif <-c(1,1,3,4,2,2,2,5,6,7,9,9,15,10,11,12,13,14,16,17,18,19)
rank <- c(1,1,2,3,1,1,1,1,2,3,1,1,2,1,2,3,1,2,3,4,5,6)
Component <- c("Ventil","Motor","Ventil","Ventil","Vergaser","Vergaser","Bremse",
"Lichtmaschine","Bremse","Lichtmaschine","Bremse","Motor","Lichtmaschine",
"Bremse","Bremse","Motor","Vergaser","Motor","Vergaser","Motor",
"Vergaser","Motor")
df <- data.frame(Equip,Notif,rank,Component)
Equip is my subject and rank the actual visit number. Component is the subject what have to be looked for.
I want to have an output like this:
If an Equip(subject) was visited 2 times( rank 1 and 2) look by all Equips with rank 1&2 , if there is any Component which was regarded the first and the second time.
If an Equip(subject) was visited 3 times (rank 1 ,2 and 3) for this look by all Equips, if there is any Component list up 3 times like Equip 1, rank 1, Component Motor, Equip 1, rank 2, Component Motor, Equip 1, rank 3, Component Motor
The output should have the name of the Component, like True "Motor"
I have a code but with this, i can just compare the 1 and the 2 visit, the 2 and the 3 together and so on( i cannot split up again with the ranks, like Equips with 2 ranks, Equips with 3 ranks and so on)
the code is this:
a <- lapply(split(df,df$Equip),function(x){
ll <- split(x,x$rank)
if(length(ll)>1 )
ii <- intersect(ll[[1]]$Component,ll[[2]]$Component ) ## test intersection
else
ii <- NA
c(length(ii)> 0 && !is.na(ii),ii)
})
b <- unlist(a)
c <- table(b,b)
rowSums(c)
Hopefully you can help me. Please ask if there are any questions.
according to your question about the output, and to your way of solution,
Equip Component V1 idx
1: 1 Ventil TRUE 3
2: 2 NA False 1
3: 3 NA False 3
4: 4 NA FALSE 2
5: 5 NA FALSE 3
6: 6 NA FALSE 6
Something like that, but if its easier, Equip and idx is not neccessarilly needed
for Equip with 2 ranks:
TRUE FALSE
0 1
for Equip with 3 ranks:
TRUE FALSE
1 2
for Equip with 6 ranks:
TRUE FALSE
0 1
Here's the output I think would be of interest to you. Its using data.table.
First, we create a data.table from your data.frame df with keys = Equip, Component as follows.
require(data.table) # load package
# then create the data.table with keys as specified above
# Check that both these columns are already sorted out for you!
dt <- data.table(df, key=c("Equip", "Component"))
Second, we create a function that'll give the desired output for a given rank query (2, 3 etc..)
this.check <- function(idx) {
chk <- seq(1, idx)
o <- subset(dt[, all(chk %in% rank), by=c("Equip", "Component")], V1 == TRUE)
if (nrow(o) > 0) o[, idx:=idx]
}
What does this do? Let's run this for rank=1,2. We run this by:
> this.check(2)
# output
Equip Component V1 idx
1: 1 Ventil TRUE 2
2: 5 Bremse TRUE 2
This tells you that for Equip = 1 and 5, there are Components = Ventil and Bremse with rank = 1 and 2, respectively (indicated with idx=2). You also get the column V1 = TRUE, even though I, as #Carl pointed out already, don't understand the need for this. If you require, you can change the column names of this output by using setnames
Third, we use this function to query ranks=1,2, then ranks=1,2,3 .. and so on. This can be accomplished with a simple lapply as follows:
# Let's run the function for idx = 2 to 6.
# This will check from rank = 1,2 until rank=1,2,3,4,5,6
o <- lapply(2:6, function(idx) {
this.check(idx)
})
> o
[[1]]
Equip Component V1 idx
1: 1 Ventil TRUE 2
2: 5 Bremse TRUE 2
[[2]]
Equip Component V1 idx
1: 1 Ventil TRUE 3
[[3]]
NULL
[[4]]
NULL
[[5]]
NULL
It shows that for rank=1,2 and rank=1,2,3 you have some Component. For others there's nothing = NULL.
Finally, we can bind all of these together using rbind to get one single data.table as follows:
o <- do.call(rbind, o)
> o
Equip Component V1 idx
1: 1 Ventil TRUE 2
2: 5 Bremse TRUE 2
3: 1 Ventil TRUE 3
Here, idx=2 are the Component that satisfies rank=1,2 and idx=3 are the ones that satisfy rank=1,2,3.
Putting it all together:
this.check <- function(idx) {
chk <- seq(1, idx)
o <- subset(dt[, all(chk %in% rank), by=c("Equip", "Component")], V1 == TRUE)
if (nrow(o) > 0) o[, idx:=idx]
}
o <- do.call(rbind, lapply(2:6, function(idx) {
this.check(idx)
}))
I hope this helps.
Edit: (After series of exchanges in comments, this is the new solution I propose. I hope this is what you are after.)
require(data.table)
dt <- data.table(df, key=c("Equip", "Component"))
dt[, `:=`(e.max=max(rank)), by=Equip]
dt[, `:=`(ec.max=max(rank)), by=c("Equip", "Component")]
setkey(dt, "e.max", "ec.max")
this.check <- function(idx) {
t1 <- dt[J(idx,idx)]
t2 <- t1[, identical(as.numeric(seq_len(idx)), as.numeric(rank)),
by=c("Equip", "Component")]
o <- table(t2$V1)
if (length(o) == 1)
o <- c(o, "TRUE"=0)
o <- c("idx"=idx, o)
}
o <- do.call(rbind, lapply(2:6, function(idx) this.check(idx)))
> o
# idx FALSE TRUE
# [1,] 2 1 0
# [2,] 3 2 1
# [3,] 4 1 0
# [4,] 5 1 0
# [5,] 6 1 0
If I make an array of your data, columnwise, as
foo<-cbind(Equip,Notif, rank, Component)
eqp<-1 # later, loop over all values
foo[c( which( foo[,1]==eqp & (foo[,3]==1 | foo[,3]==2) ) ),4]
[1] "Ventil" "Motor" "Ventil"
Feed those results to table and extract items with count ==2
Clearly any item which shows up twice is what you want.
This is not an answer I'd recommend using, since tools like ddply and aggregate will do this much more cleanly, but I want to be sure that this is the answer you're after, assuming a loop over eqp values in the original Equip .