Retaining by variables in R data.table by-without-by - r

I'd like to retain the by variables in a by-without-by operation using data.table.
I have a by-without-by that used to work (ca. 2 years ago), and now with the latest version of data.table I think the behavior must have changed.
Here's a reproducible example:
library(data.table)
dt <- data.table( by1 = letters[1:3], by2 = LETTERS[1:3], x = runif(3) )
by <- c("by1","by2")
allPermutationsOfByvars <- do.call(CJ, sapply(dt[,by,with=FALSE], unique, simplify=FALSE)) ## CJ() to form index
setkeyv(dt, by)
dt[ allPermutationsOfByvars, list( x = x ) ]
Which produces:
> dt[ allPermutationsOfByvars, list( x = x ) ]
x
1: 0.9880997
2: NA
3: NA
4: NA
5: 0.4650647
6: NA
7: NA
8: NA
9: 0.4899873
I could just do:
> cbind( allPermutationsOfByvars, dt[ allPermutationsOfByvars, list( x = x ) ] )
by1 by2 x
1: a A 0.9880997
2: a B NA
3: a C NA
4: b A NA
5: b B 0.4650647
6: b C NA
7: c A NA
8: c B NA
9: c C 0.4899873
Which indeed works, but is inelegant and possibly inefficient.
Is there an argument I'm missing or a clever stratagem to retain the by variables?

Add by = .EACHI to get the "by-without-by" aka by-EACH-element-of-I:
dt[allPermutationsOfByvars, x, by = .EACHI]
And this is how I'd have done the initial part:
allPermutationsOfByvars = dt[, do.call(CJ, unique(setDT(mget(by))))]
Finally, the on argument is usually the better choice now (vs setkey).

Related

Combining list of data.tables based on data.table name

This is an extension to the question I posted a number of years ago: here
In summary, I have an indeterminate (in advance) number of data.tables in a list, some of which have duplicate names. I want to combine all data.tables with the same name using rbindlist, and fill any columns that don't have corresponding values with NA. i.e. use the 'fill = T' element of rbindlist.
I then want to return a list with the data.tables and name them based on their original names.
I'll give an example:
library(data.table)
# Create data for list A
A_1 <- data.table(A = 1:2,B = 2:3)
A_2 <- data.table(C = 100:102,D = 300:302,E = 1:3)
A <- list(AA = A_1,BB = A_2)
# Create data for list B
B_1 <- data.table(A = 2:4,B = 1:3,F = 4:6)
B_2 <- data.table(C = 10:12,D = 20:22,G = 1:3,I = 10:12)
B <- list(AA = B_1,BB = B_2)
# Create data for list C
C_1 <- data.table(I = 1:2,J = 3:4)
C_2 <- data.table(C = 1:3,D = 4:6)
C <- list(AA = C_1,BB = C_2)
# Create list DT which is a combination of lists A, B and C
DT <- c(A,B,C)
I can combine them easily enough using the following code.
library(purrr)
pmap(.l = list(y = unique(names(DT))),
.f = function(y) rbindlist(DT[names(DT) %in% y],fill = T)) %>%
set_names(unique(names(DT)))
However, in practice, DT is formed by another function which I'd strongly prefer to simply chain (%>%) the result from that function output, and then jump straight into the pmap approach above without first 'creating' DT.
I've tried the following:
# As a substitute for the function code creating DT, I'll simply put DT itself below as
# a representation of the output
DT %>% pmap(.l = list(y = unique(names(.))),
.f = function(y) rbindlist(.[names(.) %in% y],fill = T))
This doesn't work and gives an error suggesting that there are unused elements of the function. Do you know why it doesn't work?
Note that I haven't then used 'set_names' to name the output from the pmap function, as the unique names of DT would have been 'forgotten' after creating the pmap output. I'm not sure how to use chains to do that (or if it's possible without including that output in the data.table output itself).
Let me know if this is unclear and I can try to explain it another way.
Any help would be greatly appreciated!
Thanks
Phil
We could block it in braces ({}
library(purrr) # version 1.0.0
library(dplyr)
DT %>% {
unq <- unique(names(.))
pmap(.l = list(y = unq),
.f = function(y) rbindlist(.[names(.) %in% y],fill = TRUE)) %>%
setNames(unq)
}
Or instead of looping over the names and do rbinding, split by names, which return a named list and then do the rbind
DT %>%
split(names(.)) %>%
map(list_rbind)
-output
$AA
A B F I J
1: 1 2 NA NA NA
2: 2 3 NA NA NA
3: 2 1 4 NA NA
4: 3 2 5 NA NA
5: 4 3 6 NA NA
6: NA NA NA 1 3
7: NA NA NA 2 4
$BB
C D E G I
1: 100 300 1 NA NA
2: 101 301 2 NA NA
3: 102 302 3 NA NA
4: 10 20 NA 1 10
5: 11 21 NA 2 11
6: 12 22 NA 3 12
7: 1 4 NA NA NA
8: 2 5 NA NA NA
9: 3 6 NA NA NA

How to optimize this for loop for bigger data in r?

I have some reproducible data, (my original dataset contains about 2,000,000 rows). For this reason, my for loop becomes inefficient and will take a long time to run this much data. I was wondering if there is a more efficient way to run this data. I attached my code with reproducible data
#----Reproducible data example--------------------#
#Upload first data set#
words1<-c("How","did","Quebec","nationalists","see","their","province","as","a","nation","in","the","1960s")
words2<-c("Why","does","volicty","effect","time",'?',NA,NA,NA,NA,NA,NA,NA)
words3<-c("How","do","I","wash","a","car",NA,NA,NA,NA,NA,NA,NA)
library<-c("The","the","How","see","as","a","for","then","than","example")
embedding1<-c(.5,.6,.7,.8,.9,.3,.46,.48,.53,.42)
embedding2<-c(.1,.5,.4,.8,.9,.3,.98,.73,.48,.56)
df <- data.frame(words1,words2,words3)
names(df)<-c("words1","words2","words3")
#--------Upload 2nd dataset-------#
df2 <- data.frame(library,embedding1, embedding2)
names(df2)<-c("library","embedding1","embedding2")
df2$meanembedding=rowMeans(df2[c("embedding1","embedding2")],na.rm=T)
df2<-df2[,-c(2,3)]
#-----Find columns--------#
l=ncol(df)
names<-names(df)
head(names)
classes<-sapply(df[,c(1:l)],class)
head(classes)
#------Combine and match libary to training data------#
require(gridExtra)
List = list()
for( name in names){
df1<-df[,name]
df1<-as.data.frame(df1)
x_train2<-merge(x= df1, y = df2,
by.x = "df1", by.y = 'library',all.x=T, sort=F)
x_train2<-x_train2[,-1]
x_train2<-as.data.frame(x_train2)
names(x_train2) <- name
List[[length(List)+1]] = x_train2
}
A better approach would be to use lapply:
myList2 <- lapply(names(df), function(x){
y <- merge(x = df[, x, drop = FALSE],
y = df2,
by.x = x,
by.y = 'library',
all.x = T,
sort = F)[, -1, drop = FALSE]
names(y) <- x
return(y)
})
We loop over the vector names(df), subset and merge on the fly, using [drop = FALSE] to prevent the simplification from a one-column-data.frame to a vector, and overwrite the column name. The output is a list.
Post script: You technically do not need the drop = FALSE if you use df[x] instead of df[, x], as #RuiBarradas pointed out. But I think it is helpful to know about the drop = FALSE option in cases where you need to subset both rows and columns.
when joining on large data volumes, give data.table a try...
library( data.table )
dt <- as.data.table( df )
dt2 <- as.data.table ( df2 )
lapply( names(dt), function(x) {
on_expr <- parse( text = paste0( "c( library = \"", x, "\")" ) )
dt2[dt, on = eval( on_expr )][,2]
})
# [[1]]
# meanembedding
# 1: 0.55
# 2: NA
# 3: NA
# 4: NA
# 5: 0.80
# 6: NA
# 7: NA
# 8: 0.90
# 9: 0.30
# 10: NA
# 11: NA
# 12: 0.55
# 13: NA
#
# [[2]]
# meanembedding
# 1: NA
# 2: NA
# 3: NA
# 4: NA
# 5: NA
# 6: NA
# 7: NA
# 8: NA
# 9: NA
# 10: NA
# 11: NA
# 12: NA
# 13: NA
#
# [[3]]
# meanembedding
# 1: 0.55
# 2: NA
# 3: NA
# 4: NA
# 5: 0.30
# 6: NA
# 7: NA
# 8: NA
# 9: NA
# 10: NA
# 11: NA
# 12: NA
# 13: NA

Inline ifelse assignment in data.table

Let the following data set be given:
library('data.table')
set.seed(1234)
DT <- data.table(x = LETTERS[1:10], y =sample(10))
my.rows <- sample(1:dim(DT)[1], 3)
I want to add a new column to the data set such that, whenever the rows of the data set match the row numbers given by my.rows the entry is populated with, say, true, or false otherwise.
I have got DT[my.rows, z:= "true"], which gives
head(DT)
x y z
1: A 2 NA
2: B 6 NA
3: C 5 true
4: D 8 NA
5: E 9 true
6: F 4 NA
but I do not know how to automatically populate the else condition as well, at the same time. I guess I should make use of some sort of inline ifelse but I am lacking the correct syntax.
We can compare the 'my.rows' with the sequence of row using %in% to create a logical vector and assign (:=) it to create 'z' column.
DT[, z:= 1:.N %in% my.rows ]
Or another option would be to create 'z' as a column of 'FALSE', using 'my.rows' as 'i', we assign the elements in 'z' that correspond to 'i' as 'TRUE'.
DT[, z:= FALSE][my.rows, z:= TRUE]
DT <- cbind(DT,z = ifelse(DT[, .I] %in% my.rows , T, NA))
> DT
# x y z
# 1: A 2 NA
# 2: B 6 NA
# 3: C 5 TRUE
# 4: D 8 NA
# 5: E 9 TRUE
# 6: F 4 NA
# 7: G 1 TRUE
# 8: H 7 NA
# 9: I 10 NA
#10: J 3 NA

R: Merge data.table and fill in NAs

Suppose 3 data tables:
dt1<-data.table(Type=c("a","b"),x=1:2)
dt2<-data.table(Type=c("a","b"),y=3:4)
dt3<-data.table(Type=c("c","d"),z=3:4)
I want to merge them into 1 data table, so I do this:
dt4<-merge(dt1,dt2,by="Type") # No error, produces what I want
dt5<-merge(dt4,dt3,by="Type") # Produces empty data.table (0 rows) of 4 cols: Type,x,y,z
Is there a way to make dt5 like this instead?:
> dt5
Type x y z
1: a 1 3 NA
2: b 2 4 NA
3: c NA NA 3
4: d NA NA 4
While you explore the all argument to merge, I'll also offer you an alternative that might want to consider:
Reduce(function(x, y) merge(x, y, by = "Type", all = TRUE), list(dt1, dt2, dt3))
# Type x y z
# 1: a 1 3 NA
# 2: b 2 4 NA
# 3: c NA NA 3
# 4: d NA NA 4
If you know in advance the unique values you have in your Type column you can use J and then join tables the data.table way. You should set the key for each table so data.table knows what to join on, like this...
# setkeys
setkey( dt1 , Type )
setkey( dt2 , Type )
setkey( dt3 , Type )
# Join
dt1[ dt2[ dt3[ J( letters[1:4] ) , ] ] ]
# Type x y z
#1: a 1 3 NA
#2: b 2 4 NA
#3: c NA NA 3
#4: d NA NA 4
This shows off data.table's compound queries (i.e. dt1[dt2[dt3[...]]] ) which are wicked!
If you don't know in advance the unique values for the key column you can make a list of your tables and use lapply to quickly run through them getting the unique values to make your J expression...
# A simple way to get the unique values to make 'J',
# assuming they are in the first column.
ll <- list( dt1 , dt2 , dt3 )
vals <- unique( unlist( lapply( ll , `[` , 1 ) ) )
#[1] "a" "b" "c" "d"
Then use it like before, i.e. dt1[ dt2[ dt3[ J( vals ) , ] ] ].

Remove constant columns with or without NAs

I am trying to get many lm models work in a function and I need to automatically drop constant columns from my data.table. Thus, I want to keep only columns with two or more unique values, excluding NA from the count.
I tried several methods found on SO, but I am still not able to drop columns that have two values: a constant and NAs.
My reproducible code:
library(data.table)
df <- data.table(x=c(1,2,3,NA,5), y=c(1,1,NA,NA,NA),z=c(NA,NA,NA,NA,NA),
d=c(2,2,2,2,2))
> df
x y z d
1: 1 1 NA 2
2: 2 1 NA 2
3: 3 NA NA 2
4: NA NA NA 2
5: 5 NA NA 2
My intention is to drop columns y, z, and d since they are constant, including y that only have one unique value when NAs are omitted.
I tried this:
same <- sapply(df, function(.col){ all(is.na(.col)) || all(.col[1L] == .col)})
df1 <- df[ , !same, with = FALSE]
> df1
x y
1: 1 1
2: 2 1
3: 3 NA
4: NA NA
5: 5 NA
As seen, 'y' is still there ...
Any help?
Because you have a data.table, you may use uniqueN and its na.rm argument:
df[ , lapply(.SD, function(v) if(uniqueN(v, na.rm = TRUE) > 1) v)]
# x
# 1: 1
# 2: 2
# 3: 3
# 4: NA
# 5: 5
A base alternative could be Filter(function(x) length(unique(x[!is.na(x)])) > 1, df)
There is simple solution with function Filter in base r. It will help.
library(data.table)
df <- data.table(x=c(1,2,3,NA,5), y=c(1,1,NA,NA,NA),z=c(NA,NA,NA,NA,NA),
d=c(2,2,2,2,2))
# Select only columns for which SD is not 0
> Filter(function(x) sd(x, na.rm = TRUE) != 0, df)
x
1: 1
2: 2
3: 3
4: NA
5: 5
Note: Don't forget to use na.rm = TRUE.
Check if the variance is zero:
df[, sapply(df, var, na.rm = TRUE) != 0, with = FALSE]
# x
# 1: 1
# 2: 2
# 3: 3
# 4: NA
# 5: 5
Here is an option:
df[,which(df[,
unlist(
sapply(.SD,function(x) length(unique(x[!is.na(x)])) >1))]),
with=FALSE]
x
1: 1
2: 2
3: 3
4: NA
5: 5
For each column of the data.table we count the number of unique values different of NA. We keep only column that have more than one value.
If you really mean DROPing those columns, here is a solution:
library(data.table)
dt <- data.table(x=c(1,2,3,NA,5),
y=c(1,1,NA,NA,NA),
z=c(NA,NA,NA,NA,NA),
d=c(2,2,2,2,2))
for (col in names(copy(dt))){
v = var(dt[[col]], na.rm = TRUE)
if (v == 0 | is.na(v)) dt[, (col) := NULL]
}
Just change
all(is.na(.col)) || all(.col[1L] == .col)
to
all(is.na(.col) | .col[1L] == .col)
Final code:
same <- sapply( df, function(.col){ all( is.na(.col) | .col[1L] == .col ) } )
df1 <- df[,!same, with=F]
Result:
x
1: 1
2: 2
3: 3
4: NA
5: 5
For removing constant columns,
Numeric Columns:-
constant_col = [const for const in df.columns if df[const].std() == 0]
print (len(constant_col))
print (constant_col)
Categorical Columns:-
constant_col = [const for const in df.columns if len(df[const].unique()) == 1]
print (len(constant_col))
print (constant_col)
Then you drop the columns using the drop method
library(janitor)
df %>%
remove_constant(na.rm = TRUE)
x
1: 1
2: 2
3: 3
4: NA
5: 5

Resources