I would like to change a data.table by doing a join within a function. I understand that data.tables work by reference, so assumed that reassigning a joined version of a data.table to itself would change the original data.table. What simple thing have I misunderstood?
Thanks!
library('data.table')
# function to restrict DT to subset, by join
join_test <- function(DT) {
test_dt = data.table(a = c('a', 'b'), c = c('x', 'y'))
setkey(test_dt, 'a')
setkey(DT, 'a')
DT <- DT[test_dt]
}
DT = data.table(a = c("a","b","c"), b = 1:3)
print(DT)
# a b
# 1: a 1
# 2: b 2
# 3: c 3
haskey(DT)
# [1] FALSE
join_test(DT)
print(DT)
# a b
# 1: a 1
# 2: b 2
# 3: c 3
haskey(DT)
# [1] TRUE
(haskey calls included just to double-check that some of the by reference changes work)
You can do it by reference, (since you can join and assign columns by reference based on the joined values, without actually saving the joined table back). However, you need to explicitly pick the columns you're after
join_test <- function(DT) {
test_dt = data.table(a = c('a', 'b'), c = c('x', 'y'))
DT[test_dt, c := c, on = 'a']
}
Having your function return the data table and storing the result in DT will get you what you want.
join_test <- function(DT) {
test_dt = data.table(a = c('a', 'b'), c = c('x', 'y'))
setkey(test_dt, 'a')
setkey(DT, 'a')
DT <- DT[test_dt]
return(DT)
}
DT = data.table(a = c("a","b","c"), b = 1:3)
DT <- join_test(DT)
print(DT)
# a b c
# 1: a 1 x
# 2: b 2 y
Related
I encounter this code in one of the Kaggle Notebook:
corrplot.mixed(corr = cor(videos[,c("category_id","views","likes",
"dislikes","comment_count"),with=F]))
videos is a data.frame
"category_id","views","likes","dislikes","comment_count" are columns in the videos data.frame
Would like to understand what is the function of the with parameter when selecting dataframe subset?
As mentioned by #user20650 it might be a data.table. Although in this case your code should work even without with = F.
Consider this example :
library(data.table)
dt <- data.table(a = 1:5, b = 5:1, c = 1:5)
To subset column a and b using character vector you could do
dt[, c('a', 'b'), with = F]
# a b
#1: 1 5
#2: 2 4
#3: 3 3
#4: 4 2
#5: 5 1
However, as mentioned this would work the same without with = F.
dt[, c('a', 'b')]
with = F is helpful when you have a vector of column names stored in a variable.
cols <- c('a', 'b')
dt[, cols] ##Error
dt[, cols, with = F] ##Works
I have a specific data.table question: is there a way to do an update join but by group ? Let me give an example:
df1 <- data.table(ID = rep(letters[1:3],each = 3),x = c(runif(3,0,1),runif(3,1,2),runif(3,2,3)))
df2 <- data.table(ID = c(letters[1],letters[1:5]))
> df2
ID
1: a
2: a
3: b
4: c
5: d
6: e
> df1
ID x
1: a 0.9719153
2: a 0.8897171
3: a 0.7067390
4: b 1.2122764
5: b 1.7441528
6: b 1.3389710
7: c 2.8898255
8: c 2.0388562
9: c 2.3025064
I would like to do something like
df2[df1,plouf := sample(i.x),on ="ID"]
But for each ID group, meaning that plouf would be a sample of the x values for each corresponding ID. The above line of code does not work this way, it sample the whole x vector:
> df2
ID plouf
1: a 1.3099715
2: a 0.8540039
3: b 2.0767138
4: c 0.6530148
5: d NA
6: e NA
You see that the values of plouf are not the x corresponding to the ID group of df1. I would like that the plouf value is between 0 and 1 for a, 1 and 2 for b, and 2 and 3 for c. I want to sample without replacement.
I tried :
df2[df1,plouf := as.numeric(sample(i.x,.N)),on ="ID",by = .EACHI]
which does not work:
Error in sample.int(length(x), size, replace, prob) :
cannot take a sample larger than the population when 'replace = FALSE'
This other attempt seems to be working:
df2$plouf <- df2[df1,on ="ID"][,sample(x,df2[ID == ID2,.N]),by = .(ID2 = ID)]$V1
But I find it hard to read or understand, it could be problematic for more than one grouping variable, and I am not sure it is quite efficient. I am sure there is a nice simple way to write it, but I don't have it. Any idea ?
Another option:
df1[df2[, .N, ID], on=.(ID), sample(x, N), by=.EACHI]
output:
ID V1
1: a 0.2655087
2: a 0.3721239
3: b 1.2016819
4: c 2.6607978
5: d NA
6: e NA
data:
library(data.table)
set.seed(0L)
df1 <- data.table(ID = rep(letters[1:3],each = 3),x = c(runif(3,0,1),runif(3,1,2),runif(3,2,3)))
df2 <- data.table(ID = c(letters[1],letters[1:5]))
Addressing comment:
library(data.table)
set.seed(0L)
df1 <- data.table(ID = rep(letters[1:3],each = 3),
NAME = rep(LETTERS[1:3],each = 3),
x = c(runif(3,0,1),runif(3,1,2),runif(3,2,3)))
df2 <- data.table(ID = c(letters[1],letters[1:5]),
NAME = c(LETTERS[1],LETTERS[1:5]))
df2[, ri := rowid(ID, NAME)][
df1[df2[, .N, .(ID, NAME)], on=.(ID, NAME), .(ri=1L:N, VAL=sample(x, N)), by=.EACHI],
on=.(ri, ID, NAME), VAL := VAL]
df2
If it is too repetitive to type ID, NAME, you can use
cols <- c("ID", "NAME")
df2[, ri := rowidv(.SD, cols)][
df1[df2[, .N, cols], on=cols, .(ri=1L:N, VAL=sample(x, N)), by=.EACHI],
on=c("ri", cols), VAL := VAL]
df2
Sample with replacement
You can do that like this:
df2[, plouf := df1[df2, on = .(ID),
sample(x, size = 1),
by=.EACHI]$V1]
You can join on the ID variable, but you must specify by=.EACHI as you are returning multiple values. The $V1 tells it to return the first column of the results.
Result:
ID sample
1: a 0.042188292
2: a 0.002502247
3: b 1.145714600
4: c 2.541768627
5: d NA
6: e NA
Sample without replacement
Its not pretty but it works:
df2$plouf = as.numeric(NA)
# create temporary table of number of sample required for each group
temp = df2[, .N, by = ID]
for(i in temp$ID){
# create a temporary sample
temp_sample = sample(df1[i==ID]$x, size = temp[ID==i]$n, replace = FALSE)
# assign sample
for(j in seq(1, length(temp_sample))){
df2[ID==i][j]$plouf = temp_sample[j]
}
}
Thanks to #David Arenburg for help
data.table is graceful and intuitive with the chains rule. Everything is just lined up like a machine. But sometimes we have to introduce some operation like dcast or melt.
How can I integrate all operation into the []? Simply because it's more graceful, I admit.
DT <- data.table(A = rep(letters[1:3],4), B = rep(1:4,3), C = rep(c("OK", "NG"),6))
DT.1 <- DT[,.N, by = .(B,C)] %>% dcast(B~C)
DT.2 <- DT.1[,.N, by = .(NG)]
# NG N
#1: NA 2
#2: 3 2
#same
DT <- data.table(A = rep(letters[1:3],4), B = rep(1:4,3), C = rep(c("OK", "NG"),6))[,.N, by = .(B, C)] %>%
dcast(B~C) %>% .[,.N, by =.(NG)]
Can I remove the %>% and integrate into the []?
Thanks
What about using .SD to this end:
DT[, .N, by = .(B, C)
][, dcast(.SD, B ~ C)
][, .N, by = .(NG)]
NG N
1: NA 2
2: 3 2
I am trying to perform a character operation (paste) in a column from one data.table using data from a second data.table.
Since I am also performing other unrelated merge operations before and after this particular code, the rows order might change, so I am currently setting the order both before and after this manipulation.
DT1 <- data.table(ID = c("a", "b", "c"), N = c(4,1,3)) # N used
DT2 <- data.table(ID = c("b","a","c"), N = c(10,10, 15)) # N total
# without merge
DT1 <- DT1[order(ID)]
DT2 <- DT2[order(ID)]
DT1[, N := paste0(N, "/", DT2$N)]
DT1
# ID N
# 1: a 4/10
# 2: b 1/10
# 3: c 3/15
I know a merge of the two DTs (by definition) would take care of the matching, but this creates extra columns that I need to remove afterwards.
# using merge
DT1 <- merge(DT1, DT2, by = "ID")
DT1[, N := paste0(N.x, "/", N.y)]
DT1[, c("N.x", "N.y") := list(NULL, NULL)]
DT1
# ID N
# 1: a 4/10
# 2: b 1/10
# 3: c 3/15
Is there a more intelligent way of doing this using data.table?
We can use join after converting the 'N' column to character
DT1[DT2, N := paste0(N, "/", i.N), on = .(ID)]
DT1
# ID N
#1: a 4/10
#2: b 1/10
#3: c 3/15
data
DT1 <- data.table(ID = c("a", "b", "c"), N = c(4,1,3))
DT2 <- data.table(ID = c("b","a","c"), N = c(10,10, 15)) # N total
DT1[, N:= as.character(N)]
DT - data.table with column "A"(column index==1), "B"(column index 2), "C" and etc
for example next code makes subset DT1 which consists rows where A==2:
DT1 <- DT[A==2, ]
BUT How can I make subsets like DT1 using only column index??
for example, code like next not works :
DT1 <- DT[.SD==2, .SDcols = 1]
It is not recommended to use column index instead of column names as it makes your code difficult to understand and agile for any changes that could happen to your data. (See, for example, the first paragraph of the first question in the package FAQ.) However, you can subset with column index as follows:
DT = data.table(A = 1:5, B = 2:6, C = 3:7)
DT[DT[[1]] == 2]
# A B C
#1: 2 3 4
We can get the row index with .I and use that to subset the DT
DT[DT[, .I[.SD==2], .SDcols = 1]]
# A B C
#1: 2 3 4
data
DT <- data.table(A = 1:5, B = 2:6, C = 3:7)