I am looping through different data.tables and the variables in the data.table. But I'm having trouble referencing the variables inside of the for loop
dt1 <- data.table(a1 = c(1,2,3), a2 = c(4,5,2))
dt2 <- data.table(a1 = c(1,43,1), a2 = c(52,4,1))
For each datatable, I want to find the average of each variable for observations where that variable != 1. Below is my attempt which doesn't work:
dtname = 'dt'
ind = c('1', '2')
for (d in ind) {
df <- get(paste0('dt', d, sep=''))
for (v in ind) {
varname <- paste0('a', v, sep='')
df1 <- df %>%
filter(varname!=1) %>%
summarise(varname = mean(varname))
print(df1)
}
}
The desired output is to take and print the average of a1 = c(2,3) in dt1, the average of a2 = (4,5,2) in dt1, the average of a1 = c(43) in dt2, the average of a2 = c(54,4) in dt2.
What am I doing wrong here? In general, how should I reference a variable inside of a for loop (varname) that is pieced together by using the looping index (v) and something else?
For a purely data.table way, I would combine the different data.tables and compute the averages:
# Concatenate the data.tables:
all_dt <- rbind("dt1" = dt1, "dt2" = dt2, idcol = "origin")
all_dt
# origin a1 a2
# 1: dt1 1 4
# 2: dt1 2 5
# 3: dt1 3 2
# 4: dt2 1 52
# 5: dt2 43 4
# 6: dt2 1 1
# Melt so that "a1" and "a2" are labels in a group column:
all_dt <- melt(all_dt, id.vars="origin")
all_dt
# origin variable value
# 1: dt1 a1 1
# 2: dt1 a1 2
# 3: dt1 a1 3
# 4: dt2 a1 1
# 5: dt2 a1 43
# 6: dt2 a1 1
# 7: dt1 a2 4
# 8: dt1 a2 5
# 9: dt1 a2 2
# 10: dt2 a2 52
# 11: dt2 a2 4
# 12: dt2 a2 1
# Compute averages by each data.table and column group, ignoring 1s:
all_dt[value != 1, .(mean = mean(value)), by = .(origin, variable)]
# origin variable mean
# 1: dt1 a1 2.500000
# 2: dt2 a1 43.000000
# 3: dt1 a2 3.666667
# 4: dt2 a2 28.000000
I figured out a solution based on the comments of #Amar and #Scott Richie
for (d in ind) {
df <- get(paste0('dt', d, sep=''))
for (v in ind) {
varname <- paste0('a', v, sep='')
df1 <- df[eval(as.name(varname))!=1, .(mean =
mean(eval(as.name(varname))))]
print(df1)
}
}
Thanks EVERYONE!
Would go for a vectorised approach. You are using R!
One possible way:
require(dplyr)
dt1[dt1==1] <- NA #replace 1 with NA
dt1 %>% summarise_all(mean, na.rm = TRUE) #mean of all columns.
a1 a2
1 2.5 3.666667
It is not very clear what you are trying to do, but if you want to replace all of the rows in the dataframe with the mean of the previous data frame's columns, I would suggest using a dataframe type instead as it is easier to index. Here is code that should work:
dt1 <- data.frame(a1 = c(1,2,3), a2 = c(4,5,2))
dt2 <- data.frame(a1 = c(1,43,1), a2 = c(52,4,1))
dtname = 'dt'
ind = c('1', '2')
for (d in ind){
df <- get(paste0('dt', d, sep=''))
for (i in 1:nrow(df)){
for (j in 1:ncol(df)){
if (df[i,j] !=1){
df[,j]<- mean(df[,j])
}
}
print(df)
}
}
The reason your code was not working before was because the variables were being treated like strings, not actual variables. You can see this by printing the data type of variances:
dtname = 'dt'
ind = c('1', '2')
for (d in ind) {
df <- get(paste0('dt', d, sep=''))
for (v in ind) {
varname <- paste0('a', v, sep='')
print(class(varname))
}
}
Which just returns "character"
Another solution using variable names and the dataframe type would be to index the df as follows:
df[["varname"]]
Here are two helpful links for this kind of operation:
* link 1: How to find the mean of a column
* link 2: Data frames
Related
Let's say I have two data.table, dt_a and dt_b defined as below.
library(data.table)
set.seed(20201111L)
dt_a <- data.table(
foo = c("a", "b", "c")
)
dt_b <- data.table(
bar = sample(c("a", "b", "c"), 10L, replace=TRUE),
value = runif(10L)
)
dt_b[]
## bar value
## 1: c 0.4904536
## 2: c 0.9067509
## 3: b 0.1831664
## 4: c 0.0203943
## 5: c 0.8707686
## 6: a 0.4224133
## 7: a 0.6025349
## 8: b 0.4916672
## 9: a 0.4566726
## 10: b 0.8841110
I want to left join dt_b on dt_a by reference, summing over the multiple match. A way to do so would be to first create a summary of dt_b (thus solving the multiple match issue) and merge if afterwards.
dt_b_summary <- dt_b[, .(value=sum(value)), bar]
dt_a[dt_b_summary, value_good:=value, on=c(foo="bar")]
dt_a[]
## foo value_good
## 1: a 1.481621
## 2: b 1.558945
## 3: c 2.288367
However, this will allow memory to the object dt_b_summary, which is inefficient.
I would like to have the same result by directly joining on dt_b and summing multiple match. I'm looking for something like below, but that won't work.
dt_a[dt_b, value_bad:=sum(value), on=c(foo="bar")]
dt_a[]
## foo value_good value_bad
## 1: a 1.481621 5.328933
## 2: b 1.558945 5.328933
## 3: c 2.288367 5.328933
Anyone knows if there is something possible?
We can use .EACHI with by
library(data.table)
dt_b[dt_a, .(value = sum(value)), on = .(bar = foo), by = .EACHI]
# bar value
#1: a 1.481621
#2: b 1.558945
#3: c 2.288367
If we want to update the original object 'dt_a'
dt_a[, value := dt_b[.SD, sum(value), on = .(bar = foo), by = .EACHI]$V1]
dt_a
# foo value
#1: a 1.481621
#2: b 1.558945
#3: c 2.288367
For multiple columns
dt_b$value1 <- dt_b$value
nm1 <- c('value', 'value1')
dt_a[, (nm1) := dt_b[.SD, lapply(.SD, sum),
on = .(bar = foo), by = .EACHI][, .SD, .SDcols = nm1]]
I've got two large data.tables DT1 (2M rows x 300 cols) and DT2 (50M rows x 2 cols) and i would like to merge the values of DT1 columns to a new column in DT2 based on the name of the column specified in a DT2 column. I'd like to achieve this without having to melt DT1, and by using data.table operations only, if possible.
Hora, a sample dataset.
> require(data.table)
> DT1 <- data.table(ID = c('A', 'B', 'C', 'D'), col1 = (1:4), col2 = (5:8), col3 = (9:12), col4 = (13:16))
> DT1
ID col1 col2 col3 col4
1: A 1 5 9 13
2: B 2 6 10 14
3: C 3 7 11 15
4: D 4 8 12 16
> DT2
ID col
1: A col1
2: B col2
3: B col3
4: C col1
5: A col4
#desired output
> DT2_merge
ID col col_value
1: A col1 1
2: B col2 6
3: B col3 10
4: C col1 3
5: A col4 13
Since dealing with two large data.tables, hoping to find the most efficient way of doing this.
Maybe there is a pure data.table version to do this but one way is to use matrix subsetting
library(data.table)
setDF(DT1)
DT2[, col_value := DT1[cbind(match(ID, DT1$ID), match(col, names(DT1)))]]
DT2
# ID col col_value
#1: A col1 1
#2: B col2 6
#3: B col3 10
#4: C col1 3
#5: A col4 13
Using set():
setkey(DT1, "ID")
setkey(DT2, "ID")
for (k in names(DT1)[-1]) {
rows <- which(DT2[["col"]] == k)
set(DT2, i = rows, j = "col_value", DT1[DT2[rows], ..k])
}
ID col col_value
1: A col1 1
2: A col4 13
3: B col2 6
4: B col3 10
5: C col1 3
Note: Setting the key up front speeds up the process but reorders the rows.
You can use lookup tables to find the indices for subsetting like:
setDF(DT1)
DT2[, col_value := DT1[matrix(c(setNames(seq_len(nrow(DT1)), DT1$ID)[DT2$ID],
setNames(2:NCOL(DT1), colnames(DT1)[-1])[DT2$col]), ncol=2)]]
DT2
# ID col col_value
#1: A col1 1
#2: B col2 6
#3: B col3 10
#4: C col1 3
#5: A col4 13
Using a matrix for subsetting is currently not sported in DT so if you have data.frame instead of data.table you can do it in base with:
DT2$col_value <- DT1[matrix(c(setNames(seq_len(nrow(DT1)), DT1$ID)[DT2$ID],
setNames(2:NCOL(DT1), colnames(DT1)[-1])[DT2$col]), ncol=2)]
You can also change your data structure before and change from matrix- to vector-subsetting:
DT1ID <- setNames(seq_len(nrow(DT1)), DT1$ID)
DT1 <- as.matrix(DT1[,-1])
DT2$col <- as.integer(substring(DT2$col, 4))
DT2$col_value <- DT1[c(DT1ID[DT2$ID] + (DT2$col-1)*nrow(DT1))]
Maybe also try fastmatch:
library(fastmatch)
DT1 <- as.matrix(DT1[,-1], rownames=DT1$ID)
DT2$col <- as.integer(substring(DT2$col, 4))
DT2$col_value <- DT1[c(fmatch(DT2$ID, rownames(DT1)) + (DT2$col-1)*nrow(DT1))]
Or you avoid lookup during subsetting und use levels when creating factor:
DT1 <- as.matrix(DT1[,-1], rownames=DT1$ID, colnames=colnames(DT1)[-1])
DT2$ID <- factor(DT2$ID, levels=rownames(DT1))
DT2$col <- factor(DT2$col, levels=colnames(DT1))
DT2$col_value <- DT1[c(unclass(DT2$ID) + (unclass(DT2$col)-1)*nrow(DT1))]
Here are two solutions also applicable to data.frame():
Solution 1
DT2$col_value <- apply(DT2, 1, function(v) DT1[which(DT1$ID==v[1]),which(colnames(DT1)==v[2])])
Solution 2 (same with solution by #Ronak Shah) maybe much faster than Solution 1 when with large dataset
DT2$col_value <- DT1[cbind(match(DT2$ID,DT1$ID),match(DT2$col,colnames(DT1)))]
Solution 3 (maybe the fastest)
m <- as.matrix(DT1[-1])
rownames(m) <- DT1$ID
DT2$col_value <- m[as.matrix(DT2)]
Testing some of the methods on a larger data-set and show their performance:
#sindri_baldur
library(data.table)
DT1 <- data.table(ID = rownames(x1), x1)
DT2 <- as.data.table(x2)
setkey(DT1, "ID")
setkey(DT2, "ID")
system.time(for (k in names(DT1)[-1]) {
rows <- which(DT2[["col"]] == k)
set(DT2, i = rows, j = "col_value", DT1[DT2[rows], ..k])
})
#User: 6.696
#Ronak Shah
library(data.table)
DT1 <- data.table(ID = rownames(x1), x1)
DT2 <- as.data.table(x2)
setDF(DT1)
system.time(DT2[, col_value := DT1[cbind(match(ID, DT1$ID), match(col, names(DT1)))]])
#User: 5.210
#Using fastmatch
library(fastmatch)
DT1 <- x1
DT2 <- x2
system.time(DT2$col_value <- DT1[c(fmatch(DT2$ID, rownames(DT1))
+ (fmatch(DT2$col, colnames(DT1))-1)*nrow(DT1))])
#User: 0.061
#Using factors
DT1 <- x1
DT2 <- x2
system.time(DT2$col_value <- DT1[c(unclass(DT2$ID) + (unclass(DT2$col)-1)*nrow(DT1))])
#User: 0.024
Data:
set.seed(7)
nrows <- 1e5
ncols <- 300
x1 <- matrix(sample(0:20, nrows*ncols, replace=TRUE), ncol=ncols
, dimnames=list(sample(do.call("paste0", expand.grid(rep(list(letters)
, ceiling(log(nrows, length(letters)))))), nrows), seq_len(ncols)))
x2 <- data.frame(ID=factor(sample(rownames(x1), nrows*10, replace=TRUE)
, levels=rownames(x1))
, col=factor(sample(colnames(x1), nrows*10, replace=TRUE), levels=colnames(x1)))
I'm looking for the (1) name and (2) a (cleaner) method in R (base and data.table preferred) of the following.
Input
> d1
id x y
1 1 1 NA
2 2 NA 3
3 3 4 NA
> d2
id x y z
1 4 NA 30 a
2 3 20 2 b
3 2 14 NA c
4 1 15 97 d
(note that the actual data.frames have hundreds of columns)
Expected output:
> d1
id x y z
1 1 1 97 d
2 2 14 3 c
3 3 4 2 b
Data and current solution:
d1 <- data.frame(id = 1:3, x = c(1, NA, 4), y = c(NA, 3, NA))
d2 <- data.frame(id = 4:1, x = c(NA, 20, 14, 15), y = c(30, 2, NA, 97), z = letters[1:4])
for (col in setdiff(names(d1), "id")) {
# If missing look in d2
missing <- is.na(d1[[col]])
d1[missing, col] <- d2[match(d1$id[missing], d2$id), col]
}
for (col in setdiff(names(d2), names(d1))) {
# If column missing then add
d1[[col]] <- d2[match(d1$id, d2$id), col]
}
PS:
Likely this questions has been asked before but I'm lacking in vocabulary to search it.
Assuming you are working with 2 data.frames, here is a base solution
#expand d1 to have the same columns as d2
d <- merge(d1, d2[, c("id", setdiff(names(d2), names(d1))), drop=FALSE],
by="id", all.x=TRUE, all.y=FALSE)
#make sure that d2 also have same number of columns as d1
d2 <- merge(d2, d1[, c("id", setdiff(names(d1), names(d2))), drop=FALSE],
by="id", all.x=TRUE, all.y=FALSE)
#align rows and columns to match those in d1
mask <- d2[match(d1$id, d2$id), names(d)]
#replace NAs with those mask
replace(d, is.na(d), mask[is.na(d)])
If you dont mind, we can rewrite your question into a general matrix-coalesce question (i.e. any number of matrices, columns, rows) which seems like it has not been asked before.
edit:
Another base R solution is a hack of coalesce1a from How to implement coalesce efficiently in R
coalesce.mat <- function(...) {
ans <- ..1
for (elt in list(...)[-1]) {
rn <- match(ans$id, elt$id)
ans[is.na(ans)] <- elt[rn, names(ans)][is.na(ans)]
}
ans
}
allcols <- Reduce(union, lapply(list(d1, d2), names))
do.call(coalesce.mat,
lapply(list(d1, d2), function(x) {
x[, setdiff(allcols, names(x))] <- NA
x
}))
edit:
a possible data.table solution using coalesce1a from How to implement coalesce efficiently in R by Martin Morgan.
coalesce1a <- function(...) {
ans <- ..1
for (elt in list(...)[-1]) {
i <- which(is.na(ans))
ans[i] <- elt[i]
}
ans
}
setDT(d1)
setDT(d2)
#melt into long formats and full outer join the 2
mdt <- merge(melt(d1, id.vars="id"), melt(d2, id.vars="id"), by=c("id","variable"), all=TRUE)
#perform a coalesce on vectors
mdt[, value := do.call(coalesce1a, .SD), .SDcols=grep("value", names(mdt), value=TRUE)]
#pivot into original format and subset to those in d1
dcast.data.table(mdt, id ~ variable, value.var="value")[
d1, .SD, on=.(id)]
Here is a possibility using dplyr::left_join:
left_join(d1, d2, by = "id") %>%
mutate(
x = ifelse(!is.na(x.x), x.x, x.y),
y = ifelse(!is.na(y.x), y.x, y.y)) %>%
select(id, x, y, z)
# id x y z
#1 1 1 97 d
#2 2 14 3 c
#3 3 4 2 b
We can use data.table with coalesce from dplyr. Create a vector of column names that are common ('nm1') and difference ('nm2') in both datasets. Convert the first dataset to 'data.table' (setDT(d1)), join on the 'id' column, assign (:=) the coalesced columns of the first and second (with prefix i. - if there are common columns) to update the values in the first dataset
library(data.table)
nm1 <- setdiff(intersect(names(d1), names(d2)), 'id')
nm2 <- setdiff(names(d2), names(d1))
setDT(d1)[d2, c(nm1, nm2) := c(Map(dplyr::coalesce, mget(nm1),
mget(paste0("i.", nm1))), mget(nm2)), on = .(id)]
d1
# id x y z
#1: 1 1 97 d
#2: 2 14 3 c
#3: 3 4 2 b
I have the following data:
library(data.table)
dt1 <- data.table(var1 = c("wk1","wk1","wk2"),
var2 = c(1,2,3))
dt2 <- data.table(var3 = c("a","b","c"),
var2 = c(1,2,3))
lista <- list(dt1,dt2)
dt_main <- data.table(var1 = c("wk1","wk2"),
var4 = c(100,200))
I want to merge all elements of lista which contain the variable var1 with the dt_main data.table, so in the end I would like lista to look like this:
dt1 <- data.table(var1 = c("wk1","wk1","wk2"),
var2 = c(1,2,3),
var4 = c(100,100,200))
dt2 <- data.table(var3 = c("a","b","c"),
var2 = c(1,2,3))
lista <- list(dt1,dt2)
I tried
mapply(function(X,Y){
if("var1"%in%names(X)){
X <- merge(X,Y,by="var1")
}
},X=lista,Y=dt_main)
but it does not work. Any help ?
You could use an lapply and merge inside the function:
lapply(lista, function(x) if (!is.null(x$var1)) {
#the function checks if there is a var1 column
#and if there is, it gets merged to the x data.table
return(merge(dt_main, x, by = 'var1', all.x = TRUE))
} else {
#otherwise it just returns the data.table
return(x)
})
# [[1]]
# var1 var4 var2
# 1: wk1 100 1
# 2: wk1 100 2
# 3: wk2 200 3
#
# [[2]]
# var3 var2
# 1: a 1
# 2: b 2
# 3: c 3
A somewhat different way of doing this:
lapply(lista, function(x) if ('var1' %in% names(x))
x[dt_main, on = 'var1', var4 := var4][]
else x
)
which gives:
[[1]]
var1 var2 var4
1: wk1 1 100
2: wk1 2 100
3: wk2 3 200
[[2]]
var3 var2
1: a 1
2: b 2
3: c 3
When assigning by reference with a data.table using a column from a second data.table, the results are inconsistent. When there are no matches by the key columns of both data.tables, it appears the assigment expression y := y is totally ignored - not even NAs are returned.
library(data.table)
dt1 <- data.table(id = 1:2, x = 3:4, key = "id")
dt2 <- data.table(id = 3:4, y = 5:6, key = "id")
print(dt1[dt2, y := y])
## id x # Would have also expected column: y
## 1: 1 3 # NA
## 2: 2 4 # NA
However, when there is a partial match, non-matching columns have a placeholder NA.
dt2[, id := 2:3]
print(dt1[dt2, y := y])
## id x y
## 1: 1 3 NA # <-- placeholder NA here
## 2: 2 4 5
This wreaks havoc on later code that assumes a y column exists in all cases. Otherwise I keep having to write cumbersome additional checks to take into account both cases.
Is there an elegant way around this inconsistency?
With this recent commit, this issue, #759, is now fixed in v1.9.7. It works as expected when nomatch=NA (the current default).
require(data.table)
dt1 <- data.table(id = 1:2, x = 3:4, key = "id")
dt2 <- data.table(id = 3:4, y = 5:6, key = "id")
dt1[dt2, y := y][]
# id x y
# 1: 1 3 NA
# 2: 2 4 NA
Using merge works:
> dt3 <- merge(dt1, dt2, by='id', all.x=TRUE)
> dt3
id x y
1: 1 3 NA
2: 2 4 NA