Duplicated col in datatable join

Duplicated col in datatable join - r

I do a simply join of 2 datables as follows:
set.seed(1)
DT1 <- data.table(
Idx = rep(1:100),
x1 = round(rnorm(100,0.75,0.3),2),
x2 = round(rnorm(100,0.75,0.3),2),
x3 = round(rnorm(100,0.75,0.3),2))
DT2 <- data.table(
Idx2 = rep(1:100),
x1 = round(rep(pi,100),2),
targetcol = rep(999,100))
DT2[DT1,on = c(Idx2 = "Idx")]
This works, but there is a column i.x1 in the result, which I do not want. I only want to include the 'targetcol', hence the name. Now the problem is that in another example, I have many of these duplicate columns with the 'i' before them and therefore i would like to delete them or better exclude them during the merge. I know this should be possible with X[Y,.(...)], but I didn't find the right way how to fill the dots in .(...) with all but one one column, i.e. with all but i.x1. So I wonder what is the best way to select multiple columns in data table with the list-syntax as above?
Secondly I tried the newer merge syntax of datatable:
merge(x = DT1, y = DT2[,c("Idx2","targetcol")], by.x = "Idx",by.y = "Idx2", all.x=TRUE)
but it leads to a different column ordering, naming (x1.x and x1.y), and moreover, I read it is slower than the other way.
What is the best method to solve this (also in case there are many more columns and duplicates; this was just to illustrate the issue)?

Answer moved from comments with slight modification from HubertL code
DT1[DT2[, .(Idx2, targetcol)], on = c(Idx = "Idx2")]

Related

Update data.table by reference but populate only certain rows when duplicates are present using a prioritized vector

I didn't quite know how to word the title, but here is what I'm trying to do. I'd like to grow the data table dt1 using columns from dt2. In dt1, there are duplicated data in the column I'm updating/merging by. My goal is to populate new columns in dt1 at duplicates only if a condition is met
specified by another variable. Let me demonstrate what I mean:
library(data.table)
dt1 <- data.table(common_var = c(rep("a", 3), rep("b", 2)),
condition_var = c("update1", rep(c("update2", "update3"), 2)),
other_var = 1:5)
dt2 <- data.table(common_var = c("a", "b", "C", "d"),
new_var1 = 11:14,
new_var2 = 21:24)
# What I want to obtain is the following
dt_goal <- data.table(common_var = dt1$common_var,
condition_var = dt1$condition_var,
other_var = dt1$other_var,
new_var1 = c(11, NA, NA, 12, NA),
new_var2 = c(21, NA, NA, 22, NA))
dt_goal
Updating by reference or merging populates all the matching rows (as expected), but this is not what I want:
# Updating by reference populates all the duplicate rows as expected
# (doesn't work for my purpose)
dt1[, names(dt2) := as.list(dt2[match(dt1$common_var, dt2$common_var),])]
# merging also populates duplicate rows as expected.
# dt3 <- merge(dt1, dt2, by="common_var")
I tried overriding the rows of merged dt3 (or updated dt1) with NAs where I don't want to have data:
dt3 <- dt3[which(alldup(dt3$common_var) & dt3$condition_var %in% c("update2", "update3")), names(dt2)[2:3] := NA]
dt3
The logic in the code above finds duplicates and the unwanted conditional cases, and replaces the selected columns with NA. This partially works, with two problems:
1) If the value to keep (update1) isn't present in other duplicate rows (b in my example), they get erased too
2) This approach requires hard-coding the case I want to keep. In my real-world application, I will loop this type of data prep and the conditional values will change. I know the priority for updating the data table though:
order_to_populate_dups <- c("update1", "update2", "update3")
In other words, I want a code to grow the data table as follows:
1) When no duplicates, add columns by reference (or merge) normally
2) When duplicates are present under the id variable, look at condition_var
2a) If you see update1 add data, if not, next
2b) If you see update2 add data, if not, next
2c) If you see update3 add data, if not, next, ...
I couldn't locate a solution for this problem in SO. Please let me know if this is somehow duplicate.
Thanks!

Are you looking for something like:
cols <- paste0("new_var", 1:2)
remap <- c(update1=1, update2=2, update3=3)
dt1[, rp := remap[condition_var]]
setkey(dt1, common_var, rp)
dt1[rowid(common_var)==1L, (cols) :=
dt2[.SD, on=.(common_var), mget(paste0("i.",cols))]
Explanation:
You can use factor or a vector to remap your character vector into something that can be ordered accordingly. Then use setkey to sort the data before performing an update join on the first row of each group of common_var.

Please let me know if i understood your example correctly or not. I can change the solution if needed.
# order dt1 by the common variable and
setorder(dt1, common_var, condition_var) condition
# calculate row_id for each group (grouped by common_var)
dt1[, row_index := rowid(common_var)]
# assume dt2 has only one row per common_var
dt2[, row_index := 1]
# left join on common_var and row_index, reorder columns.
dt3 <- dt2[dt1, on = c('common_var', 'row_index')][, list(common_var, condition_var, other_var, new_var1, new_var2)]

Assign values to multiple columns on subset using ":=" from data.table

Following up on this question, how would you assign values to multiple columns in a data table using the ":=" sign?
For example:
x <- data.table(a = 1:3, b = 1:6, c = 11:16)
I can get what i want using two lines:
x[a>2, b:=NA]
x[a>2, c:=NA]
but would like to be able to do it in one, something like this:
x[a>2, .(b:=NA, c:=NA)]
But unfortunately that doesn't work. Is there another way?

We can use the := once with
x[a >2, `:=`(b = NA, c = NA)]
If there are many columns, another option is set
for(nm in names(x)[-1]) set(x, i=which(x[["a"]]>2), j=nm, value = NA)

Obtain column names as vector in R

I want to merge two data tables both have common column names. See below for my script. But I need to obtain the column names using a code but not manually enter like below.
Basically, I need to create a vector of column names for each data table.
setkeyv(Tab_1, c("State","County_ID","Year"))
setkeyv(Tab_2, c("State","County_ID","Year"))
sub_Merge <- merge(Tab_1, Tab_2, all.x = TRUE)
For example something like this below,
setkeyv(Tab_1, as.vector(colnames(Tab_1))
setkeyv(Tab_2, as.vector(colnames(Tab_2))
sub_Merge <- merge(Tab_1, Tab_2, all.x = TRUE)
Any help is appreciated.

With data.table, it's pretty concise:
dt1[dt2, on = names(dt1)[names(dt1) %in% names(dt2)]]
data.table uses the dt[i,j,by] structure. Putting another data.table in the i slot asks to join it to the data.table in the dt position. In a join, you can add an on= statement to specify which columns to base the join on, if any keyed columns already present in the two data.tables aren't suitable for us as such. In the code above, names(dt1)[names(dt1) %in% names(dt2)] returns a list of columns that are found in both dt1 and dt2, and feeds them into the on= clause. The idea of doing it this way, is that you can calculate shared column names on-the-fly, and don't have to write each one out.
This depends on having no duplicate values in dt1, and wanting to join on ALL shared columns in dt1 and dt2.
I used this mock data:
dt1 <-
data.table(
a = LETTERS[1:10],
b = letters[1:10],
c = runif(10),
d = runif(10)
)
dt2 <-
data.table(
a = LETTERS[1:10],
b = letters[1:10],
e = runif(10),
f = runif(10)
)

assigning a subset of data.table rows and columns by join

I'm trying to do something similar but different enough from what's described here:
Update subset of data.table based on join
Specifically, I'd like to assign to matching key values (person_id is a key in both tables) column values from table control. CI is the column index. The statement below says 'with=F' was not used. when I delete those parts, it also doesn't work as expected. Any suggestions?
To rephrase: I'd like to set the subset of flatData that corresponds to control FROM control.
flatData[J(eval(control$person_id)), ci, with=F] = control[, ci, with=F]
To give a reproducible example using classic R:
x = data.frame(a = 1:3, b = 1:3, key = c('a', 'b', 'c'))
y = data.frame(a = c(2, 5), b = c(11, 2), key = c('a', 'b'))
colidx = match(c('a', 'b'), colnames(y))
x[x$key %in% y$key, colidx] = y[, colidx]
As an aside, someone please explain how to easily assign SETS of columns without using indices! Indices and data.table are a marriage made in hell.

You can use the := operator along with the join simultaneously as follows:
First prepare data:
require(data.table) ## >= 1.9.0
setDT(x) ## converts DF to DT by reference
setDT(y)
setkey(x, key) ## set key column
setkey(y, key)
Now the one-liner:
x[y, c("a", "b") := list(i.a, i.b)]
:= modifies by reference (in-place). The rows to modify are provided by the indices computed from the join in i.
i.a and i.b are the column names data.table internally generates for easy access to i's columns when both x and i have identical column names, when performing a join of the form x[i].
HTH
PS: In your example y's columns a and b are of type numeric and x's are of type integer and therefore you'll get a warning when run on your data, that the types dint match and therefore a coercion had to take place.

R loops: Adding a column to a table if does not already exist

I am trying to compile data from several files using for loops in R. I would like to get all the data into one table. Following calculation is just an example.
library(reshape)
dat1 <- data.frame("Specimen" = paste("sp", 1:10, sep=""), "Density_1" = rnorm(10,4,2), "Density_2" = rnorm(10,4,2), "Density_3" = rnorm(10,4,2))
dat2 <- data.frame("Specimen" = paste("fg", 1:10, sep=""), "Density_1" = rnorm(10,4,2), "Density_2" = rnorm(10,4,2))
dat <- c("dat1", "dat2")
for(i in 1:length(dat)){
data <- get(dat[i])
melt.data <- melt(data, id = 1)
assign(paste(dat[i], "tbl", sep=""), cast(melt.data, ~ variable, mean))
}
rbind(dat1tbl, dat2tbl)
What is the smoothest way to add an extra column into dat2? I would like to get the same column name ("Density_3" in this case) and fill it up with zeros, if it does not already exist. Assume that I have ~100 tables with number of columns (Density_1, 2, 3 etc) varying between 5 and 6.
I tried following, but it didn't work:
if(names(data) %in% "Density_3" == FALSE){
dat.all$Density_3 <- 0
} else {
dat.all$Density_3 <- dat.all$Density3}
Another one: is there a smooth way to rbind() the tables? It seems that rbind(get(dat)) does not work.

After staring at this question for a while I think its intent may have been obscured by the unnecessary get and assign manipulations. And I think the answer is pylr::rbind.fill
I would have constructed "dat", not as a character vector but as a list of two dataframes, used aggregate( ..., FUN=mean) (because I haven't gotten on the reshape2/plyr bus, except for melt and rbind.fill that is ) and then do.call(rbind.fill, ...) on the resulting list. At any rate this is what I think you want. I do not think it is a good idea to add in zeros for what are really missing values.
> rbind.fill(dat1tbl, dat2tbl)
value Density_1 Density_2 Density_3
1 (all) 5.006709 4.088988 2.958971
2 (all) 4.178586 3.812362 NA

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Duplicated col in datatable join - r

Answer moved from comments with slight modification from HubertL code DT1[DT2[, .(Idx2, targetcol)], on = c(Idx = "Idx2")]

Related

Update data.table by reference but populate only certain rows when duplicates are present using a prioritized vector

Assign values to multiple columns on subset using ":=" from data.table

Obtain column names as vector in R

assigning a subset of data.table rows and columns by join

R loops: Adding a column to a table if does not already exist

Categories

Resources