Obtain column names as vector in R - r

I want to merge two data tables both have common column names. See below for my script. But I need to obtain the column names using a code but not manually enter like below.
Basically, I need to create a vector of column names for each data table.
setkeyv(Tab_1, c("State","County_ID","Year"))
setkeyv(Tab_2, c("State","County_ID","Year"))
sub_Merge <- merge(Tab_1, Tab_2, all.x = TRUE)
For example something like this below,
setkeyv(Tab_1, as.vector(colnames(Tab_1))
setkeyv(Tab_2, as.vector(colnames(Tab_2))
sub_Merge <- merge(Tab_1, Tab_2, all.x = TRUE)
Any help is appreciated.

With data.table, it's pretty concise:
dt1[dt2, on = names(dt1)[names(dt1) %in% names(dt2)]]
data.table uses the dt[i,j,by] structure. Putting another data.table in the i slot asks to join it to the data.table in the dt position. In a join, you can add an on= statement to specify which columns to base the join on, if any keyed columns already present in the two data.tables aren't suitable for us as such. In the code above, names(dt1)[names(dt1) %in% names(dt2)] returns a list of columns that are found in both dt1 and dt2, and feeds them into the on= clause. The idea of doing it this way, is that you can calculate shared column names on-the-fly, and don't have to write each one out.
This depends on having no duplicate values in dt1, and wanting to join on ALL shared columns in dt1 and dt2.
I used this mock data:
dt1 <-
data.table(
a = LETTERS[1:10],
b = letters[1:10],
c = runif(10),
d = runif(10)
)
dt2 <-
data.table(
a = LETTERS[1:10],
b = letters[1:10],
e = runif(10),
f = runif(10)
)

Related

Update data.table by reference but populate only certain rows when duplicates are present using a prioritized vector

I didn't quite know how to word the title, but here is what I'm trying to do. I'd like to grow the data table dt1 using columns from dt2. In dt1, there are duplicated data in the column I'm updating/merging by. My goal is to populate new columns in dt1 at duplicates only if a condition is met
specified by another variable. Let me demonstrate what I mean:
library(data.table)
dt1 <- data.table(common_var = c(rep("a", 3), rep("b", 2)),
condition_var = c("update1", rep(c("update2", "update3"), 2)),
other_var = 1:5)
dt2 <- data.table(common_var = c("a", "b", "C", "d"),
new_var1 = 11:14,
new_var2 = 21:24)
# What I want to obtain is the following
dt_goal <- data.table(common_var = dt1$common_var,
condition_var = dt1$condition_var,
other_var = dt1$other_var,
new_var1 = c(11, NA, NA, 12, NA),
new_var2 = c(21, NA, NA, 22, NA))
dt_goal
Updating by reference or merging populates all the matching rows (as expected), but this is not what I want:
# Updating by reference populates all the duplicate rows as expected
# (doesn't work for my purpose)
dt1[, names(dt2) := as.list(dt2[match(dt1$common_var, dt2$common_var),])]
# merging also populates duplicate rows as expected.
# dt3 <- merge(dt1, dt2, by="common_var")
I tried overriding the rows of merged dt3 (or updated dt1) with NAs where I don't want to have data:
dt3 <- dt3[which(alldup(dt3$common_var) & dt3$condition_var %in% c("update2", "update3")), names(dt2)[2:3] := NA]
dt3
The logic in the code above finds duplicates and the unwanted conditional cases, and replaces the selected columns with NA. This partially works, with two problems:
1) If the value to keep (update1) isn't present in other duplicate rows (b in my example), they get erased too
2) This approach requires hard-coding the case I want to keep. In my real-world application, I will loop this type of data prep and the conditional values will change. I know the priority for updating the data table though:
order_to_populate_dups <- c("update1", "update2", "update3")
In other words, I want a code to grow the data table as follows:
1) When no duplicates, add columns by reference (or merge) normally
2) When duplicates are present under the id variable, look at condition_var
2a) If you see update1 add data, if not, next
2b) If you see update2 add data, if not, next
2c) If you see update3 add data, if not, next, ...
I couldn't locate a solution for this problem in SO. Please let me know if this is somehow duplicate.
Thanks!
Are you looking for something like:
cols <- paste0("new_var", 1:2)
remap <- c(update1=1, update2=2, update3=3)
dt1[, rp := remap[condition_var]]
setkey(dt1, common_var, rp)
dt1[rowid(common_var)==1L, (cols) :=
dt2[.SD, on=.(common_var), mget(paste0("i.",cols))]
Explanation:
You can use factor or a vector to remap your character vector into something that can be ordered accordingly. Then use setkey to sort the data before performing an update join on the first row of each group of common_var.
Please let me know if i understood your example correctly or not. I can change the solution if needed.
# order dt1 by the common variable and
setorder(dt1, common_var, condition_var) condition
# calculate row_id for each group (grouped by common_var)
dt1[, row_index := rowid(common_var)]
# assume dt2 has only one row per common_var
dt2[, row_index := 1]
# left join on common_var and row_index, reorder columns.
dt3 <- dt2[dt1, on = c('common_var', 'row_index')][, list(common_var, condition_var, other_var, new_var1, new_var2)]

How to construct an empty data.table with the colum names of an existing data.table?

I would like to create an empty data.table in R with colum names from another existing data.table.
Somehow I could not find a solution for that.
I would like to do something like that:
require(data.table)
dt1 <- data.table(fn = c("A","B","C"), x = c(1,2,3), y = c(2,3,4), a = 1, b = 2, c = 3)
dt2 <- data.table(names=colnames(dt1)) # Gives 6 rows instead of 6 cols
How can this be achieved?
Thanks!
You can also take your old dt1, clear it and keep as dt2
dt2 <- dt1[0,]
dt2
Empty data.table (0 rows and 6 cols): fn,x,y,a,b,c
It isn't precisely what did you want, but it always some solution.
One option could be:
dt2 <- setnames(data.table(matrix(nrow = 0, ncol = length(dt1))), names(dt1))
Empty data.table (0 rows and 6 cols): fn,x,y,a,b,c

Duplicated col in datatable join

I do a simply join of 2 datables as follows:
set.seed(1)
DT1 <- data.table(
Idx = rep(1:100),
x1 = round(rnorm(100,0.75,0.3),2),
x2 = round(rnorm(100,0.75,0.3),2),
x3 = round(rnorm(100,0.75,0.3),2))
DT2 <- data.table(
Idx2 = rep(1:100),
x1 = round(rep(pi,100),2),
targetcol = rep(999,100))
DT2[DT1,on = c(Idx2 = "Idx")]
This works, but there is a column i.x1 in the result, which I do not want. I only want to include the 'targetcol', hence the name. Now the problem is that in another example, I have many of these duplicate columns with the 'i' before them and therefore i would like to delete them or better exclude them during the merge. I know this should be possible with X[Y,.(...)], but I didn't find the right way how to fill the dots in .(...) with all but one one column, i.e. with all but i.x1. So I wonder what is the best way to select multiple columns in data table with the list-syntax as above?
Secondly I tried the newer merge syntax of datatable:
merge(x = DT1, y = DT2[,c("Idx2","targetcol")], by.x = "Idx",by.y = "Idx2", all.x=TRUE)
but it leads to a different column ordering, naming (x1.x and x1.y), and moreover, I read it is slower than the other way.
What is the best method to solve this (also in case there are many more columns and duplicates; this was just to illustrate the issue)?
Answer moved from comments with slight modification from HubertL code
DT1[DT2[, .(Idx2, targetcol)], on = c(Idx = "Idx2")]

data.table join (multiple) selected columns with new names

I like to join two tables that have some identical columns (names and values) and others that are not. I'm only interested in joining those that are not identical and I would like to determine a new name for them. The way I currently do it seems verbose and hard to handle for the real tables I have with 100+ columns, i.e. I would like to determine the columns to be joined in advance and not in join statement. Reproducible example:
# create table 1
DT1 = data.table(id = 1:5, x=letters[1:5], a=11:15, b=21:25)
# create table 2 with changed values for a, b via pre-determined cols
DT2 = copy(DT1)
cols <- c("a", "b")
DT2[, (cols) := lapply(.SD, function(x) x*2), .SDcols = cols]
# this both works but is verbose for many columns
DT1[DT2, c("a_new", "b_new") := list(i.a, i.b), on=c(id="id")]
DT1[DT2, `:=` (a_new=i.a, b_new=i.b), on = c(id="id")]
I was thinking about something like this (doesn't work):
cols_new <- c("a_new", "b_new")
cols <- c("a", "b")
DT1[DT2, cols_new := i.cols, on=c(id="id")]
Updated answer based on Arun's recommendation:
cols_old <- c('i.a', 'i.b')
DT1[DT2, (cols_new) := mget(cols_old), on = c(id = "id")]
you could also generate the cols_old by doing:
paste0('i.', gsub('_new', '', cols_new, fixed = TRUE))
See history for the old answer.

assigning a subset of data.table rows and columns by join

I'm trying to do something similar but different enough from what's described here:
Update subset of data.table based on join
Specifically, I'd like to assign to matching key values (person_id is a key in both tables) column values from table control. CI is the column index. The statement below says 'with=F' was not used. when I delete those parts, it also doesn't work as expected. Any suggestions?
To rephrase: I'd like to set the subset of flatData that corresponds to control FROM control.
flatData[J(eval(control$person_id)), ci, with=F] = control[, ci, with=F]
To give a reproducible example using classic R:
x = data.frame(a = 1:3, b = 1:3, key = c('a', 'b', 'c'))
y = data.frame(a = c(2, 5), b = c(11, 2), key = c('a', 'b'))
colidx = match(c('a', 'b'), colnames(y))
x[x$key %in% y$key, colidx] = y[, colidx]
As an aside, someone please explain how to easily assign SETS of columns without using indices! Indices and data.table are a marriage made in hell.
You can use the := operator along with the join simultaneously as follows:
First prepare data:
require(data.table) ## >= 1.9.0
setDT(x) ## converts DF to DT by reference
setDT(y)
setkey(x, key) ## set key column
setkey(y, key)
Now the one-liner:
x[y, c("a", "b") := list(i.a, i.b)]
:= modifies by reference (in-place). The rows to modify are provided by the indices computed from the join in i.
i.a and i.b are the column names data.table internally generates for easy access to i's columns when both x and i have identical column names, when performing a join of the form x[i].
HTH
PS: In your example y's columns a and b are of type numeric and x's are of type integer and therefore you'll get a warning when run on your data, that the types dint match and therefore a coercion had to take place.

Resources