R Replace a value with a character from another data.table - r

Here I met some problems in R about replacement coding.
Here is the original data.table. There are two datatables:
dt1 <- data.table(V1 = c(1,"A"))
dt2 <- data.table("1" = c(4,5,6), "A" = c("c","d","e"))
Now I want to replace values in dt1 with value in dt2 by matching relationship.
The desired output should be:
dt3 <- data.table(V1 = c("4,5,6", "c,d,e"))
That is, I want to replace values in dt1 with all values in the corresponding column in dt2. And this is a simple example, I want to apply it to the whole data.table in R.
I met so big trouble in dealing with this, so please help me.

Here is a way from your input to desired output.
dt1[, V1 := sapply(dt2, paste, collapse = ',')[V1]]
# Test
all.equal(dt1, dt3)
[1] TRUE
PS. Are you sure that storing the values separated by a comma in a string is the best?

We may do
dt1[, V1 := unlist(lapply(V1, function(x) toString(dt2[[x]])))]
dt1
V1
1: 4, 5, 6
2: c, d, e

Related

How to obtain values from data table? [duplicate]

I have the following data.table (DT):
DT <- data.table(V1 = 1:3, V2 = 4:6, V3 = 7:9)
I would like to select a subset of the variables programmatically (dynamically), by using an object where the relevant variable names are stored. For example, I want to select the two columns "V1" and "V3" stored in a variable "keep"
keep <- c("V1", "V3")
If we were to select the "keep" columns from a data.frame, the following would work:
DT[keep]
Unfortunately, this is not working when this is a data.table. I thought the data.frame and data.table are identical with this kind of behavior, but apperently they aren't. Anybody able to advise on the correct syntax?
This is covered in FAQ 1.1, 1.2 and 2.17.
Some possibilities:
DT[, keep, with = FALSE]
DT[, c('V1', 'V3'), with = FALSE]
DT[, c(1, 3), with = FALSE]
DT[, list(V1, V3)]
The reason DF[c('V1','V3')] works as it does for a data.frame is covered in ?`[.data.frame`
Data frames can be indexed in several modes. When [ and [[ are used
with a single vector index (x[i] or x[[i]]), they index the data frame
as if it were a list. In this usage a drop argument is ignored, with a
warning.
From data.table 1.10.2, you may use the .. prefix when subsetting columns programmatically:
When j is a symbol prefixed with .. it will be looked up in calling scope and its value taken to be column names or numbers [...] It is experimental.
Thus:
DT[ , ..keep]
# V1 V3
# 1: 1 7
# 2: 2 8
# 3: 3 9
Some more possibilities:
DT[, .SD, .SDcols = keep]
DT[, mget(keep)]

Update data.table by reference but populate only certain rows when duplicates are present using a prioritized vector

I didn't quite know how to word the title, but here is what I'm trying to do. I'd like to grow the data table dt1 using columns from dt2. In dt1, there are duplicated data in the column I'm updating/merging by. My goal is to populate new columns in dt1 at duplicates only if a condition is met
specified by another variable. Let me demonstrate what I mean:
library(data.table)
dt1 <- data.table(common_var = c(rep("a", 3), rep("b", 2)),
condition_var = c("update1", rep(c("update2", "update3"), 2)),
other_var = 1:5)
dt2 <- data.table(common_var = c("a", "b", "C", "d"),
new_var1 = 11:14,
new_var2 = 21:24)
# What I want to obtain is the following
dt_goal <- data.table(common_var = dt1$common_var,
condition_var = dt1$condition_var,
other_var = dt1$other_var,
new_var1 = c(11, NA, NA, 12, NA),
new_var2 = c(21, NA, NA, 22, NA))
dt_goal
Updating by reference or merging populates all the matching rows (as expected), but this is not what I want:
# Updating by reference populates all the duplicate rows as expected
# (doesn't work for my purpose)
dt1[, names(dt2) := as.list(dt2[match(dt1$common_var, dt2$common_var),])]
# merging also populates duplicate rows as expected.
# dt3 <- merge(dt1, dt2, by="common_var")
I tried overriding the rows of merged dt3 (or updated dt1) with NAs where I don't want to have data:
dt3 <- dt3[which(alldup(dt3$common_var) & dt3$condition_var %in% c("update2", "update3")), names(dt2)[2:3] := NA]
dt3
The logic in the code above finds duplicates and the unwanted conditional cases, and replaces the selected columns with NA. This partially works, with two problems:
1) If the value to keep (update1) isn't present in other duplicate rows (b in my example), they get erased too
2) This approach requires hard-coding the case I want to keep. In my real-world application, I will loop this type of data prep and the conditional values will change. I know the priority for updating the data table though:
order_to_populate_dups <- c("update1", "update2", "update3")
In other words, I want a code to grow the data table as follows:
1) When no duplicates, add columns by reference (or merge) normally
2) When duplicates are present under the id variable, look at condition_var
2a) If you see update1 add data, if not, next
2b) If you see update2 add data, if not, next
2c) If you see update3 add data, if not, next, ...
I couldn't locate a solution for this problem in SO. Please let me know if this is somehow duplicate.
Thanks!
Are you looking for something like:
cols <- paste0("new_var", 1:2)
remap <- c(update1=1, update2=2, update3=3)
dt1[, rp := remap[condition_var]]
setkey(dt1, common_var, rp)
dt1[rowid(common_var)==1L, (cols) :=
dt2[.SD, on=.(common_var), mget(paste0("i.",cols))]
Explanation:
You can use factor or a vector to remap your character vector into something that can be ordered accordingly. Then use setkey to sort the data before performing an update join on the first row of each group of common_var.
Please let me know if i understood your example correctly or not. I can change the solution if needed.
# order dt1 by the common variable and
setorder(dt1, common_var, condition_var) condition
# calculate row_id for each group (grouped by common_var)
dt1[, row_index := rowid(common_var)]
# assume dt2 has only one row per common_var
dt2[, row_index := 1]
# left join on common_var and row_index, reorder columns.
dt3 <- dt2[dt1, on = c('common_var', 'row_index')][, list(common_var, condition_var, other_var, new_var1, new_var2)]

How to construct an empty data.table with the colum names of an existing data.table?

I would like to create an empty data.table in R with colum names from another existing data.table.
Somehow I could not find a solution for that.
I would like to do something like that:
require(data.table)
dt1 <- data.table(fn = c("A","B","C"), x = c(1,2,3), y = c(2,3,4), a = 1, b = 2, c = 3)
dt2 <- data.table(names=colnames(dt1)) # Gives 6 rows instead of 6 cols
How can this be achieved?
Thanks!
You can also take your old dt1, clear it and keep as dt2
dt2 <- dt1[0,]
dt2
Empty data.table (0 rows and 6 cols): fn,x,y,a,b,c
It isn't precisely what did you want, but it always some solution.
One option could be:
dt2 <- setnames(data.table(matrix(nrow = 0, ncol = length(dt1))), names(dt1))
Empty data.table (0 rows and 6 cols): fn,x,y,a,b,c

select part of a string after certain number of special character

I have a data.table with a column
V1
a_b_c_las_poi
a_b_c_kiosk_pran
a_b_c_qwer_ok
I would like to add a new column to this data.table which does not include the last part of string after the "_".
UPDATE
So i would like the output to be
a_b_c_las
a_b_c_kiosk
a_b_c_qwer
If k is the number of fields to keep:
k <- 2
DT[, V1 := do.call(paste, c(read.table(text=V1, fill=TRUE, sep="_")[1:k], sep = "_"))]
fill=TRUE can be omitted if all rows have the same number of fields.
Note: DT in a reproducible form is:
library(data.table)
DF <- data.frame(V1 = c("a_b_c_las_poi", "a_b_c_kiosk_pran", "a_b_c_qwer_ok"),
stringsAsFactors = FALSE)
DT <- as.data.table(DF)
You can do this with sub and a regular expression.
sub("(.*)_.*", "\\1", V1)
[1] "a_b_c_las" "a_b_c_kiosk" "a_b_c_qwer"

How to select columns programmatically in a data.table?

I have the following data.table (DT):
DT <- data.table(V1 = 1:3, V2 = 4:6, V3 = 7:9)
I would like to select a subset of the variables programmatically (dynamically), by using an object where the relevant variable names are stored. For example, I want to select the two columns "V1" and "V3" stored in a variable "keep"
keep <- c("V1", "V3")
If we were to select the "keep" columns from a data.frame, the following would work:
DT[keep]
Unfortunately, this is not working when this is a data.table. I thought the data.frame and data.table are identical with this kind of behavior, but apperently they aren't. Anybody able to advise on the correct syntax?
This is covered in FAQ 1.1, 1.2 and 2.17.
Some possibilities:
DT[, keep, with = FALSE]
DT[, c('V1', 'V3'), with = FALSE]
DT[, c(1, 3), with = FALSE]
DT[, list(V1, V3)]
The reason DF[c('V1','V3')] works as it does for a data.frame is covered in ?`[.data.frame`
Data frames can be indexed in several modes. When [ and [[ are used
with a single vector index (x[i] or x[[i]]), they index the data frame
as if it were a list. In this usage a drop argument is ignored, with a
warning.
From data.table 1.10.2, you may use the .. prefix when subsetting columns programmatically:
When j is a symbol prefixed with .. it will be looked up in calling scope and its value taken to be column names or numbers [...] It is experimental.
Thus:
DT[ , ..keep]
# V1 V3
# 1: 1 7
# 2: 2 8
# 3: 3 9
Some more possibilities:
DT[, .SD, .SDcols = keep]
DT[, mget(keep)]

Resources