R / data.table() merge on named subset of another data.table - r

I'm trying to put together several files and need to do a bunch of merges on column names that are created inside a loop. I can do this fine using data.frame() but am having issues using similar code with a data.table():
library(data.table)
df1 <- data.frame(id = 1:20, col1 = runif(20))
df2 <- data.frame(id = 1:20, col1 = runif(20))
newColNum <- 5
newColName <- paste('col',newColNum ,sep='')
df1[,newColName] <- runif(20)
df2 <- merge(df2, df1[,c('id',newColName)], by = 'id', all.x = T) # Works fine
######################
dt1 <- data.table(id = 1:20, col1 = runif(20))
dt2 <- data.table(id = 1:20, col1 = runif(20))
newColNum <- 5
newColName <- paste('col',newColNum ,sep='')
dt1[,newColName] <- runif(20)
dt2 <- merge(dt2, dt1[,c('id',newColName)], by = 'id', all.x = T) # Doesn't work
Any suggestions?

This really has nothing to do with merge(), and everything to do with how the j (i.e. column) index is, by default, interpreted by [.data.table().
You can make the whole statement work by setting with=FALSE, which causes the j index to be interpreted as it would be in a data.frame:
dt2 <- merge(dt2, dt1[,c('id',newColName), with=FALSE], by = 'id', all.x = T)
head(dt2, 3)
# id col1 col5
# 1: 1 0.4954940 0.07779748
# 2: 2 0.1498613 0.12707070
# 3: 3 0.8969374 0.66894157
More precisely, from ?data.table:
with: By default 'with=TRUE' and 'j' is evaluated within the frame
of 'x'. The column names can be used as variables. When
'with=FALSE', 'j' is a vector of names or positions to
select.
Note that this could be avoided by storing the columns in a variable like so:
cols = c('id', newColName)
dt1[ , ..cols]
.. signals to "look up one level"

Try dt1[,list(id,get(newColName))] in your merge.

Related

Spliting string in column by seperator and adding those as new columns in the same data frame using R

I have a column in dataframe df with value 'name>year>format'. Now I want to split this column by > and add those values to new columns named as name, year, format. How can I do this in R.
You can do that easily using separate function in tidyr;
library(tidyr)
library(dplyr)
data <-
data.frame(
A = c("Joe>1993>student")
)
data %>%
separate(A, into = c("name", "year", "format"), sep = ">", remove = FALSE)
# A name year format
# Joe>1993>student Joe 1993 student
If you do not want the original column in the result dataframe change remove to TRUE
An option is read.table in base R
cbind(df, read.table(text = as.character(df$column), sep=">",
header = FALSE, col.names = c("name", "year", "format")))
In case your data is big, it would be a good idea to use data.table as it is very fast.
If you know how many fields your "combined" column has:
Suppose the column has 3 fields, and you know it:
library(data.table)
# the 1:3 should be replaced by 1:n, where n is the number of fields
dt1[, paste0("V", 1:3) := tstrsplit(y, split = ">", fixed = TRUE)]
If you DON'T know in advance how many fields the column has:
Now we can get some help from the stringi package:
library(data.table)
library(stringi)
maxFields <- dt2[, max(stri_count_fixed(y, ">")) + 1]
dt2[, paste0("V", 1:maxFields) := tstrsplit(y, split = ">", fixed = TRUE, fill = NA)]
Data used:
library(data.table)
dt1 <- data.table(x = c("A", "B"), y = c("letter>2018>pdf", "code>2020>Rmd"))
dt2 <- rbind(dt1, data.table(x = "C", y = "report>2019>html>pdf"))

How to write a function to translate data to higher dimension

If you are familiar with SVM, we can move data to higher dimension in order to deal with non-linearity.
I want to do that. I have 19 features and I want to do this:
for any pair of features x_i and x_j I have to find :
sqrt(2)*x_i*x_j
and also square of each features
( x_i)^2
so new features will be:
(x_1)^2, (x_2)^2,...,(x_19)^2, sqrt(2)*x_1*x_2, sqrt(2)*x_1*x_3,...
at the end removing columns whose values are all zero
example
col1 col2 col3
1 2 6
new data frame
col1 col2 col3 col4 col5 col6
(1)^2 (2)^2 (6)^2 sqrt(2)*(1)*(2) sqrt(2)*(1)*(6) sqrt(2)*(2)*(6)
I use data.table package to do these kind of operations. You will need gtools as well for making the combination of the features.
# input data frame
df <- data.frame(x1 = 1:3, x2 = 4:6, x3 = 7:9)
library(data.table)
library(gtools)
# convert to data table to do this
dt <- as.data.table(df)
# specify the feature variables
features <- c("x1", "x2", "x3")
# squares columns
dt[, (paste0(features, "_", "squared")) := lapply(.SD, function(x) x^2),
.SDcols = features]
# combinations columns
all_combs <- as.data.table(gtools::combinations(v=features, n=length(features), r=2))
for(i in 1:nrow(all_combs)){
set(dt,
j = paste0(all_combs[i, V1], "_", all_combs[i, V2]),
value = sqrt(2) * dt[, get(all_combs[i, V1])*get(all_combs[i, V2])])
}
# convert back to data frame
df2 <- as.data.frame(dt)
df2

Joining data frames without returning all matching combinations

I have a list of data.frames (in this example only 2):
set.seed(1)
df1 <- data.frame(id = sample(LETTERS,50,replace=T), val = rnorm(50), val1 = rnorm(50), stringsAsFactors = F)
df2 <- data.frame(id = sample(LETTERS,30,replace=T), val = rnorm(30), val2 = rnorm(30), stringsAsFactors = F)
df.list <- list(df1,df2)
I want to join them into a single data.frame only by a subset of the shared column names, in this case by id.
If I use:
library(dplyr)
df <- df.list %>% purrr::reduce(dplyr::inner_join,by="id")
The shared column names, which I'm not joining by, get mutated with the x and y suffices:
id val.x val1 val.y val2
1 G -0.05612874 0.2914462 2.087167 0.7876396
2 G -0.05612874 0.2914462 -0.255027 1.4411577
3 J -0.15579551 -0.4432919 -1.286301 1.0273924
In reality, for the shared column names for which I'm not joining by, it's good enough to select them from a single data.frame in the list - which ever they exist in WRT to the joined id.
I don't know these shared column names in advance but that's not difficult find out:
E.g.:
df.list.colnames <- unlist(lapply(df.list,function(l) colnames(l %>% dplyr::select(-id))))
df.list.colnames <- table(df.list.colnames)
repeating.colnames <- names(df.list.colnames)[which(df.list.colnames > 1)]
Which will then allow me to separate them from the data.frames in the list:
repeating.colnames.df <- do.call(rbind,lapply(df.list,function(r) r %>% dplyr::select_(.dots = c("id",repeating.colnames)))) %>%
unique()
I can then join the list of data.frames excluding these columns:
And then join them as above:
for(r in 1:length(df.list)) df.list[[r]] <- df.list[[r]] %>% dplyr::select_(.dots = paste0("-",repeating.colnames))
df <- df.list %>% purrr::reduce(dplyr::inner_join,by="id")
And now I'm left with adding the repeating.colnames.df to that. I don't know of any join in dplyr that wont return all combinations between df and repeating.colnames.df, so it seems that all I can do is apply over each df$id, pick the first match in repeating.colnames.df and join the result with df.
Is there anything less cumbersome for this situation?
If I followed correctly, I think you can handle this by writing a custom function to pass into reduce that identifies the common column names (excluding your joining columns) and excludes those columns from the "second" table in the merge. As reduce works through the list, the function will "accumulate" the unique columns, defaulting to the columns in the "left-most" table.
Something like this:
library(dplyr)
library(purrr)
set.seed(1)
df1 <- data.frame(id = sample(LETTERS,50,replace=T), val = rnorm(50), val1 = rnorm(50), stringsAsFactors = F)
df2 <- data.frame(id = sample(LETTERS,30,replace=T), val = rnorm(30), val2 = rnorm(30), stringsAsFactors = F)
df.list <- list(df1,df2)
fun <- function(df1, df2, by_col = "id"){
df1_names <- names(df1)
df2_names <- names(df2)
dup_cols <- intersect(df1_names[!df1_names %in% by_col], df2_names[!df2_names %in% by_col])
out <- dplyr::inner_join(df1, df2[, !(df2_names %in% dup_cols)], by = by_col)
return(out)
}
df_chase <- df.list %>% reduce(fun,by_col="id")
Created on 2019-01-15 by the reprex package (v0.2.1)
If I compare df_chase to your final solution, I yield the same answer:
> all.equal(df_chase, df_orig)
[1] TRUE
You can just get rid of the duplicate columns from one of the data frames if you say you don't really care about them and simply use base::merge:
set.seed(1)
df1 <- data.frame(id = sample(LETTERS,50,replace=T), val = rnorm(50), val1 = rnorm(50), stringsAsFactors = F)
df2 <- data.frame(id = sample(LETTERS,30,replace=T), val = rnorm(30), val2 = rnorm(30), stringsAsFactors = F)
duplicates = names(df1) == names(df2) & names(df1) !="id"
df2 = df2[,!duplicates]
df12 = base::merge.data.frame(df1, df2, by = "id")
head(df12)

Assign by reference a list of results to a number of columns of a data.table

Imagine you have 2 distributions resulting from two simulations stored in a data.frame:
sim1 = 1:10
sim2 = 91:100
sim = data.frame(sim1, sim2)
Now, we want to find the 10% and 90% percentiles of each distribution. This can be done by:
diffSim = ncol(sim)
confidenceInterval = c(0.1, 0.9)
results = lapply(1:diffSim, function(j) {quantile(sim[, j], confidenceInterval,
names = FALSE, type = 3)})
I would like to store these results in a data.table by assigning by reference (:=). However, I first need to getresults in the appropriate shape (i.e. a data.table of 1 row and 4 columns). To do so, I subsequently apply unlist, matrix and as.data.table to results:
DT = data.table(Col1 = "Result")
DT[, c("col2", "col3", "col4", "col5") := as.data.table(matrix(unlist(results), nrow = 1))]
I don't like this at all. Is there a shorter way of doing this?
Not necessarily shorter, but everything in data.table:
library(data.table)
setDT(sim)[, .(col1 = 'Result',
cols = paste0('col',2:5),
vals = unlist(lapply(.SD, quantile, probs = confidenceInterval, type = 3)))
][, dcast(.SD, col1 ~ cols, value.var = 'vals')]
which gives:
col1 col2 col3 col4 col5
1: Result 1 9 91 99

how to lapply to one column in a list of data tables

I have a list of dt with same structure, some columns are numeric some characters.
dt1 <- data.table(x = c(1:5), y = "test")
dt2 <- data.table(x = c(1:5), y = "test")
mylist <- list(A = dt1, B = dt2)
I want to apply a function, say sum or mean that cannot be applied across the whole datatable because there are some character columns.
I have tried different combinations of lapply(mylist$y,sum) or lapply(mylist[2],sum) but it doesn't work.
You can create an anonymous function inside lapply in which you subset and perform the needed calculation (promoting my comment to an answer):
lapply(mylist, function(i) i[, sum(x)])
# or:
lapply(mylist, function(i) sum(i[["x"]]))
which gives:
$A
[1] 76
$B
[1] 99
Another example giving you the number of unique y-values for x > 3:
lapply(mylist, function(i) i[x>3, uniqueN(y)])
which gives:
$A
[1] 10
$B
[1] 14
Used data:
dt1 <- data.table(x = c(1:5), y = letters)
dt2 <- data.table(x = c(1:7), y = letters)
mylist <- list(A = dt1, B = dt2)
I really think the purrr package makes these problems easier to think about by letting you break the problem up into bite sized pieces:
library(purrr)
dt1 <- data_frame(x = c(1:5), y = letters[1:5])
dt2 <- data_frame(x = c(1:5), y = letters[1:5])
mylist <- list(A = dt1, B = dt2)
map(mylist, "y") %>%
map(length)
You can also use something like this to apply a function conditionally
map(mylist, ~map_if(., is.numeric, sum))
You could also use nested lapply() functions like so:
dt1 <- data.table(x = c(1:5), y = letters[1:5])
dt2 <- data.table(x = c(6:10), y = letters[1:5])
mylist <- list(A = dt1, B = dt2)
lapply(lapply(mylist, function(x) x[[1]]), mean)
# $A
# [1] 3
# $B
# [1] 8
Many options here it looks like. With my code, it might be interesting to see what lapply() returns and how the other lapply() deals with it to understand why it works.

Resources