I have a vector with 2 elements:
v1 <- c('X1','X2')
I want to create possible combinations of these elements.
The resultant data frame would look like:
structure(list(ID = c(1, 2, 3, 4), c1 = c("X1", "X2", "X1", "X2"
), c2 = c("X1", "X1", "X2", "X2")), class = "data.frame", row.names = c(NA,
-4L))
Here, rows with ID=2 and ID=3 have same elements (however arranged in different order). I would like to consider these 2 rows as duplicate. Hence, the final output will have 3 rows only.
i.e. 3 combinations
X1, X1
X1, X2
X2, X2
In my actual dataset, I have 16 such elements in the vector V1.
I have tried using expand.grid approach for obtaining possible combinations but this actually exceeds the machine limit. (number of combinations with 16 elements will be too large). This is potentially due to duplications described above.
Can someone help here to get all possible combinations without any duplications ?
I am actually looking for a solution that uses data table functionality. I believe this can be really faster
Thanks in advance.
Here is a base R solution using your sample == data:
First, create your combinations. Using unique = TRUE cuts back on the number of combinations.
library(data.table)
data <- setDT(CJ(df$c1, df$c2, unique = TRUE))
Then, filter out duplicates:
data[!duplicated(t(apply(data, 1, sort))),]
This gives us:
V1 V2
1 X1 X1
2 X2 X1
10 X2 X2
I would look into the ?expand.grid function for this type of task.
expand.grid(v1, v1)
Var1 Var2
1 X1 X1
2 X2 X1
3 X1 X2
4 X2 X2
dat4 is the final output.
v1 <- c('X1','X2')
library(data.table)
dat <- expand.grid(v1, v1, stringsAsFactors = FALSE)
setDT(dat)
# A function to combine and sort string from two columns
f <- function(x, y){
z <- paste(sort(c(x, y)), collapse = "-")
return(z)
}
# Apply the f function to each row
dat2 <- dat[, comb := f(Var1, Var2), by = 1:nrow(dat)]
# Remove the duplicated elements in the comb column
dat3 <- unique(dat2, by = "comb")
# Select the columns
dat4 <- dat3[, c("Var1", "Var2")]
print(dat4)
# Var1 Var2
# 1: X1 X1
# 2: X2 X1
# 3: X2 X2
You may want to check RcppAlgos::comboGeneral, which does exactly what you want and is known to be fast and memory efficient. Just do something like this:
vars <- paste0("X", 1:2)
RcppAlgos::comboGeneral(vars, length(vars), repetition = TRUE)
Output
[,1] [,2]
[1,] "X1" "X1"
[2,] "X1" "X2"
[3,] "X2" "X2"
On my laptop with 16Gb RAM, I can run this function up to 14 variables, and it takes less than 5s to finish. Speed is less of a concern. However, note that you need at least 17.9Gb RAM to get all 16-variable combinations.
We can use crossing from tidyr
library(tidyr)
crossing(v1, v1)
Related
I'm writing a function that needs to work with different datasets. The columns that have to be passed inside the function look somewhat like the following data frames:
df1 <- data.frame(x1 = c("d","e","f","g"), x2 = c("Aug 2017","Sep 2017","Oct 2017","Nov 2017"), x3 = c(456,678,876,987))
df2 <- data.frame(x1 = c("a","b","c","d"), x2 = c("Aug 2017","Sep 2017","Oct 2017","Nov 2017"), x3 = c(123,324,345,564))
From these I need to find out if any of the df1$x1 are present in df2$x2. If present, print the entire row where df1$x1 value that is present in df2$x2.
I need to use the data frames inside the function but I can't specify the column names explicitly. So I need to find a way to access the columns without exactly using the column name.
The desired output:
x1 x2 x3 x4
d Aug 2017 456 common
enter image description here
My problem is, I can't use any kind of function where I need to specify the column names explicitly. For example, inner join cannot be performed since I have to specify
by = 'col_name'
You can use match with column indices:
df1[na.omit(match(df2[, 1], df1[, 1])), ]
# x1 x2 x3
#1 d Aug 2017 456
Here are three simple examples of functions that you might use to return the rows you want, where the function itself does not need to hardcode the name of the column, and does not require the indices to match:
Pass the frames directly, and the column names as strings:
f <- function(d1, d2, col1, col2) d1[which(d1[,col1] %in% d2[,col2]),]
Usage and Output
f(df1,df2, "x1", "x1")
x1 x2 x3
1 d Aug 2017 456
Pass only the values of the columns as vectors:
f <- function(x,y) which(x %in% y)
Usage and Output
df1 %>% filter(row_number() %in% f(x1, df2$x1))
x1 x2 x3
1 d Aug 2017 456
Pass the frames and the unquoted columns, using the {{}} operator
f <- function(d1, d2, col1, col2) {
filter(d1, {{col1}} %in% pull(d2, {{col2}}))
}
Usage and Output:
f(df1,df2,x1,x1)
x1 x2 x3
1 d Aug 2017 456
If you are familiar with SVM, we can move data to higher dimension in order to deal with non-linearity.
I want to do that. I have 19 features and I want to do this:
for any pair of features x_i and x_j I have to find :
sqrt(2)*x_i*x_j
and also square of each features
( x_i)^2
so new features will be:
(x_1)^2, (x_2)^2,...,(x_19)^2, sqrt(2)*x_1*x_2, sqrt(2)*x_1*x_3,...
at the end removing columns whose values are all zero
example
col1 col2 col3
1 2 6
new data frame
col1 col2 col3 col4 col5 col6
(1)^2 (2)^2 (6)^2 sqrt(2)*(1)*(2) sqrt(2)*(1)*(6) sqrt(2)*(2)*(6)
I use data.table package to do these kind of operations. You will need gtools as well for making the combination of the features.
# input data frame
df <- data.frame(x1 = 1:3, x2 = 4:6, x3 = 7:9)
library(data.table)
library(gtools)
# convert to data table to do this
dt <- as.data.table(df)
# specify the feature variables
features <- c("x1", "x2", "x3")
# squares columns
dt[, (paste0(features, "_", "squared")) := lapply(.SD, function(x) x^2),
.SDcols = features]
# combinations columns
all_combs <- as.data.table(gtools::combinations(v=features, n=length(features), r=2))
for(i in 1:nrow(all_combs)){
set(dt,
j = paste0(all_combs[i, V1], "_", all_combs[i, V2]),
value = sqrt(2) * dt[, get(all_combs[i, V1])*get(all_combs[i, V2])])
}
# convert back to data frame
df2 <- as.data.frame(dt)
df2
I have output from older software that wraps the record for each transaction into multiple rows. I want to unwrap these rows into one flat dataframe. I have found solutions to unwrap columns, but not rows, and can do what I need in a loop, but the output is large and I would prefer a faster solution than a loop.
Example: I read into R from a .csv file 6 pieces of information about each of two transactions ("tran") that come wrapped into four rows.
The following represents and mimics my data as I read it into R from a .csv file:
V1 <- c("tran1.col1", "tran1.col4","tran2.col1", "tran2.col4")
V2 <- c("tran1.col2", "tran1.col5", "tran2.col2", "tran2.col5")
V3 <- c("tran1.col3", "tran1.col6", "tran2.col3", "tran2.col6")
df <- as.data.frame(matrix(c(V1, V2, V3), ncol = 3))
I am looking to transform the above to the following:
X1 <- c("tran1.col1", "tran2.col1")
X2 <- c("tran1.col2", "tran2.col2")
X3 <- c("tran1.col3", "tran2.col3")
X4 <- c("tran1.col4", "tran2.col4")
X5 <- c("tran1.col5", "tran2.col5")
X6 <- c("tran1.col6", "tran2.col6")
df.x <- as.data.frame(matrix(c(X1, X2, X3, X4, X5, X6), ncol = 6))
I've looked at tidy routines to gather and spread datafiles as well as melt and decast in reshape, but as far as I can tell, I need to unwrap the rows first.
If all your inputs have 6 pieces of information by however many transactions, then the following should work.
vec <- as.character(unlist(t(df)))
df.x <- as.data.frame(matrix(vec, ncol = 6, byrow = T))
To break it down to explain what's happening ...
# Transpose the df (to a matrix)
matrix <- t(df)
# Now that the matrix is in this sequence it will allow us to unlist it so
# that it produces a vector in the correct sequence (i.e tran1.col1,
# tran1.col2 .. tran2.col1, tran1.col2)
vec <- unlist(matrix)
# Now we can coerce it back to a data.frame, defining the number of columns
# and creating it by row (rather than column)
df.x <- as.data.frame(matrix(vec, ncol = 6, byrow = T))
I can't seem to find this specifically (I looked here: How to split a character vector into data frame?) and a few other places.
I am trying to split a character vector in R into a data frame, with a set number of columns, filling in NA for any extras or missing. As below (reproducible):
###Reproduce column vector
cv <- c("a1", "b1", "c1", "d1", "e1", "f1", "aa2", "bb2", "cc2", "dd2", "ee2", "ff2", "x1", "x2", "x3", "x4", "x5", "x6", "rr2", "tt3", "bb4")
###Desired data frame separating 6 columns
df.desired <- data.frame(col1=c("a1","aa2","x1","rr2"),col2=c("b1","bb2","x2","tt3"),col3=c("c1","cc2","x3","bb4"),col4=c("d1","dd2","x4",NA),col5=c("e1","ee2","x5",NA),col6=c("f1","ff2","x6",NA),stringsAsFactors = F)
Thanks in advance!
1) base Create a matrix of NA values of the requisite dimensions and then fill it with cv up to its length. Transpose that and convert to a data frame.
mat <- t(replace(matrix(NA, 6, ceiling(length(cv) / 6)), seq_along(cv), cv))
as.data.frame(mat, stringsAsFactors = FALSE)
2) another base solution Using the cv2 copy of cv expand its length to that required and then reshape it into a matrix. We used cv2 in order to preserve the original cv but if you don't mind adding NAs to the end of cv then you could just use it instead of creating cv2 reducing the code by one line (two lines if we can use mat rather than needing a data frame). This solution avoids needing to use transpose by making use of the byrow argument of matrix.
cv2 <- cv
length(cv2) <- 6 * ceiling(length(cv) / 6)
mat <- matrix(cv2,, 6, byrow = TRUE)
as.data.frame(mat, stringsAsFactors = FALSE)
3) base solution using ts This one gets the row and column indexes by extracting them from the times of a ts object rather than calculating the dimensions via numeric calculation. To do that create the times, tt, of a ts object from cv. tt itself is a ts object for which as.integer(tt) is the row index numbers and cycle(tt) is the column index numbers. Finally use tapply with that:
tt <- time(ts(cv, frequency = 6))
mat <- tapply(cv, list(as.integer(tt), cycle(tt)), c)
as.data.frame(mat, stringsAsFactors = FALSE)
4) rollapply Like (3) this one does not explicitly calculate the dimensions of mat. It uses rollapply in the zoo package with a simple function, Fillr to avoid this. The Fill function returns its argument x padded out with NAs on the right to a length of 6.
library(zoo)
Fill <- function(x) { length(x) <- 6; x }
mat <- rollapplyr(cv, 6, by = 6, Fill, align = "left", partial = TRUE)
as.data.frame(mat, stringsAsFactors = FALSE)
In all alternatives above omit the last line if a matrix mat is adequate as the result.
Added
As of R 4.0 stringsAsFaactors=FALSE is the default so it could be omitted above.
1) base R - split the vector using a grouping variable created with gl and then append NA at the end with length<-
lst <- split(cv, as.integer(gl(length(cv), 6, length(cv))))
as.data.frame(do.call(rbind, lapply(lst, `length<-`, max(lengths(lst)))))
# V1 V2 V3 V4 V5 V6
#1 a1 b1 c1 d1 e1 f1
#2 aa2 bb2 cc2 dd2 ee2 ff2
#3 x1 x2 x3 x4 x5 x6
#4 rr2 tt3 bb4 <NA> <NA> <NA>
I have two data frames:
df1
x1 x2
1 a
2 b
3 c
4 d
and
df2
x1 x2
2 zz
3 qq
I want to replace some of the values in df1$x2 with values in df2$x2 based on the conditional match between df1$x1 and df2$x2 to produce:
df1
x1 x2
1 a
2 zz
3 qq
4 d
use match(), assuming values in df1 are unique.
df1 <- data.frame(x1=1:4,x2=letters[1:4],stringsAsFactors=FALSE)
df2 <- data.frame(x1=2:3,x2=c("zz","qq"),stringsAsFactors=FALSE)
df1$x2[match(df2$x1,df1$x1)] <- df2$x2
> df1
x1 x2
1 1 a
2 2 zz
3 3 qq
4 4 d
If the values aren't unique, use :
for(id in 1:nrow(df2)){
df1$x2[df1$x1 %in% df2$x1[id]] <- df2$x2[id]
}
We can use {powerjoin}, and handle the conflicting columns with coalesce_yx
library(powerjoin)
df1 <- data.frame(x1 = 1:4, x2 = letters[1:4], stringsAsFactors = FALSE)
df2 <- data.frame(x1 = 2:3, x2 = c("zz", "qq"), stringsAsFactors = FALSE)
power_left_join(df1, df2, by = "x1", conflict = coalesce_yx)
#> x1 x2
#> 1 1 a
#> 2 2 zz
#> 3 3 qq
#> 4 4 d
The first part of Joris' answer is good, but in the case of non-unique values in df1, the row-wise for-loop will not scale well on large data.frames.
You could use a data.table "update join" to modify in place, which will be quite fast:
library(data.table)
setDT(df1); setDT(df2)
df1[df2, on = .(x1), x2 := i.x2]
Or, assuming you don't care about maintaining row order, you could use SQL-inspired dplyr:
library(dplyr)
union_all(
inner_join( df1["x1"], df2 ), # x1 from df1 with matches in df2, x2 from df2
anti_join( df1, df2["x1"] ) # rows of df1 with no match in df2
) # %>% arrange(x1) # optional, won't maintain an arbitrary row order
Either of these will scale much better than the row-wise for-loop.
I see that Joris and Aaron have both chosen to build examples without factors. I can certainly understand that choice. For the reader with columns that are already factors there would also be to option of coercion to "character". There is a strategy that avoids that constraint and which also allows for the possibility that there may be indices in df2 that are not in df1 which I believe would invalidate Joris Meys' but not Aaron's solutions posted so far:
df1 <- data.frame(x1=1:4,x2=letters[1:4])
df2 <- data.frame(x1=c(2,3,5), x2=c("zz", "qq", "xx") )
It requires that the levels be expanded to include the intersection of both factor variables and then also the need to drop non-matching columns (= NA values) in match(df1$x1, df2$x1)
df1$x2 <- factor(df1$x2 , levels=c(levels(df1$x2), levels(df2$x2)) )
df1$x2[na.omit(match(df2$x1,df1$x1))] <- df2$x2[which(df2$x1 %in% df1$x1)]
df1
#-----------
x1 x2
1 1 a
2 2 zz
3 3 qq
4 4 d
(Note that recent versions of R do not have stringsAsFactors set to TRUE in the data.frame function defaults, unlike it was for most of the history of R.)
You can do it by matching the other way too but it's more complicated. Joris's solution is better but I'm putting this here also as a reminder to think about which way you want to match.
df1 <- data.frame(x1=1:4, x2=letters[1:4], stringsAsFactors=FALSE)
df2 <- data.frame(x1=2:3, x2=c("zz", "qq"), stringsAsFactors=FALSE)
swap <- df2$x2[match(df1$x1, df2$x1)]
ok <- !is.na(swap)
df1$x2[ok] <- swap[ok]
> df1
x1 x2
1 1 a
2 2 zz
3 3 qq
4 4 d
It can be done with dplyr.
library(dplyr)
full_join(df1,df2,by = c("x1" = "x1")) %>%
transmute(x1 = x1,x2 = coalesce(x2.y,x2.x))
x1 x2
1 1 a
2 2 zz
3 3 qq
4 4 d
new here, but using the following dplyr approach seems to work as well
similar but slightly different to one of the answers above
df3 <- anti_join(df1, df2, by = "x1")
df3 <- rbind(df3, df2)
df3
As of dplyr 1.0.0 there is a function specifically for this:
library(dplyr)
df1 <- data.frame(x1=1:4,x2=letters[1:4],stringsAsFactors=FALSE)
df2 <- data.frame(x1=2:3,x2=c("zz","qq"),stringsAsFactors=FALSE)
rows_update(df1, df2, by = "x1")
See https://stackoverflow.com/a/65254214/2738526