Subset data.table to represent the connections of objects - r

Given is a data.table representing the relations between 6 objects:
# create sampla data.table
x1 <- c(1,1,1,2,2,2,3,3,3,4,5,6)
x2 <- c(1,2,3,1,2,3,1,2,3,4,6,5)
dt <- data.table(x1, x2)
1st row represents the objects.
2nd row represents connection with other objects.
# check combinations
dt[dt$x1 != dt$x2]
Object 4 has no connections with other objects.
Objects 1, 2 and 3 are connected, as well as objects 5 and 6.
Now, a new column should be created where all connected objects get the same number (ID)
The resulting data.table should look like:
x3 <- c(1,1,1,1,1,1,1,1,1,2,3,3)
dt.res <- data.table(dt, x3)
How can this be achieved?

x1 <- c(1,1,1,2,2,2,3,3,3,4,5,6)
x2 <- c(1,2,3,1,2,3,1,2,3,4,6,5)
dt <- data.frame(x1, x2)
dt$x3=dt$x1
dt
for(i in 1:nrow(dt)){
if(dt$x3[i]!=dt$x2[i]){
dt$x3[dt$x3==dt$x2[i]]=dt$x3[i]
}
}
setDT(dt)[, id := .GRP, by=x3]
dt
Create duplicate of x1, x3
Iterate through x3, check if different from x2
If different, replaces all elements in x3 which are equal to the element that you just checked in x2 with the current value of x3
Assign ID's with setDT function

Related

find the common rows in 2 different dataframes

I'm writing a function that needs to work with different datasets. The columns that have to be passed inside the function look somewhat like the following data frames:
df1 <- data.frame(x1 = c("d","e","f","g"), x2 = c("Aug 2017","Sep 2017","Oct 2017","Nov 2017"), x3 = c(456,678,876,987))
df2 <- data.frame(x1 = c("a","b","c","d"), x2 = c("Aug 2017","Sep 2017","Oct 2017","Nov 2017"), x3 = c(123,324,345,564))
From these I need to find out if any of the df1$x1 are present in df2$x2. If present, print the entire row where df1$x1 value that is present in df2$x2.
I need to use the data frames inside the function but I can't specify the column names explicitly. So I need to find a way to access the columns without exactly using the column name.
The desired output:
x1 x2 x3 x4
d Aug 2017 456 common
enter image description here
My problem is, I can't use any kind of function where I need to specify the column names explicitly. For example, inner join cannot be performed since I have to specify
by = 'col_name'
You can use match with column indices:
df1[na.omit(match(df2[, 1], df1[, 1])), ]
# x1 x2 x3
#1 d Aug 2017 456
Here are three simple examples of functions that you might use to return the rows you want, where the function itself does not need to hardcode the name of the column, and does not require the indices to match:
Pass the frames directly, and the column names as strings:
f <- function(d1, d2, col1, col2) d1[which(d1[,col1] %in% d2[,col2]),]
Usage and Output
f(df1,df2, "x1", "x1")
x1 x2 x3
1 d Aug 2017 456
Pass only the values of the columns as vectors:
f <- function(x,y) which(x %in% y)
Usage and Output
df1 %>% filter(row_number() %in% f(x1, df2$x1))
x1 x2 x3
1 d Aug 2017 456
Pass the frames and the unquoted columns, using the {{}} operator
f <- function(d1, d2, col1, col2) {
filter(d1, {{col1}} %in% pull(d2, {{col2}}))
}
Usage and Output:
f(df1,df2,x1,x1)
x1 x2 x3
1 d Aug 2017 456

Assign multiple columns when using mutate in dtplyr

Is there a way of getting my data table to look like my target table when using dtplyr and mutate?`
A Dummy table
library(data.table)
library(dtplyr)
library(dplyr)
id <- rep(c("A","B"),each=3)
x1 <- rnorm(6)
x2 <- rnorm(6)
dat <- data.table(id,x1,x2)
A dummy function
my_fun <- function(x,y){
cbind(a = x+10,b=y-10)
}
And I would like to use this type of syntax
dat |>
group_by(id) |>
mutate(my_fun(x = x1,y = x2))
Where the end result will look like this
data.table(id, x1, x2, a=x1+10,b=x2-10)
I would like to have a generic solution that works for functions with variable number of columns returned but is that possible?
I think we would need more information about how this would work with a variable number of columns:
Are the columns named in a specific way?
Do the output columns need to be named in a specific way?
Are there standard calculations being done to each column dependent on name? E.g., x1 = +10 and x2 = -10?
At any rate, here is a solution that works with your provided data to return the data.table you specified:
my_fun <- function(data, ...){
dots <- list(...)
cbind(data,
a = data[[dots$x]] + 10,
b = data[[dots$y]] - 10
)
}
dat |>
my_fun(x = "x1", y = "x2")
id x1 x2 a b
1: A 0.8485309 -0.3532837 10.848531 -10.353284
2: A 0.7248478 -1.6561564 10.724848 -11.656156
3: A -1.3629114 0.4210139 8.637089 -9.578986
4: B -1.7934827 0.6717033 8.206517 -9.328297
5: B -1.0971890 -0.3008422 8.902811 -10.300842
6: B 0.4396630 -0.7447419 10.439663 -10.744742

Create unique possible combinations from the elements in vector in R

I have a vector with 2 elements:
v1 <- c('X1','X2')
I want to create possible combinations of these elements.
The resultant data frame would look like:
structure(list(ID = c(1, 2, 3, 4), c1 = c("X1", "X2", "X1", "X2"
), c2 = c("X1", "X1", "X2", "X2")), class = "data.frame", row.names = c(NA,
-4L))
Here, rows with ID=2 and ID=3 have same elements (however arranged in different order). I would like to consider these 2 rows as duplicate. Hence, the final output will have 3 rows only.
i.e. 3 combinations
X1, X1
X1, X2
X2, X2
In my actual dataset, I have 16 such elements in the vector V1.
I have tried using expand.grid approach for obtaining possible combinations but this actually exceeds the machine limit. (number of combinations with 16 elements will be too large). This is potentially due to duplications described above.
Can someone help here to get all possible combinations without any duplications ?
I am actually looking for a solution that uses data table functionality. I believe this can be really faster
Thanks in advance.
Here is a base R solution using your sample == data:
First, create your combinations. Using unique = TRUE cuts back on the number of combinations.
library(data.table)
data <- setDT(CJ(df$c1, df$c2, unique = TRUE))
Then, filter out duplicates:
data[!duplicated(t(apply(data, 1, sort))),]
This gives us:
V1 V2
1 X1 X1
2 X2 X1
10 X2 X2
I would look into the ?expand.grid function for this type of task.
expand.grid(v1, v1)
Var1 Var2
1 X1 X1
2 X2 X1
3 X1 X2
4 X2 X2
dat4 is the final output.
v1 <- c('X1','X2')
library(data.table)
dat <- expand.grid(v1, v1, stringsAsFactors = FALSE)
setDT(dat)
# A function to combine and sort string from two columns
f <- function(x, y){
z <- paste(sort(c(x, y)), collapse = "-")
return(z)
}
# Apply the f function to each row
dat2 <- dat[, comb := f(Var1, Var2), by = 1:nrow(dat)]
# Remove the duplicated elements in the comb column
dat3 <- unique(dat2, by = "comb")
# Select the columns
dat4 <- dat3[, c("Var1", "Var2")]
print(dat4)
# Var1 Var2
# 1: X1 X1
# 2: X2 X1
# 3: X2 X2
You may want to check RcppAlgos::comboGeneral, which does exactly what you want and is known to be fast and memory efficient. Just do something like this:
vars <- paste0("X", 1:2)
RcppAlgos::comboGeneral(vars, length(vars), repetition = TRUE)
Output
[,1] [,2]
[1,] "X1" "X1"
[2,] "X1" "X2"
[3,] "X2" "X2"
On my laptop with 16Gb RAM, I can run this function up to 14 variables, and it takes less than 5s to finish. Speed is less of a concern. However, note that you need at least 17.9Gb RAM to get all 16-variable combinations.
We can use crossing from tidyr
library(tidyr)
crossing(v1, v1)

# Unwrapping a row list before transforming it

I have output from older software that wraps the record for each transaction into multiple rows. I want to unwrap these rows into one flat dataframe. I have found solutions to unwrap columns, but not rows, and can do what I need in a loop, but the output is large and I would prefer a faster solution than a loop.
Example: I read into R from a .csv file 6 pieces of information about each of two transactions ("tran") that come wrapped into four rows.
The following represents and mimics my data as I read it into R from a .csv file:
V1 <- c("tran1.col1", "tran1.col4","tran2.col1", "tran2.col4")
V2 <- c("tran1.col2", "tran1.col5", "tran2.col2", "tran2.col5")
V3 <- c("tran1.col3", "tran1.col6", "tran2.col3", "tran2.col6")
df <- as.data.frame(matrix(c(V1, V2, V3), ncol = 3))
I am looking to transform the above to the following:
X1 <- c("tran1.col1", "tran2.col1")
X2 <- c("tran1.col2", "tran2.col2")
X3 <- c("tran1.col3", "tran2.col3")
X4 <- c("tran1.col4", "tran2.col4")
X5 <- c("tran1.col5", "tran2.col5")
X6 <- c("tran1.col6", "tran2.col6")
df.x <- as.data.frame(matrix(c(X1, X2, X3, X4, X5, X6), ncol = 6))
I've looked at tidy routines to gather and spread datafiles as well as melt and decast in reshape, but as far as I can tell, I need to unwrap the rows first.
If all your inputs have 6 pieces of information by however many transactions, then the following should work.
vec <- as.character(unlist(t(df)))
df.x <- as.data.frame(matrix(vec, ncol = 6, byrow = T))
To break it down to explain what's happening ...
# Transpose the df (to a matrix)
matrix <- t(df)
# Now that the matrix is in this sequence it will allow us to unlist it so
# that it produces a vector in the correct sequence (i.e tran1.col1,
# tran1.col2 .. tran2.col1, tran1.col2)
vec <- unlist(matrix)
# Now we can coerce it back to a data.frame, defining the number of columns
# and creating it by row (rather than column)
df.x <- as.data.frame(matrix(vec, ncol = 6, byrow = T))

Replace value in column with corresponding value from another column in same dataframe

Im trying to match a specific value in one column and replace it with the corresponding value from another column (same row). This is probably very easy... I have been trying to find a solution with for loop, sub, subset, data.table but I have not succeeded. There must be a neat way of doing this.
Example data, where we aim at swapping a in the first column with the corresponding value in the second column and outputting the column again.
df <- data.frame(rbind(c('a','D'),c('H','W'),c('Q','E'),c('a','F'),c('U','P'),c('a','B')))
df$X1 <- as.character(df$X1)
df$X2 <- as.character(df$X2)
# not working
for (i in seq_along(df$X1)){
a <- df$X1[i]
b <- df$X2[i]
x <- ifelse(a[i=='a'], a[i]<-b[i], do.nothing )
print(x)
}
The output would be like this;
X1 X2
1 D a
2 H W
3 Q E
4 F a
5 U P
6 B a
(The switch isn't necessary). It's the first column Im interested in.
Any pointer would be appreciated, thanks!
There are several alternatives. Here are three:
Most basic, using data.frames :
df[ df$X1 == "a" , "X1" ] <- df[ df$X1 == "a", "X2" ]
More Terse, using with:
df$X1 <- with( df, ifelse( X1 == "a", X2, X1 ) )
Most terse and transparent Using data.tables
library(data.table) ## >= 1.9.0
setDT(df) ## converts to data.table by reference, no need for `<-`
df[ X1 == "a", X1 := X2 ]
Here's another approach if you have more than one condition (swap "a" for a vector of values).
> find.a <- df$X1 %in% "a"
> df[find.a, "X1"] <- df[find.a, "X2"]
> df
X1 X2
1 D D
2 3 W
3 Q E
4 F F
5 U P
6 B B

Resources