sum of combinations rows (2 rows) in dataframe without repetitions - r

It is necessary to “increase” the data frame by adding each line from each rows (combinations without repetitions) and writing the result to a new data frame. The result is a huge number of lines compared to the original data frame, so I would like to do without a cycle, deciding, for example, with apply. Data frame for example:
1 3 6
2 2 4
5 1 2
6 4 1
The result should be:
1 3 6
2 2 4
5 1 2
6 4 1
3 5 10
6 4 8
7 7 7
7 3 6
8 6 5
11 5 3

We can use combn and generate combination of row numbers taking 2 at a time, add a custom function to add those rows and bind them with the original dataframe.
rbind(df, do.call("rbind",
combn(1:nrow(df), 2, function(x) df[x[1], ] + df[x[2], ], simplify = FALSE)))
# V1 V2 V3
#1 1 3 6
#2 2 2 4
#3 5 1 2
#4 6 4 1
#11 3 5 10
#23 6 4 8
#32 7 7 7
#21 7 3 6
#22 8 6 5
#31 11 5 3
FYI, the key part here is
combn(1:nrow(df), 2) #which gives
# [,1] [,2] [,3] [,4] [,5] [,6]
#[1,] 1 1 1 2 2 3
#[2,] 2 3 4 3 4 4
and this input is used to subset rows from original data frame.

Related

Generating an vector with rep and seq but without the c() function [duplicate]

This question already has answers here:
R repeating sequence add 1 each repeat
(2 answers)
Closed 5 months ago.
Suppose that I am not allowed to use the c() function.
My target is to generate the vector
"1 2 3 4 5 2 3 4 5 6 3 4 5 6 7 4 5 6 7 8 5 6 7 8 9"
Here is my attempt:
rep(seq(1, 5, 1), 5)
# [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
rep(0:4,rep(5,5))
# [1] 0 0 0 0 0 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4
So basically I am sum them up. But I wonder if there is a better way to use rep and seq functions ONLY.
Like so:
1:5 + rep(0:4, each = 5)
# [1] 1 2 3 4 5 2 3 4 5 6 3 4 5 6 7 4 5 6 7 8 5 6 7 8 9
I like the sequence option as well:
sequence(rep(5, 5), 1:5)
You could do
rep(1:5, each=5) + rep.int(0:4, 5)
# [1] 1 2 3 4 5 2 3 4 5 6 3 4 5 6 7 4 5 6 7 8 5 6 7 8 9
Just to be precise and use seq as well:
rep(seq.int(1:5), each=5) + rep.int(0:4, 5)
(PS: You can remove the .ints, but it's slower.)
One possible way:
as.vector(sapply(1:5, `+`, 0:4))
[1] 1 2 3 4 5 2 3 4 5 6 3 4 5 6 7 4 5 6 7 8 5 6 7 8 9
I would also propose the outer() function as well:
library(dplyr)
outer(1:5, 0:4, "+") %>%
array()
Or without magrittr %>% function in newer R versions:
outer(1:5, 0:4, "+") |>
array()
Explanation.
The first function will create an array of 1:5 by 0:4 sequencies and fill the intersections with sums of these values:
[,1] [,2] [,3] [,4] [,5]
[1,] 1 2 3 4 5
[2,] 2 3 4 5 6
[3,] 3 4 5 6 7
[4,] 4 5 6 7 8
[5,] 5 6 7 8 9
The second will pull the vector from the array and return the required vector:
[1] 1 2 3 4 5 2 3 4 5 6 3 4 5 6 7 4 5 6 7 8 5 6 7 8 9

melt the lower half from systematic matrix in R

Given that I have a three by three systematic matrix.
> x<-matrix(1:9,3)
> x[lower.tri(x)] = t(x)[lower.tri(x)]
> x
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 4 5 8
[3,] 7 8 9
Then I apply library reshape2 to make it in long-format.
> library(reshape2)
> x <- melt(x)
> x
Var1 Var2 value
1 1 1 1
2 2 1 4
3 3 1 7
4 1 2 4
5 2 2 5
6 3 2 8
7 1 3 7
8 2 3 8
9 3 3 9
As the upper diagonal and bottom diagonal are identical, I only need half of result, which will look like below.
Var1 Var2 value
1 1 1
2 1 4
3 1 7
2 2 5
3 2 8
3 3 9
Any elegant approach to do this?
You can change the values for the bottom or upper half to NA, and then melt ignoring missing values, assume there are not missing values in the matrix originally or you don't need to keep them in the result if there are:
x[upper.tri(x)] = NA
reshape2::melt(x, na.rm=T)
# Var1 Var2 value
#1 1 1 1
#2 2 1 4
#3 3 1 7
#5 2 2 5
#6 3 2 8
#9 3 3 9
As the 'x' was already assigned and melted, we can get a logical index of the non-duplicate rows after sorting the subset of dataset with 1st and 2nd column by row and then use it to subset the rows
x[!duplicated(t(apply(x[1:2], 1, sort))),]
# Var1 Var2 value
#1 1 1 1
#2 2 1 4
#3 3 1 7
#5 2 2 5
#6 3 2 8
#9 3 3 9

Reduce columns of a matrix by a function in R

I have a matrix sort of like:
data <- round(runif(30)*10)
dimnames <- list(c("1","2","3","4","5"),c("1","2","3","2","3","2"))
values <- matrix(data, ncol=6, dimnames=dimnames)
# 1 2 3 2 3 2
# 1 5 4 9 6 7 8
# 2 6 9 9 1 2 5
# 3 1 2 5 3 10 1
# 4 6 5 1 8 6 4
# 5 6 4 5 9 4 4
Some of the column names are the same. I want to essentially reduce the columns in this matrix by taking the min of all values in the same row where the columns have the same name. For this particular matrix, the result would look like this:
# 1 2 3
# 1 5 4 7
# 2 6 1 2
# 3 1 1 5
# 4 6 4 1
# 5 6 4 4
The actual data set I'm using here has around 50,000 columns and 4,500 rows. None of the values are missing and the result will have around 40,000 columns. The way I tried to solve this was by melting the data then using group_by from dplyr before reshaping back to a matrix. The problem is that it takes forever to generate the data frame from the melt and I'd like to be able to iterate faster.
We can use rowMins from library(matrixStats)
library(matrixStats)
res <- vapply(split(1:ncol(values), colnames(values)),
function(i) rowMins(values[,i,drop=FALSE]), rep(0, nrow(values)))
res
# 1 2 3
#[1,] 5 4 7
#[2,] 6 1 2
#[3,] 1 1 5
#[4,] 6 4 1
#[5,] 6 4 4
row.names(res) <- row.names(values)

Rearranging the columns of a data frame [duplicate]

This question already has answers here:
Splitting triplicates into duplicates
(3 answers)
Closed 8 years ago.
Given a data frame, I'd like to rearrange it and return another data frame of 2 columns. The 2 columns of this data frame are made up of any 2 elements of a row in the original data frame. So we will have C(ncol,2) * nrow number of rows in the second data frame. Here's an example. Given the data frame z, I'd like to return x. How can I do this?
> z = data.frame(A = c(1,2,3), B = c(4,5,6), C = c(7,8,9))
> z
A B C
1 1 4 7
2 2 5 8
3 3 6 9
> x
A B
1 1 4
2 1 7
3 4 7
4 2 5
5 2 8
6 5 8
7 3 6
8 3 9
9 6 9
Or, you could try:
matrix(apply(z, 1, combn,2), ncol=2, byrow=TRUE)
# [,1] [,2]
#[1,] 1 4
#[2,] 1 7
#[3,] 4 7
#[4,] 2 5
#[5,] 2 8
#[6,] 5 8
#[7,] 3 6
#[8,] 3 9
#[9,] 6 9
To get data.frame as output
setNames(as.data.frame(matrix(apply(z, 1, combn,2), ncol=2, byrow=TRUE)), LETTERS[1:2])
Something like this would work
newz <- setNames(do.call(rbind.data.frame, lapply(split(z, 1:nrow(z)), function(x)
t(combn(x,2)))),
c("A","B"))
newz
# A B
# 1.1 1 4
# 1.2 1 7
# 1.3 4 7
# 2.1 2 5
# 2.2 2 8
# 2.3 5 8
# 3.1 3 6
# 3.2 3 9
# 3.3 6 9
This generates the new rows using all combinations if the columns via combn(). If you hate the default rownames, you can get rid of them with
rownames(newz)<-NULL
newz
# A B
# 1 1 4
# 2 1 7
# 3 4 7
# 4 2 5
# 5 2 8
# 6 5 8
# 7 3 6
# 8 3 9
# 9 6 9

convert rows after column

I have csv file which reads like this
1 5
2 3
3 2
4 6
5 3
6 7
7 2
8 1
9 1
What I want to do is to this:
1 5 4 6 7 2
2 3 5 3 8 1
3 2 6 7 9 1
i.e after every third row, I want a different column of the values side by side. Any advise?
Thanks a lot
Here's a way to do this with matrix indexing. It's a bit strange, but I find it interesting so I will post it.
You want an index matrix, with indices as follows. This gives the order of your data as a matrix (column-major order):
1, 1
2, 1
3, 1
1, 2
2, 2
3, 2
4, 1
...
8, 2
9, 2
This gives the pattern that you need to select the elements. Here's one approach to building such a matrix. Say that your data is in the object dat, a data frame or matrix:
m <- matrix(
c(
outer(rep(1:3, 2), seq(0,nrow(dat)-1,by=3), FUN='+'),
rep(rep(1:2, each=3), nrow(dat)/3)
),
ncol=2
)
The outer expression is the first column of the desired index matrix, and the rep expression is the second column. Now just index dat with this index matrix, and build a result matrix with three rows:
matrix(dat[m], nrow=3)
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 1 5 4 6 7 2
## [2,] 2 3 5 3 8 1
## [3,] 3 2 6 7 9 1
a <- read.table(text = "1 5
2 3
3 2
4 6
5 3
6 7
7 2
8 1
9 1")
(seq_len(nrow(a))-1) %/% 3
# [1] 0 0 0 1 1 1 2 2 2
split(a, (seq_len(nrow(a))-1) %/% 3)
# $`0`
# V1 V2
# 1 1 5
# 2 2 3
# 3 3 2
# $`1`
# V1 V2
# 4 4 6
# 5 5 3
# 6 6 7
# $`2`
# V1 V2
# 7 7 2
# 8 8 1
# 9 9 1
do.call(cbind,split(a, (seq_len(nrow(a))-1) %/% 3))
# 0.V1 0.V2 1.V1 1.V2 2.V1 2.V2
# 1 1 5 4 6 7 2
# 2 2 3 5 3 8 1
# 3 3 2 6 7 9 1

Resources