Removing duplicates on subset of columns in R

Removing duplicates on subset of columns in R - r

I have a table which is
[,1] [,2] [,3] [,4] [,5]
[1,] 1 5 10 0.00040803 0.00255277
[2,] 1 11 3 0.01765470 0.01584580
[3,] 1 6 2 0.15514850 0.15509000
[4,] 1 8 14 0.02100531 0.02572320
[5,] 1 9 4 0.04748648 0.00843252
[6,] 2 5 10 0.00040760 0.06782680
[7,] 2 11 3 0.01765480 0.01584580
[8,] 2 6 2 0.15514810 0.15509000
[9,] 2 8 14 0.02100491 0.02572320
[10,] 2 9 4 0.04748608 0.00843252
[11,] 3 5 10 0.00040760 0.06782680
[12,] 3 11 3 0.01765480 0.01584580
[13,] 3 8 14 0.02100391 0.02572320
[14,] 3 9 4 0.04748508 0.00843252
[15,] 4 5 10 0.00040760 0.06782680
[16,] 4 11 3 0.01765480 0.01584580
[17,] 4 8 14 0.02100391 0.02572320
[18,] 4 9 4 0.04748508 0.00843252
[19,] 5 8 14 0.02100391 0.02572320
[20,] 5 9 4 0.04748508 0.00843252
I want to remove duplicates from this table. However, only colums 2,3,4 matter. Example: rows 1,6,11,15 are identical if only columns 2,3,4 are observed. Note for column 4: is it possible to incorporate that it is considered as being the same as long as it is within 10e-5 of the value? So that rows 1 and 6 would be considered as being identical although the value in column 4 differs slightly (within the tolerance I mentioned)?
Then it would be great to get an output which would be like:
column 2 value | column 3 value | column 1 value at which the the pair has been first observed (with the tolerance) (in the example 1) | column 1 value at which the pair has been last observed (with tolerance) (in the example 4) | value of column 4 at first appearance (0.00040803 in the example)

This is a way of thinking about it, but I'm not sure it's what you're looking for. The logic should be able to get you started though.
dat <- YOUR DATA SET
dat
V1 V2 V3 V4 V5
1 1 5 10 0.00040803 0.00255277
2 1 11 3 0.01765470 0.01584580
3 1 6 2 0.15514850 0.15509000
4 1 8 14 0.02100531 0.02572320
5 1 9 4 0.04748648 0.00843252
# TRUNCATED
dat <- dat[, c(2, 3, 4)]
dat$V4 <- round(dat$V4, 5)
unique(dat)
V2 V3 V4
1 5 10 0.00041
2 11 3 0.01765
3 6 2 0.15515
4 8 14 0.02101
5 9 4 0.04749
9 8 14 0.02100

You could do something like this:
# read your data
yy <- read.csv('your-data.csv', header=F)
## V1 V2 V3 V4 V5
## 1 1 5 10 0.00040803 0.00255277
## 2 1 11 3 0.01765470 0.01584580
## 3 1 6 2 0.15514850 0.15509000
## 4 1 8 14 0.02100531 0.02572320
# create a logical matrix indicating value is within tolerance
mat.eq.tol <- sapply(yy$V4, function(x) abs(yy$V4-x) < 1E-5)
# minimum index
eq.min <- apply(mat.eq.tol, 1, function(x) min(which(x)))
# maximum index
eq.max <- apply(mat.eq.tol, 1, function(x) max(which(x)))
# combine result
res <- cbind(yy$V2, yy$V3, yy$V1[eq.min], yy$V1[eq.max], yy$V4[eq.min])
## [,1] [,2] [,3] [,4] [,5]
## [1,] 5 10 1 4 0.00040803
## [2,] 11 3 1 4 0.01765470
## [3,] 6 2 1 2 0.15514850
## [4,] 8 14 1 5 0.02100531
## [5,] 9 4 1 5 0.04748648
## [6,] 5 10 1 4 0.00040803

Related

Create matrix row-index which increments when rowsum > 100, and following row

I have a matrix:
mat <- matrix(c(2,11,3,1,2,4,55,65,12,4,6,6,7,9,3,23,16,77,5,5,7),ncol = 3, byrow = TRUE)
[,1] [,2] [,3]
[1,] 2 11 3
[2,] 1 2 4
[3,] 55 65 12
[4,] 4 6 6
[5,] 7 9 3
[6,] 23 16 77
[7,] 5 5 7
I want to add a column with rows index. This index will starts at 1 and repeats the same index, until it arrived to a row where the rowsums is > 100 to move to the next value.
Indx[,2][,3][,4]
[1,] 1 2 11 3
[2,] 1 1 2 4
[3,] 2 55 65 12
[4,] 3 4 6 6
[5,] 3 7 9 3
[6,] 4 23 16 77
[7,] 5 5 5 7

Using rle:
matRle <- rle(rowSums(mat) > 100)$lengths
cbind(rep(seq(length(matRle)), matRle), mat)
# [,1] [,2] [,3] [,4]
# [1,] 1 2 11 3
# [2,] 1 1 2 4
# [3,] 2 55 65 12
# [4,] 3 4 6 6
# [5,] 3 7 9 3
# [6,] 4 23 16 77
# [7,] 5 5 5 7

A solution using dplyr.
library(dplyr)
mat2 <- mat %>%
as.data.frame() %>%
mutate(Indx = cumsum(rowSums(dat) > 100 | lag(rowSums(dat) > 100, default = TRUE))) %>%
select(Indx, paste0("V", 1:ncol(mat))) %>%
as.matrix()
mat2
# Indx V1 V2 V3
# [1,] 1 2 11 3
# [2,] 1 1 2 4
# [3,] 2 55 65 12
# [4,] 3 4 6 6
# [5,] 3 7 9 3
# [6,] 4 23 16 77
# [7,] 5 5 5 7

cbind(cumsum(replace(a<-rowSums(mat)>100,which(a==1)+1,1))+1,mat)
[,1] [,2] [,3] [,4]
[1,] 1 2 11 3
[2,] 1 1 2 4
[3,] 2 55 65 12
[4,] 3 4 6 6
[5,] 3 7 9 3
[6,] 4 23 16 77
[7,] 5 5 5 7
What does this do??:
first obtain the rowSums which are greater than 100
a<-rowSums(mat)>100
Then the next row for every row>100, should have the next index. Thus do a replace and cumsum:
cumsum(replace(a,which(a==1)+1,1))
Now you will realize that this starts from zero, so you add 1.

We could do this with rleid from data.table
library(data.table)
cbind(Indx = rleid(rowSums(mat) > 100), mat)
# Indx
#[1,] 1 2 11 3
#[2,] 1 1 2 4
#[3,] 2 55 65 12
#[4,] 3 4 6 6
#[5,] 3 7 9 3
#[6,] 4 23 16 77
#[7,] 5 5 5 7

What is the best way to tidy a matrix in R

Is there a best practice means of "tidying" a matrix/array? By "tidy" in this context I mean
one row per element of the matrix
one column per dimension. the elements of these columns give you the "coordinates" of the matrix element which is stored on that row
I have an example here for a 2d matrix, but ideally this would work with an array also (This example works for mm <- array(1:18, c(3,3,3)), but I thought that would be too much to paste in here)
mm <- matrix(1:9, nrow = 3)
mm
#> [,1] [,2] [,3]
#> [1,] 1 4 7
#> [2,] 2 5 8
#> [3,] 3 6 9
inds <- which(mm > -Inf, arr.ind = TRUE)
cbind(inds, value = mm[inds])
#> row col value
#> [1,] 1 1 1
#> [2,] 2 1 2
#> [3,] 3 1 3
#> [4,] 1 2 4
#> [5,] 2 2 5
#> [6,] 3 2 6
#> [7,] 1 3 7
#> [8,] 2 3 8
#> [9,] 3 3 9

as.data.frame.table One way to convert from wide to long is the following. See ?as.data.frame.table for more information. No packages are used.
mm <- matrix(1:9, 3)
long <- as.data.frame.table(mm)
The code gives this data.frame:
> long
Var1 Var2 Freq
1 A A 1
2 B A 2
3 C A 3
4 A B 4
5 B B 5
6 C B 6
7 A C 7
8 B C 8
9 C C 9
numbers
If you prefer row and column numbers:
long[1:2] <- lapply(long[1:2], as.numeric)
giving:
> long
Var1 Var2 Freq
1 1 1 1
2 2 1 2
3 3 1 3
4 1 2 4
5 2 2 5
6 3 2 6
7 1 3 7
8 2 3 8
9 3 3 9
names Note that above it used A, B, C, ... because there were no row or column names. They would have been used if present. That is, had there been row and column names and dimension names the output would look like this:
mm2 <- array(1:9, c(3, 3), dimnames = list(A = c("a", "b", "c"), B = c("x", "y", "z")))
as.data.frame.table(mm2, responseName = "Val")
giving:
A B Val
1 a x 1
2 b x 2
3 c x 3
4 a y 4
5 b y 5
6 c y 6
7 a z 7
8 b z 8
9 c z 9
3d
Here is a 3d example:
as.data.frame.table(array(1:8, c(2,2,2)))
giving:
Var1 Var2 Var3 Freq
1 A A A 1
2 B A A 2
3 A B A 3
4 B B A 4
5 A A B 5
6 B A B 6
7 A B B 7
8 B B B 8
2d only For 2d one can alternately use row and col:
sapply(list(row(mm), col(mm), mm), c)
or
cbind(c(row(mm)), c(col(mm)), c(mm))
Either of these give this matrix:
[,1] [,2] [,3]
[1,] 1 1 1
[2,] 2 1 2
[3,] 3 1 3
[4,] 1 2 4
[5,] 2 2 5
[6,] 3 2 6
[7,] 1 3 7
[8,] 2 3 8
[9,] 3 3 9

Another method is to use arrayInd together with cbind like this.
# a 3 X 3 X 2 array
mm <- array(1:18, dim=c(3,3,2))
Similar to your code, but with the more natural arrayInd function, we have
# get array in desired format
myMat <- cbind(c(mm), arrayInd(seq_along(mm), .dim=dim(mm)))
# add column names
colnames(myMat) <- c("values", letters[24:26])
which returns
myMat
values x y z
[1,] 1 1 1 1
[2,] 2 2 1 1
[3,] 3 3 1 1
[4,] 4 1 2 1
[5,] 5 2 2 1
[6,] 6 3 2 1
[7,] 7 1 3 1
[8,] 8 2 3 1
[9,] 9 3 3 1
[10,] 10 1 1 2
[11,] 11 2 1 2
[12,] 12 3 1 2
[13,] 13 1 2 2
[14,] 14 2 2 2
[15,] 15 3 2 2
[16,] 16 1 3 2
[17,] 17 2 3 2
[18,] 18 3 3 2

How can I find repeated values/ data points and their index in 2D matrix of a dataframe in R?

For example suppose I have matrix A
x y z f
1 1 2 A 1005
2 2 4 B 1002
3 3 2 B 1001
4 4 8 C 1001
5 5 10 D 1004
6 6 12 D 1004
7 7 11 E 1005
8 8 14 E 1003
From this matrix I want to find the repeated values like 1001, 1005, D, 2 (in third column) and I also want to find their index (which row, or which position).
I am new to R!
Obviously it is possible to do with simple searching element by element by using a for loop, but I want to know, is there any function available in R for this kind of problem.
Furthermore, I tried using duplicated and unique, both functions are giving me the duplicated row number or column number, they are also giving me how many of them were repeated, but I can not search for whole matrix using both of them!

You can write a rather simple function to get this information. Though note that this solution works with a matrix. It does not work with a data.frame. A similar function could be written for a data.frame using the fact that the data.frame data structure is a subset of a list.
# example data
set.seed(234)
m <- matrix(sample(1:10, size=100, replace=T), 10)
find_matches <- function(mat, value) {
nr <- nrow(mat)
val_match <- which(mat == value)
out <- matrix(NA, nrow= length(val_match), ncol= 2)
out[,2] <- floor(val_match / nr) + 1
out[,1] <- val_match %% nr
return(out)
}
R> m
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 8 6 6 7 6 7 4 10 6 9
[2,] 8 6 6 3 10 4 5 4 6 9
[3,] 1 6 9 2 9 2 3 6 4 2
[4,] 8 6 7 8 3 9 9 4 9 2
[5,] 1 1 5 6 7 1 5 1 10 6
[6,] 7 5 4 7 8 2 4 4 7 10
[7,] 10 4 7 8 3 1 8 6 3 4
[8,] 8 8 2 2 7 5 6 4 10 4
[9,] 10 2 9 6 6 9 7 2 4 7
[10,] 3 9 9 4 2 7 7 2 9 6
R> find_matches(m, 8)
[,1] [,2]
[1,] 1 1
[2,] 2 1
[3,] 4 1
[4,] 8 1
[5,] 8 2
[6,] 4 4
[7,] 7 4
[8,] 6 5
[9,] 7 7
In this function, the row index is output in column 1 and the column index is output in column 2

All possible combinations over groups

I have 5 groups: G1, G2,…,G5 with n1,n2,…,n5 elements in each group respectively. I select 2 elements from each of the 4 groups and 1 element from the 5th group. How do I generate all possible combinations in R?

(It is not specified in the question whether the groups are mutually exclusive or not; So, assume:
1. the groups are mutually exclusive
2. the subsets of groups (n1, n2, ...) will use the same elements in being filled)
3 just for the sake of argument |G1|=|G2|=|G3|=5 (The user can change the following code accordingly for differing numbers of elements in the groups)
The following is 3 set mock-up answer of the question that any user can generalize to arbitrary number of groups. So, assume group names are G1, G2, G3.
library(causfinder)
gctemplate(5,2,2) # Elements are coded as: 1,2,3,4,5; |sub-G1|=2; |sub-G2|=2; |sub-G3|=5-(2+2)=1
# In the following table, each number represents a unique element. (SOLUTION ENDED!)
My package (causfinder) is not in CRAN. Hence, I will give the function gctemplate's code below.
[,1] [,2] [,3] [,4] [,5]
[1,] 1 2 3 4 5 sub-G1={1,2} sub-G2={3,4} sub-G3={5}
[2,] 1 2 3 5 4
[3,] 1 2 4 5 3 sub-G1={1,2} sub-G2={4,5} sub-G3={3}
[4,] 1 3 2 4 5
[5,] 1 3 2 5 4
[6,] 1 3 4 5 2
[7,] 1 4 2 3 5
[8,] 1 4 2 5 3
[9,] 1 4 3 5 2
[10,] 1 5 2 3 4
[11,] 1 5 2 4 3
[12,] 1 5 3 4 2
[13,] 2 3 1 4 5
[14,] 2 3 1 5 4
[15,] 2 3 4 5 1
[16,] 2 4 1 3 5
[17,] 2 4 1 5 3
[18,] 2 4 3 5 1
[19,] 2 5 1 3 4
[20,] 2 5 1 4 3
[21,] 2 5 3 4 1
[22,] 3 4 1 2 5
[23,] 3 4 1 5 2
[24,] 3 4 2 5 1
[25,] 3 5 1 2 4
[26,] 3 5 1 4 2
[27,] 3 5 2 4 1
[28,] 4 5 1 2 3
[29,] 4 5 1 3 2
[30,] 4 5 2 3 1
The code of gctemplate:
gctemplate <- function(nvars, ncausers, ndependents){
independents <- combn(nvars, ncausers)
patinajnumber <- dim(combn(nvars - ncausers, ndependents))[[2]]
independentspatinajednumber <- dim(combn(nvars, ncausers))[[2]]*patinajnumber
dependents <- matrix(, nrow = dim(combn(nvars, ncausers))[[2]]*patinajnumber, ncol = ndependents)
for (i in as.integer(1:dim(combn(nvars, ncausers))[[2]])){
dependents[(patinajnumber*(i-1)+1):(patinajnumber*i),] <- t(combn(setdiff(seq(1:nvars), independents[,i]), ndependents))
}
independentspatinajed <- matrix(, nrow = dim(combn(nvars, ncausers))[[2]]*patinajnumber, ncol = ncausers)
for (i in as.integer(1:dim(combn(nvars, ncausers))[[2]])){
for (j in as.integer(1:patinajnumber)){
independentspatinajed[(i-1)*patinajnumber+j,] <- independents[,i]
}}
independentsdependents <- cbind(independentspatinajed, dependents)
others <- matrix(, nrow = dim(combn(nvars, ncausers))[[2]]*patinajnumber, ncol = nvars - ncausers - ndependents)
for (i in as.integer(1:((dim(combn(nvars, ncausers))[[2]])*patinajnumber))){
others[i, ] <- setdiff(seq(1:nvars), independentsdependents[i,])
}
causalitiestemplate <- cbind(independentsdependents, others)
causalitiestemplate
}
Now, the solution for G1,G2,G3 is the above. Just generalize the above code to 5-variable case with the very same logic!

Create dataframe of all array indices in R

Using R, I'm trying to construct a dataframe of the row and col numbers of a given matrix. E.g., if
a <- matrix(c(1:15), nrow=5, ncol=3)
then I'm looking to construct a dataframe that gives:
row col
1 1
1 2
1 3
. .
5 1
5 2
5 3
What I've tried:
row <- matrix(row(a), ncol=1, nrow=dim(a)[1]*dim(a)[2], byrow=T)
col <- matrix(col(a), ncol=1, nrow=dim(a)[1]*dim(a)[2], byrow=T)
out <- cbind(row, col)
colnames(out) <- c("row", "col")
results in:
row col
[1,] 1 1
[2,] 2 1
[3,] 3 1
[4,] 4 1
[5,] 5 1
[6,] 1 2
[7,] 2 2
[8,] 3 2
[9,] 4 2
[10,] 5 2
[11,] 1 3
[12,] 2 3
[13,] 3 3
[14,] 4 3
[15,] 5 3
Which isn't what I'm looking for, as the sequence of rows and cols in suddenly reversed, even tough I specified "byrow=T". I don't see if and where I'm making a mistake but would hugely appreciate suggestions to overcome this problem. Thanks in advance!

I'd use expand.grid on the vectors 1:ncol and 1:nrow, then flip the columns with [,2:1] to get them in the order you want:
> expand.grid(seq(ncol(a)),seq(nrow(a)))[,2:1]
Var2 Var1
1 1 1
2 1 2
3 1 3
4 2 1
5 2 2
6 2 3
7 3 1
8 3 2
9 3 3
10 4 1
11 4 2
12 4 3
13 5 1
14 5 2
15 5 3

Use row and col, but more directly manipulate their output ordering since they return corresponding indices in place for the input array. Use t to get the non-default order you want in the end:
data.frame(row = as.vector(t(row(a))), col = as.vector(t(col(a))))
row col
1 1 1
2 1 2
3 1 3
4 2 1
5 2 2
6 2 3
7 3 1
8 3 2
9 3 3
10 4 1
11 4 2
12 4 3
13 5 1
14 5 2
15 5 3
Or, as a matrix not a data.frame:
cbind(as.vector(t(row(a))), as.vector(t(col(a))))
[,1] [,2]
[1,] 1 1
[2,] 1 2
[3,] 1 3
[4,] 2 1
[5,] 2 2
[6,] 2 3
[7,] 3 1
[8,] 3 2
[9,] 3 3
[10,] 4 1
[11,] 4 2
[12,] 4 3
[13,] 5 1
[14,] 5 2
[15,] 5 3

You may want to have a look at ?expand.grid, which does just about exactly what you want to achieve.

Since there are many ways to skin a cat, I'll chip in with yet another variant based on rep:
data.frame(row=rep(seq(nrow(a)), each=ncol(a)), col=rep(seq(ncol(a)), nrow(a)))
...but to announce a "winner", I think you need to time the solutions:
# Make up a huge matrix...
a <- matrix(runif(1e7), 1e4)
system.time( a1<-data.frame(row = as.vector(t(row(a))),
col = as.vector(t(col(a)))) ) # 0.68 secs
system.time( a2<-expand.grid(col = seq(ncol(a)),
row = seq(nrow(a)))[,2:1] ) # 0.49 secs
system.time( a3<-data.frame(row=rep(seq(nrow(a)), each=ncol(a)),
col=rep(seq(ncol(a)), nrow(a))) ) # 0.59 secs
identical(a1, a2) && identical(a1, a3) # TRUE
...so it seems #Spacedman has the speediest solution!

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Removing duplicates on subset of columns in R - r

Related

Create matrix row-index which increments when rowsum > 100, and following row

What is the best way to tidy a matrix in R

How can I find repeated values/ data points and their index in 2D matrix of a dataframe in R?

All possible combinations over groups

Create dataframe of all array indices in R

Categories

Resources