Remove duplicated columns in matrix - r

I have a data set of dimension 401*5677. Among the column of this matrix there are columns which are identical but under different column names.
Now, I want to keep only one column from the columns which are repeated more than once, and also get the index j for the columns removed.
Let us use as an example matrix, the following:
B=matrix(c(1,4,0,2,56,7,1,4,0,33,2,5), nrow=3)
colnames(B)<-c("a","b","c","d")
What I did so far (on my real matrix G) is:
corrG<-cor(G)
Gtest=G
for (i in 1:nrow(corrG)){
for (j in 1:ncol(corrG)){
if (i<j && corrG[i,j]==1){
Gtest[,j]=NA
}
}
}
Gfinal<-Gtest[,complete.cases(t(Gtest))]
My code returns a matrix that still contains (!) some duplicated columns.
Any help?

try duplicated function on transpose of the matrix.
duplicated.columns <- duplicated(t(your.matrix))
new.matrix <- your.matrix[, !duplicated.columns]

One line answer
B = matrix(c(1, 4, 0, 2, 56, 7, 1, 4, 0, 33, 2, 5), nrow = 3)
colnames(B) <- c("a", "b", "c", "d")
B
## a b c d
## [1,] 1 2 1 33
## [2,] 4 56 4 2
## [3,] 0 7 0 5
B[, !duplicated(t(B))]
## a b d
## [1,] 1 2 33
## [2,] 4 56 2
## [3,] 0 7 5

Related

Change the column of same values to column of all zeros in R

Assume I have a list called: LS1 and within the list I have 20 matrix of 100 by 5. Now some columns might have just one value repeated like one column is all 100. I want to make these all 100 to all zeros. I can write a for loop to do that but I want to do it more efficiently with lapply and apply. For example one example of this matrix is
1 2 3 4 5
1 3 4 5 6
1 5 6 8 9
I want the first column which is all ones is changed to all zeros.
This is what I have done :
A= lapply(LS1, function(x) {apply(x,2,function(x1) {if(max(x1)== min(x1))
{0}}}
but this makes all the values NULL. Can anyone suggest doing this with lapply and apply?
This should work, especially for integer matrices.
lapply(lst,
function(mat) {
all_dupes = apply(mat, 2, function(x) length(unique(x)) ==1)
mat[, all_dupes] = 0L
return(mat)
}
)
This is my solution:
df <- data.frame(a = c(1, 1, 1),
b = c(2, 3, 5),
c = c(4, 5, 8),
d = c(5, 6, 9),
e = c(5, 5, 5))
A = data.frame(lapply(df, function(x) x = (max(x)!=min(x))*x ))
A
> A
a b c d e
1 0 2 4 5 0
2 0 3 5 6 0
3 0 5 8 9 0
If use sapply:
A = sapply(df, function(x) x = (max(x)!=min(x))*x)
A
a b c d e
[1,] 0 2 4 5 0
[2,] 0 3 5 6 0
[3,] 0 5 8 9 0

Generate matrices using positive integer solutions of the indefinite equation

I asked a question similar to this one previously. But this one little more tricky. I have POSITIVE INTEGER solutions(previously NON-NEGATIVE solutions) matrix(say A) to the indefinite equation x1+x2+x3 = 8. Also, I have another matrix(say B) with columns
0 1 0 1
0 0 1 1
I want to generate matrices using rows of A and the columns of B.
For an example, let (2,2,4) is the one solution(one row) of the matrix A. In this case, I just cannot use rep. So I tried to generate all the three column matrices from matrix B and then try to apply rep, but couldn't figure that out. I use the following lines to generate lists of all three column matrices.
cols <- combn(ncol(B), 3, simplify=F, FUN=as.numeric)
M3 <- lapply(cols, function(x) cbind(B[,x]))
For an example, cols[[1]]
[1] 1 2 3
Then, the columns of my new matrix would be
0 0 1 1 0 0 0 0
0 0 0 0 1 1 1 1
Columns of this new matrix are the multiples of columns of B. i.e., first column 2-times, second column 2-time and third column 4-times. I want to use this procedure all the rows of matrix A. How do I do this?
?rep(x, times) says;
if times is a vector of the same length as x (after replication by
each), the result consists of x[1] repeated times[1] times, x[2]
repeated times[2] times and so on.
Basic idea is;
B <- matrix(c(0, 1, 0, 1, 0, 0, 1, 1), byrow = T, nrow = 2)
cols <- combn(ncol(B), 3, simplify=F, FUN=as.numeric)
a1 <- c(2, 2, 4)
cols[[1]] # [1] 1 2 3
rep(cols[[1]], a1) # [1] 1 1 2 2 3 3 3 3
B[, rep(cols[[1]], a1)]
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
# [1,] 0 0 1 1 0 0 0 0
# [2,] 0 0 0 0 1 1 1 1
testA <- rbind(c(2,2,4), c(2,1,5), c(2,3,3))
## apply(..., lapply(...)) approach (output is list in list)
apply(testA, 1, function(x) lapply(cols, function(y) B[, rep(y, x)]))
## other approach using combination of indices
ind <- expand.grid(ind_cols = 1:length(cols), ind_A = 1:nrow(testA))
col_ind <- apply(ind, 1, function(x) rep(cols[[x[1]]], testA[x[2],]))
lapply(1:ncol(col_ind), function(x) B[, col_ind[,x]]) # output is list
library(dplyr)
apply(col_ind, 2, function(x) t(B[, x])) %>% matrix(ncol = 8, byrow=T) # output is matrix

Reading Every Other Column in CSV into alternating matrix in R

I need to read in a CSV file with no headers and with an unknown number of columns and rows. However , every other column belongs in one matrix while the next needs to be in a different matrix. Example
CSV input:
1,2,3,4
1,2,3,4
1,2,3,4
1,2,3,4
Desired result would be equivalent to:
matrix1 <- (c( 1, 3,
1, 3,
1, 3,
1, 3), NumberOfRows, NumberOfColumns, byrow=T);
and
matrix2 <- (c( 2, 4,
2, 4,
2, 4,
2, 4), NumberOfRows, NumberOfColumns, byrow=T);
I have tried something like this (but this seems overly complex and doesn't work anyways). Isn't there a simple way to do this in R?
mydata<- read.csv("~/Desktop/file.csv", header=FALSE, nrows=4000);
columnCount<-ncol(mydata);
rowCount<-nrow(mydata);
evenColumns <- matrix(); oddColumns <-matrix();
for (i in 1:columnCount) {
if (i %% 2) {
for (l in 1:rowCount){
col <- 1;
evenColumns[col, l] <-mydata[i,l];
col<-col+1;
}
}
else {
for (l in 1:rowCount){
col <-1;
oddColumns[col, l] <-mydata[i,l];
col<-col+1;
}
}
}
How should this be done properly in R?
You can get the column numbers with seq:
full = read.csv("mat.csv", header=FALSE)
odds = as.matrix(full[, seq(1, ncol(full), by=2)])
evens = as.matrix(full[, seq(2, ncol(full), by=2)])
Output:
> odds
V1 V3
[1,] 1 3
[2,] 1 3
[3,] 1 3
[4,] 1 3
> evens
V2 V4
[1,] 2 4
[2,] 2 4
[3,] 2 4
[4,] 2 4
Similar to the problem discussed here
mat.even <- mydata[,which(1:ncol(mydata) %% 2 == 0)]
mat.odd <- mydata[,which(1:ncol(mydata) %% 2 == 1)]
Every other starting with the first:
> cdat[ , c(TRUE,FALSE)]
V1 V3
1 1 3
2 1 3
3 1 3
4 1 3
Every other starting with the second:
> cdat[ , !c(TRUE,FALSE)]
V2 V4
1 2 4
2 2 4
3 2 4
4 2 4

counting zeros in columns in data frame in R and express as percentage

I want to count number of zeros in each column in a R data frame and express it as a percentage. This percentage should be added to last row of the original data frame?
example
x <- c(0, 4, 6, 0, 10)
y <- c(3, 0, 9, 12, 15)
z <- c(3, 6, 9, 0, 15)
data_a <- cbind(x,y,z)
want to see the zeros in each column and express as percentage
Thanks
x <- c(0, 4, 6, 0, 10)
y <- c(3, 0, 9, 12, 15)
z <- c(3, 6, 9, 0, 15)
data_a <- cbind(x,y,z)
#This is a matrix not a data.frame.
res <- colSums(data_a==0)/nrow(data_a)*100
If you must, rbind to the matrix (usually not really a good idea).
rbind(data_a, res)
# x y z
# 0 3 3
# 4 0 6
# 6 9 9
# 0 12 0
# 10 15 15
# res 40 20 20
Here is one more method using lapply, this would work for a data frame though.
lapply(data_a, function(x){ length(which(x==0))/length(x)})
A combination of prop.table and some *apply work can give you the same answer as #Roland's
> prop <- apply(data_a, 2, function(x) prop.table(table(x))*100)
> rbind(data_a, sapply(prop, "[", 1))
x y z
[1,] 0 3 3
[2,] 4 0 6
[3,] 6 9 9
[4,] 0 12 0
[5,] 10 15 15
[6,] 40 20 20
This is probably inelegant, but this is how I went about it when my columns had NAs:
#Returns the number of zeroes in a column
numZero <- colSums(vars == 0, na.rm = T)
#Returns the number of non-NA entries in each column
numNA <- colSums(is.na(vars))
#Returns total sample size
numSamp <- rep(nrow(vars), ncol(vars))
#Combine the three
varCheck <- as.data.frame(cbind(numZero, numNA, numSamp))
#Number of observations for that variable
varCheck$numTotal <- varCheck$numSamp - varCheck$numNA
#Percentage zero
varCheck$pctZero <- varCheck$numZero / varCheck$numTotal
#Check which have lower than 1%
varCheck[which(varCheck$pctZero > 0.99),]

Matrix elements manipulation

b = c(1,1,2,2,3,3,4,4,1)
c = c(10,10,20,20,30,30,40,40,5)
a <- NULL
a <- matrix(c(b,c), ncol=2)
What I want to do is to compare the numbers In the first column of this matrix, and if the first number is equal to the second consecutive number in the column (in this case if 1 = 1, and so on) then I want to add the corresponding numbers in the second column together (as in 10 + 10 = 20, and so on) and that would be only one value and I want then to store this output in a separate vector.
The output from the matrix I am looking for is as follows:
[,1] [,2] [,3]
[1,] 1 10 20
[2,] 1 10 40
[3,] 2 20 62
[4,] 2 20 85
[5,] 3 30 5
[6,] 3 32
[7,] 4 40
[8,] 4 45
[9,] 1 5
I am quite new to R and struggling with this. Thank you in advance!
This sounds like a job for rle and tapply:
b = c(1,1,2,2,3,3,4,4,1)
c = c(10,10,20,20,30,30,40,40,5)
a <- NULL
a <- matrix(c(b,c), ncol=2)
A <- rle(a[, 1])$lengths
tapply(a[, 2], rep(seq_along(A), A), sum)
# 1 2 3 4 5
# 20 40 60 80 5
Explanation:
rle identifies the run-lengths of the items in the first column of matrix "a".
We create a grouping variable for tapply from the run-lengths using rep(seq_along(A), A).
We put those two things together in tapply to get the sums you want.
Is this what you want? I bet there are clean base solutions, but I give it a try with rollsum in zoo package:
library(zoo)
mm <- cbind(c(1, 1, 2, 2, 3, 3, 4, 4, 1), c(10, 10, 20, 20, 30, 30, 40, 40, 5))
# calculate all lagged sums of column 2
sums <- rollsum(x = mm[ , 2], k = 2)
# calculate differences between consecutive numbers in column 1
diffs <- diff(mm[ , 1])
# select sums where diff is 0, i.e. where the two consecutive numbers in column 1 are equal.
sums2 <- sums[diffs == 0]
sums2
# [1] 20 40 60 80

Resources