R, incident matrix, remove named columns based on their column sums - r

In an incident matrix with named columns, I want to remove columns with only ones in them.
For instance in
a b c
1 0 1 1
1 1 0 1
column c should be removed. I think about somethink like this:
colnames(featureMatrix)[]
# get column names of 1-cols
useless <- colnames(matrix)[?]
# remove columns
matrix <- matrix[,!colnames(matrix) %in% useless ]
What is missing is the condition based on the column sum.

m <- matrix(c(0,1,1,0,1,1),2)
rownames(m) <- c(1,1)
colnames(m) <- c("a","b","c")
m[,colMeans(m)!=1]
# a b
# 1 0 1
# 1 1 0

Related

R: Finding values of one vector in another and corresponding values

If I have a data.frame (df1) as follows:
Name Count
a 1
b 2
c 3
and another data.frame (df2) like:
Name Count
aa 0
ba 0
ca 0
b 0
a 0
c 0
I want to get the values corresponding to df1 in df2, where the names match. I am currently trying:
idx = which(df2$Name %in% df1$Name)
df2[idx,2] = df1$Count
This seems to swap or permutes some of the stored counts. What would be a method where the order in the original data.frame can be preserved?
You can get the indices for replacement using match:
df2[match(df1$Name, df2$Name),]$Count <- df1$Count
As to why your solution doesn't work, compare the output of:
which(df2$Name %in% df1$Name)
[1] 4 5 6
and
match(df1$Name, df2$Name)
[1] 5 4 6

Replace value per row with value in first column

My question is very simple. I have a data frame with various numbers in each row, more than 100 columns. First column is always a non zero number. What I want to do is replace each nonzero number in each row (excluding the first column) with the first number in the row (the value of the first column)
I would think in the lines of an ifelse and a for loop that iterates through rows but there must be a simpler vectorised way to do it...
Another approach is to use sapply, which is more efficient than looping. Assuming your data is in a data frame df:
df[,-1] <- sapply(df[,-1], function(x) {ind <- which(x!=0); x[ind] = df[ind,1]; return(x)})
Here, we are applying the function over each and all columns of df except for the first column. In the function, x is each of these columns in turn:
First find the row indices of the column that are zeroes using which.
Set these rows in x to the corresponding values in the rows of the first column of df.
Returns the column
Note that the operations in the function are all "vectorized" over the column. That is, no looping over the rows of the column. The result from sapply is a matrix of the processed columns, which replaces all columns of df that are not the first column.
See this for an excellent review of the *apply family of functions.
Hope this helps.
Since you're data is not that big, I suggest you use a simple loop
for (i in 1:nrow(mydata))
{
for (j in 2:ncol(mydata)
{
mydata[i,j]<- ifelse(mydata[i,j]==0 ,0 ,mydata[i,1])
}
}
Suppose your data frame is dat, I have a fully-vectorized solution for you:
mat <- as.matrix(dat[, -1])
pos <- which(mat != 0)
mat[pos] <- rep(dat[[1]], times = ncol(mat))[pos]
new_dat <- "colnames<-"(cbind.data.frame(dat[1], mat), colnames(dat))
Example
set.seed(0)
dat <- "colnames<-"(cbind.data.frame(1:5, matrix(sample(0:1, 25, TRUE), 5)),
c("val", letters[1:5]))
# val a b c d e
#1 1 1 0 0 1 1
#2 2 0 1 0 0 1
#3 3 0 1 0 1 0
#4 4 1 1 1 1 1
#5 5 1 1 0 0 0
My code above gives:
# val a b c d e
#1 1 1 0 0 1 1
#2 2 0 2 0 0 2
#3 3 0 3 0 3 0
#4 4 4 4 4 4 4
#5 5 5 5 0 0 0
You want a benchmark?
set.seed(0)
n <- 2000 ## use a 2000 * 2000 matrix
dat <- "colnames<-"(cbind.data.frame(1:n, matrix(sample(0:1, n * n, TRUE), n)),
c("val", paste0("x",1:n)))
## have to test my solution first, as aichao's solution overwrites `dat`
## my solution
system.time({mat <- as.matrix(dat[, -1])
pos <- which(mat != 0)
mat[pos] <- rep(dat[[1]], times = ncol(mat))[pos]
"colnames<-"(cbind.data.frame(dat[1], mat), colnames(dat))})
# user system elapsed
# 0.352 0.056 0.410
## solution by aichao
system.time(dat[,-1] <- sapply(dat[,-1], function(x) {ind <- which(x!=0); x[ind] = dat[ind,1]; x}))
# user system elapsed
# 7.804 0.108 7.919
My solution is 20 times faster!

R - setting a value based on a function applied to other values in the same row

I have a dataframe containing (surprise) data. I have one column which I wish to populated on a per-row basis, calculated from the values of other columns in the same row.
From googling, it seems like I need 'apply', or one of it's close relatives. Unfortunately I haven't managed to make it actually work.
Example code:
#Example function
getCode <- function (ar1, ar2, ar3){
if(ar1==1 && ar2==1 && ar3==1){
return(1)
} else if(ar1==0 && ar2==0 && ar3==0){
return(0)
}
return(2)
}
#Create data frame
a = c(1,1,0)
b = c(1,0,0)
c = c(1,1,0)
df <- data.frame(a,b,c)
#Add column for new data
df[,"x"] <- 0
#Apply function to new column
df[,"x"] <- apply(df[,"x"], 1, getCode(df[,"a"], df[,"b"], df[,"c"]))
I would like df to be taken from:
a b c x
1 1 1 1 0
2 1 0 1 0
3 0 0 0 0
to
a b c x
1 1 1 1 1
2 1 0 1 2
3 0 0 0 0
Unfortunately running this spits out:
Error in match.fun(FUN) : 'getCode(df[, "a"], df[, "b"], df[,
"c"])' is not a function, character or symbol
I'm new to R, so apologies if the answer is blindingly simple. Thanks.
A few things: apply would be along the dataframe itself (i.e. apply(df, 1, someFunc)); it's more idiomatic to access columns by name using the $ operator.. so if I have a dataframe named df with a column named a, access a with df$a.
In this case, I like to do an sapply along the index of the dataframe, and then use that index to get the appropriate elements from the dataframe.
df$x <- sapply(1:nrow(df), function(i) getCode(df$a[i], df$b[i], df$c[i]))
As #devmacrile mentioned above, I would just modify the function to be able to get a vector with 3 elements as input and use it within an apply command as you mentioned.
#Example function
getCode <- function (x){
ifelse(x[1]==1 & x[2]==1 & x[3]==1,
1,
ifelse(x[1]==0 & x[2]==0 & x[3]==0,
0,
2)) }
#Create data frame
a = c(1,1,0)
b = c(1,0,0)
c = c(1,1,0)
df <- data.frame(a,b,c)
df
# a b c
# 1 1 1 1
# 2 1 0 1
# 3 0 0 0
# create your new column of results
df$x = apply(df, 1, getCode)
df
# a b c x
# 1 1 1 1 1
# 2 1 0 1 2
# 3 0 0 0 0

Filling in missing rows/columns in distance matrices in R

I've two distance matrices.. but either of them can have items missing, and they can be out of order -- for example:
matrix #1 (missing item c)
a b d
a 0 2 3
b 2 0 4
d 3 4 0
matrix #2 (missing item b, and items out of order)
d c a
d 0 1 2
c 1 0 1
a 2 1 0
I want to find the difference between the matrices, while assuming that any missing items are 0. So, my resulting matrix should be:
a b c d
a 0 2 1 1
b 2 0 0 4
c 1 0 0 1
d 1 4 1 0
What's the best way to go about this? Should I be sorting both matrices and then filling in missing columns/rows so that I can then just abs(m1-m2), or is there a way to use row/column headings to have them automatically "match up" when subtracting?
These matrices are 5000x5000 or so, and I'll have about a 1000 to do pairwise comparison on, so I'd rather take a hit on preprocessing the data if that will make each computation significantly faster.
Any hints or suggestions are welcome. I'm usually a non-R programmer, so an iterative solution that I would normally come up would take forever -- I'm hoping for the "R way" of doing things that will be significantly faster.
We create a names index ('Un1') which is the union of names of the first ('m1') and second ('m2') matrix. Two new 0 matrices ('m1N', 'm2N') are created by specifying the dimensions and dim names based on 'Un1'. By row/column indexing, we change the 0 values in these matrices to the values in 'm1', 'm2', subtract and get the absolute.
Un1 <- sort(union(colnames(m1), colnames(m2)))
m1N <- matrix(0, ncol=length(Un1), nrow=length(Un1), dimnames=list(Un1, Un1))
m2N <- m1N
m1N[rownames(m1), colnames(m1)] <- m1
m2N[rownames(m2), colnames(m2)] <- m2
abs(m1N-m2N)
# a b c d
#a 0 2 1 1
#b 2 0 0 4
#c 1 0 0 1
#d 1 4 1 0
Update
If we have several matrices with object names m followed by numbers, we can place them in a list. We get the object names using ls and the values in a list with mget. Loop through the list with lapply to get the column names, use union as f in Reduce, sort to get the unique elements.
lst <- mget(ls(pattern='m\\d+')) #change the pattern accordingly
Un1 <- sort(Reduce(union, lapply(lst, colnames)))
We can create another list with matrix of 0s.
lst1 <- lapply(seq_along(lst), function(i)
matrix(0, ncol=length(Un1), nrow=length(Un1), dimnames=list(Un1, Un1)))
We can change the corresponding elements of 'lst1' using the row/column index of corresponding matrices of 'lst' using Map.
lst2 <- Map(function(x,y) {x[rownames(y), colnames(y)] <- y; x}, lst1, lst)
If we need pairwise difference, combn may be an option
lst3 <- combn(seq_along(lst2),2, FUN=function(x)
list(abs(lst2[[x[1]]]-lst2[[x[2]]])))
names(lst3) <- combn(seq_along(lst2), 2, FUN=paste, collapse='_')
Another approach using match (beginning is similar to #akrun):
func = function(cols, m)
{
res = `dimnames<-`(m[match(cols,rownames(m)), match(cols,colnames(m))],
list(cols, cols))
ifelse(is.na(res), 0, res)
}
cols = sort(union(colnames(m1), colnames(m2)))
abs(func(cols,m1) - func(cols,m2))
# a b c d
#a 0 2 1 1
#b 2 0 0 4
#c 1 0 0 1
#d 1 4 1 0

R: count number of matches in matrix row

I have a matrix mat.
mat<-matrix(
c('a','a','b','a','b','b'),
nrow=3, ncol=2)
I want to make a vector of the count matches in each row of the matrix. For example, let's say I wanted to count the number of matches of the letter a in each row. The first row of the matrix has an a,a: two matches of a. The second row of the matrix has an a,b: one match of a.
I can count the number of matches of the character a in a row with this line of code:
sum(!is.na(charmatch(mat[1,c(1,2)],"a"))) # first row, returns 2
sum(!is.na(charmatch(mat[2,c(1,2)],"a"))) # second row, returns 1
I want to vectorize this counting procedure. In other words, I want to do something like this
as.vector(rowsum(!is.na(charmatch(mat[,c(1,2)], "a"))))
So that it returns a vector like this 2,1,0 which means 2 matches of a in row 1 of the matrix, 1 match of a in row 2 of the matrix, 0 matches of a in row 3 of the matrix.
You can just do
rowSums(mat=='a', na.rm=TRUE)
#[1] 2 1 0
For all unique values
Un <- sort(unique(c(mat)))
res <- sapply(Map(`==`, list(mat), Un), rowSums, na.rm=TRUE)
colnames(res) <- Un
res
# a b
#[1,] 2 0
#[2,] 1 1
#[3,] 0 2
Or as contributed by #Ananda Mahto, a faster approach would be
lvl <- sort(unique(c(mat)))
vapply(lvl, function(x) rowSums(mat == x, na.rm = TRUE), numeric(nrow(mat)))
If you wanted to do this for all values, you can try one of the following:
table with factor in apply
levs <- unique(c(mat))
t(apply(mat, 1, function(x) table(factor(x, levs))))
# a b
# [1,] 2 0
# [2,] 1 1
# [3,] 0 2
melt and dcast with fun.aggregate = length from "reshape2"
library(reshape2)
dcast(melt(mat), Var1 ~ value, value.var = "Var2")
# Aggregation function missing: defaulting to length
# Var1 a b
# 1 1 2 0
# 2 2 1 1
# 3 3 0 2
Better yet would just be table after manually creating the values to tabulate:
table(rep(sequence(nrow(mat)), ncol(mat)), c(mat))
#
# a b
# 1 2 0
# 2 1 1
# 3 0 2

Resources