R: count number of matches in matrix row - r

I have a matrix mat.
mat<-matrix(
c('a','a','b','a','b','b'),
nrow=3, ncol=2)
I want to make a vector of the count matches in each row of the matrix. For example, let's say I wanted to count the number of matches of the letter a in each row. The first row of the matrix has an a,a: two matches of a. The second row of the matrix has an a,b: one match of a.
I can count the number of matches of the character a in a row with this line of code:
sum(!is.na(charmatch(mat[1,c(1,2)],"a"))) # first row, returns 2
sum(!is.na(charmatch(mat[2,c(1,2)],"a"))) # second row, returns 1
I want to vectorize this counting procedure. In other words, I want to do something like this
as.vector(rowsum(!is.na(charmatch(mat[,c(1,2)], "a"))))
So that it returns a vector like this 2,1,0 which means 2 matches of a in row 1 of the matrix, 1 match of a in row 2 of the matrix, 0 matches of a in row 3 of the matrix.

You can just do
rowSums(mat=='a', na.rm=TRUE)
#[1] 2 1 0
For all unique values
Un <- sort(unique(c(mat)))
res <- sapply(Map(`==`, list(mat), Un), rowSums, na.rm=TRUE)
colnames(res) <- Un
res
# a b
#[1,] 2 0
#[2,] 1 1
#[3,] 0 2
Or as contributed by #Ananda Mahto, a faster approach would be
lvl <- sort(unique(c(mat)))
vapply(lvl, function(x) rowSums(mat == x, na.rm = TRUE), numeric(nrow(mat)))

If you wanted to do this for all values, you can try one of the following:
table with factor in apply
levs <- unique(c(mat))
t(apply(mat, 1, function(x) table(factor(x, levs))))
# a b
# [1,] 2 0
# [2,] 1 1
# [3,] 0 2
melt and dcast with fun.aggregate = length from "reshape2"
library(reshape2)
dcast(melt(mat), Var1 ~ value, value.var = "Var2")
# Aggregation function missing: defaulting to length
# Var1 a b
# 1 1 2 0
# 2 2 1 1
# 3 3 0 2
Better yet would just be table after manually creating the values to tabulate:
table(rep(sequence(nrow(mat)), ncol(mat)), c(mat))
#
# a b
# 1 2 0
# 2 1 1
# 3 0 2

Related

padding sequence in R

I would like to create left padding column depending on the max number of element of a column in an efficient way.
Let me detail the input and output
My input data is this:
input <- data.frame(path=c("2","4,3,4","3,1"))
input
path
1 2
2 4,3,4
3 3,1
expected output:
V1 V2 V3
1 0 0 2
2 4 3 4
3 0 3 1
The logic base on the input is:
1 - We look at the row having the maximum element ( on this example it is 3 because we have 4,3,4 in row number 2) we now know that we will need to create 3 column as output
2 - for the row that doesn't have 3 element we backfill with 0.
Since the first row has only 1 element i need to put two column with 0
for the second row i can directly fill all columns because we have three elements,for the last row i have 2 elements so i need to backfill one 0.
My attempt which does the work but is dirty
input$path <- as.character(input$path)
lst <- strsplit(input$path, ",")
column_to_create <- max(lengths(lst))
output <- list()
i <- 1
for(i in 1:length(lst)){
if (length(lst[[i]]) < column_to_create) {
nb_create <- column_to_create - length(lst[[i]])
output[[i]] <- c(rep(0,nb_create),lst[[i]])
}
else{
output[[i]] <- lst[[i]]
}
}
output <- lapply(output,as.numeric)
do.call(rbind,output)
[,1] [,2] [,3]
[1,] 0 0 2
[2,] 4 3 4
[3,] 0 3 1
Here's a tidyverse solution
library(tidyverse)
input %>%
separate(path, into=c("V1","V2","V3"), ",", fill="left") %>%
replace(is.na(.), 0)
separate by , into three columns, fill to the left if not enough pieces, replace NA with 0s
Output
V1 V2 V3
1 0 0 2
2 4 3 4
3 0 3 1
----------------------------------------------------------------------------------------
Generalizing for unknown number of columns
Iterate through each row, determine length of vector after strsplit, and save the max as num.cols. paste new column names
num.cols <- max(sapply(1:nrow(input), function(x) length(unlist(strsplit(as.character(input$path[x]), ",")))))
new.cols <- paste0("V", 1:num.cols)
Now you can use new.cols to define your column names
input %>%
separate(path, into=new.cols, ",", fill="left") %>%
replace(is.na(.), 0)

Replace value per row with value in first column

My question is very simple. I have a data frame with various numbers in each row, more than 100 columns. First column is always a non zero number. What I want to do is replace each nonzero number in each row (excluding the first column) with the first number in the row (the value of the first column)
I would think in the lines of an ifelse and a for loop that iterates through rows but there must be a simpler vectorised way to do it...
Another approach is to use sapply, which is more efficient than looping. Assuming your data is in a data frame df:
df[,-1] <- sapply(df[,-1], function(x) {ind <- which(x!=0); x[ind] = df[ind,1]; return(x)})
Here, we are applying the function over each and all columns of df except for the first column. In the function, x is each of these columns in turn:
First find the row indices of the column that are zeroes using which.
Set these rows in x to the corresponding values in the rows of the first column of df.
Returns the column
Note that the operations in the function are all "vectorized" over the column. That is, no looping over the rows of the column. The result from sapply is a matrix of the processed columns, which replaces all columns of df that are not the first column.
See this for an excellent review of the *apply family of functions.
Hope this helps.
Since you're data is not that big, I suggest you use a simple loop
for (i in 1:nrow(mydata))
{
for (j in 2:ncol(mydata)
{
mydata[i,j]<- ifelse(mydata[i,j]==0 ,0 ,mydata[i,1])
}
}
Suppose your data frame is dat, I have a fully-vectorized solution for you:
mat <- as.matrix(dat[, -1])
pos <- which(mat != 0)
mat[pos] <- rep(dat[[1]], times = ncol(mat))[pos]
new_dat <- "colnames<-"(cbind.data.frame(dat[1], mat), colnames(dat))
Example
set.seed(0)
dat <- "colnames<-"(cbind.data.frame(1:5, matrix(sample(0:1, 25, TRUE), 5)),
c("val", letters[1:5]))
# val a b c d e
#1 1 1 0 0 1 1
#2 2 0 1 0 0 1
#3 3 0 1 0 1 0
#4 4 1 1 1 1 1
#5 5 1 1 0 0 0
My code above gives:
# val a b c d e
#1 1 1 0 0 1 1
#2 2 0 2 0 0 2
#3 3 0 3 0 3 0
#4 4 4 4 4 4 4
#5 5 5 5 0 0 0
You want a benchmark?
set.seed(0)
n <- 2000 ## use a 2000 * 2000 matrix
dat <- "colnames<-"(cbind.data.frame(1:n, matrix(sample(0:1, n * n, TRUE), n)),
c("val", paste0("x",1:n)))
## have to test my solution first, as aichao's solution overwrites `dat`
## my solution
system.time({mat <- as.matrix(dat[, -1])
pos <- which(mat != 0)
mat[pos] <- rep(dat[[1]], times = ncol(mat))[pos]
"colnames<-"(cbind.data.frame(dat[1], mat), colnames(dat))})
# user system elapsed
# 0.352 0.056 0.410
## solution by aichao
system.time(dat[,-1] <- sapply(dat[,-1], function(x) {ind <- which(x!=0); x[ind] = dat[ind,1]; x}))
# user system elapsed
# 7.804 0.108 7.919
My solution is 20 times faster!

R Undo Dummy Variables

I have a data set where a bunch of categorical variables were converted to dummy variables (all classes used, NOT n-1) and some were not. I'm trying to recode them in a single column.
For instance
Q1.1 Q1.2 Q1.3 Q1.NA Q2 Q3.1 Q3.2
1 0 0 0 3 0 1
0 1 0 0 4 1 0
0 0 1 0 2 0 1
Is there a simple way to convert this to:
Q1 Q2 Q3
1 3 2
2 4 1
3 2 2
Right now I'm just using strsplit() (as all the dummied variable names contain '.') with a couple loops but feel like there should be a better way. Any suggestions?
I wrote a function a while back that did this sort of thing.
MultChoiceCondense<-function(vars,indata){
tempvar<-matrix(NaN,ncol=1,nrow=length(indata[,1]))
dat<-indata[,vars]
for (i in 1:length(vars)){
for (j in 1:length(indata[,1])){
if (dat[j,i]==1) tempvar[j]=i
}
}
return(tempvar)
}
If your data is called Dat, then:
Dat$Q1<-MultChoiceCondense(c("Q1.1","Q1.2","Q1.3"),Dat)
Here's an approach that uses melt from "reshape2" and cSplit from my "splitstackshape" package along with some "data.table" fun. I've loaded dplyr so that we can pipe all the things.
library(splitstackshape)
library(reshape2)
library(dplyr)
mydf %>%
as.data.table(keep.rownames = TRUE) %>% # Convert to data.table. Keep rownames
melt(id.vars = "rn", variable.name = "V") %>% # Melt the dataset by rownames
.[value > 0] %>% # Subset for all non-zero values
cSplit("V", ".") %>% # Split the "V" column (names) by "."
.[is.na(V_2), V_2 := value] %>% # Replace NA values with actual values
dcast.data.table(rn ~ V_1, value.var = "V_2") # Go wide.
# rn Q1 Q2 Q3
# 1: 1 1 3 2
# 2: 2 2 4 1
# 3: 3 3 2 2
Here's a possible base R approach:
## Which columns are binary?
Bins <- sapply(mydf, function(x) {
all(x %in% c(0, 1))
})
## Two vectors -- part after the dot and before
X <- gsub(".*\\.(.*)$", "\\1", names(mydf)[Bins])
Y <- unique(gsub("(.*)\\..*$", "\\1", names(mydf)[Bins]))
## Use `apply` to subset the X value based on the
## logical version of the binary variable
cbind(mydf[!Bins],
`colnames<-`(t(apply(mydf[Bins], 1, function(z) {
X[as.logical(z)]
})), Y))
# Q2 Q1 Q3
# 1 3 1 2
# 2 4 2 1
# 3 2 3 2
At the end, you can just reorder the columns as required. You may also need to convert them to numeric since in this case, Q1 and Q3 will be factors.
another base R approach
dat <- read.table(header = TRUE, text = "Q1.1 Q1.2 Q1.3 Q1.NA Q2 Q3.1 Q3.2
1 0 0 0 3 0 1
0 1 0 0 4 1 0
0 0 1 0 2 0 1")
## this will take all the unique questions; Q1, Q2, Q3; test if
## they are dummies; and return the column if so or find which
## dummy column is a 1 otherwise
res <- lapply(unique(gsub('\\..*', '', names(dat))), function(x) {
tmp <- dat[, grep(x, names(dat)), drop = FALSE]
if (ncol(tmp) == 1) unlist(tmp, use.names = FALSE) else max.col(tmp)
})
# [[1]]
# [1] 1 2 3
#
# [[2]]
# [1] 3 4 2
#
# [[3]]
# [1] 2 1 2
do.call('cbind', res)
# [,1] [,2] [,3]
# [1,] 1 3 2
# [2,] 2 4 1
# [3,] 3 2 2
I'm assuming your data looks like this, where the categorical columns are encoded using a dot at the end. You may also have a case where all of the values in a row are zero, which indicates a base level (such as how dummyVars in caret works with fullRank=FALSE). If so, here is a vectorized solution.
library(dplyr)
dummyVars.undo = function(df, col_prefix) {
if (!endsWith(col_prefix, '.')) {
# If col_prefix doesn't end with a period, include one, but save the
# "pretty name" as the one without a period
pretty_col_prefix = col_prefix
col_prefix = paste0(col_prefix, '.')
} else {
# Otherwise, strip the period for the pretty column name
pretty_col_prefix = substr(col_prefix, 1, nchar(col_prefix)-1)
}
# Get all columns with that encoding prefix
cols = names(df)[names(df) %>% startsWith(col_prefix)]
# Find the rows where all values are zero. If this isn't the case
# with your data there's no worry, it won't hurt anything.
base_level.idx = rowSums(df[cols]) == 0
# Set the column value to a base value of zero
df[base_level.idx, pretty_col_prefix] = 0
# Go through the remaining columns and find where the maximum value (1) occurs
df[!base_level.idx, pretty_col_prefix] = cols[apply(df[!base_level.idx, cols], 1, which.max)] %>%
strsplit('\\.') %>%
sapply(tail, 1)
# Drop the encoded columns
df[cols] = NULL
return(df)
}
Usage:
# Collapse Q1
df = dummyVars.undo(df, 'Q1')
# Collapse Q3
df = dummyVars.undo(df, 'Q3')
This uses dplyr, but only for the pipe operator %>%. You could certainly remove that if you'd prefer to do base R instead.

Deleting inverses in a matrix in R

I have initially a matrix, p:
# p is a matrix
p
A B
[1,] 1 1
[2,] 2 3
[3,] 3 2
[4,] 1 1
[5,] 8 2
For a given matrix, I want to iterate through the rows and removing any inversions. So that the new matrix is:
p
A B
[1,] 1 1
[2,] 2 3
[3,] 8 2
This is what I got:
p<-unique(p) # gets rid of duplicates
output<-lapply(p, function(x){
check<-which(p$A[x,] %in% p$B[x,])#is the value in row x of column A found in
#column B if so return the row number it was found in column B
if (length(check)!=0 ){
if(p$A[check,]== p$B[x]){ # now check if at the found row (check)of p$A is equal to p$B[x]
p<-p[-check,] #if so remove that inverse
}
}
}
)
I get this message Error in which(p$A[x] %in% p$B[x]) :
Why am I getting this Error?
Is there a better way to find inversions?
Try
p <- unique(p)
p[!duplicated(apply(p, 1, function(x) paste(sort(x), collapse=''))),]
# A B
#[1,] 1 1
#[2,] 2 3
#[3,] 8 2
data
p <- matrix(c(1,2,3,1,8, 1,3,2,1,2),
dimnames=list(NULL, c("A", "B")), ncol=2)
It's not clear whether the order of values is important in your final output, but perhaps you can make use of pmin and pmax.
Here's an approach using those functions within "data.table":
library(data.table)
unique(as.data.table(p)[, list(A = pmin(A, B), B = pmax(A, B))])
# A B
# 1: 1 1
# 2: 2 3
# 3: 2 8
The question is a bit unclear. I am assuming based on your example that you want to remove the row containing "3 2" because first value occurs in the second column (in a different row). In that case
check <- which(p[,1] %in% p[,2])
should return the rows that you want to delete. Your second round of checking is not needed. You could just delete the rows returned.

R, incident matrix, remove named columns based on their column sums

In an incident matrix with named columns, I want to remove columns with only ones in them.
For instance in
a b c
1 0 1 1
1 1 0 1
column c should be removed. I think about somethink like this:
colnames(featureMatrix)[]
# get column names of 1-cols
useless <- colnames(matrix)[?]
# remove columns
matrix <- matrix[,!colnames(matrix) %in% useless ]
What is missing is the condition based on the column sum.
m <- matrix(c(0,1,1,0,1,1),2)
rownames(m) <- c(1,1)
colnames(m) <- c("a","b","c")
m[,colMeans(m)!=1]
# a b
# 1 0 1
# 1 1 0

Resources