Count Number of Pairwise Differences of a Matrix in R - r

I have the following matrix:
0 1 0 0 0 1 0 0 # Row A
0 1 0 0 0 0 1 0 # Row B
0 1 0 0 0 0 0 0 # Row C
0 0 1 0 0 0 0 0 # Row D
I want to make a new matrix that shows the pairwise difference between each row (e.g between rows A and B, there are 2 columns that are different, so the entry in the matrix corresponding to A and B is 2). Like this:
A B C D
A - 2 1 3
B - - 1 3
C - - - 2
D - - - -
The matrix isn't absolutely necessary. It's just an intermediary step for what I really want to do: count the number of pairwise differences between each row in the original matrix like so...
(2+1+3+1+3+2) = 12

You could try combn
v1 <- combn(1:nrow(m1), 2, FUN=function(x) sum(m1[x[1],]!= m1[x[2],]))
v1
#[1] 2 1 3 1 3 2
sum(v1)
#[1] 12
If you need a matrix output
m2 <- outer(1:nrow(m1), 1:nrow(m1), FUN=Vectorize(function(x,y)
sum(m1[x,]!=m1[y,])))
dimnames(m2) <- rep(list(LETTERS[1:4]),2)
m2[lower.tri(m2)] <- 0
m2
# A B C D
#A 0 2 1 3
#B 0 0 1 3
#C 0 0 0 2
#D 0 0 0 0
data
m1 <- structure(c(0L, 0L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 1L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L,
0L, 0L, 0L), .Dim = c(4L, 8L))

I think that this function could help you to count the differences
count.diff <- function(mat) {
Nrow <- nrow(mat)
count <- 0
for (i in 1:(Nrow-1)) count <- count + sum(t(t(mat[-(1:i),])!=mat[i,]))
count
}
mat <- matrix(rbinom(n=24,size=1,prob=0.7), ncol=4)
mat
count.diff(mat)

Related

Converting one-hot encoded data to aggregate in dplyr

I have age columns like so that are dummy encoded.
How can I aggregate the information so i can get counts in dplyr
Input:
age_010 age_11-20 age_2130 age_3140 age_41-50 age_5160
0 1 0 0 0 0
0 0 1 0 0 0
0 0 0 1 0 0
0 1 0 0 0 0
0 0 0 0 0 1
Expected Output:
age n
age_010 0
age_11-20 2
age_2130 1
age_3140 1
age_41-50 0
age_5160 1
We may do the column wise sum
v1 <- colSums(df1)
data.frame(age = names(v1), n = unname(v1))
-output
age n
1 age_010 0
2 age_11.20 2
3 age_2130 1
4 age_3140 1
5 age_41.50 0
6 age_5160 1
If we want the tidyverse, do the sum across all the columns and then reshape to 'long' with pivot_longer
library(dplyr)
library(tidyr)
df1 %>%
summarise(across(everything(), sum)) %>%
pivot_longer(cols = everything(), names_to = 'age', values_to = 'n')
# A tibble: 6 × 2
age n
<chr> <int>
1 age_010 0
2 age_11.20 2
3 age_2130 1
4 age_3140 1
5 age_41.50 0
6 age_5160 1
data
df1 <- structure(list(age_010 = c(0L, 0L, 0L, 0L, 0L), age_11.20 = c(1L,
0L, 0L, 1L, 0L), age_2130 = c(0L, 1L, 0L, 0L, 0L), age_3140 = c(0L,
0L, 1L, 0L, 0L), age_41.50 = c(0L, 0L, 0L, 0L, 0L), age_5160 = c(0L,
0L, 0L, 0L, 1L)), class = "data.frame", row.names = c(NA, -5L
))

Subsampling without replacement in R [duplicate]

This question already has an answer here:
Sample without replacement, or duplicates, in R
(1 answer)
Closed 1 year ago.
There's a very similar question to mine on stack, but that doesn't directly answer my question.
I have abundance data for 250 species across 1000 sites. Species are columns, sites are rows. My abundances data look something like the data in the linked post above.
0 0 3 0 0 201 0 0 0 82
0 23 5 0 0 0 0 0 0 0
9 0 0 0 0 12 0 0 0 913
0 7 91 0 8 0 0 92 9 0
131 12 0 410 0 0 0 3 0 0
If I wanted to sample 50 individuals from each site, without replacement, how can I do this?
Focusing on code for single sites for now.
This code:
samples <- sample(1:ncol(abundances), 50, rep=FALSE, prob=abundances[1,]) doesn't work unless I change to rep=TRUE. However, I need sampling WITHOUT replacement.
I don't want to use sample(abundances[1,], 50, rep=FALSE) because then instead of sampling individuals, it samples species and will report the whole value in that row (i.e. species 6 may occur 201 times at site 1, it'll report 201, rather than 1 individual from that species, resulting in >50 individuals in final subsample).
I essentially want an output identical to what user Dinre answered in post above, but without it being for bootstrapping. I just want to sample without replacement. This process will ultimately be integrated into a for loop for a subsample from each site.
Here is a way to sample vector elements from each row with the sum of each sampled row equal to a chosen integer, the size. In the code below, n <- 5 is passed as function argument size. The call to runif adds an element of randomness to the sampling function.
fun <- function(x, size){
x <- x*runif(length(x))
y <- size*x/sum(x)
round(y)
}
set.seed(2021)
n <- 5
t(apply(df1, 1, fun, size = n))
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
#[1,] 0 0 0 0 0 3 0 0 0 2
#[2,] 0 4 1 0 0 0 0 0 0 0
#[3,] 0 0 0 0 0 0 0 0 0 5
#[4,] 0 0 3 0 0 0 0 1 0 0
#[5,] 0 0 0 5 0 0 0 0 0 0
Data
Here is the question's data in dput format.
X <-
structure(c(0L, 0L, 9L, 0L, 131L, 0L, 23L, 0L, 7L, 12L, 3L, 5L,
0L, 91L, 0L, 0L, 0L, 0L, 0L, 410L, 0L, 0L, 0L, 8L, 0L, 201L,
0L, 12L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 92L, 3L, 0L,
0L, 0L, 9L, 0L, 82L, 0L, 913L, 0L, 0L), .Dim = c(5L, 10L), .Dimnames = list(
NULL, c("V1", "V2", "V3", "V4", "V5", "V6", "V7", "V8", "V9", "V10")))

delete the colmuns which are contained just NA or 0 or both values [duplicate]

This question already has an answer here:
Remove the columns with the colsums=0
(1 answer)
Closed 7 years ago.
What should I do if I want to remove the columns which are contained just the 0 or NA or both values?
mat <- structure(c(0L, 0L, 1L, 0L, 2L, 0L, 0L, 0L, NA, 0L, 0L, 0L, 0L,
0L, 2L, 0L, 0L, NA, 0L, 0L, 0L, 1L, 0L, 0L, NA, 0L, NA, 0L, 0L,
0L, 0L, NA, 0L, 2L, 0L, 0L), .Dim = c(6L, 6L), .Dimnames = list(
c("A05363", "A05370", "A05380", "A05397", "A05400", "A05426"), c("X1.110590170", "X1.110888172", "X1.110906406", "X1.110993854", "X1.110996710", "X1.111144756")))
My output should be like this:
X1.110590170 X1.110906406 X1.110993854 X1.111144756
A05363 0 0 0 0
A05370 0 0 0 NA
A05380 1 2 0 0
A05397 0 0 1 2
A05400 2 0 0 0
A05426 0 NA 0 0
You can filter out the columns using the apply function along the columns.
You will simply have to use the all function to make sure that all values in the column satisfy the logic: is.na(x) | x == 0.
filter_cols <- apply(mat, 2, function(x) !all(is.na(x) | x == 0))
mat[,filter_cols]
#' X1.110590170 X1.110906406 X1.110993854 X1.111144756
#' A05363 0 0 0 0
#' A05370 0 0 0 NA
#' A05380 1 2 0 0
#' A05397 0 0 1 2
#' A05400 2 0 0 0
#' A05426 0 NA 0 0

R: use a row as a grouping vector for row sums

If I have a data set laid out like:
Cohort Food1 Food2 Food 3 Food 4
--------------------------------
Group 1 1 2 3
A 1 1 0 1
B 0 0 1 0
C 1 1 0 1
D 0 0 0 1
I want to sum each row, where I can define food groups into different categories. So I would like to use the Group row as the defining vector.
Which would mean that food1 and food2 are in group 1, food3 is in group 2, food 4 is in group 3.
Ideal output something like:
Cohort Group1 Group2 Group3
A 2 0 1
B 0 1 0
C 2 0 1
D 0 0 1
I tried using this rowsum() based functions but no luck, do I need to use ddply() instead?
Example data from comment:
dat <-
structure(list(species = c("group", "princeps", "bougainvillei",
"hombroni", "lindsayi", "concretus", "galatea", "ellioti", "carolinae",
"hydrocharis"), locust = c(1L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L,
0L), grasshopper = c(1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 1L, 0L),
snake = c(2L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L), fish = c(2L,
1L, 0L, 1L, 1L, 0L, 1L, 0L, 1L, 0L), frog = c(2L, 0L, 0L,
0L, 0L, 0L, 0L, 1L, 0L, 0L), toad = c(2L, 0L, 0L, 0L, 0L,
1L, 0L, 0L, 0L, 0L), fruit = c(3L, 0L, 0L, 0L, 0L, 1L, 1L,
0L, 0L, 0L), seed = c(3L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L,
0L)), .Names = c("species", "locust", "grasshopper", "snake",
"fish", "frog", "toad", "fruit", "seed"), class = "data.frame", row.names = c(NA,
-10L))
There are most likely more direct approaches, but here is one you can try:
First, create a copy of your data minus the second header row.
dat2 <- dat[-1, ]
melt() and dcast() and so on from the "reshape2" package don't work nicely with duplicated column names, so let's make the column names more "reshape2 appropriate".
Seq <- ave(as.vector(unlist(dat[1, -1])),
as.vector(unlist(dat[1, -1])),
FUN = seq_along)
names(dat2)[-1] <- paste("group", dat[1, 2:ncol(dat)],
".", Seq, sep = "")
melt() the dataset
m.dat2 <- melt(dat2, id.vars="species")
Use the colsplit() function to split the columns correctly.
m.dat2 <- cbind(m.dat2[-2],
colsplit(m.dat2$variable, "\\.",
c("group", "time")))
head(m.dat2)
# species value group time
# 1 princeps 0 group1 1
# 2 bougainvillei 0 group1 1
# 3 hombroni 1 group1 1
# 4 lindsayi 0 group1 1
# 5 concretus 0 group1 1
# 6 galatea 0 group1 1
Proceed with dcast() as usual
dcast(m.dat2, species ~ group, sum)
# species group1 group2 group3
# 1 bougainvillei 0 0 0
# 2 carolinae 1 1 0
# 3 concretus 0 2 2
# 4 ellioti 0 1 0
# 5 galatea 1 1 1
# 6 hombroni 2 1 0
# 7 hydrocharis 0 0 0
# 8 lindsayi 0 1 0
# 9 princeps 0 1 0
Note: Edited because original answer was incorrect.
Update: An easier way in base R
This problem is much more easily solved if you start by transposing your data.
dat3 <- t(dat[-1, -1])
dat3 <- as.data.frame(dat3)
names(dat3) <- dat[[1]][-1]
t(do.call(rbind, lapply(split(dat3, as.numeric(dat[1, -1])), colSums)))
# 1 2 3
# princeps 0 1 0
# bougainvillei 0 0 0
# hombroni 2 1 0
# lindsayi 0 1 0
# concretus 0 2 2
# galatea 1 1 1
# ellioti 0 1 0
# carolinae 1 1 0
# hydrocharis 0 0 0
You can do this using base R fairly easily. Here's an example.
First, figure out which animals belong in which group:
groupings <- as.data.frame(table(as.numeric(dat[1,2:9]),names(dat)[2:9]))
attach(groupings)
grp1 <- groupings[Freq==1 & Var1==1,2]
grp2 <- groupings[Freq==1 & Var1==2,2]
grp3 <- groupings[Freq==1 & Var1==3,2]
detach(groupings)
Then, use the groups to do a rowSums() on the correct columns.
dat <- cbind(dat,rowSums(dat[as.character(grp1)]))
dat <- cbind(dat,rowSums(dat[as.character(grp2)]))
dat <- cbind(dat,rowSums(dat[as.character(grp3)]))
Delete the initial row and the intermediate columns:
dat <- dat[-1,-c(2:9)]
Then, just rename things correctly:
row.names(dat) <- rm()
names(dat) <- c("species","group_1","group_2","group_3")
And you ultimately get:
species group_1 group_2 group_3
bougainvillei 0 0 0
carolinae 1 1 0
concretus 0 2 2
ellioti 0 1 0
galatea 1 1 1
hombroni 2 1 0
hydrocharis 0 0 0
lindsayi 0 1 0
princeps 0 1 0
EDITED: Changed sort order to alphabetical, like other answer.

How to select columns from a data frame around the one with maximum value?

I'm really a beginner in R so, sorry if my code shocks you guys.
My data resembles something like this:
a b c d e f g h i j
t1 0 0 0 0 3 0 0 0 0 0
t2 0 0 0 0 0 6 0 0 0 0
t3 0 0 0 0 0 0 0 0 0 8
t4 0 0 0 0 0 0 0 0 9 0
I'd like to, for each row find the column with the maximum value and then get columns minus 3 to plus 3 of that one.
I wrote the following script to perform exactly that:
M<-c(1)
for (row in 1: length(D[,1])) {
max<-which.max(D[row,])
D<-D[,c(max-3,max-2,max-1,max,max+1,max+2,max+3)]
M<- cbind(M,D)
}
M<-M[,-1]
It would work, except for the case in which the maximum value is in a column near the beginning or end of a row (like rows t3 and t4 in the example above). In this case I'd like to have the 7 columns more close to the column with the maximum value, like this:
t1 0 0 0 3 0 0 0
t2 0 0 0 6 0 0 0
t3 0 0 0 0 0 0 8
t4 0 0 0 0 0 9 0
Help would be really appreciated!
dput() version of example data:
structure(list(a = c(0L, 0L, 0L, 0L), b = c(0L, 0L, 0L, 0L),
c = c(0L, 0L, 0L, 0L), d = c(0L, 0L, 0L, 0L), e = c(3L, 0L,
0L, 0L), f = c(0L, 6L, 0L, 0L), g = c(0L, 0L, 0L, 0L), h = c(0L,
0L, 0L, 0L), i = c(0L, 0L, 0L, 9L), j = c(0L, 0L, 8L, 0L)), .Names = c("a",
"b", "c", "d", "e", "f", "g", "h", "i", "j"), class = "data.frame",
row.names = c("t1", "t2", "t3", "t4"))
This should work nicely:
t(apply(D,
MARGIN = 1,
FUN = function(X) {
n <- which.max(X)
i <- seq(min(max(1, n-3), ncol(D)-6), len=7)
X[i]
}))
# [,1] [,2] [,3] [,4] [,5] [,6] [,7]
# t1 0 0 0 3 0 0 0
# t2 0 0 0 6 0 0 0
# t3 0 0 0 0 0 0 8
# t4 0 0 0 0 0 9 0
To test that the key column-selecting bit works as you'd like it to, you can try the following:
n <- 2
seq(min(max(1, n-3), ncol(D)-6), len=7)
n <- 10
seq(min(max(1, n-3), ncol(D)-6), len=7)
n <- 6
seq(min(max(1, n-3), ncol(D)-6), len=7)

Resources