Counting NA characters of a matrix - r

I have the following matrix:
V1 V2 V3 V4
[1,] "d" "e" "i" "NA"
[2,] "j" "e" "i" "NA"
[3,] "j" "n" "k" "l"
[4,] "j" "k" "l" "m"
[5,] "j" "k" "i" "NA"
[6,] "o" "n" "NA" "NA"
I am trying to count the number elements per row that is not NA, but all of the usual ways like !is.na(MATRIX) are not working. I am always getting the answer to be 4. I presume this is because the program is viewing "NA" as a character, but I do not know how to fix this.

'NA' is not NA_character_ so is.na does not work. Just use
rowSums(MATRIX != 'NA')

If the NAs are character strings, convert them to real NA with mat[mat=="NA"] <- NA and then use the solution in Sotos' comment

Related

Why does gtools::combinations and permutations not work with a vector containing the same elements?

Say I have a vector vec <- c("H", "H", "H", "H", "M", "M", "A", "A")
How do I get all combinations / permutations if I e.g. draw 5 out of 8 with the expetced ouput.
> head(t, 6)
[,1] [,2] [,3] [,4] [,5]
[1,] "H" "H" "H" "H" "M"
[2,] "H" "H" "H" "H" "M"
[3,] "H" "H" "H" "H" "A"
[4,] "H" "H" "H" "H" "A"
[5,] "H" "H" "H" "M" "M"
[6,] "H" "H" "H" "M" "A"
I tried gtools::combinations() but I always get the error that there are too few different elements (same is true for gtools::permutations() regardless if repeats are allowed or not.
So I did it in a laborious way
t <- gtools::combinations(8, 5, vec, repeats.allowed = F)
Error in gtools::combinations(8, 5, vec, repeats.allowed = F) :
too few different elements
t <- gtools::combinations(8, 5, letters[1:8], repeats.allowed = F)
for ( i in 1:8) {
if ( i <=4 ) {
t[t == letters[i]] <- "H"
} else if (i <= 6) {
t[t == letters[i]] <- "M"
} else if (i <= 8) {
t[t == letters[i]] <- "A"
}
}
I am looking for an easier solution from any package or base R and want to know, why it doesn't work. Thanks in advance.
An alternative
combn(vec,5)
which results in 56 combinations (choose(8,5)).
When you need combinations/permutations of a vector that contains repeats, or multisets, many of the available functions in base R and other packages will produce unnecessary duplicate results that eventually need to be filtered out. For smaller problems, this is not an issue, however this approach quickly becomes impractical.
Currently, there are a couple of packages capable of handling these types of problems. They are arrangements and RcppAlgos (I am the author).
vec <- c("H", "H", "H", "H", "M", "M", "A", "A")
tbl_v <- table(vec)
tbl_v
vec
A H M
2 4 2
library(RcppAlgos)
comboGeneral(names(tbl_v), 5, freqs = tbl_v)
[,1] [,2] [,3] [,4] [,5]
[1,] "A" "A" "H" "H" "H"
[2,] "A" "A" "H" "H" "M"
[3,] "A" "A" "H" "M" "M"
[4,] "A" "H" "H" "H" "H"
[5,] "A" "H" "H" "H" "M"
[6,] "A" "H" "H" "M" "M"
[7,] "H" "H" "H" "H" "M"
[8,] "H" "H" "H" "M" "M"
## For package arrangements we have:
## arrangements::combinations(names(tbl_v), 5, freq = tbl_v)
Similarly, for permutations, we have:
permuteGeneral(names(tbl_v), 5, freqs = tbl_v)
[,1] [,2] [,3] [,4] [,5]
[1,] "A" "A" "H" "H" "H"
[2,] "A" "A" "H" "H" "M"
[3,] "A" "A" "H" "M" "H"
[4,] "A" "A" "H" "M" "M"
. . . . . .
. . . . . .
. . . . . .
[137,] "M" "M" "H" "A" "A"
[138,] "M" "M" "H" "A" "H"
[139,] "M" "M" "H" "H" "A"
[140,] "M" "M" "H" "H" "H"
## For package arrangements we have:
## arrangements::permutations(names(tbl_v), 5, freq = tbl_v)
Both packages contain algorithms that generate each result without the need for filtering. This approach is much more efficient.
For example, what if we had big_vec <- rep(vec, 8) and we wanted all combinations of length 16. Using the filtering approach, one would need to generate all combinations of a vector of length 64 choose 16 and then filter them. That is choose(64, 16) = 4.885269e+14 total combinations. That's going to be difficult.
With these two packages, this problem is a breeze.
big_vec <- rep(vec, 8)
tbl_big_v <- table(big_vec)
tbl_big_v
big_vec
A H M
16 32 16
system.time(test_big <- comboGeneral(names(tbl_big_v), 16,
freqs = tbl_big_v))
user system elapsed
0 0 0
dim(test_big)
[1] 153 16
apply(gtools::combinations(8,5,repeats.allowed = FALSE),2,\(x) vec[x])
does what you want.
I don't know why the package wants different values if applying it on a vector through. It's unclear within the documentation.

Splitting a dataframe into overlapping groups of equal sizes

I am looking for a way to split my data into groups where each group is made of the same window size I define.
Chrom Start End
chr1 1 10
chr1 11 20
chr1 21 30
chr1 31 40
For example, if I want a window size of 20, then the groups would be : 1-20 , 11-30 , 21 - 40.
As long as the size of the group did not exceed 20 it can keep adding to the same group.
I tried using the split function but couldn't implement this way using it.
Is there a way around this?
A vector (or column of a dataframe) can be split into overlapping windows like this:
# Size of overlap
o <- 10
# Size of sliding window
n <- 20
# Dummy data
x <- sample(LETTERS, size = 40, replace = T)
# Define start and end point (s and e)
s <- 1
e <- n
# Loop to create fragments
for(i in 1:(length(x)/o)){
assign(paste0("x", i), x[s:e])
s <- s + o
e <- (s + n) - 1
}
# Call fragments
x1
x2
x3
Result:
> x
[1] "F" "E" "G" "X" "R" "S" "L" "F" "F" "C" "I" "X" "A" "C" "B" "Z" "Q" "T" "W" "L" "G" "I" "B" "I" "O" "V" "J" "Z" "C" "R" "W" "Z" "F" "T" "N" "U" "F" "R" "A" "V"
> x1
[1] "F" "E" "G" "X" "R" "S" "L" "F" "F" "C" "I" "X" "A" "C" "B" "Z" "Q" "T" "W" "L"
> x2
[1] "I" "X" "A" "C" "B" "Z" "Q" "T" "W" "L" "G" "I" "B" "I" "O" "V" "J" "Z" "C" "R"
library(IRanges)
library(GenomicRanges)
(gr1 <- GRanges("chr1",IRanges(c(1,11,21,31),width=10),strand="*"))
(gr2 <- GRanges("chr1",IRanges(c(1,11,21),width=20),strand="*"))
fo <- findOverlaps(gr1, gr2)
queryHits(fo)
subjectHits(fo)
Check http://genomicsclass.github.io/book/pages/bioc1_igranges.html#intrarange for more details.

Dataframe manipulation in R - Shift cells to the left and remove NAs

I have a dataframe whose columns consists of randomly distributed values and NAs, as seen below:
a<-c("S","E","NA","S","NA")
b<-c("A","NA","M","G","K")
c<-c("I","NA","NA","NA","L")
meh<-dataframe(a,b,c)
# [,1] [,2] [,3] [,4] [,5]
#a "S" "E" "NA" "S" "NA"
#b "A" "NA" "M" "G" "K"
#c "I" "NA" "NA" "NA" "L"
I want to remove all NAs and shift the non-NAs to the left - it should look like this
# [,1] [,2] [,3] [,4]
#a "S" "E" "S"
#b "A" "M" "G" "K"
#c "I" "L"
Any ideas?
We can also use stri_list2matrix
library(stringi)
stri_list2matrix(lapply(meh, function(x) x[x!='NA']), fill='', byrow=TRUE)
# [,1] [,2] [,3] [,4]
#[1,] "S" "E" "S" ""
#[2,] "A" "M" "G" "K"
#[3,] "I" "L" "" ""
It might help if you specify what you want to do with the data after you finish this process, but here's a way to get rid of NA's in the each column and store them to a variable. That is if you actually have NA's. I changed your example dataset to reflect the comments above to include NA not "NA".
a<-c("S","E",NA,"S",NA)
b<-c("A",NA,"M","G","K")
c<-c("I",NA,NA,NA,"L")
meh<-data.frame(a,b,c)
newcol<-na.omit(meh$a) #Removes all NA's from your column
newcol<-newcol[1:length(newcol)] #Gives you an output without any NA's
The same can be done with each row like jeremycg suggests, using lapply.
lapply(1:nrow(meh), function(x) meh[x,][is.na(meh[x,])==F])
Once the vectors are all different sizes, it doesn't make sense to colbind them back into wide form. Instead,
library(dplyr)
library(tidyr)
meh %>%
gather(variable, value) %>%
filter(!is.na(value))

Extracting common character strings from multiple vectors of different lengths

Is there function in R to find the common characters in multiple vectors (of different lengths). For example, if I have 3 vectors...
a1 <- LETTERS[1:7]
a2 <- LETTERS[4:8]
a3 <- LETTERS[2:10]
a1
# [1] "A" "B" "C" "D" "E" "F" "G"
a2
# [1] "D" "E" "F" "G" "H"
a3
# [1] "B" "C" "D" "E" "F" "G" "H" "I" "J"
I can think of a messy solutions...
intersect(intersect(a1,a2),a3)
# [1] "D" "E" "F" "G"
Problem is, I have around 8 or 9 vectors. Is there a better way to this?
Yes:
Reduce(intersect,list(a1,a2,a3))

Using a sample list as a template for sampling from a larger list with wraparound

Similar to my question at Using a sample list as a template for sampling from a larger list without wraparound, how can I know do this allowing for a wrap-around?
Thus, if I have a vector of letters:
> all <- letters
> all
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"
and then I define a reference sample from letters as follows:
> refSample <- c("j","l","m","s")
in which the spacing between elements is 2 (1st to 2nd), 1 (2nd to 3rd) and 6 (3rd to 4th), how can I then select n samples from all that have identical, wrap-around spacing between its elements to refSample? For example, "a","c","d","j", "q" "s" "t" "z" and "r" "t" "u" "a" would be valid samples, but "a","c","d","k" would not.
Again, parameterised for a function is best.
I would have left it as an exercise but here goes --
all <- letters
refSample <- c("j","l","m","s")
pick_matches <- function(n, ref, full, wrap = FALSE) {
iref <- match(ref,full)
spaces <- diff(iref)
tot_space <- sum(spaces)
N <- length( full ) - 1
max_start <- N - tot_space*(1-wrap)
starts <- sample(0:max_start, n, replace = TRUE)
return( sapply( starts, function(s) full[ 1 + cumsum(c(s, spaces)) %% (N+1) ] ) )
}
> set.seed(1)
> pick_matches(5, refSample, all, wrap = FALSE)
[,1] [,2] [,3] [,4] [,5]
[1,] "e" "g" "j" "p" "d"
[2,] "g" "i" "l" "r" "f"
[3,] "h" "j" "m" "s" "g"
[4,] "n" "p" "s" "y" "m"
> pick_matches(5, refSample, all, wrap = TRUE)
[,1] [,2] [,3] [,4] [,5]
[1,] "x" "y" "r" "q" "b"
[2,] "z" "a" "t" "s" "d"
[3,] "a" "b" "u" "t" "e"
[4,] "g" "h" "a" "z" "k"

Resources