Detect different levels in in column - r

I do have a column with about 80k entries which has only 22 different levels (the number of the chromosome). Is there any quick trick in R to find out at which position a level changes into the next ... so to figure out at which row chromosome 1 changes to chromosome 2 ( all entries for a single chromosomes are listed together)?
My data looks like this:
chr number marker name (SNP)
1 rs...
1 rs...
.
.
2
thanks

You can use unique and match from base R:
data <- c(rep("a",10),rep("b",5),rep("c",2),rep("d",10))
match( unique(data) , data )
#[1] 1 11 16 18
Match returns a vector of the position of the first match of it's first argument in it's second argument. This works because all your entries for a chromosome are listed together.

Check for diff being nonzero. This returns a logical vector which is TRUE when consecutive values aren't the same. Wrap it with which to get numeric indicies.
(x <- factor(sample(c("a", "b"), 15, replace = TRUE)))
# [1] a a b b a a b b b b b a b a a
# Levels: a b
diff(as.integer(x)) != 0
# [1] FALSE TRUE FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE
which(diff(as.integer(x)) != 0)
# [1] 2 4 6 11 12 13
If all your chromosome values are grouped together, you can find the first instance of each level with duplicated.
(x2 <- factor(rep(c("a", "b", "c"), times = c(3, 4, 6))))
# [1] a a a b b b b c c c c c c
# Levels: a b c
!duplicated(x2)
# [1] TRUE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
which(!duplicated(x2))
# [1] 1 4 8

You could use rle for this (if I get your question right):
x <- rep(LETTERS[1:22], each = 3)
x
# [1] "A" "A" "A" "B" "B" "B" "C" "C" "C" "D" "D" "D" "E" "E" "E" "F" "F" "F" "G" "G" "G" "H" "H" "H" #"I" "I" "I" "J" "J" "J" "K" "K" "K" "L" "L" "L" "M" "M" "M" "N" "N" "N" "O" "O" "O" "P" "P" "P" #"Q" "Q" "Q" "R" "R" "R" "S" "S" "S" "T" "T" "T" "U" "U" "U" "V" "V" "V"
rles <- rle(x)
cumsum(rles$lengths)
# [1] 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66

Related

R - Expand Grid Without Duplicates

I need a function similar to expand.grid but without the combinations of duplicate elements.
Here is a simplified version of my problem.
X1 = c("x","y","z")
X2 = c("A","B","C")
X3 = c("y","C","G")
d <- expand.grid(X1,X2,X3)
d
Var1 Var2 Var3
1 x A y
2 y A y
3 z A y
4 x B y
. . . .
. . . .
. . . .
23 y B G
24 z B G
25 x C G
26 y C G
27 z C G
d has 27 rows. But 6 of these contain duplicate values which I do not need Rows: 2, 5, 8, 16, 17 & 18
Is there a way to get the other 21 rows which does not contain any duplicates.
Note that vectors have more than 3 elements (c("x","y","z","k","m"...), up to 50) and number of vectors is more than 3 in the real case. (X4, X5, X6... up to 11 ). Because of this expanded object is getting real large and RAM cannot handle it.
In RcppAlgos*, there is a function called comboGrid that does the trick:
library(RcppAlgos) ## as of v2.4.3
comboGrid(X1, X2, X3, repetition = F)
# Var1 Var2 Var3
# [1,] "x" "A" "C"
# [2,] "x" "A" "G"
# [3,] "x" "A" "y"
# [4,] "x" "B" "C"
# [5,] "x" "B" "G"
# [6,] "x" "B" "y"
# [7,] "x" "C" "G"
# [8,] "x" "C" "y"
# [9,] "y" "A" "C"
# [10,] "y" "A" "G"
# [11,] "y" "B" "C"
# [12,] "y" "B" "G"
# [13,] "y" "C" "G"
# [14,] "z" "A" "C"
# [15,] "z" "A" "G"
# [16,] "z" "A" "y"
# [17,] "z" "B" "C"
# [18,] "z" "B" "G"
# [19,] "z" "B" "y"
# [20,] "z" "C" "G"
# [21,] "z" "C" "y"
Large Test
set.seed(42)
rnd_lst <- lapply(1:11, function(x) {
sort(sample(LETTERS, sample(26, 1)))
})
## Number of results that expand.grid would return if your machine
## had enough memory... over 300 trillion!!!
prettyNum(prod(lengths(rnd_lst)), big.mark = ",")
# [1] "365,634,846,720"
exp_grd_test <- expand.grid(rnd_lst)
# Error: vector memory exhausted (limit reached?)
system.time(cmb_grd_test <- comboGrid(rnd_lst, repetition=FALSE))
# user system elapsed
# 9.866 0.330 10.196
dim(cmb_grd_test)
# [1] 3036012 11
head(cmb_grd_test)
# Var1 Var2 Var3 Var4 Var5 Var6 Var7 Var8 Var9 Var10 Var11
# [1,] "A" "E" "C" "B" "D" "G" "F" "H" "J" "I" "K"
# [2,] "A" "E" "C" "B" "D" "G" "F" "H" "J" "I" "L"
# [3,] "A" "E" "C" "B" "D" "G" "F" "H" "J" "I" "M"
# [4,] "A" "E" "C" "B" "D" "G" "F" "H" "J" "I" "N"
# [5,] "A" "E" "C" "B" "D" "G" "F" "H" "J" "I" "O"
# [6,] "A" "E" "C" "B" "D" "G" "F" "H" "J" "I" "P"
* I am the author of RcppAlgos
(Sorry, I just realized that your problem is as much a size problem, so removing them post-generation may not be feasible. For that, this may not be the best answer, but I'll keep it around for smaller-and-related questions.)
base R
I hard-code "3", but you can use ncol(d) and/or ncol(d)-1 for programmatic use.
d[lengths(apply(d, 1, unique)) > 2, ]
# Var1 Var2 Var3
# 1 x A y
# 3 z A y
# 4 x B y
# 6 z B y
# 7 x C y
# 9 z C y
# 10 x A C
# 11 y A C
# 12 z A C
# 13 x B C
# 14 y B C
# 15 z B C
# 19 x A G
# 20 y A G
# 21 z A G
# 22 x B G
# 23 y B G
# 24 z B G
# 25 x C G
# 26 y C G
# 27 z C G
(The row names are not reset, you can see the gaps to verify it is not 27 rows.)
And to verify, here are the rows with dupes:
d[lengths(apply(d, 1, unique)) < 3, ]
# Var1 Var2 Var3
# 2 y A y
# 5 y B y
# 8 y C y
# 16 x C C
# 17 y C C
# 18 z C C

Extract names of neighbour of a collection of nodes as list

I want to extract the names of the neighbors of selected nodes in igraph as a list.
This is what I have so far:
library(igraph)
set.seed(100)
g<-erdos.renyi.game(26, 0.4)
V(g)$name<-letters
x<-neighborhood(g, order = 1, V(g)$name %in% c('a', 'd', 'z'))
In the above example, I want to extract the names of the neighbours of nodes a,d, and z. And this is the output I am getting:
[[1]]
+ 9/26 vertices, named, from 6ba7060:
[1] a c h i l q w x z
[[2]]
+ 11/26 vertices, named, from 6ba7060:
[1] d b c e g h i k l w y
[[3]]
+ 9/26 vertices, named, from 6ba7060:
[1] z a c g h l o u v
I want to make a long list of the names with representations. The output should look like:
[1] "a" "c" "h" "i" "l" "q" "w" "x" "z" "d" "b" "c" "e" "g" "h" "i" "k" "l" "w" "y" "z" "a" "c"
[24] "g" "h" "l" "o" "u" "v
Thus far I have tried unlist and various versions of x %>% map(2) %>% flatten() using the library purrr, but to no avail.
I am also not oppose to getting the output in the form of a data.frame or tibble with names in one column and count of occurrences in another.
Maybe you can try names(unlist(x)), e.g.,
> names(unlist(x))
[1] "a" "c" "h" "i" "l" "q" "w" "x" "z" "d" "b" "c" "e" "g" "h" "i" "k" "l" "w"
[20] "y" "z" "a" "c" "g" "h" "l" "o" "u" "v"
You can apply names() to each elements in the list to get the vertex name as a character value, and then you can unlist those lists into a single character vector
unlist(lapply(x, names))
If you would rather a data frame with counts, you can do:
stack(table(unlist(lapply(x, names))))[2:1]
#> ind values
#> 1 a 2
#> 2 b 1
#> 3 c 3
#> 4 d 1
#> 5 e 1
#> 6 g 2
#> 7 h 3
#> 8 i 2
#> 9 k 1
#> 10 l 3
#> 11 o 1
#> 12 q 1
#> 13 u 1
#> 14 v 1
#> 15 w 2
#> 16 x 1
#> 17 y 1
#> 18 z 2

How to plot frequency of all elements in one list appearing in another list

I have a long list of sequences as follows
AAAAAACGTTATGATCGATC
AAAATTCGCGCTTAGAGATC
AAGCTACGCATGCATCGACT
AAAAAACGTTATGATCGATC
AAAAAACGTTATGATCGATC
AAAATTCGCGCTTAGAGATC etc.
I also have a shorter list and I would like to see how many times each element in the short list appears in the long list and plot it as a histogram. I suppose its like a Vlookup function. How can I do this in R?
Try:
longlist = c("AAAAAACGTTATGATCGATC", "AAAATTCGCGCTTAGAGATC", "AAGCTACGCATGCATCGACT",
"AAAAAACGTTATGATCGATC", "AAAAAACGTTATGATCGATC", "AAGCTACGCATGCATCGACT",
"AAGCTACGCATGCATCGACT", "AAAAAACGTTATGATCGATC", "AAAAAACGTTATGATCGATC"
)
shortlist = c("AAAAAACGTTATGATCGATC", "AAGCTACGCATGCATCGACT")
longlist
[1] "AAAAAACGTTATGATCGATC" "AAAATTCGCGCTTAGAGATC" "AAGCTACGCATGCATCGACT" "AAAAAACGTTATGATCGATC" "AAAAAACGTTATGATCGATC"
[6] "AAGCTACGCATGCATCGACT" "AAGCTACGCATGCATCGACT" "AAAAAACGTTATGATCGATC" "AAAAAACGTTATGATCGATC"
shortlist
[1] "AAAAAACGTTATGATCGATC" "AAGCTACGCATGCATCGACT"
outdf = data.frame(var=character(), freq=numeric(), stringsAsFactors=F)
for(i in 1:length(shortlist)) {outdf[i,]=c(shortlist[i], sum(longlist==shortlist[i]))}
outdf
var freq
1 AAAAAACGTTATGATCGATC 5
2 AAGCTACGCATGCATCGACT 3
outdf$freq = as.numeric(outdf$freq)
barplot(outdf$freq, names.arg=outdf$var)
Can easily use following to see frequency and barplot of full longlist:
table(longlist)
longlist
AAAAAACGTTATGATCGATC AAAATTCGCGCTTAGAGATC AAGCTACGCATGCATCGACT
5 1 3
barplot(table(longlist))
match and table should work for your character vectors. Here's an example just random letters:
set.seed(1492)
dat <- sample(c(letters, LETTERS), 100, replace=TRUE)
dat
## [1] "o" "l" "j" "f" "c" "a" "S" "A" "u" "N" "H" "H" "k" "B" "B" "P" "g"
## [18] "r" "I" "V" "H" "t" "g" "F" "e" "W" "E" "D" "r" "Y" "h" "Z" "R" "l"
## [35] "Z" "K" "v" "f" "b" "q" "M" "P" "i" "u" "w" "m" "S" "g" "f" "g" "G"
## [52] "h" "q" "T" "J" "M" "K" "m" "X" "Q" "f" "x" "t" "B" "k" "z" "I" "Y"
## [69] "z" "g" "z" "u" "O" "k" "G" "L" "n" "B" "A" "A" "J" "p" "U" "F" "E"
## [86] "X" "R" "J" "G" "L" "H" "o" "z" "r" "d" "r" "V" "H" "S" "I"
matches <- match(dat, LETTERS)
match_counts <- table(matches[!is.na(matches)])
match_counts
##
## 1 2 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
## 3 4 1 2 2 3 5 3 3 2 2 2 1 1 2 1 2 3 1 1 2 1 2 2 2
names(match_counts) <- LETTERS[as.numeric(names(match_counts))]
match_counts
## A B D E F G H I J K L M N O P Q R S T U V W X Y Z
## 3 4 1 2 2 3 5 3 3 2 2 2 1 1 2 1 2 3 1 1 2 1 2 2 2
barplot(sort(match_counts), col="#649388")
Assuming that the sequences are strings.
lines <- readLines(n=6)
AAAAAACGTTATGATCGATC
AAAATTCGCGCTTAGAGATC
AAGCTACGCATGCATCGACT
AAAAAACGTTATGATCGATC
AAAAAACGTTATGATCGATC
AAAATTCGCGCTTAGAGATC
shortlist <- readLines(n=1)
AGTD
Here, I am assuming that each element as individual characters as it is not clear.
pat1 <- gsub("(?<=[A-Za-z])(?=[A-Za-z])", "|", shortlist, perl=TRUE)
pat1
#[1] "A|G|T|D"
library(stringr)
lvls <- unique(str_extract_all(shortlist, "[A-Za-z]")[[1]])
t1 <- table(factor(unlist(regmatches(lines,gregexpr(pat1, lines))), levels=lvls))
t1
#
# A G T D
#47 21 29 0
barplot(t1, col="#649388")
Update
If your shortlist is like below and you wanted to get the frequencies for each string instead of characters in the string.
shortlist1 <- readLines(n=4)
AAGCTACGCATGCATCGACT
AAAAAACGTTATGATCGATC
AAAAAACGTTATCT
AAAAAACG
pat2 <- paste0("^",paste(shortlist1, collapse="|"), "$")
lvls1 <- unique(shortlist1)
t2 <- table(factor(unlist(regmatches(lines,gregexpr(pat2, lines))), levels=lvls1))
t2
#AAGCTACGCATGCATCGACT AAAAAACGTTATGATCGATC AAAAAACGTTATCT
# 1 3 0
# AAAAAACG
# 0
barplot(t2, col="#649388")

Matching values from two vectors in R

I have two vectors:
A <- c(1,3,5,6,4,3,2,3,3,3,3,3,4,6,7,7,5,4,4,3) # 7 unique values
B <- c("a","b","c","d","e","f","g") # 7 different values
I would like to match the values of B to A such that the smallest value in A gets the first value from B and continued on to the largest.
The above example would be:
A: 1 3 5 6 4 3 2 3 3 3 3 3 4 6 7 7 5 4 4 3
assigned: a c e f d c b c c c c c d f g g e d d c
Try this:
A <- c(1,3,5,6,4,3,2,3,3,3,3,3,4,6,7,7,5,4,4,3)
B <- letters[1:7]
B[match(A, sort(unique(A)))]
# [1] "a" "c" "e" "f" "d" "c" "b" "c" "c" "c" "c" "c" "d" "f" "g"
# [16] "g" "e" "d" "d" "c"
Another option that handles the general case that #JoshO'Brien addresses would be
B[as.numeric(factor(A))]
# [1] "a" "c" "e" "f" "d" "c" "b" "c" "c" "c" "c" "c" "d"
# [14] "f" "g" "g" "e" "d" "d" "c"
A2<-ifelse(A > 4, A + 1, A)
# [1] 1 3 6 7 4 3 2 3 3 3 3 3 4 7 8 8 6 4 4 3
B[as.numeric(factor(A2))]
# [1] "a" "c" "e" "f" "d" "c" "b" "c" "c" "c" "c" "c" "d"
# [14] "f" "g" "g" "e" "d" "d" "c"
However, following benchmark shows that this method is slower than #JoshOBrien's.
library(microbenchmark)
B <- make.unique(rep(letters, length.out=1000))
A <- sample(seq_along(B), replace=TRUE)
unique_sort_match <- function() B[match(A, sort(unique(A)))]
factor_as.numeric <- function() B[as.numeric(factor(A))]
bm<-microbenchmark(unique_sort_match(), factor_as.numeric(), times=1000L)
plot(bm)
To elaborate on the comments in #Josh's answer:
If A does in fact represent a permutation of the elements of B (ie, where a 1 in A represents the first element of B, a 4 in A represents the 4th element in B, etc), then as #Matthew Plourde points out, you would want to simply use A as your index to B:
B[A]
If A does not represent a permutation of B, then you should use the method suggested by #Josh

match() to list of vectors - of possibly different lengths

The match(x, y) function is perfect to search the elements of the vector x within the elements of vector y. But what is an efficient and easy way to do the similar job when y is a list of vectors - of possibly different lengths?
I mean the result should be a vector of the same length as x, and the i-th element should be the first member of y that contains the i-th element of x, or NA.
To find the element of y in which each element of x (first) occurs, try this:
## First, a reproducible example
set.seed(44)
x <- letters[1:25]
y <- replicate(4, list(sample(letters, 8)))
y
# [[1]]
# [1] "t" "h" "m" "n" "a" "d" "i" "b"
#
# [[2]]
# [1] "c" "l" "z" "a" "s" "d" "i" "u"
#
# [[3]]
# [1] "b" "k" "e" "g" "o" "i" "h" "j"
#
# [[4]]
# [1] "g" "i" "f" "r" "h" "w" "l" "o"
## Find the element of y first containing the letters a-j
breaks <- c(0, cumsum(sapply(y, length))) + 1
findInterval(match(x, unlist(y)), breaks)
# [1] 1 1 2 1 3 4 3 1 1 3 3 2 1 1 3 NA NA 4 2 1 2 NA 4 NA NA

Resources