Retrieving a vector from one of its permutations in R-software - r

In R-software, suppose you have a vector N1 of length n.
n <- 10
N1 <- letters[rbinom(n, size = 20, prob = 0.5)]
names(N1) <- seq(n)
Suppose you have another vector N2 that is a permutation of the elements of N1
N2 <- sample(N1, size = n, replace = FALSE)
I was wondering if you could help me to find a function in R-software that receives N2 as input and obtains N1 as output, please. Thanks a lot for your help.

Just a guess:
set.seed(2)
n <- 10
N1 <- letters[rbinom(n, size = 20, prob = 0.5)]
names(N1) <- seq(n)
N1
# 1 2 3 4 5 6 7 8 9 10
# "h" "k" "j" "h" "n" "n" "g" "l" "j" "j"
Having repeats makes it difficult to find a return function, since there is not a 1-to-1 mapping. However, if ...
ind <- sample(n)
ind
# [1] 6 3 7 2 9 5 4 1 10 8
N2 <- N1[ind]
N2
# 6 3 7 2 9 5 4 1 10 8
# "n" "j" "g" "k" "j" "n" "h" "h" "j" "l"
We have the same effect that you were doing before, except ...
N2[order(ind)]
# 1 2 3 4 5 6 7 8 9 10
# "h" "k" "j" "h" "n" "n" "g" "l" "j" "j"
all(N1 == N2[order(ind)])
# [1] TRUE
This allows you to get a reverse mapping from some function on N2:
toupper(N2)[order(ind)]
# 1 2 3 4 5 6 7 8 9 10
# "H" "K" "J" "H" "N" "N" "G" "L" "J" "J"
regardless of whether you have an assured 1-to-1 mapping.

Related

Extract names of neighbour of a collection of nodes as list

I want to extract the names of the neighbors of selected nodes in igraph as a list.
This is what I have so far:
library(igraph)
set.seed(100)
g<-erdos.renyi.game(26, 0.4)
V(g)$name<-letters
x<-neighborhood(g, order = 1, V(g)$name %in% c('a', 'd', 'z'))
In the above example, I want to extract the names of the neighbours of nodes a,d, and z. And this is the output I am getting:
[[1]]
+ 9/26 vertices, named, from 6ba7060:
[1] a c h i l q w x z
[[2]]
+ 11/26 vertices, named, from 6ba7060:
[1] d b c e g h i k l w y
[[3]]
+ 9/26 vertices, named, from 6ba7060:
[1] z a c g h l o u v
I want to make a long list of the names with representations. The output should look like:
[1] "a" "c" "h" "i" "l" "q" "w" "x" "z" "d" "b" "c" "e" "g" "h" "i" "k" "l" "w" "y" "z" "a" "c"
[24] "g" "h" "l" "o" "u" "v
Thus far I have tried unlist and various versions of x %>% map(2) %>% flatten() using the library purrr, but to no avail.
I am also not oppose to getting the output in the form of a data.frame or tibble with names in one column and count of occurrences in another.
Maybe you can try names(unlist(x)), e.g.,
> names(unlist(x))
[1] "a" "c" "h" "i" "l" "q" "w" "x" "z" "d" "b" "c" "e" "g" "h" "i" "k" "l" "w"
[20] "y" "z" "a" "c" "g" "h" "l" "o" "u" "v"
You can apply names() to each elements in the list to get the vertex name as a character value, and then you can unlist those lists into a single character vector
unlist(lapply(x, names))
If you would rather a data frame with counts, you can do:
stack(table(unlist(lapply(x, names))))[2:1]
#> ind values
#> 1 a 2
#> 2 b 1
#> 3 c 3
#> 4 d 1
#> 5 e 1
#> 6 g 2
#> 7 h 3
#> 8 i 2
#> 9 k 1
#> 10 l 3
#> 11 o 1
#> 12 q 1
#> 13 u 1
#> 14 v 1
#> 15 w 2
#> 16 x 1
#> 17 y 1
#> 18 z 2

Append vector not giving names

In R studio, I am looking to create a vector for country names. They are enclosed in my data set in column 1. Countryvec gives factor names
"Australia Australia ..."
x just gives the names of Russia, country 36, country ends up being
1,1,...,2,2,...,4,4.. etc.
They are also not in order, 3 ends up between 42 and 43. How do I make the numbers the factors?
gdppc=read.xlsx("H:/dissertation/ALL/YAS.xlsx",sheetIndex = 1,startRow = 1)
countryvec=gdppc[,1]
country=c()
for (j in 1:43){
x=rep(countryvec[j],25)
country=append(country,x)
}
You need to retrieve the levels attribute
set.seed(7)
v <- factor(letters[rbinom(20, 10, .5)])
> c(v)
[1] 6 4 2 2 3 5 3 6 2 4 2 3 5 2 4 2 4 1 6 3
> levels(v)[v]
[1] "h" "e" "c" "c" "d" "f" "d" "h" "c" "e" "c" "d" "f" "c" "e" "c" "e" "a" "h" "d"
You'll probably need to modify the code to inside the loop:
x <- rep(levels(countryvec)[countryvec][j], 25)
Or convert the vector prior to the loop:
countryvec <- levels(countryvec)[countryvec]

How to plot frequency of all elements in one list appearing in another list

I have a long list of sequences as follows
AAAAAACGTTATGATCGATC
AAAATTCGCGCTTAGAGATC
AAGCTACGCATGCATCGACT
AAAAAACGTTATGATCGATC
AAAAAACGTTATGATCGATC
AAAATTCGCGCTTAGAGATC etc.
I also have a shorter list and I would like to see how many times each element in the short list appears in the long list and plot it as a histogram. I suppose its like a Vlookup function. How can I do this in R?
Try:
longlist = c("AAAAAACGTTATGATCGATC", "AAAATTCGCGCTTAGAGATC", "AAGCTACGCATGCATCGACT",
"AAAAAACGTTATGATCGATC", "AAAAAACGTTATGATCGATC", "AAGCTACGCATGCATCGACT",
"AAGCTACGCATGCATCGACT", "AAAAAACGTTATGATCGATC", "AAAAAACGTTATGATCGATC"
)
shortlist = c("AAAAAACGTTATGATCGATC", "AAGCTACGCATGCATCGACT")
longlist
[1] "AAAAAACGTTATGATCGATC" "AAAATTCGCGCTTAGAGATC" "AAGCTACGCATGCATCGACT" "AAAAAACGTTATGATCGATC" "AAAAAACGTTATGATCGATC"
[6] "AAGCTACGCATGCATCGACT" "AAGCTACGCATGCATCGACT" "AAAAAACGTTATGATCGATC" "AAAAAACGTTATGATCGATC"
shortlist
[1] "AAAAAACGTTATGATCGATC" "AAGCTACGCATGCATCGACT"
outdf = data.frame(var=character(), freq=numeric(), stringsAsFactors=F)
for(i in 1:length(shortlist)) {outdf[i,]=c(shortlist[i], sum(longlist==shortlist[i]))}
outdf
var freq
1 AAAAAACGTTATGATCGATC 5
2 AAGCTACGCATGCATCGACT 3
outdf$freq = as.numeric(outdf$freq)
barplot(outdf$freq, names.arg=outdf$var)
Can easily use following to see frequency and barplot of full longlist:
table(longlist)
longlist
AAAAAACGTTATGATCGATC AAAATTCGCGCTTAGAGATC AAGCTACGCATGCATCGACT
5 1 3
barplot(table(longlist))
match and table should work for your character vectors. Here's an example just random letters:
set.seed(1492)
dat <- sample(c(letters, LETTERS), 100, replace=TRUE)
dat
## [1] "o" "l" "j" "f" "c" "a" "S" "A" "u" "N" "H" "H" "k" "B" "B" "P" "g"
## [18] "r" "I" "V" "H" "t" "g" "F" "e" "W" "E" "D" "r" "Y" "h" "Z" "R" "l"
## [35] "Z" "K" "v" "f" "b" "q" "M" "P" "i" "u" "w" "m" "S" "g" "f" "g" "G"
## [52] "h" "q" "T" "J" "M" "K" "m" "X" "Q" "f" "x" "t" "B" "k" "z" "I" "Y"
## [69] "z" "g" "z" "u" "O" "k" "G" "L" "n" "B" "A" "A" "J" "p" "U" "F" "E"
## [86] "X" "R" "J" "G" "L" "H" "o" "z" "r" "d" "r" "V" "H" "S" "I"
matches <- match(dat, LETTERS)
match_counts <- table(matches[!is.na(matches)])
match_counts
##
## 1 2 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
## 3 4 1 2 2 3 5 3 3 2 2 2 1 1 2 1 2 3 1 1 2 1 2 2 2
names(match_counts) <- LETTERS[as.numeric(names(match_counts))]
match_counts
## A B D E F G H I J K L M N O P Q R S T U V W X Y Z
## 3 4 1 2 2 3 5 3 3 2 2 2 1 1 2 1 2 3 1 1 2 1 2 2 2
barplot(sort(match_counts), col="#649388")
Assuming that the sequences are strings.
lines <- readLines(n=6)
AAAAAACGTTATGATCGATC
AAAATTCGCGCTTAGAGATC
AAGCTACGCATGCATCGACT
AAAAAACGTTATGATCGATC
AAAAAACGTTATGATCGATC
AAAATTCGCGCTTAGAGATC
shortlist <- readLines(n=1)
AGTD
Here, I am assuming that each element as individual characters as it is not clear.
pat1 <- gsub("(?<=[A-Za-z])(?=[A-Za-z])", "|", shortlist, perl=TRUE)
pat1
#[1] "A|G|T|D"
library(stringr)
lvls <- unique(str_extract_all(shortlist, "[A-Za-z]")[[1]])
t1 <- table(factor(unlist(regmatches(lines,gregexpr(pat1, lines))), levels=lvls))
t1
#
# A G T D
#47 21 29 0
barplot(t1, col="#649388")
Update
If your shortlist is like below and you wanted to get the frequencies for each string instead of characters in the string.
shortlist1 <- readLines(n=4)
AAGCTACGCATGCATCGACT
AAAAAACGTTATGATCGATC
AAAAAACGTTATCT
AAAAAACG
pat2 <- paste0("^",paste(shortlist1, collapse="|"), "$")
lvls1 <- unique(shortlist1)
t2 <- table(factor(unlist(regmatches(lines,gregexpr(pat2, lines))), levels=lvls1))
t2
#AAGCTACGCATGCATCGACT AAAAAACGTTATGATCGATC AAAAAACGTTATCT
# 1 3 0
# AAAAAACG
# 0
barplot(t2, col="#649388")

Detect different levels in in column

I do have a column with about 80k entries which has only 22 different levels (the number of the chromosome). Is there any quick trick in R to find out at which position a level changes into the next ... so to figure out at which row chromosome 1 changes to chromosome 2 ( all entries for a single chromosomes are listed together)?
My data looks like this:
chr number marker name (SNP)
1 rs...
1 rs...
.
.
2
thanks
You can use unique and match from base R:
data <- c(rep("a",10),rep("b",5),rep("c",2),rep("d",10))
match( unique(data) , data )
#[1] 1 11 16 18
Match returns a vector of the position of the first match of it's first argument in it's second argument. This works because all your entries for a chromosome are listed together.
Check for diff being nonzero. This returns a logical vector which is TRUE when consecutive values aren't the same. Wrap it with which to get numeric indicies.
(x <- factor(sample(c("a", "b"), 15, replace = TRUE)))
# [1] a a b b a a b b b b b a b a a
# Levels: a b
diff(as.integer(x)) != 0
# [1] FALSE TRUE FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE
which(diff(as.integer(x)) != 0)
# [1] 2 4 6 11 12 13
If all your chromosome values are grouped together, you can find the first instance of each level with duplicated.
(x2 <- factor(rep(c("a", "b", "c"), times = c(3, 4, 6))))
# [1] a a a b b b b c c c c c c
# Levels: a b c
!duplicated(x2)
# [1] TRUE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
which(!duplicated(x2))
# [1] 1 4 8
You could use rle for this (if I get your question right):
x <- rep(LETTERS[1:22], each = 3)
x
# [1] "A" "A" "A" "B" "B" "B" "C" "C" "C" "D" "D" "D" "E" "E" "E" "F" "F" "F" "G" "G" "G" "H" "H" "H" #"I" "I" "I" "J" "J" "J" "K" "K" "K" "L" "L" "L" "M" "M" "M" "N" "N" "N" "O" "O" "O" "P" "P" "P" #"Q" "Q" "Q" "R" "R" "R" "S" "S" "S" "T" "T" "T" "U" "U" "U" "V" "V" "V"
rles <- rle(x)
cumsum(rles$lengths)
# [1] 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66

match() to list of vectors - of possibly different lengths

The match(x, y) function is perfect to search the elements of the vector x within the elements of vector y. But what is an efficient and easy way to do the similar job when y is a list of vectors - of possibly different lengths?
I mean the result should be a vector of the same length as x, and the i-th element should be the first member of y that contains the i-th element of x, or NA.
To find the element of y in which each element of x (first) occurs, try this:
## First, a reproducible example
set.seed(44)
x <- letters[1:25]
y <- replicate(4, list(sample(letters, 8)))
y
# [[1]]
# [1] "t" "h" "m" "n" "a" "d" "i" "b"
#
# [[2]]
# [1] "c" "l" "z" "a" "s" "d" "i" "u"
#
# [[3]]
# [1] "b" "k" "e" "g" "o" "i" "h" "j"
#
# [[4]]
# [1] "g" "i" "f" "r" "h" "w" "l" "o"
## Find the element of y first containing the letters a-j
breaks <- c(0, cumsum(sapply(y, length))) + 1
findInterval(match(x, unlist(y)), breaks)
# [1] 1 1 2 1 3 4 3 1 1 3 3 2 1 1 3 NA NA 4 2 1 2 NA 4 NA NA

Resources