Retrieving a vector from one of its permutations in R-software

Retrieving a vector from one of its permutations in R-software - r

In R-software, suppose you have a vector N1 of length n.
n <- 10
N1 <- letters[rbinom(n, size = 20, prob = 0.5)]
names(N1) <- seq(n)
Suppose you have another vector N2 that is a permutation of the elements of N1
N2 <- sample(N1, size = n, replace = FALSE)
I was wondering if you could help me to find a function in R-software that receives N2 as input and obtains N1 as output, please. Thanks a lot for your help.

Just a guess:
set.seed(2)
n <- 10
N1 <- letters[rbinom(n, size = 20, prob = 0.5)]
names(N1) <- seq(n)
N1
# 1 2 3 4 5 6 7 8 9 10
# "h" "k" "j" "h" "n" "n" "g" "l" "j" "j"
Having repeats makes it difficult to find a return function, since there is not a 1-to-1 mapping. However, if ...
ind <- sample(n)
ind
# [1] 6 3 7 2 9 5 4 1 10 8
N2 <- N1[ind]
N2
# 6 3 7 2 9 5 4 1 10 8
# "n" "j" "g" "k" "j" "n" "h" "h" "j" "l"
We have the same effect that you were doing before, except ...
N2[order(ind)]
# 1 2 3 4 5 6 7 8 9 10
# "h" "k" "j" "h" "n" "n" "g" "l" "j" "j"
all(N1 == N2[order(ind)])
# [1] TRUE
This allows you to get a reverse mapping from some function on N2:
toupper(N2)[order(ind)]
# 1 2 3 4 5 6 7 8 9 10
# "H" "K" "J" "H" "N" "N" "G" "L" "J" "J"
regardless of whether you have an assured 1-to-1 mapping.

Related

Extract names of neighbour of a collection of nodes as list

I want to extract the names of the neighbors of selected nodes in igraph as a list.
This is what I have so far:
library(igraph)
set.seed(100)
g<-erdos.renyi.game(26, 0.4)
V(g)$name<-letters
x<-neighborhood(g, order = 1, V(g)$name %in% c('a', 'd', 'z'))
In the above example, I want to extract the names of the neighbours of nodes a,d, and z. And this is the output I am getting:
[[1]]
+ 9/26 vertices, named, from 6ba7060:
[1] a c h i l q w x z
[[2]]
+ 11/26 vertices, named, from 6ba7060:
[1] d b c e g h i k l w y
[[3]]
+ 9/26 vertices, named, from 6ba7060:
[1] z a c g h l o u v
I want to make a long list of the names with representations. The output should look like:
[1] "a" "c" "h" "i" "l" "q" "w" "x" "z" "d" "b" "c" "e" "g" "h" "i" "k" "l" "w" "y" "z" "a" "c"
[24] "g" "h" "l" "o" "u" "v
Thus far I have tried unlist and various versions of x %>% map(2) %>% flatten() using the library purrr, but to no avail.
I am also not oppose to getting the output in the form of a data.frame or tibble with names in one column and count of occurrences in another.

Maybe you can try names(unlist(x)), e.g.,
> names(unlist(x))
[1] "a" "c" "h" "i" "l" "q" "w" "x" "z" "d" "b" "c" "e" "g" "h" "i" "k" "l" "w"
[20] "y" "z" "a" "c" "g" "h" "l" "o" "u" "v"

You can apply names() to each elements in the list to get the vertex name as a character value, and then you can unlist those lists into a single character vector
unlist(lapply(x, names))

If you would rather a data frame with counts, you can do:
stack(table(unlist(lapply(x, names))))[2:1]
#> ind values
#> 1 a 2
#> 2 b 1
#> 3 c 3
#> 4 d 1
#> 5 e 1
#> 6 g 2
#> 7 h 3
#> 8 i 2
#> 9 k 1
#> 10 l 3
#> 11 o 1
#> 12 q 1
#> 13 u 1
#> 14 v 1
#> 15 w 2
#> 16 x 1
#> 17 y 1
#> 18 z 2

Append vector not giving names

In R studio, I am looking to create a vector for country names. They are enclosed in my data set in column 1. Countryvec gives factor names
"Australia Australia ..."
x just gives the names of Russia, country 36, country ends up being
1,1,...,2,2,...,4,4.. etc.
They are also not in order, 3 ends up between 42 and 43. How do I make the numbers the factors?
gdppc=read.xlsx("H:/dissertation/ALL/YAS.xlsx",sheetIndex = 1,startRow = 1)
countryvec=gdppc[,1]
country=c()
for (j in 1:43){
x=rep(countryvec[j],25)
country=append(country,x)
}

You need to retrieve the levels attribute
set.seed(7)
v <- factor(letters[rbinom(20, 10, .5)])
> c(v)
[1] 6 4 2 2 3 5 3 6 2 4 2 3 5 2 4 2 4 1 6 3
> levels(v)[v]
[1] "h" "e" "c" "c" "d" "f" "d" "h" "c" "e" "c" "d" "f" "c" "e" "c" "e" "a" "h" "d"
You'll probably need to modify the code to inside the loop:
x <- rep(levels(countryvec)[countryvec][j], 25)
Or convert the vector prior to the loop:
countryvec <- levels(countryvec)[countryvec]

How to plot frequency of all elements in one list appearing in another list

I have a long list of sequences as follows
AAAAAACGTTATGATCGATC
AAAATTCGCGCTTAGAGATC
AAGCTACGCATGCATCGACT
AAAAAACGTTATGATCGATC
AAAAAACGTTATGATCGATC
AAAATTCGCGCTTAGAGATC etc.
I also have a shorter list and I would like to see how many times each element in the short list appears in the long list and plot it as a histogram. I suppose its like a Vlookup function. How can I do this in R?

Try:
longlist = c("AAAAAACGTTATGATCGATC", "AAAATTCGCGCTTAGAGATC", "AAGCTACGCATGCATCGACT",
"AAAAAACGTTATGATCGATC", "AAAAAACGTTATGATCGATC", "AAGCTACGCATGCATCGACT",
"AAGCTACGCATGCATCGACT", "AAAAAACGTTATGATCGATC", "AAAAAACGTTATGATCGATC"
)
shortlist = c("AAAAAACGTTATGATCGATC", "AAGCTACGCATGCATCGACT")
longlist
[1] "AAAAAACGTTATGATCGATC" "AAAATTCGCGCTTAGAGATC" "AAGCTACGCATGCATCGACT" "AAAAAACGTTATGATCGATC" "AAAAAACGTTATGATCGATC"
[6] "AAGCTACGCATGCATCGACT" "AAGCTACGCATGCATCGACT" "AAAAAACGTTATGATCGATC" "AAAAAACGTTATGATCGATC"
shortlist
[1] "AAAAAACGTTATGATCGATC" "AAGCTACGCATGCATCGACT"
outdf = data.frame(var=character(), freq=numeric(), stringsAsFactors=F)
for(i in 1:length(shortlist)) {outdf[i,]=c(shortlist[i], sum(longlist==shortlist[i]))}
outdf
var freq
1 AAAAAACGTTATGATCGATC 5
2 AAGCTACGCATGCATCGACT 3
outdf$freq = as.numeric(outdf$freq)
barplot(outdf$freq, names.arg=outdf$var)
Can easily use following to see frequency and barplot of full longlist:
table(longlist)
longlist
AAAAAACGTTATGATCGATC AAAATTCGCGCTTAGAGATC AAGCTACGCATGCATCGACT
5 1 3
barplot(table(longlist))

match and table should work for your character vectors. Here's an example just random letters:
set.seed(1492)
dat <- sample(c(letters, LETTERS), 100, replace=TRUE)
dat
## [1] "o" "l" "j" "f" "c" "a" "S" "A" "u" "N" "H" "H" "k" "B" "B" "P" "g"
## [18] "r" "I" "V" "H" "t" "g" "F" "e" "W" "E" "D" "r" "Y" "h" "Z" "R" "l"
## [35] "Z" "K" "v" "f" "b" "q" "M" "P" "i" "u" "w" "m" "S" "g" "f" "g" "G"
## [52] "h" "q" "T" "J" "M" "K" "m" "X" "Q" "f" "x" "t" "B" "k" "z" "I" "Y"
## [69] "z" "g" "z" "u" "O" "k" "G" "L" "n" "B" "A" "A" "J" "p" "U" "F" "E"
## [86] "X" "R" "J" "G" "L" "H" "o" "z" "r" "d" "r" "V" "H" "S" "I"
matches <- match(dat, LETTERS)
match_counts <- table(matches[!is.na(matches)])
match_counts
##
## 1 2 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
## 3 4 1 2 2 3 5 3 3 2 2 2 1 1 2 1 2 3 1 1 2 1 2 2 2
names(match_counts) <- LETTERS[as.numeric(names(match_counts))]
match_counts
## A B D E F G H I J K L M N O P Q R S T U V W X Y Z
## 3 4 1 2 2 3 5 3 3 2 2 2 1 1 2 1 2 3 1 1 2 1 2 2 2
barplot(sort(match_counts), col="#649388")

Assuming that the sequences are strings.
lines <- readLines(n=6)
AAAAAACGTTATGATCGATC
AAAATTCGCGCTTAGAGATC
AAGCTACGCATGCATCGACT
AAAAAACGTTATGATCGATC
AAAAAACGTTATGATCGATC
AAAATTCGCGCTTAGAGATC
shortlist <- readLines(n=1)
AGTD
Here, I am assuming that each element as individual characters as it is not clear.
pat1 <- gsub("(?<=[A-Za-z])(?=[A-Za-z])", "|", shortlist, perl=TRUE)
pat1
#[1] "A|G|T|D"
library(stringr)
lvls <- unique(str_extract_all(shortlist, "[A-Za-z]")[[1]])
t1 <- table(factor(unlist(regmatches(lines,gregexpr(pat1, lines))), levels=lvls))
t1
#
# A G T D
#47 21 29 0
barplot(t1, col="#649388")
Update
If your shortlist is like below and you wanted to get the frequencies for each string instead of characters in the string.
shortlist1 <- readLines(n=4)
AAGCTACGCATGCATCGACT
AAAAAACGTTATGATCGATC
AAAAAACGTTATCT
AAAAAACG
pat2 <- paste0("^",paste(shortlist1, collapse="|"), "$")
lvls1 <- unique(shortlist1)
t2 <- table(factor(unlist(regmatches(lines,gregexpr(pat2, lines))), levels=lvls1))
t2
#AAGCTACGCATGCATCGACT AAAAAACGTTATGATCGATC AAAAAACGTTATCT
# 1 3 0
# AAAAAACG
# 0
barplot(t2, col="#649388")

Detect different levels in in column

I do have a column with about 80k entries which has only 22 different levels (the number of the chromosome). Is there any quick trick in R to find out at which position a level changes into the next ... so to figure out at which row chromosome 1 changes to chromosome 2 ( all entries for a single chromosomes are listed together)?
My data looks like this:
chr number marker name (SNP)
1 rs...
1 rs...
.
.
2
thanks

You can use unique and match from base R:
data <- c(rep("a",10),rep("b",5),rep("c",2),rep("d",10))
match( unique(data) , data )
#[1] 1 11 16 18
Match returns a vector of the position of the first match of it's first argument in it's second argument. This works because all your entries for a chromosome are listed together.

Check for diff being nonzero. This returns a logical vector which is TRUE when consecutive values aren't the same. Wrap it with which to get numeric indicies.
(x <- factor(sample(c("a", "b"), 15, replace = TRUE)))
# [1] a a b b a a b b b b b a b a a
# Levels: a b
diff(as.integer(x)) != 0
# [1] FALSE TRUE FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE
which(diff(as.integer(x)) != 0)
# [1] 2 4 6 11 12 13
If all your chromosome values are grouped together, you can find the first instance of each level with duplicated.
(x2 <- factor(rep(c("a", "b", "c"), times = c(3, 4, 6))))
# [1] a a a b b b b c c c c c c
# Levels: a b c
!duplicated(x2)
# [1] TRUE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
which(!duplicated(x2))
# [1] 1 4 8

You could use rle for this (if I get your question right):
x <- rep(LETTERS[1:22], each = 3)
x
# [1] "A" "A" "A" "B" "B" "B" "C" "C" "C" "D" "D" "D" "E" "E" "E" "F" "F" "F" "G" "G" "G" "H" "H" "H" #"I" "I" "I" "J" "J" "J" "K" "K" "K" "L" "L" "L" "M" "M" "M" "N" "N" "N" "O" "O" "O" "P" "P" "P" #"Q" "Q" "Q" "R" "R" "R" "S" "S" "S" "T" "T" "T" "U" "U" "U" "V" "V" "V"
rles <- rle(x)
cumsum(rles$lengths)
# [1] 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66

match() to list of vectors - of possibly different lengths

The match(x, y) function is perfect to search the elements of the vector x within the elements of vector y. But what is an efficient and easy way to do the similar job when y is a list of vectors - of possibly different lengths?
I mean the result should be a vector of the same length as x, and the i-th element should be the first member of y that contains the i-th element of x, or NA.

To find the element of y in which each element of x (first) occurs, try this:
## First, a reproducible example
set.seed(44)
x <- letters[1:25]
y <- replicate(4, list(sample(letters, 8)))
y
# [[1]]
# [1] "t" "h" "m" "n" "a" "d" "i" "b"
#
# [[2]]
# [1] "c" "l" "z" "a" "s" "d" "i" "u"
#
# [[3]]
# [1] "b" "k" "e" "g" "o" "i" "h" "j"
#
# [[4]]
# [1] "g" "i" "f" "r" "h" "w" "l" "o"
## Find the element of y first containing the letters a-j
breaks <- c(0, cumsum(sapply(y, length))) + 1
findInterval(match(x, unlist(y)), breaks)
# [1] 1 1 2 1 3 4 3 1 1 3 3 2 1 1 3 NA NA 4 2 1 2 NA 4 NA NA

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Retrieving a vector from one of its permutations in R-software - r

Related

Extract names of neighbour of a collection of nodes as list

Append vector not giving names

How to plot frequency of all elements in one list appearing in another list

Detect different levels in in column

match() to list of vectors - of possibly different lengths

Categories

Resources