My code snippet is as below:
df = as.data.frame(rbind(
c("a","b",2),
c("b","d",2),
c("d","g",2),
c("g","j",8),
c("j","i",2),
c("i","f",6),
c("f","c",2),
c("c","a",4),
c("c","e",4),
c("e","h",2),
c("h","j",4),
c("e","g",1),
c("e","i",3),
c("e","b",7)
))
names(df) = c("start_node","end_node","dist")
# Convert this to "igraph" class
gdf <- graph.data.frame(df, directed=FALSE)
# Compute the min distances from 'a' to all other vertices
dst_a <- shortest.paths(gdf,v='a',weights=E(gdf)$dist)
# Compute the min distances from 'a' to 'j'
dst_a[1, which(V(gdf)$name == 'j')]
While it returns the result 12, I need to get the shortest path which in this case should be a - b - d - g - e - i - j. I have tried to use get.shortest.paths(), but in vain.
Try using get.all.shortest.paths(). Take into account that there may be more than one short path (e.g. try the same between 'a' and 'e')
sp=get.all.shortest.paths(gdf, "a", "j",weights=E(gdf)$dist)
sp
$res
$res[[1]]
[1] 1 2 3 4 9 6 5
$nrgeo
[1] 1 1 1 1 1 1 1 1 1 1
V(gdf)[sp$res[[1]]]$name
[1] "a" "b" "d" "g" "e" "i" "j"
What did you try with get.shortest.paths? Because this works:
> V(gdf)[get.shortest.paths(gdf,"a","j",weights=E(gdf)$dist)[[1]]]
Vertex sequence:
[1] "a" "b" "d" "g" "e" "i" "j"
get.shortest.paths returns a list of length 1 because I'm only asking it to calculate the shortest path from "a" to "j", so I take the first element of it.
Related
Please consider the following example:
[[1]]
[1] 11 12 13 14
[[2]]
[1] 1 2 3
[[3]]
[1] 4
[[4]]
[1] 5
[[5]]
[1] 6
[[6]]
[1] 7
[[7]]
[1] 8
[[8]]
[1] 9
[[9]]
[1] 10
[[10]]
[1] 15
[[11]]
[1] 16
[[12]]
[1] 17
In this example, I have 12 unique values in a vector that is 17 elements long. For simplicity, let's say that this vector is:
foo_bar <- c("b","b","b","c","d","e","f","g","h","i","a","a","a","a", "j", "k", "l")
The first code block shows the index positions in foo_bar of each of the unique values (the letters a–l).
I am attempting to write an algorithm that reorders foo_bar so that, for all indices except the final one (index 17 in the foo_bar example), position i and position i+1 never contains the same two values. Here's an example of what would be an appropriate outcome:
reordered_foo_bar <- c("b","c","b","d","b","e","f","g","h","a","i","a","j","a","k","a", "l")
something like this?
foo_bar <- c("b","b","b","c","d","e","f","g","h","i","a","a","a","a", "j", "k", "l")
test == FALSE
while (test == FALSE) {
new_foo_bar <- sample(foo_bar, size = length(foo_bar), replace = FALSE)
test <- length(rle(new_foo_bar)$lengths) == length(foo_bar)
}
new_foo_bar
# [1] "f" "a" "g" "b" "h" "d" "j" "c" "e" "i" "a" "b" "k" "a" "l" "a" "b"
First we identify the indices of the unique values in the vector.
indices <-
unique(foo_bar) %>%
sort() %>%
lapply(function(x) which(foo_bar == x))
Then we create a position score based on 1) which order the value has when ordered by decreasing frequency and 2) how many previous occurences of this value has occurred, and we add these two values together. However, to ensure that we get a different value inserted between them, we divide 2) by 2. Finally, we order the position scores and reorder foo_bar with this new order.
This solution is also robust in case it is not possible to prevent duplicate values next to each other (for example because the values are c("a","a","b","a").
out <-
lengths(indices) %>%
lapply(., function(x) 1:x) %>%
{lapply(len_seq(.), function(x) (unlist(.[x]) + x / 2))} %>%
unlist() %>%
order() %>%
{unlist(indices)[.]} %>%
foo_bar[.]
The output is then:
> "a" "b" "a" "c" "b" "d" "a" "e" "b" "f" "a" "g" "h" "i" "j" "k" "l"
I am a developing beginner in r. I have a simple question about r language.
Thanks to many experts in this site, I am improving a lot.
I am always grateful for that, and anyone who's giving hand with this question, thank you in advance.
This is the code.
Data=sample(1:5,size=25,replace=T)
names(Data)=c("a","b","c","d","e")
I want to name each of 1,2,3,4,5 to a,b,c,d,e.
so I thought I could accomplish this by using the upper code.
I know that the right code is
Data=c("a","b","c","d","e")[Data]
But I can't understand why this is the right code and why I need the last [Data].
Any help would be really appreciated!! Thank you so much in advance!!:)
The last Data provides an index to subset values from c("a","b","c","d","e").
Let's take a simple example :
Consider,
a <- 1:10
Now to get the first value in a you can do
a[1]
#[1] 1
To get 3rd value in a you can do
a[3]
#[1] 3
To get 6th and 8th value in a you can do
a[c(6, 8)]
#[1] 6 8
What will happen if you repeat a certain index? Say you select 1 twice and 3 once.
a[c(1, 1, 3)]
#[1] 1 1 3
As you can see the first value is selected two times and third one time.
Now ,Data that you have serves as that index to subset whereas a becomes c("a","b","c","d","e")
a <- c("a","b","c","d","e")
set.seed(123)
Data=sample(1:5,size=25,replace=T)
Data
#[1] 3 3 2 2 3 5 4 1 2 3 5 3 3 1 4 1 1 5 3 2 2 1 3 4 1
Now you use this Data values to subset from a giving
a[Data]
#[1] "c" "c" "b" "b" "c" "e" "d" "a" "b" "c" "e" "c" "c" "a" "d" "a" "a" "e" "c" "b" "b" "a" "c" "d" "a"
A side note, there is an inbuilt constant letters and LETTERS which gives 26 lower and upper case alphabets.
letters
#[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"
LETTERS
# [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S" "T" "U" "V" "W" "X" "Y" "Z"
Here is a way that takes advantage of how objects of class "factor" are coded internally in R. In R, factors are coded as consecutive integers starting at 1, and what the user sees is their labels and levels, not the integer values. But the integer values do not go away, they are still there.
First, create a vector of integers like in the question but setting the RNG seed in order to make the results reproducible. This vector is saved for later.
set.seed(123)
Data <- sample(1:5, size = 25, replace = TRUE)
Saved <- Data
Now create the factor. Note the labels atribute is set to the letters "a" to"e".
Data <- factor(Data, labels = c("a","b","c","d","e"))
Data
# [1] c c b b c e d a b c e c c a d a a e c b b a c d a
#Levels: a b c d e
See the internal representation.
as.integer(Data)
# [1] 3 3 2 2 3 5 4 1 2 3 5 3 3 1 4 1 1 5 3 2 2 1 3 4 1
And compare with the initial values.
identical(Saved, as.integer(Data))
# [1] TRUE
This is because Data contains the numbers you want to name in the order you want to name them. By adding [Data] to the end you are selecting the letters in the order of Data. To understand this, try what c("a","b","c","d","e")[c(1, 2)] does; it selects just the two first letters. If you instead type c("a","b","c","d","e")[c(5, 4)] it will select the two last letters, but in reverse order. Then if you print just Data, you'll see that it contains the numbers from 1 to 5, which is the amount of unique letters. So it will select the letters according to that order. You can see that all the numbers correspond to the letters in order by printing the correctly named Data.
Using names(Data)=c("a","b","c","d","e") does not work correctly since you aren't naming all 25 of the numbers, but rather just the first five of them.
I am creating a function to help me quickly recode variables into numerical values, as a form of practice. The idea behind creating the function is to quickly recode several values into numerical form, for any length. If a dataset is really long for instance, the function in theory should recode all of these values without having to manually type out each condition in which to recode it into a specific value.
For instance:
levels(d$letters)
[1] a b c d
The general form of the function is to:
d$letters.recode[d$letters == "a"] <- 1
d$letters.recode[d$letters == "b"] <- 2
d$letters.recode[d$letters == "c"] <- 3
And using this function:
rc.f <- function(a, b){
x <- levels(a)
y <- length(a)
b <- NA
for (i in 1:y){
z <- b[a==x[i]] <- i
}
}
In theory, the idea is that this function should create another variable, where a is recoded as 1, b is recoded as 2 and so on.
However when I run rc.f(d$letters, d$letters.recode), no new variables are created in the dataset, and the function does not return an error.
Any ideas?
Thanks.
Another example dataset d:
Say for a list of respondents they are assigned a category depending on their region:
Respondent Region
1 d
2 b
3 g
4 c
5 e
6 c
7 f
8 a
I am looking for a way to recode d$Region into a numerical value, to d$Region.R.
Using the same function as above, I am wondering whether I can use the function to create another variable in the dataframe, by inputting d$Region and d$Region.R into the function. So recoding a,b,c,[...],g into 1,2,3,[...],7.
If you want to a,b,f,d as 1,2,4,3 then use following
I have updated your code for function rc.f a little bit
Removed second argument b, since we are giving b <- NA ,so we do not need second argument
We do not need other variable to store the value of b , so i removed z
Since every argument is not factor so we need to coerce it into factor
we do not need y , we can directly put length(a) in for loop condition
and last but not the least the last line is the output of the function unless we use return, so there i putted b in last
The code is
rc.f <- function(a)
{
a<-as.factor(a)
x <- levels(a)
b <- NA
for (i in 1:length(a))
{
b[a==x[i]] <- i
}
b
}
let us take an example
> l<-c("a","b","b","a","a","g","h","y","f","v","h","j","f","d","a","s","s","s")
> l
[1] "a" "b" "b" "a" "a" "g" "h" "y" "f" "v" "h" "j" "f"
[14] "d" "a" "s" "s" "s"
> rc.f(l)
[1] 1 2 2 1 1 5 6 10 4 9 6 7 4 3 1 8 8 8
If you want a,b,f,d as 1,2,6,4 then use following
rc.f <- function(a)
{
a<-as.factor(a)
b <- NA
for (i in 1:26)
{
b[a==letters[i]] <- i
}
b
}
lets take an example
> l<-c("a","b","b","a","a","g","h","y","f","v","h","j","f","d","a","s","s","s")
> l
[1] "a" "b" "b" "a" "a" "g" "h" "y" "f" "v" "h" "j" "f" "d"
[15] "a" "s" "s" "s"
> rc.f(l)
[1] 1 2 2 1 1 7 8 25 6 22 8 10 6 4 1 19 19 19
I would like to sort the data x (here: 1:12) according to the sectors sec and subsectors ssec. Below is an example showing how this can be done. The questions is whether this can be done more elegantly (maybe with a base-R function/not using additional packages)?
## Data
set.seed(17)
(sec <- sample(rep(LETTERS[1:3], each = 4))) # 3 sectors
(ssec <- rep(sample(1:4, 12, replace = TRUE))) # 4 subsectors
x <- 1:12 # data to sort according to increasing sectors and subsectors
## Sort according to sectors
ord <- order(sec)
x. <- x[ord]
sec. <- sec[ord]
ssec. <- ssec[ord]
## Sort according to subsectors
usec. <- unique(sec.)
x.. <- x.
ssec.. <- ssec.
for(grp in usec.) {
ii <- sec. == grp # indices of components in that sector
ord. <- order(ssec.[ii])
x..[ii] <- x.[ii][ord.]
ssec..[ii] <- ssec.[ii][ord.]
}
## Result
x..
sec.
ssec..
The order function from base R also accepts multiple arguments. From ?order:
order returns a permutation which rearranges its first argument into
ascending or descending order, breaking ties by further arguments.
To demonstrate, we can check how order(sec, ssec) sort the sector and subsector
Here is the original sec and ssec:
sec
[1] "B" "C" "A" "B" "A" "B" "C" "C" "C" "A" "B" "A"
ssec
[1] 3 1 3 2 1 2 4 1 3 2 1 4
After applying the ordered index, sec is sorted alphabetically and ssec is sorted within each sec, which means the index order(sec, ssec) is the sorting index expected:
sec[order(sec, ssec)]
[1] "A" "A" "A" "A" "B" "B" "B" "B" "C" "C" "C" "C"
ssec[order(sec, ssec)]
[1] 1 2 3 4 1 2 2 3 1 1 3 4
I am quite new to R and I have a table of strings, I believe, that I extracted from a text file that contains a list of nucleotides (ex. "AGCTGTCATGCT.....").
Here are the first two rows of the text file to help as an example:
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAAC
I need to count every "A" in the sequence by incrementing its variable, a. The same applies for G, C, and T (variables to increment are g, c ,t respectively).
At the end of the "for" loop I want the number of times "A" "G" "C" and "T" nucleotides occurred so I can calculate the dinucleotide frequencies, and hoepfully the transition matrix. My code is below, it doesn't work, it just returns each variable being equal to 0 which is wrong. Please help, thanks!
#I saved the newest version to a text file of the nucleotides
dnaseq <- read.table("/My path file/ecoli.txt")
g=0
c=0
a=0
t=0
for(i in dnaseq[[1]]){
if(i=="A") (inc(a)<-1)
if(i=="G") (inc(g)<-1)
if(i=="C") (inc(c)<-1)
if(i=="T") (inc(t)<-1)
}
a
g
c
t
The simplest way to get the counts of each nucleotide (or any kind of letter) is to use the table and strsplit functions. For example:
myseq = "AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC"
# split it into a vector of individual characters:
strsplit(myseq, "")[[1]]
# [1] "A" "G" "C" "T" "T" "T" "T" "C" "A" "T" "T" "C" "T" "G" "A" "C" "T" "G" "C" "A" "A" "C" "G" "G" "G" "C" "A" "A" "T" "A" "T" "G" "T" "C" "T" "C" "T" "G" "T"
# [40] "G" "T" "G" "G" "A" "T" "T" "A" "A" "A" "A" "A" "A" "A" "G" "A" "G" "T" "G" "T" "C" "T" "G" "A" "T" "A" "G" "C" "A" "G" "C"
# count the frequencies of each
table(strsplit(myseq, "")[[1]])
# A C G T
# 20 12 17 21
Now, if you don't care about the difference between one line and the next (if this is just one long sequence in ecoli.txt) then you want to combine the file into one long string first:
table(strsplit(paste(dnaseq[[1]], collapse = ""), "")[[1]])
That's the one line solution, but it might be clearer to see it in three lines:
combined.seq = paste(dnaseq[[1]], collapse = "")
combined.seq.vector = strsplit(combined.seq, "")
frequencies = table(combined.seq.vector)
If you're wondering what was wrong with your original code- first, I don't know where the inc function comes from (and why it wasn't throwing an error: are you sure dnaseq[[1]] has length greater than 0?) but in any case, you weren't iterating over the sequence, you were iterating over the lines. i was never going to be a single character like A or T, it was always going to be a full line.
In any case, the solution with collapse, table and strsplit is both more concise and computationally efficient than a for loop (or a pair of nested for loops, which is what you would need).
You may use the following code which calls the str_count function (that counts the number of occurrences of a fixed text pattern) from the stringr package. It should work faster than the other solution which splits the character string into one-letter substrings.
require('stringr') # call install.packages('stringr') to download the package first
# read the text file (each text line will be a separate string):
dnaseq <- readLines("path_to_file.txt")
# merge text lines into one string:
dnaseq <- str_c(dnaseq, collapse="")
# count the number of occurrences of each nucleotide:
sapply(c("A", "G", "C", "T"), function(nuc)
str_count(dnaseq, fixed(nuc)))
Note that this solution may easily br extended to the length > 1 subsequence finding task (just change the search pattern in sapply(), e.g. to as.character(outer(c("A", "G", "C", "T"), c("A", "G", "C", "T"), str_c)), which generates all pairs of nucleotides).
However, note that detecting AGA in AGAGA will report only 1 occurrence as str_count() does not take overlapping patterns into account.
I am assuming that your nucleotide sequence is in a character vector of length one. If you are looking for the dinucleotide frequencies and a transition matrix, here is one solution:
dnaseq <- "AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAG
CTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAAC"
## list of nucleotides
nuc <- c("A","T","G","C")
## all distinct dinucleotides
nuc_comb <- expand.grid(nuc,nuc)
nuc_comb$two <- paste(nuc_comb$Var1, nuc$Var2, sep = "")
# Var1 Var2 two
# 1 A A AA
# 2 T A TA
# 3 G A GA
# 4 C A CA
# 5 A T AT
# 6 T T TT
# 7 G T GT
# 8 C T CT
# 9 A G AG
# 10 T G TG
# 11 G G GG
# 12 C G CG
# 13 A C AC
# 14 T C TC
# 15 G C GC
# 16 C C CC
## Using `vapply` and regular expressions to count dinucleotide sequences:
nuc_comb$freq <- vapply(nuc_comb$two,
function(x) length(gregexpr(x, dnaseq, fixed = TRUE)[[1]]),
integer(1))
# AA TA GA CA AT TT GT CT AG TG GG CG AC TC GC CC
# 11 11 7 5 9 12 9 13 7 13 4 2 8 7 5 2
## label and reshape to matrix/table
dinuc_df <- reshape(nuc_comb, direction = "wide",
idvar = "Var1", timevar = "Var2", drop = "two")
dinuc_mat <- as.matrix(dinuc_df_wide[-1])
rownames(dinuc_mat) <- colnames(dinuc_mat) <- nuc
# A T G C
# A 11 9 7 8
# T 11 12 13 7
# G 7 9 4 5
# C 5 13 2 2
## get margin proportions for transition matrix
## probability of moving from nucleotide in row to nucleotide in column)
dinuc_tab <- prop.table(dinuc_mat, 1)
# A T G C
# A 0.3142857 0.2571429 0.20000000 0.22857143
# T 0.2558140 0.2790698 0.30232558 0.16279070
# G 0.2800000 0.3600000 0.16000000 0.20000000
# C 0.2272727 0.5909091 0.09090909 0.09090909