how to break a vector into subvectors in R - r

I have a vector like:
A B C A B A B D D E
and I'd like to break it into as many vectors as the number of "A" I have, like:
A B C
A B
A B D D E
is there a way to accomplish this task?

You can use split and cumsum:
split(x, cumsum(x == "A"))
What you get in return is a list of vectors. A list seems most useful to me here since it allows vectors of different sizes in each element (unlike a data.frame for instance).

Not as elegant as split approach but we can go also for strsplit:
strsplit(paste0("A", strsplit(paste0(vec, collapse = ""), "A")[[1]][-1]),"")
# [[1]]
# [1] "A" "B" "C"
# [[2]]
# [1] "A" "B"
# [[3]]
# [1] "A" "B" "D" "D" "E"

Related

Maintaining order of extracted patterns from strings in R

I'm trying to extract a pattern from a string, but am having difficulty maintaining the order. Consider:
library(stringr)
string <- "A A A A C A B A"
extract <- c("B","C")
str_extract_all(string,extract)
[[1]]
[1] "B"
[[2]]
[1] "C"
The output of this is a list; is it possible to return a vector that maintains the original ordering, ie that "C"precedes "B" in the string? I've tried many options of gsub with no luck. Thanks.
Try using the following regexp:
str_extract_all(string,"[BC]")
## [[1]]
## [1] "C" "B"
or more generally:
str_extract_all(string, paste(extract, collapse = "|"))
string <- "A A A A C A B A B"
extract <- c("B","C")
inds = unlist(sapply(extract, function(p){
as.numeric(gregexpr(p, string)[[1]])
}))
sort(inds[inds > 0])
# C B1 B2
# 9 13 17

Comprehensive matching across a list of vectors

I am trying to match between all available combinations of vectors.
For example, i have 4 vectors:
a<-c(1,2,3,4)
b<-c(1,2,6,7)
c<-c(1,2,8,9)
d<-c(3,6,8,2)
The intended output should be able to tell me:
similarity between a & b: 1, 2
similarity between a & c: 1, 2
similarity between a & d: 2, 3
similarity between b & c: 1, 2
similarity between b & d: 2, 6
similarity between c & d: 2, 8
similarity between a & b & c: 1, 2
similarity between b & c & d: 2
similarity between a & c & d: 2
similarity between a & b & d: 2
similarity between a & b & c & d: 2
Does R have a function that does such comparison/ matching?
For simplicity, the number of vectors is set at 4 for now. I am in fact dealing with 100s of vectors and would like to match/intersect/compare between all possible combinations of vectors. For example with 4 vectors, there will be a possible 4C2+4C3+4C4=11 available combinations. With 100 vectors, there will be a possible 100C100+ 100C99+100C98+...+100C2 available combinations
thanks in advance
intersect seems to do what you want. It only does pairs of vectors at a time though eg
intersect(a, b) # 1 2
intersect(b, intersect(c, d)) # 2
If you want a shorthand to intersect more than 2, try Reduce (?Reduce)
# intersection of a/b/c/d
Reduce(intersect, list(a, b, c, d), a)
# intersection of b/c/d
Reduce(intersect, list(b, c, d), b)
Reduce will successively apply intersect to the list and the result of the previous intersect call, starting with intersect(b, b) (the init argument I just set to one of the vectors to be intersected, as the intersection of a set with itself is the set).
If you wanted a way to go through all (pairs, tuples, quadruples) of (a, b, c, d) and return the intersection, you could try
generate all combinations of (a, b, c, d) in lengths 2 (pairs), 3 (tuples), 4 (quadruples)
combos = lapply(2:4, combn, x=c('a', 'b', 'c', 'd'), simplify=F)
# [[1]]
# [[1]][[1]]
# [1] "a" "b"
# [[1]][[2]]
# [1] "a" "c"
# ...
# [[2]]
# [[2]][[1]]
# [1] "a" "b" "c"
# [[2]][[2]]
# [1] "a" "b" "d"
# ...
# [[3]]
# [[3]][[1]]
# [1] "a" "b" "c" "d"
Flatten it out to just a list of character vectors
combos = unlist(combos, recursive=F)
# [[1]]
# [1] "a" "b"
# ...
# [[10]]
# [1] "b" "c" "d"
# [[11]]
# [1] "a" "b" "c" "d"
For each set, call Reduce as specified above. We can use (e.g.) get("a") to get the variable a; or mget(c("a", "b", "c") to get the variables a, b, c in a list. If your variables are columns in a dataframe, then you can modify appropriately.
intersects = lapply(combos, function (varnames) {
Reduce(intersect, mget(varnames, inherits=T), get(varnames[1]))
})
# add some labels for clarity.
# You will probably actually want to /do/ something with the
# resulting intersections rather than this.
names(intersects) <- sapply(combos, paste, collapse=", ")
intersects
# $`a, b`
# [1] 1 2
# $`a, c`
# [1] 1 2
# ...
# $`a, b, c, d`
# [1] 2
You will need to modify to suit how your data is in R; e.g. columns of a dataframe vs named vectors in the workspace and so on.
You might also just prefer a for loop from step 3. onwards rather than all the *apply depending on what you want to do with the result. (Plus, if you have many vectors, holding all the intersections simultaneously in memory might not be a good idea anyway).

Analyze table in R to count nucleotide frequencies

I am quite new to R and I have a table of strings, I believe, that I extracted from a text file that contains a list of nucleotides (ex. "AGCTGTCATGCT.....").
Here are the first two rows of the text file to help as an example:
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAAC
I need to count every "A" in the sequence by incrementing its variable, a. The same applies for G, C, and T (variables to increment are g, c ,t respectively).
At the end of the "for" loop I want the number of times "A" "G" "C" and "T" nucleotides occurred so I can calculate the dinucleotide frequencies, and hoepfully the transition matrix. My code is below, it doesn't work, it just returns each variable being equal to 0 which is wrong. Please help, thanks!
#I saved the newest version to a text file of the nucleotides
dnaseq <- read.table("/My path file/ecoli.txt")
g=0
c=0
a=0
t=0
for(i in dnaseq[[1]]){
if(i=="A") (inc(a)<-1)
if(i=="G") (inc(g)<-1)
if(i=="C") (inc(c)<-1)
if(i=="T") (inc(t)<-1)
}
a
g
c
t
The simplest way to get the counts of each nucleotide (or any kind of letter) is to use the table and strsplit functions. For example:
myseq = "AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC"
# split it into a vector of individual characters:
strsplit(myseq, "")[[1]]
# [1] "A" "G" "C" "T" "T" "T" "T" "C" "A" "T" "T" "C" "T" "G" "A" "C" "T" "G" "C" "A" "A" "C" "G" "G" "G" "C" "A" "A" "T" "A" "T" "G" "T" "C" "T" "C" "T" "G" "T"
# [40] "G" "T" "G" "G" "A" "T" "T" "A" "A" "A" "A" "A" "A" "A" "G" "A" "G" "T" "G" "T" "C" "T" "G" "A" "T" "A" "G" "C" "A" "G" "C"
# count the frequencies of each
table(strsplit(myseq, "")[[1]])
# A C G T
# 20 12 17 21
Now, if you don't care about the difference between one line and the next (if this is just one long sequence in ecoli.txt) then you want to combine the file into one long string first:
table(strsplit(paste(dnaseq[[1]], collapse = ""), "")[[1]])
That's the one line solution, but it might be clearer to see it in three lines:
combined.seq = paste(dnaseq[[1]], collapse = "")
combined.seq.vector = strsplit(combined.seq, "")
frequencies = table(combined.seq.vector)
If you're wondering what was wrong with your original code- first, I don't know where the inc function comes from (and why it wasn't throwing an error: are you sure dnaseq[[1]] has length greater than 0?) but in any case, you weren't iterating over the sequence, you were iterating over the lines. i was never going to be a single character like A or T, it was always going to be a full line.
In any case, the solution with collapse, table and strsplit is both more concise and computationally efficient than a for loop (or a pair of nested for loops, which is what you would need).
You may use the following code which calls the str_count function (that counts the number of occurrences of a fixed text pattern) from the stringr package. It should work faster than the other solution which splits the character string into one-letter substrings.
require('stringr') # call install.packages('stringr') to download the package first
# read the text file (each text line will be a separate string):
dnaseq <- readLines("path_to_file.txt")
# merge text lines into one string:
dnaseq <- str_c(dnaseq, collapse="")
# count the number of occurrences of each nucleotide:
sapply(c("A", "G", "C", "T"), function(nuc)
str_count(dnaseq, fixed(nuc)))
Note that this solution may easily br extended to the length > 1 subsequence finding task (just change the search pattern in sapply(), e.g. to as.character(outer(c("A", "G", "C", "T"), c("A", "G", "C", "T"), str_c)), which generates all pairs of nucleotides).
However, note that detecting AGA in AGAGA will report only 1 occurrence as str_count() does not take overlapping patterns into account.
I am assuming that your nucleotide sequence is in a character vector of length one. If you are looking for the dinucleotide frequencies and a transition matrix, here is one solution:
dnaseq <- "AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAG
CTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAAC"
## list of nucleotides
nuc <- c("A","T","G","C")
## all distinct dinucleotides
nuc_comb <- expand.grid(nuc,nuc)
nuc_comb$two <- paste(nuc_comb$Var1, nuc$Var2, sep = "")
# Var1 Var2 two
# 1 A A AA
# 2 T A TA
# 3 G A GA
# 4 C A CA
# 5 A T AT
# 6 T T TT
# 7 G T GT
# 8 C T CT
# 9 A G AG
# 10 T G TG
# 11 G G GG
# 12 C G CG
# 13 A C AC
# 14 T C TC
# 15 G C GC
# 16 C C CC
## Using `vapply` and regular expressions to count dinucleotide sequences:
nuc_comb$freq <- vapply(nuc_comb$two,
function(x) length(gregexpr(x, dnaseq, fixed = TRUE)[[1]]),
integer(1))
# AA TA GA CA AT TT GT CT AG TG GG CG AC TC GC CC
# 11 11 7 5 9 12 9 13 7 13 4 2 8 7 5 2
## label and reshape to matrix/table
dinuc_df <- reshape(nuc_comb, direction = "wide",
idvar = "Var1", timevar = "Var2", drop = "two")
dinuc_mat <- as.matrix(dinuc_df_wide[-1])
rownames(dinuc_mat) <- colnames(dinuc_mat) <- nuc
# A T G C
# A 11 9 7 8
# T 11 12 13 7
# G 7 9 4 5
# C 5 13 2 2
## get margin proportions for transition matrix
## probability of moving from nucleotide in row to nucleotide in column)
dinuc_tab <- prop.table(dinuc_mat, 1)
# A T G C
# A 0.3142857 0.2571429 0.20000000 0.22857143
# T 0.2558140 0.2790698 0.30232558 0.16279070
# G 0.2800000 0.3600000 0.16000000 0.20000000
# C 0.2272727 0.5909091 0.09090909 0.09090909

Shuffling a vector - all possible outcomes of sample()?

I have a vector with five items.
my_vec <- c("a","b","a","c","d")
If I want to re-arrange those values into a new vector (shuffle), I could use sample():
shuffled_vec <- sample(my_vec)
Easy - but the sample() function only gives me one possible shuffle. What if I want to know all possible shuffling combinations? The various "combn" functions don't seem to help, and expand.grid() gives me every possible combination with replacement, when I need it without replacement. What's the most efficient way to do this?
Note that in my vector, I have the value "a" twice - therefore, in the set of shuffled vectors returned, they all should each have "a" twice in the set.
I think permn from the combinat package does what you want
library(combinat)
permn(my_vec)
A smaller example
> x
[1] "a" "a" "b"
> permn(x)
[[1]]
[1] "a" "a" "b"
[[2]]
[1] "a" "b" "a"
[[3]]
[1] "b" "a" "a"
[[4]]
[1] "b" "a" "a"
[[5]]
[1] "a" "b" "a"
[[6]]
[1] "a" "a" "b"
If the duplicates are a problem you could do something similar to this to get rid of duplicates
strsplit(unique(sapply(permn(my_vec), paste, collapse = ",")), ",")
Or probably a better approach to removing duplicates...
dat <- do.call(rbind, permn(my_vec))
dat[duplicated(dat),]
Noting that your data is effectively 5 levels from 1-5, encoded as "a", "b", "a", "c", and "d", I went looking for ways to get the permutations of the numbers 1-5 and then remap those to the levels you use.
Let's start with the input data:
my_vec <- c("a","b","a","c","d") # the character
my_vec_ind <- seq(1,length(my_vec),1) # their identifier
To get the permutations, I applied the function given at Generating all distinct permutations of a list in R:
permutations <- function(n){
if(n==1){
return(matrix(1))
} else {
sp <- permutations(n-1)
p <- nrow(sp)
A <- matrix(nrow=n*p,ncol=n)
for(i in 1:n){
A[(i-1)*p+1:p,] <- cbind(i,sp+(sp>=i))
}
return(A)
}
}
First, create a data.frame with the permutations:
tmp <- data.frame(permutations(length(my_vec)))
You now have a data frame tmp of 120 rows, where each row is a unique permutation of the numbers, 1-5:
>tmp
X1 X2 X3 X4 X5
1 1 2 3 4 5
2 1 2 3 5 4
3 1 2 4 3 5
...
119 5 4 3 1 2
120 5 4 3 2 1
Now you need to remap them to the strings you had. You can remap them using a variation on the theme of gsub(), proposed here: R: replace characters using gsub, how to create a function?
gsub2 <- function(pattern, replacement, x, ...) {
for(i in 1:length(pattern))
x <- gsub(pattern[i], replacement[i], x, ...)
x
}
gsub() won't work because you have more than one value in the replacement array.
You also need a function you can call using lapply() to use the gsub2() function on every element of your tmp data.frame.
remap <- function(x,
old,
new){
return(gsub2(pattern = old,
replacement = new,
fixed = TRUE,
x = as.character(x)))
}
Almost there. We do the mapping like this:
shuffled_vec <- as.data.frame(lapply(tmp,
remap,
old = as.character(my_vec_ind),
new = my_vec))
which can be simplified to...
shuffled_vec <- as.data.frame(lapply(data.frame(permutations(length(my_vec))),
remap,
old = as.character(my_vec_ind),
new = my_vec))
.. should you feel the need.
That gives you your required answer:
> shuffled_vec
X1 X2 X3 X4 X5
1 a b a c d
2 a b a d c
3 a b c a d
...
119 d c a a b
120 d c a b a
Looking at a previous question (R: generate all permutations of vector without duplicated elements), I can see that the gtools package has a function for this. I couldn't however get this to work directly on your vector as such:
permutations(n = 5, r = 5, v = my_vec)
#Error in permutations(n = 5, r = 5, v = my_vec) :
# too few different elements
You can adapt it however like so:
apply(permutations(n = 5, r = 5), 1, function(x) my_vec[x])
# [,1] [,2] [,3] [,4]
#[1,] "a" "a" "a" "a" ...
#[2,] "b" "b" "b" "b" ...
#[3,] "a" "a" "c" "c" ...
#[4,] "c" "d" "a" "d" ...
#[5,] "d" "c" "d" "a" ...

How to analyze the data whose different rows have different number of elements using R?

The data format is as following, the first column is the id:
1, b, c
2, a, d, e, f
3, u, i, c
4, k, m
5, o
However, i can do nothing to analyze this data. Do you have a good idea of how to read the data into R? Further, My question is: How to analyze the data whose different rows have different number of elements using R?
It seems you are trying to read a file with elements of unequal length. The structure in R that is list.
It is possible to do this by combining read.table with sep="\n" and then to apply strsplit on each row of data.
Here is an example:
dat <- "
1 A B
2 C D E
3 F G H I J
4 K L
5 M"
The code to read and convert to a list:
x <- read.table(textConnection(dat), sep="\n")
apply(x, 1, function(i)strsplit(i, "\\s")[[1]])
The results:
[[1]]
[1] "1" "A" "B"
[[2]]
[1] "2" "C" "D" "E"
[[3]]
[1] "3" "F" "G" "H" "I" "J"
[[4]]
[1] "4" "K" "L"
[[5]]
[1] "5" "M"
You can now use any list manipulation technique to work with your data.
using the readLines and strsplit to solve this problem.
text <- readLines("./xx.txt",encoding='UTF-8', n = -1L)
txt = unlist(strsplit(text, sep = " "))

Resources