I have a fasta format file where in i have to only keep those nodes whose length is less than 100. however, the problem i am currently facing is that i am able to separate the nodes but am not able to put the characters of each node in separate variable whose length i can then check and subsequently separate the requisite nodes from longer ones.
So what i mean is i am able to read the headings and separate nodes but how do i put the characters within each node in a variable.
This is a sample of my data
>NODE_1
GTTGGCCGAGCCCCAGGACGCGTGGTTGTTGAACCAGATCAGGTCCGGGCTCCACTGCAC
GTAGTCCTCGTTGGACAGCAGCGGGGCGTACGAGGCCAGCTTGACCACGTCGGCGTTGCG
CTCGAGCCGGTCATGAACGCGGCCTCGGCGAGGGCGTTCTTCCAGGCGTTGCCCTGGGAA
>NODE_2
CCTCCGGCGGCACCACGGTCGGCGAGGCCCTCAACATCCTGGAGCGCACCGACCTGTCCA
CCGCGGACAAGGCCGGTTACCTGCACCGCTACATCGAGGCCAGCCGCATCGCGTTCGCGG
ACCGCGGGCGCTGGGTCGGCGACCCCGCCTTCGAGGACGTAC
>NODE_3
CCTCCGGCGGCACCACGGTCGGCGAGGCCCTCAACATCCTGGAGCGCACCGACCTGTCCA
CCGCGGACAAGGCCGGTTACCTGCACCGCTACATCGAGGCCAGCCGCATCGCGTTCGCGG
ACCGCGGGCGCTGGGTCGGCGACCCCGCCTTCGAGGACGTACATCATTCCTTAATCTTCC
my code:
x <- readLines("1.fa", n = -1L, ok = TRUE, warn = TRUE)
for (i in 1:length(x)) {
if (substr(x[i],1,1)=='>') {
head <- c(head,x[i])
q <- x[i+1]
if (q=!0) {
contig <- c(contig,q)
print(contig)
contig.length <- c(contig.length, nchar(q))
} else {
break
}
} else {
z <- paste(z,x[i], sep=" ")
}
}
You should use BioConductor for that. You're actually trying to parse a FASTA-file to some kind of a list. Bioconductor has a simple function read.fasta() that does just that, and returns an object where you can get the lengths and so on. Learning bioconductor is definitely worth the hassle if you work with sequences.
To do it in base R, you'll need to work with lists, something like :
Split.Fasta <- function(x){
out <- list()
for(i in x){
if(substr(i,1,1)==">") {
name <- gsub(">","",i)
out[[name]] <- character(0)
} else if (grepl("\\w",i)){
out[[name]] <- paste(out[[name]],gsub("\\W","",i),sep="")
}
}
out
}
Which works like :
zz <- textConnection(">NODE_1
GTTGGCCGAGCCCCAGGACGCGTGGTTGTTGAACCAGATCAGGTCCGGGCTCCACTGCAC
GTAGTCCTCGTTGGACAGCAGCGGGGCGTACGAGGCCAGCTTGACCACGTCGGCGTTGCG
CTCGAGCCGGTCATGAACGCGGCCTCGGCGAGGGCGTTCTTCCAGGCGTTGCCCTGGGAA
>NODE_2
CCTCCGGCGGCACCACGGTCGGCGAGGCCCTCAACATCCTGGAGCGCACCGACCTGTCCA
CCGCGGACAAGGCCGGTTACCTGCACCGCTACATCGAGGCCAGCCGCATCGCGTTCGCGG
ACCGCGGGCGCTGGGTCGGCGACCCCGCCTTCGAGGACGTAC
>NODE_3
CCTCCGGCGGCACCACGGTCGGCGAGGCCCTCAACATCCTGGAGCGCACCGACCTGTCCA
CCGCGGACAAGGCCGGTTACCTGCACCGCTACATCGAGGCCAGCCGCATCGCGTTCGCGG
ACCGCGGGCGCTGGGTCGGCGACCCCGCCTTCGAGGACGTACATCATTCCTTAATCTTCC")
X <- readLines(zz,n=-1L,ok=TRUE,warn=TRUE)
close(zz)
Y <- Split.Fasta(X)
$`NODE_1 `
[1] "GTTGGCCGAGCCCCAGGACGCGTGGTTGTTGAACCAGATCA...
$`NODE_2 `
[1] "CCTCCGGCGGCACCACGGTCGGCGAGGCCCTCAACATCCTGGAGC...
$`NODE_3 `
[1] "CCTCCGGCGGCACCACGGTCGGCGAGGCCCTCAACATCCTGGAGCGCAC...
It returns a list which you can use later on to check lengths and so on :
sapply(Y,nchar)
NODE_1 NODE_2 NODE_3
180 162 180
Still, learn to use BioConductor, you'll thank yourself for that.
You could install the seqinr package, which has lots of methods for analysing sequence data.
install.packages("seqinr")
Next, read in your fasta file:
seqs <- read.fasta("myfile.fa")
And then, extract sequences from the list with length < 100:
seqs.small <- seqs[sapply(seqs, function(x) getLength(x) < 100)]
maybe assign would be helpful?
assign('NODE_1', 'GTTGG...')
Related
I am writing a function that will go through a list of files in a directory, count number of complete cases, and if the sum of complete cases is above a given threshhold, a correlation should be calculated. The output must be a numeric vector of correlations for all files that meet the threshhold requirement. This is what I have so far (and it gives me an Error: unexpected '}' in "}" Full disclosure - I am a complete newbie, as in wrote my first code 2 weeks ago. What am I doing wrong?
correlation <- function (directory, threshhold = 0) {
all_files <- list.files(path = getwd())
correlations_list <- numeric()
for (i in seq_along(all_files)) {
dataFR2 <- read.csv(all_files[i])
c <- c(sum(complete.cases(dataFR2)))
if c >= threshhold {
d <- cor(dataFR2$sulfate, dataFR2$nitrate, use = "complete.obs", method = c("pearson"))
correlations_list <- c(correlations_list, d)
}
}
correlations_list
}
"Unexpected *" errors are a syntax error. Often a missing parenthesis, comma, or curly bracket. In this case, you need to change if c >= threshhold { to if (c >= threshhold) {. if() is a function and it requires parentheses.
I'd also strongly recommend that you not use c as a variable name. c() is the most commonly used R function, and giving an object the same name will make your code look very strange to anyone else reading it.
Lastly, I'd recommend that you make your output the same length as the the number of files. As you have it, there won't be any way to know which files met the threshold to have their correlations calculated. I'd make correlations_list have the same length as the number of files, and add names to it so you know which correlation belongs to which file. This has the side benefit of not "growing an object in a loop", which is an anti-pattern known for its inefficiency. A rewritten function would look something like this:
correlation <- function (directory, threshhold = 0) {
all_files <- list.files(path = getwd())
correlations_list <- numeric(length(all_files)) ## initialize to full length
for (i in seq_along(all_files)) {
dataFR2 <- read.csv(all_files[i])
n_complete <- sum(complete.cases(dataFR2))
if(n_complete >= threshhold) {
d <- cor(dataFR2$sulfate, dataFR2$nitrate, use = "complete.obs", method = c("pearson"))
} else {
d <- NA
}
correlations_list[i] <- d
}
names(correlations_list) <- all_files
correlations_list
}
I have a dataframe with ~9000 rows of human coded data in it, two coders per item so about 4500 unique pairs. I want to break the dataset into each of these pairs, so ~4500 dataframes, run a kripp.alpha on the scores that were assigned, and then save those into a coder sheet I have made. I cannot get the loop to work to do this.
I can get it to work individually, using this:
example.m <- as.matrix(example.m)
s <- kripp.alpha(example.m)
example$alpha <- s$value
However, when trying a loop I am getting either "Error in get(v) : object 'NA' not found" when running this:
for (i in items) {
v <- i
v <- v[c("V1","V2")]
v <- assign(v, as.matrix(get(v)))
s <- kripp.alpha(v)
i$alpha <- s$value
}
Or am getting "In i$alpha <- s$value : Coercing LHS to a list" when running:
for (i in items) {
i.m <- i[c("V1","V2")]
i.m <- as.matrix(i.m)
s <- kripp.alpha(i.m)
i$alpha <- s$value
}
Here is an example set of data. Items is a list of individual dataframes.
l <- as.data.frame(matrix(c(4,3,3,3,1,1,3,3,3,3,1,1),nrow=2))
t <- as.data.frame(matrix(c(4,3,4,3,1,1,3,3,1,3,1,1),nrow=2))
items <- c("l","t")
I am sure this is a basic question, but what I want is for each file, i, to add a column with the alpha score at the end. Thanks!
Your problem is with scoping and extracting names from objects when referenced through strings. You'd need to eval() some of your object to make your current approach work.
Here's another solution
library("irr") # For kripp.alpha
# Produce the data
l <- as.data.frame(matrix(c(4,3,3,3,1,1,3,3,3,3,1,1),nrow=2))
t <- as.data.frame(matrix(c(4,3,4,3,1,1,3,3,1,3,1,1),nrow=2))
# Collect the data as a list right away
items <- list(l, t)
Now you can sapply() directly over the elements in the list.
sapply(items, function(v) {
kripp.alpha(as.matrix(v[c("V1","V2")]))$value
})
which produces
[1] 0.0 -0.5
I regularly come up against the issue of how to categorise dataframes from a list of dataframes according to certain values within them (E.g. numeric, factor strings, etc). I am using a simplified version using vectors here.
After writing messy for loops for this task a bunch of times, I am trying to write a function to repeatedly solve the problem. The code below returns a subscripting error (given at the bottom), however I don't think this is a subscripting problem, but to do with my use of return.
As well as fixing this, I would be very grateful for any pointers on whether there are any cleaner / better ways to code this function.
library(plyr)
library(dplyr)
#dummy data
segmentvalues <- c('1_P', '2_B', '3_R', '4_M', '5_D', '6_L')
trialvec <- vector()
for (i in 1:length(segmentvalues)){
for (j in 1:20) {
trialvec[i*j] <- segmentvalues[i]
}
}
#vector categorisation
vcategorise <- function(categories, data) {
#categorises a vector into a list of vectors
#requires plyr and dyplyr
assignment <- list()
catlength <- length(categories)
for (i in 1:length(catlength)){
for (j in 1:length(data)) {
if (any(contains(categories[i], ignore.case = TRUE,
as.vector(data[j])))) {
assignment[[i]][j] <- data[j]
}
}
}
return (assignment)
}
result <- vcategorise(categories = segmentvalues, data = trialvec)
Error in *tmp*[[i]] : subscript out of bounds
You are indexing assignments -- which is ok, even if at an index that doesn't have a value, that just gives you NULL -- and then indexing into what you get there -- which won't work if you get NULL. And NULL you will get, because you haven't allocated the list to be the right size.
In any case, I don't think it is necessary for you to allocate a table. You are already using a flat indexing structure in your test data generation, so why not do the same with assignment and then set its dimensions afterwards?
Something like this, perhaps?
vcategorise <- function(categories, data) {
assignment <- vector("list", length = length(data) * length(categories))
n <- length(data)
for (i in 1:length(categories)){
for (j in 1:length(data)) {
assignment[(i-1)*n + j] <-
if (any(contains(categories[i],
ignore.case = TRUE,
as.vector(data[j])))) {
data[j]
} else {
NA
}
}
}
dim(assignment) <- c(length(data), length(categories))
assignment
}
It is not the prettiest code, but without fully understanding what you want to achieve, I don't know how to go further.
UPDATE
Thanks to the help and suggestions of #CarlWitthoft my code was simplified to this:
model <- unlist(sapply(1:length(model.list),
function(i) ifelse(length(model.list[[i]][model.lookup[[i]]] == "") == 0,
NA, model.list[[i]][model.lookup[[i]]])))
ORIGINAL POST
Recently I read an article on how vectorizing operations in R instead of using for loops are a good practice, I have a piece of code where I used a big for loop and I'm trying to make it a vector operation but I cannot find the answer, could someone help me? Is it possible or do I need to change my approach? My code works fine with the for loop but I want to try the other way.
model <- c(0)
price <- c(0)
size <- c(0)
reviews <- c(0)
for(i in 1:length(model.list)) {
if(length(model.list[[i]][model.lookup[[i]]] == "") == 0) {
model[i] <- NA
} else {
model[i] <- model.list[[i]][model.lookup[[i]]]
}
if(length(model.list[[i]][price.lookup[[i]]] == "") == 0) {
price[i] <- NA
} else {
price[i] <- model.list[[i]][price.lookup[[i]]]
}
if(length(model.list[[i]][reviews.lookup[[i]]] == "") == 0) {
reviews[i] <- NA
} else {
reviews[i] <- model.list[[i]][reviews.lookup[[i]]]
}
size[i] <- product.link[[i]][size.lookup[[i]]]
}
Basically the model.list variable is a list from which I want to extract a particular vector, the location from that vector is given by the variables model.lookup, price.lookup and reviews.lookup which contain logical vectors with just one TRUE value which is used to return the desired vector from model.list. Then every cycle of the for loop the extracted vectors are stored on variables model, price, size and reviews.
Could this be changed to a vector operation?
In general, try to avoid if when not needed. I think your desired output can be built as follows.
model <- unlist(sapply(1:length(model.list), function(i) model.list[[i]][model.lookup[[i]]]))
model[model=='']<-NA
And the same for your other variables. This assumes that all model.lookup[[i]] are of length one. If they aren't, you won't be able to write the output to a single element of model in the first place.
I would also note that you are grossly overcoding, e.g. x<-0 is better than x<-c(0), and don't bother with length evaluation on a single item.
I have a data.frame object. For a simple example:
> data.frame(x=c('A','A','B','B','B'), y=c('Ab','Ac','Ba', 'Ba','Bd'), z=c('Abb','Acc','Bad', 'Bae','Bdd'))
x y z
1 A Ab Abb
2 A Ac Acc
3 B Ba Bad
4 B Ba Bae
5 B Bd Bdd
there are a lot more rows and columns in the actual data. how could I create a nested tree structure object of dendrogram like this:
|---Ab---Abb
A---|
| |---Ac---Acc
--| /--Bad
| |---Ba-------|
B---| \--Bae
|---Bb---Bdd
data.frame to Newick
I did my PhD in computational phylogenetics and somewhere along the way I produced this code, that I used once or twice when I got some data in this nonstandard format (in phylogenetic sense). The script traverses the dataframe as if it were a tree ... and pastes stuff along the way into a Newick string, which is a standard format and can be then transformed in any kind of tree object.
I guess the script could be optimized (I used it so rarely that more work on it would reduce the overall efficiency), but at least it is better to share than to let it collect dust laying around on my harddrive.
## recursion function
traverse <- function(a,i,innerl){
if(i < (ncol(df))){
alevelinner <- as.character(unique(df[which(as.character(df[,i])==a),i+1]))
desc <- NULL
if(length(alevelinner) == 1) (newickout <- traverse(alevelinner,i+1,innerl))
else {
for(b in alevelinner) desc <- c(desc,traverse(b,i+1,innerl))
il <- NULL; if(innerl==TRUE) il <- a
(newickout <- paste("(",paste(desc,collapse=","),")",il,sep=""))
}
}
else { (newickout <- a) }
}
## data.frame to newick function
df2newick <- function(df, innerlabel=FALSE){
alevel <- as.character(unique(df[,1]))
newick <- NULL
for(x in alevel) newick <- c(newick,traverse(x,1,innerlabel))
(newick <- paste("(",paste(newick,collapse=","),");",sep=""))
}
The main function df2newick() takes two arguments:
df which is the dataframe to be transformed (object of class data.frame)
innerlabel which tells the function to write labels for inner nodes (bulean)
To demonstrate it on your example:
df <- data.frame(x=c('A','A','B','B','B'), y=c('Ab','Ac','Ba', 'Ba','Bd'), z=c('Abb','Acc','Bad', 'Bae','Bdd'))
myNewick <- df2newick(df)
#[1] "((Abb,Acc),((Bad,Bae),Bdd));"
Now you could read it into a object of class phylo with read.tree() from ape
library(ape)
mytree <- read.tree(text=myNewick)
plot(mytree)
If you want to add inner node labels to the Newick string, you can use this:
myNewick <- df2newick(df, TRUE)
#[1] "((Abb,Acc)A,((Bad,Bae)Ba,Bdd)B);"
Hope this is useful (and maybe my PhD wasn't a complete waist of time ;-)
Additional note for your dataframe format:
As you can observe the df2newick function ignores inner modes with one child (which is anyway best to be used with most phylogenetic methods ... was only relevant to me). The df objects that I originally got and used with this script were of this format:
df <- data.frame(x=c('A','A','B','B','B'), y=c('Abb','Acc','Ba', 'Ba','Bdd'), z=c('Abb','Acc','Bad', 'Bae','Bdd'))
Very similar to yours ... but the "inner singe child nodes" just had the same name as their children, but you have different inner names for this nodes too, and the names get ignored ... might not be relevant but you can just ignore a part of the recursion function, like this:
traverse <- function(a,i,innerl){
if(i < (ncol(df))){
alevelinner <- as.character(unique(df[which(as.character(df[,i])==a),i+1]))
desc <- NULL
##if(length(alevelinner) == 1) (newickout <- traverse(alevelinner,i+1,innerl))
##else {
for(b in alevelinner) desc <- c(desc,traverse(b,i+1,innerl))
il <- NULL; if(innerl==TRUE) il <- a
(newickout <- paste("(",paste(desc,collapse=","),")",il,sep=""))
##}
}
else { (newickout <- a) }
}
and you would get something like this:
[1] "(((Abb)Ab,(Acc)Ac)A,((Bad,Bae)Ba,(Bdd)Bd)B);"
This really looks odd to me, but I add it just in case, cause it really includes now all the information from your original dataframe.
I don't know much about the internal structure of dendrograms in R, but the following code will create a nested list structure that has the hierarchy that I think you look for:
stree = function(x,level=0) {
#x is a string vector
#resultis a hierarchical structure of lists (that contains lists, etc.)
#the names of the lists are the node values.
level = level+1
if (length(x)==1) {
result = list()
result[[substring(x[1],level)]]=list()
return(result)
}
result=list()
this.level = substring(x,level,level)
next.levels = unique(this.level)
for (p in next.levels) {
if (p=="") {
result$p = list()
} else {
ids = which(this.level==p)
result[[p]] = stree(x[ids],level)
}
}
result
}
it operates on a vector of strings. so in case of your dataframe you'd need to call
stree(as.character(df[,3]))
Hope this helps.