parsing GTF files - r

I wrote a script to parse out the last column of a gtf file. Script works well for 238460 rows, but when I applied it for whole rows, I faced an error. I would be grateful if you could help!
GTF file link:
the error is:
“Error in h(simpleError(msg, call)) : error in evaluating the argument 'expr' in selecting a method for function 'eval': :1:376: unexpected ';' 1: odel_evidence="Supporting evidence includes similarity to: 5 ESTs, 9 Proteins, and 100% coverage of the annotated genomic feature by RNAseq alignments, including 205 samples with support for a ^”
require(data.table)
info <- fread("GCF_000005505.3_Brachypodium_distachyon_v3.0_genomic.gtf")
colnames(info)[9] <- "AdditionalInfo"
keys <- lapply(info$AdditionalInfo, function(x)
unlist(lapply(unlist(strsplit(x, "; ")),
function(y) unlist(strsplit(y, " "))[1])) )
values <- lapply(info$AdditionalInfo, function(x)
unlist(lapply(unlist(strsplit(x, "; ")),
function(y) gsub("\"", "", unlist(strsplit(y, " "))[2]))) )
#keys[[2]]
#values[[2]]
x2= rbindlist(info[, .(list(eval(parse(text=
paste0('list(',
sub(',$', '',
gsub('([^ ]+) ([^;]+); *', '\\1=\\2,', AdditionalInfo)),
')')), .SD)))
, by = AdditionalInfo]$V1, fill = T)
I want to parse last column of gtf!

Related

Error in is.single.string(object) : argument "object" is missing, with no default

I want to parse the AAChange.refGene column and then use biomaRt R package to extract information. My code is raising Error in is.single.string(object) : argument "object" is missing, with no default even though the getSequence function is meant to accept multiple arguments.
library(tidyr)
variant_calls = read.delim("variant_calls.txt")
info = tidyr::separate(variant_calls["AAChange.refGene"], AAChange.refGene, c("Refseq ID", "cDNA level change", "Protein level change"), ":")
df = cbind(variant_calls["Gene.refGene"],info)
library(biomaRt)
ensembl <- useMart(biomart="ENSEMBL_MART_ENSEMBL", dataset="hsapiens_gene_ensembl", host="https://grch37.ensembl.org", path="/biomart/martservice")
pep <- vector()
for(i in 1:length(df$`Refseq ID`)){
temp <- getSequence(id=df$`Refseq ID`[i],type='refseq_mrna',seqType='peptide', mart=ensembl)
temp <- sapply(temp$peptide, nchar)
temp <- sort(temp, decreasing = TRUE)
temp <- names(temp[1])
pep[i] <- temp
}
df$Sequence <- pep
Traceback:
Error in is.single.string(object) :
argument "object" is missing, with no default
I got the same error and found out (using ?getSequence) that it was a conflict between packages (classic R), specifically biomart and seqinr which is used to handle fasta format thus probably used together often.
My solution consisted in calling the function like this:
biomaRt::getSequence()

arulesViz subscript out of bounds paracoord

I want to perform basket analysis and draw a paracoord plot however I receive an error.
Content of this error is: :
Error in m[j, i] : subscript out of bounds.In addition: Warning message:
In cbind(pl, pr) :
number of rows of result is not a multiple of vector length (arg 2)
I am using data from: Link.
First I am transforming this to fit basket analysis, name of the original excel files is Online_Retail:
library(arules)
library(arulesViz)
library(plyr)
items <- ddply(Online_Retail, c("CustomerID", "InvoiceDate"), function(df1)paste(df1$Description, collapse = ","))
items1 <- items["V1"]
write.csv(items1, "groceries1.csv", quote=FALSE, row.names = FALSE, col.names = FALSE)
trans1 <- read.transactions("groceries1.csv", format = "basket", sep=",",skip=1)
And to draw paracoord I have created such a code:
rules.trans2<-apriori(data=trans1, parameter=list(supp=0.001,conf = 0.05),
appearance=list(default="rhs", lhs="ROSES REGENCY TEACUP AND SAUCER"), control=list(verbose=F))
sorted.plot <- sort(rules.trans2, by="support", decreasing = TRUE)
plot(sorted.plot, method="paracoord", control=list(reorder=TRUE, verbose = TRUE))
Why my code for paracoord is not working? how can I fix it? What should I change?
This is, unfortunately, a bug in arulesViz. This will be fixed in the next release (arulesViz 1.3-3). The fix is already available in the development version on GitHub: https://github.com/mhahsler/arulesViz

Basic ntile function to create decile portfolios

https://imgur.com/a/O1O9G
I try to create decile portfolios based on momentum. I use DAX in Germany.
dax <- read.csv("DAXclean.csv", header = TRUE, sep = ";", dec = ",")
as.ts(dax)
mom.return <- matrix(NA,nrow(dax),ncol(dax))
mom.decile <- matrix(NA,nrow(dax),ncol(dax))
for (row in 13:nrow(dax)) {
for (column in 2:ncol(dax)) {
mom.return[row,column] <- (dax[row-1,column]-dax[row-12,column])/dax[row-12,column]
}
mom.decile[row,column] <- ntile(mom.return[row,2:ncol(dax)], 10)
}
When I run this code I get the following error message:
"Error in `[<-`(`*tmp*`, row, column, value = ntile(mom.return[row, 2:ncol(dax)], :
subscript out of bounds"
If I remove the following command, everything works fine.
mom.decile[row,column] <- ntile(mom.return[row,2:ncol(dax)], 10)
I can't see what the problem is.
Thank you in advance!

extract ARIMA specificaiton

Printing a fitted model object from auto.arima() includes a line such as
"ARIMA(2,1,0) with drift,"
which would be a nice item to include in sweave (or other) output illustrating the fitted model. Is it possible to extract that line as a block? At this point the best I am doing is to extract the appropriate order from the arma component (possibly coupled with verbiage from the names of the coefficients of the fitted model, e.g., "with drift" or "with non-zero mean.")
# R 3.0.2 x64 on Windows, forecast 5.3
library(forecast)
y <- ts(data = c(-4.389, -3.891, -4.435, -5.403, -2.501, -1.858, -4.735, -1.085, -2.701, -3.908, -2.520, -2.009, -6.961, -2.891, -0.6791, -1.459, -3.210, -2.178, -1.972, -1.207, -1.376, -1.355, -1.950, -2.862, -3.475, -1.027, -2.673, -3.116, -1.290, -1.510, -1.736, -2.565, -1.932, -0.8247, -2.067, -2.148, -1.236, -2.207, -1.120, -0.6152), start = 1971, end = 2010)
fm <- auto.arima(y)
fm
# what I want is the line: "ARIMA(2,1,0) with drift`"
str(fm)
paste("ARIMA(", fm$arma[1], ",", fm$arma[length(fm$arma)-1], ",", fm$arma[2], ") with ",intersect("drift", names(fm$coef)), sep = "")
Checking the auto.arima function, I noticed it internally calls another function which is named arima.string.
Then I did:
getAnywhere(arima.string)
and the output was:
A single object matching ‘arima.string’ was found
It was found in the following places
namespace:forecast
with value
function (object)
{
order <- object$arma[c(1, 6, 2, 3, 7, 4, 5)]
result <- paste("ARIMA(", order[1], ",", order[2], ",", order[3],
")", sep = "")
if (order[7] > 1 & sum(order[4:6]) > 0)
result <- paste(result, "(", order[4], ",", order[5],
",", order[6], ")[", order[7], "]", sep = "")
if (is.element("constant", names(object$coef)) | is.element("intercept",
names(object$coef)))
result <- paste(result, "with non-zero mean")
else if (is.element("drift", names(object$coef)))
result <- paste(result, "with drift ")
else if (order[2] == 0 & order[5] == 0)
result <- paste(result, "with zero mean ")
else result <- paste(result, " ")
return(result)
}
Then I copied the function code and pasted it in a new function which I named arima.string1
arima.string1(fm)
# [1] "ARIMA(2,1,0) with drift "
Today I discovered the block of text I sought is also stored in the forecast from the fitted model object:
forecast(fm)$method
Thanks to Fernando for properly markup-ing my question and to Davide for providing a pair of useful insights - getAnywhere() and arima.string() ready for modification.

Error in table(x, y) : attempt to make a table with >= 2^31 elements

I have a problem with plotting my results. Previously (about two weeks ago) I can use same code at below to plot my data but now I'am getting error
data<- read.table("my_step.odt", header = FALSE, sep = "", quote="\"'", dec=".", as.is = FALSE, strip.white=FALSE, col.names=c(.......);
mgn_my <- data[1:49999,18]
sim <- data[1:49999, 21]
plot(sim , mgn_my , type="l",xlab="Time (ns)",ylab="mx")
error
Error in table(x, y) : attempt to make a table with >= 2^31 elements
any suggestion?
I have had a similar problem as you before. Based on my response from another post, here's what I would suggest before you run plot:
Option 1: Use droplevels
mgn_my <- droplevels(data[1:49999,18])
Option 2: Use apply. This approach seems "friendlier" if you are familiar with apply-family functions in R. For example:
mgn_my <- data[1:49999,18]
apply(mgn_my,1,plot)

Resources