I'm trying to create an intensity GDS file from existing Illumina files using createDataFile() function of GWASTools.
I tried this:
col.nums <- as.integer(c(1,11,12,13,14))
names(col.nums) <- c("snp", "BAlleleFreq", "LogRRatio", "a1", "a2")
variables <- c("genotype","BAlleleFreq","LogRRatio")
intens <- createDataFile(path="/pathexample/", "/pathexample/IntensityGDS", file.type="gds", variables=variables, snp.annotation=snpAnnot, scan.annotation=scanAnnot, sep.type=",", skip.num=12, col.total=14, col.nums=col.nums, scan.name.in.file=-1, allele.coding="nucleotide", precision="single", compress="LZMA_RA:1M", compress.geno="", compress.annot="LZMA_RA", array.name=NULL, genome.build=NULL, diagnostics.filename="createDataFile.diagnostics.RData", verbose=TRUE)
The error I'm getting is:
Error: all(c("snpID", "chromosome", "position", "snpName") %in% names(snp.annotation)) is not TRUE
However I know those column names are in both the snp.annotation snpAnnotationDataFrame (aka snpAnnot) and the underlying dataframe I used to create that snpAnnotationDataFrame. E.g.:
varLabels(snpAnnot)
yields
"snpName" "chromosome" "position" "rsID_real" "snpID"
Thanks!!
Apparently the problem was that createDataFile() takes regular R dataframes in the snp.annotation and scan.annotation arguments, not an object of class "snp annotation data frame." ie, no need to run the command SnpAnnotationDataFrame() on your dataframe, just insert the actual dataframe.
Related
I have a seurat object "gunion.data".
The metadata for gunion.data#meta.data$orig.ident is either "control, "ischemia", "synIRI" or "alloIRI". I would like to change ""synIRI" and "alloIRI" into "other".
I tried this
gunion.data#meta.data$orig.ident["alloIRI"] <- "other"
but it gave me an error:
Error in $<-.data.frame(tmp, orig.ident, value = c("control", "control", : replacement has 26933 rows, data has 26932
How should I format the code to change all "alloIRI" and "synIRI" in the data into "other"?
What you want to do is rename an Ident.
The below should work once you've changed your idents to 'orig.ident'.
Idents(gunion.data) <- 'orig.ident'
gunion.data <- RenameIdents(object = gunion.data, `synIRI` = "other", `alloIRI` = "other")
Idents(gunion.data) #to confirm the change has happened.
I would suggest you also have a look at the Seurat Essential commands, found here.
I want to parse the AAChange.refGene column and then use biomaRt R package to extract information. My code is raising Error in is.single.string(object) : argument "object" is missing, with no default even though the getSequence function is meant to accept multiple arguments.
library(tidyr)
variant_calls = read.delim("variant_calls.txt")
info = tidyr::separate(variant_calls["AAChange.refGene"], AAChange.refGene, c("Refseq ID", "cDNA level change", "Protein level change"), ":")
df = cbind(variant_calls["Gene.refGene"],info)
library(biomaRt)
ensembl <- useMart(biomart="ENSEMBL_MART_ENSEMBL", dataset="hsapiens_gene_ensembl", host="https://grch37.ensembl.org", path="/biomart/martservice")
pep <- vector()
for(i in 1:length(df$`Refseq ID`)){
temp <- getSequence(id=df$`Refseq ID`[i],type='refseq_mrna',seqType='peptide', mart=ensembl)
temp <- sapply(temp$peptide, nchar)
temp <- sort(temp, decreasing = TRUE)
temp <- names(temp[1])
pep[i] <- temp
}
df$Sequence <- pep
Traceback:
Error in is.single.string(object) :
argument "object" is missing, with no default
I got the same error and found out (using ?getSequence) that it was a conflict between packages (classic R), specifically biomart and seqinr which is used to handle fasta format thus probably used together often.
My solution consisted in calling the function like this:
biomaRt::getSequence()
I'm trying to train a list of text datasets at the character level (for example, a cat => "a", " ", "c", "a", "t") so that I can classify them with great accuracy. I'm using mxnet package (CNN Network) in R and using crepe model. So to prepare for training, I need to do iterations for both training and test datasets. So the code is as follow:
train.iter <- CustomCSVIter$new(iter=NULL, data.csv=train.file.output,
batch.size=args$batch_size, alphabet=alphabet,
feature.len=feature.len)
test.iter <- CustomCSVIter$new(iter=NULL, data.csv=test.file.output,
batch.size=args$batch_size, alphabet=alphabet,
feature.len=feature.len)
data.csv where I have these datasets, batch.size is just an integer, feature.len is also just an integer, and alphabet is a vector of alphanumeric quotations (abcd...?!""). When I run the above code, I get a message saying I have a fatal error and Rstudio crashes and reloads. I don't know what I'm doing wrong. To run the above code, you need the following function:
CustomCSVIter <- setRefClass("CustomCSVIter",
fields=c("iter", "data.csv", "batch.size",
"alphabet","feature.len"),
contains = "Rcpp_MXArrayDataIter",
methods=list(
initialize=function(iter, data.csv, batch.size,
alphabet, feature.len){
csv_iter <- mx.io.CSVIter(data.csv=data.csv,
data.shape=feature.len+1, #=features + label
batch.size=batch.size)
.self$iter <- csv_iter
.self$data.csv <- data.csv
.self$batch.size <- batch.size
.self$alphabet <- alphabet
.self$feature.len <- feature.len
.self
},
value=function(){
val <- as.array(.self$iter$value()$data)
val.y <- val[1,]
val.x <- val[-1,]
val.x <- dict.decoder(data=val.x,
alphabet=.self$alphabet,
feature.len=.self$feature.len,
batch.size=.self$batch.size)
val.x <- mx.nd.array(val.x)
val.y <- mx.nd.array(val.y)
list(data=val.x, label=val.y)
},
iter.next=function(){
.self$iter$iter.next()
},
reset=function(){
.self$iter$reset()
},
num.pad=function(){
.self$iter$num.pad()
},
finalize=function(){
.self$iter$finalize()
}
)
)
Usually a problem like this arises when there is a mismatch between shapes of the input file and the data.shape parameter of the iterator.
You can easily check if this is the problem by running your code outside from RStudio. Run R from terminal/command line and paste your code there. When an exception happens, it will terminate the R session, and you will be able to read the exception message. In my case it was:
Check failed: row.length == shape.Size() (2 vs. 1) The data size in CSV do not match size of shape: specified shape=(1,), the csv row-length=2
In your case it is probably something similar.
Btw, there is an implementation of a custom iterator for MNIST dataset, which you may find useful: https://github.com/apache/incubator-mxnet/issues/4105#issuecomment-266190690
I'm using the R package rnoaa(along with it required other packages) to gather historical weather data. I wrote this nestled loop to gather all the data sets but I keep getting errors when I run it. It seems to run for a second fine
The loop:
require('triebeard')
require('bindr')
require('colorspace')
require('mime')
require('curl')
require('openssl')
require('R6')
require('urltools')
require('httpcode')
require('stringr')
require('assertthat')
require('bindrcpp')
require('glue')
require('magrittr')
require('pkgconfig')
require('rlang')
require('Rcpp')
require('BH')
require('plogr')
require('purrr')
require('stringi')
require('tidyselect')
require('digest')
require('gtable')
require('plyr')
require('reshape2')
require('lazyeval')
require('RColorBrewer')
require('dichromat')
require('munsell')
require('labeling')
require('viridisLite')
require('data.table')
require('rjson')
require('httr')
require('crul')
require('lubridate')
require('dplyr')
require('tidyr')
require('ggplot2')
require('scales')
require('XML')
require('xml2')
require('jsonlite')
require('rappdirs')
require('gridExtra')
require('tibble')
require('isdparser')
require('geonames')
require('hoardr')
require('rnoaa')
install.package('ncdf4')
install.packages("devtools")
library(devtools)
install_github("rnoaa", "ropensci")
library(rnoaa)
list <- buoys(dataset='wlevel')
lid <- data.frame(list$id)
foo <- for(range in 1990:2017){
for(bid in lid){
bid_range <- buoy(dataset = 'wlevel', buoyid = bid, year = range)
bid.year.data <- data.frame(bid.year$data)
write.csv(bid.year.data, file='cwind/bid_range.csv')
}
}
The response:
Using c1990.nc
Using
Error: length(url) == 1 is not TRUE
It saves the first data-set but it does not apply the for in the file name it just names it bid_range.csv.
This error message shows that there are no any data of a given station id in 1990. Because you were using for loop, once it gots an error, it stops.
Here I introduce the use of tidyverse to download the NOAA buoy data. A lot of the following functions are from the purrr package, which is part of the tidyverse.
# Load packages
library(tidyverse)
library(rnoaa)
Step 1: Create a "Grid" containing all combination of id and year
The expand function from tidyr can create the combination of different values.
data_list <- buoys(dataset = 'wlevel')
data_list2 <- data_list %>%
select(id) %>%
expand(id, year = 1990:2017)
Step 2: Create a "safe" version that does not break when there is no data.
Also make this function suitable for the map2 function
Because we will use map2 to loop through all the combination of id and year using the map2 function by its .x and .y argument. We modified the sequence of argument to create buoy_modify. We also use the safely function to create a safe version of buoy_modify. Now when it meets error, it will store the error message and moves to the next one rather than breaks.
# Modify the buoy function
buoy_modify <- function(buoyid, year, dataset, ...){
buoy(dataset, buoyid = buoyid, year = year, ...)
}
# Creare a safe version of buoy_modify
buoy_safe <- safely(buoy_modify)
Step 3: Apply the buoy_safe function
wlevel_data <- map2(data_list2$id, data_list2$year, buoy_safe, dataset = "wlevel")
# Assign name for the element in the list based on id and year
names(wlevel_data) <- paste(data_list2$id, data_list2$year, sep = "_")
After this step, all the data were downloaded in wlevel_data. Each element in wlevel_data has two parts. $result shows the data if the download is successful, otherwise, it shows NULL. $error shows NULL if the download is successful, otherwise, it shows the error message.
Step 4: Access the data
transpose can turn a list "inside out". So now wlevel_data2 has two elements: result and error. We can store these two and access the data.
# Turn the list "inside out"
wlevel_data2 <- transpose(wlevel_data)
# Get the error message
wlevel_error <- wlevel_data2$error
# Get he result
wlevel_result <- wlevel_data2$result
# Remove NULL element in wlevel_result
wlevel_result2 <- wlevel_result[!map_lgl(wlevel_result, is.null)]
In R the Limma package can give you a list of differentially expressed genes.
How can I simply get all the probesets with highest signal intensity in the respect of a threshold?
Can I get only the most expressed genes in an healty experiment, for example from one .CEL file? Or the most expressed genes from a set of .CEL files of the same group (all of the control group, or all of the sample group).
If you run the following script, it's all ok. You have many .CEL files and all work.
source("http://www.bioconductor.org/biocLite.R")
biocLite(c("GEOquery","affy","limma","gcrma"))
gse_number <- "GSE13887"
getGEOSuppFiles( gse_number )
COMPRESSED_CELS_DIRECTORY <- gse_number
untar( paste( gse_number , paste( gse_number , "RAW.tar" , sep="_") , sep="/" ), exdir=COMPRESSED_CELS_DIRECTORY)
cels <- list.files( COMPRESSED_CELS_DIRECTORY , pattern = "[gz]")
sapply( paste( COMPRESSED_CELS_DIRECTORY , cels, sep="/") , gunzip )
celData <- ReadAffy( celfile.path = gse_number )
gcrma.ExpressionSet <- gcrma(celData)
But if you delete all .CEL files manually but you leave only one, execute the script from scratch, in order to have 1 sample in the celData object:
> celData
AffyBatch object
size of arrays=1164x1164 features (17 kb)
cdf=HG-U133_Plus_2 (54675 affyids)
number of samples=1
number of genes=54675
annotation=hgu133plus2
notes=
Then you'll get the error:
Error in model.frame.default(formula = y ~ x, drop.unused.levels = TRUE) :
variable lengths differ (found for 'x')
How can I get the most expressed genes from 1 .CEL sample file?
I've found a library that could be useful for my purpose: the panp package.
But, if you run the following script:
if(!require(panp)) { biocLite("panp") }
library(panp)
myGDS <- getGEO("GDS2697")
eset <- GDS2eSet(myGDS,do.log2=TRUE)
my_pa <- pa.calls(eset)
you'll get an error:
> my_pa <- pa.calls(eset)
Error in if (chip == "hgu133b") { : the argument has length zero
even if the platform of the GDS is that expected by the library.
If you run with the pa.call() with gcrma.ExpressionSet as parameter then all work:
my_pa <- pa.calls(gcrma.ExpressionSet)
Processing 28 chips: ############################
Processing complete.
In summary, If you run the script you'll get an error while executing:
my_pa <- pa.calls(eset)
and not while executing
my_pa <- pa.calls(gcrma.ExpressionSet)
Why if they are both ExpressionSet?
> is(gcrma.ExpressionSet)
[1] "ExpressionSet" "eSet" "VersionedBiobase" "Versioned"
> is(eset)
[1] "ExpressionSet" "eSet" "VersionedBiobase" "Versioned"
Your gcrma.ExpressionSet is an object of class "ExpressionSet"; working with ExpressionSet objects is described in the Biobase vignette
vignette("ExpressionSetIntroduction")
also available on the Biobase landing page. In particular the matrix of summarized expression values can be extracted with exprs(gcrma.ExpressionSet). So
> eset = gcrma.ExpressionSet ## easier to display
> which(exprs(eset) == max(exprs(eset)), arr.ind=TRUE)
row col
213477_x_at 22779 24
> sampleNames(eset)[24]
[1] "GSM349767.CEL"
Use justGCRMA() rather than ReadAffy as a faster and more memory efficient way to get to an ExpressionSet.
Consider asking questions about Biocondcutor packages on the Bioconductor support site where you'll get fast responses from knowledgeable members.