ERROR in Biostrings while trying to MSA using ggmsa - r

I want to do a msa of the same peptide in 3 species (rat, zebrafish, and pupfish) and match it (found identical identities/disparities) with 2 synthetic peptides that I have (M35 and M871) but I'm getting the following error after building the vector:
Library (ggmsa)
galanin_table <- c("MACSKHLVLFLTILLSLAETPDSAPAHRGRGGWTLNSAGYLLGPVLHLSSKANQGRKTDSALEILDLWKAIDGLPYSRSPRMTKRSMGETFVKPRTGDLRIVDKNVPDEEATLNL", "Rat", "MHRCVGGVCVSLIVCAFLTETLGMVIAAKEKRGWTLNSAGYLLGPRRIDHLIQIKDTPSARGREDLLGQYAIDSHRSLSDKHGLAGKREMPLDEDFKTGALRIADEDVVHTIIDFLSYLKLKEIGALDSLPSSLTSEEISQP", "Zebrafish", "MQRSFAVFCVSLIFCATLSETIGLVIAAKEKRGWTLNSAGYLLGPRRIDHLIQIKDSPSARGRDELVNQYGIDGHRTLGDKAGLAGKRDMAQEDDVRTGPLRIGDEDIIHTVIDFLSYLKLKEMGALDSLPSPLTSDELANP", "Pupfish", "GWTLNSAGYLLGPPPGFSPFR","M35", "WTLNSAGYLLGPEHPPPALALA","M871")
galanin_matrix <- matrix(galanin_table, byrow=T, nrow=5)
galanin_table <- as.data.frame(galanin_matrix, stringsAsFactors = F)
colnames(galanin_table) <- c("Sequences", "Species")
galanin_table <- as.data.frame(galanin_table)
galanin_list <- as.list(galanin_table)
galanin_asvector <- as.vector(galanin_list)
galanin_asvector_ss <- Biostrings::AAStringSet(x= galanin_asvector)
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function 'seqtype' for signature '"character"'
Probably I'm building the vector in the wrong way

You've certainly started out with an interesting approach for importing your sequences into R. ggmsa() expects either a system file identifying sequences in a recognized format like FASTA, or a XStringSet object of your sequences. I don't know if you've actually stored your sequences in a character string, or that was just an easy avenue for including them here in this example, but assuming that's what you've got this should get you started:
# load in decipher for the aligner
suppressMessages(library(DECIPHER))
# load in ggmsa
library(ggmsa)
# your sequences
galanin_table <- c("MACSKHLVLFLTILLSLAETPDSAPAHRGRGGWTLNSAGYLLGPVLHLSSKANQGRKTDSALEILDLWKAIDGLPYSRSPRMTKRSMGETFVKPRTGDLRIVDKNVPDEEATLNL", "Rat", "MHRCVGGVCVSLIVCAFLTETLGMVIAAKEKRGWTLNSAGYLLGPRRIDHLIQIKDTPSARGREDLLGQYAIDSHRSLSDKHGLAGKREMPLDEDFKTGALRIADEDVVHTIIDFLSYLKLKEIGALDSLPSSLTSEEISQP", "Zebrafish", "MQRSFAVFCVSLIFCATLSETIGLVIAAKEKRGWTLNSAGYLLGPRRIDHLIQIKDSPSARGRDELVNQYGIDGHRTLGDKAGLAGKRDMAQEDDVRTGPLRIGDEDIIHTVIDFLSYLKLKEMGALDSLPSPLTSDELANP", "Pupfish", "GWTLNSAGYLLGPPPGFSPFR","M35", "WTLNSAGYLLGPEHPPPALALA","M871")
# grab your sequnces, c(T,F) will recycle over the original vector to select
# a 1,3,5,7,etc pattern
# conversely c(F,T) can grab the names in the opposite pattern
seqs <- AAStringSet(galanin_table[c(T,F)])
names(seqs) <- galanin_table[c(F,T)]
# align your sequences
ali <- AlignSeqs(seqs)
# call ggmsa
ggmsa(msa = ali,
color = "Clustal",
font = "DroidSansMono",
char_width = 0.5,
seq_name = TRUE)
Good luck!

Related

Normalization of cel. files after filtering of non-expressed genes in R

Before normalizing my data, I want to perform a filtering where genes that are not expressed are deleted. For this purpose I have specified a threshold value. I would like to do this filtering before normalizing and then normalize it.
library(limma)
library(hgu133plus2cdf)
library(affy)
library(dplyr)
library(oligo)
setwd("C:/A549_ALI/4_tert-Butanol (22)/")
data=read.celfiles(list.celfiles())
eset <- rma(data, normalize=FALSE, background=FALSE)
not_expressed_threshold <- quantile(exprs(eset),0.1)
not_expressed <- exprs(eset) < not_expressed_threshold
not_expressed_2 <- rownames(exprs(eset))[not_expressed]
celfiles_filtered <- eset[!rownames(eset) %in% not_expressed_2, ]
cat("Dimensions of celfiles:", dim(eset), "\n")
cat("Dimensions of celfiles_filtered:", dim(celfiles_filtered), "\n")
eset_ <- rma(celfiles_filtered, normalize=FALSE, background=FALSE)
I get this error message:
Error in (function (classes, fdef, mtable) :
cannot find inherited method for function 'rma' for signature '"ExpressionSet"'.
I tried to filter data directly and then normalize it, but it didn't work either. Thanks for any reply.

R - extract part of .nc file and convert into raster (similar to WorldClim format)

I have a netcdf file I made which contains percentage values.
The file has 1 variable, 5 dimensions and 0 NetCDF attributes.
The dimensions are
"lon" "lat" "month" "CR" "yearSumm"
They were created using
lon <- ncdim_def("lon", "modis_degrees", -179.5:179.5, unlim=FALSE,
create_dimvar=TRUE, calendar=NA, longname="Longitude")
lat <- ncdim_def("lat", "modis_degrees", -89.5:89.5, unlim=FALSE,
create_dimvar=TRUE, calendar=NA, longname="Latitude")
month <- ncdim_def("month", "month_name", 1:13, unlim=FALSE,
create_dimvar=TRUE, calendar=NA, longname="Month.and.Annual.Data")
CR <- ncdim_def("CR", "CR_numeric", 1:12, unlim=FALSE,
create_dimvar=TRUE, calendar=NA, longname="Cloud.Regime")
yearSumm <- ncdim_def("yearSumm", "yearOrSummType", 1:21, unlim=FALSE,
create_dimvar=TRUE, calendar=NA, longname="Year.and.Summary.Data")
I want to extract 13 layers (each latxlong with each cell a percentage value) from this and make them into a raster file like the bioclimatic data you can download from worldclim
I have tried extracting the data I want into an array, to then make a raster. I did that using
CR_RFO <- ncvar_get(CRnc, attributes(CRnc$var)$names[1])
CR_Ann <- as.array(CR_RFO[1:360, 1:180, 13, 1:12, 18])
This seems to have selected the data I want.
I then tried to make that into raster format.
raster(CR_Ann)
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘raster’ for signature ‘"array"’
> CR_R <- as.raster(CR_Ann)
Error in array(if (d[3L] == 3L) rgb(t(x[, , 1L]), t(x[, , 2L]), t(x[, :
a raster array must have exactly 3 or 4 planes
> CR_R <- raster(CR_Ann)
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘raster’ for signature ‘"array"’
> CR_R <- stack(CR_Ann)
Error in data.frame(values = unlist(unname(x)), ind, stringsAsFactors = FALSE) :
arguments imply differing number of rows: 777600, 0
> CR_R <- brick(CR_Ann)
Eventually brick worked, but I don't think that is actually what I want.
When I looked up the WorldClim files I downloaded, it is a zip file of .tifs
I also had tried
# set path and filename
ncpath <- "data/"
ncname <- "CR_RFO"
ncfname <- paste(ncpath, ncname, ".nc", sep="")
dname <- "Ann" # note: Ann means Annual
CR_raster <- brick(ncfname, varname="CR_RFO")
CR_raster; class(CR_raster)
which resulted in the error
CR_RFO has more than 4 dimensions, I do not know what to do with these data
I suspect I am going about it from the wrong angle, and maybe even have made my netcdf file incorrectly, as lat and long are not variables like in some of the examples I have read.
How can I extract these 13 lat x long layers and output them as .tif as per worldclim?
This is how I have ended up doing what I think I needed to. I haven't tested this in place of worldclim data yet, but I have successfully made the geotiff files.
CRnc <- nc_open("data/CR_RFO.nc")
CR_RFO <- ncvar_get(CRnc, attributes(CRnc$var)$names[1])
Repeat from here for each tif I want, selecting the correct number in the 4th place in the index, and changing the file names accordingly.
CR1_Ann <- as.matrix(CR_RFO[1:360, 1:180, 13, 1, 18])
CR1_Ann <- t(CR1_Ann)
CR1_Ann <- flipud(CR1_Ann)
CR1_Annr <- raster(CR1_Ann, ymn = -89.5, ymx = 89.5, xmn = -179.5, xmx = 179.5)
#plot(CR1_Annr)
writeRaster(CR1_Annr, "./data/CR_Ann/CR1_Ann", format = "GTiff")
This is not an elegant solution, so if anyone has a better way, please share.

Error in is.single.string(object) : argument "object" is missing, with no default

I want to parse the AAChange.refGene column and then use biomaRt R package to extract information. My code is raising Error in is.single.string(object) : argument "object" is missing, with no default even though the getSequence function is meant to accept multiple arguments.
library(tidyr)
variant_calls = read.delim("variant_calls.txt")
info = tidyr::separate(variant_calls["AAChange.refGene"], AAChange.refGene, c("Refseq ID", "cDNA level change", "Protein level change"), ":")
df = cbind(variant_calls["Gene.refGene"],info)
library(biomaRt)
ensembl <- useMart(biomart="ENSEMBL_MART_ENSEMBL", dataset="hsapiens_gene_ensembl", host="https://grch37.ensembl.org", path="/biomart/martservice")
pep <- vector()
for(i in 1:length(df$`Refseq ID`)){
temp <- getSequence(id=df$`Refseq ID`[i],type='refseq_mrna',seqType='peptide', mart=ensembl)
temp <- sapply(temp$peptide, nchar)
temp <- sort(temp, decreasing = TRUE)
temp <- names(temp[1])
pep[i] <- temp
}
df$Sequence <- pep
Traceback:
Error in is.single.string(object) :
argument "object" is missing, with no default
I got the same error and found out (using ?getSequence) that it was a conflict between packages (classic R), specifically biomart and seqinr which is used to handle fasta format thus probably used together often.
My solution consisted in calling the function like this:
biomaRt::getSequence()

Cannot identify Chinese character in word2vec

I am a new comer on word embedding and write a simple program to capture the message from my whatsapp to try the word2vec function in R. Everything works well and I can successfully generate the embedding matrix with the Chinese character showed in the correct way. However, when I use the predict, type=nearest function, the program shows that the Chinese character is not in the dictionary (there is no such problem if the character is English). Is it a problem related to encoding?
My code is as follows:
library(tidyverse)
library(dplyr)
library(rwhatsapp)
library(word2vec)
chat<-rwa_read("C:/Users/peace/Desktop/_chat.txt")
temp<-post_seg$text
words<-word2vec(temp,dim=15,encoding ="UTF-8")
embedding <- as.matrix(words)
nn1 <- predict(words, c("cpc"), type = "nearest", top_n = 5,encoding ="UTF-8")
nn2 <- predict(words, c("夠"), type = "nearest", top_n = 5,encoding ="UTF-8")
Error message shown when nn2 is run:
Error in w2v_nearest(object$model, x = x, top_n = top_n, ...) :
Could not find the word in the dictionary: 夠
But it works well when running the embedding matrix and nn1:
方猛 -0.1368161887 -1.1562500000 -1.461319923
夠 -0.8252676129 -1.5346769094 -1.077145815
cpc -0.1976414174 0.3481757045 0.275686920
[ reached getOption("max.print") -- omitted 2410 rows ]
> nn1
$cpc
term1 term2 similarity rank
1 cpc storeid 0.9780686 1
2 cpc ns 0.9569275 2
3 cpc term 0.8783157 3
Try this way
library(tidyverse)
library(dplyr)
library(rwhatsapp)
library(word2vec)
chat<-rwa_read("C:/Users/peace/Desktop/_chat.txt")
temp<-post_seg$text
words<-word2vec(temp,dim=15,encoding ="UTF-8")
Sys.setlocale(category = 'LC_ALL', locale = 'C')
embedding <- as.matrix(words)
nn2 <- predict(words, c("夠"), type = "nearest", top_n = 5,encoding ="UTF-8")
Sys.setlocale(); Sys.getlocale()
nn2

Get the most expressed genes from one .CEL file in R

In R the Limma package can give you a list of differentially expressed genes.
How can I simply get all the probesets with highest signal intensity in the respect of a threshold?
Can I get only the most expressed genes in an healty experiment, for example from one .CEL file? Or the most expressed genes from a set of .CEL files of the same group (all of the control group, or all of the sample group).
If you run the following script, it's all ok. You have many .CEL files and all work.
source("http://www.bioconductor.org/biocLite.R")
biocLite(c("GEOquery","affy","limma","gcrma"))
gse_number <- "GSE13887"
getGEOSuppFiles( gse_number )
COMPRESSED_CELS_DIRECTORY <- gse_number
untar( paste( gse_number , paste( gse_number , "RAW.tar" , sep="_") , sep="/" ), exdir=COMPRESSED_CELS_DIRECTORY)
cels <- list.files( COMPRESSED_CELS_DIRECTORY , pattern = "[gz]")
sapply( paste( COMPRESSED_CELS_DIRECTORY , cels, sep="/") , gunzip )
celData <- ReadAffy( celfile.path = gse_number )
gcrma.ExpressionSet <- gcrma(celData)
But if you delete all .CEL files manually but you leave only one, execute the script from scratch, in order to have 1 sample in the celData object:
> celData
AffyBatch object
size of arrays=1164x1164 features (17 kb)
cdf=HG-U133_Plus_2 (54675 affyids)
number of samples=1
number of genes=54675
annotation=hgu133plus2
notes=
Then you'll get the error:
Error in model.frame.default(formula = y ~ x, drop.unused.levels = TRUE) :
variable lengths differ (found for 'x')
How can I get the most expressed genes from 1 .CEL sample file?
I've found a library that could be useful for my purpose: the panp package.
But, if you run the following script:
if(!require(panp)) { biocLite("panp") }
library(panp)
myGDS <- getGEO("GDS2697")
eset <- GDS2eSet(myGDS,do.log2=TRUE)
my_pa <- pa.calls(eset)
you'll get an error:
> my_pa <- pa.calls(eset)
Error in if (chip == "hgu133b") { : the argument has length zero
even if the platform of the GDS is that expected by the library.
If you run with the pa.call() with gcrma.ExpressionSet as parameter then all work:
my_pa <- pa.calls(gcrma.ExpressionSet)
Processing 28 chips: ############################
Processing complete.
In summary, If you run the script you'll get an error while executing:
my_pa <- pa.calls(eset)
and not while executing
my_pa <- pa.calls(gcrma.ExpressionSet)
Why if they are both ExpressionSet?
> is(gcrma.ExpressionSet)
[1] "ExpressionSet" "eSet" "VersionedBiobase" "Versioned"
> is(eset)
[1] "ExpressionSet" "eSet" "VersionedBiobase" "Versioned"
Your gcrma.ExpressionSet is an object of class "ExpressionSet"; working with ExpressionSet objects is described in the Biobase vignette
vignette("ExpressionSetIntroduction")
also available on the Biobase landing page. In particular the matrix of summarized expression values can be extracted with exprs(gcrma.ExpressionSet). So
> eset = gcrma.ExpressionSet ## easier to display
> which(exprs(eset) == max(exprs(eset)), arr.ind=TRUE)
row col
213477_x_at 22779 24
> sampleNames(eset)[24]
[1] "GSM349767.CEL"
Use justGCRMA() rather than ReadAffy as a faster and more memory efficient way to get to an ExpressionSet.
Consider asking questions about Biocondcutor packages on the Bioconductor support site where you'll get fast responses from knowledgeable members.

Resources