Given the following:
library(raster)
r <- raster(ncol=10, nrow=10)
s <- stack(lapply(1:3, function(x) setValues(r, runif(ncell(r)))))
s <- setZ(s, as.Date('2000-1-1') + 0:2,name="time")
s
getZ(s)
how can I remove "time" from s?
The reason why I want to remove "time" is because I get errors while croppinga RasterStack P similar to s:
cr <- crop(P, extent(Germany),snap="out")
NOTE: rgdal::checkCRSArgs: no proj_defs.dat in PROJ.4 shared files
Error in R_nc4_def_var_float: NetCDF: String match to name in use
Name of variable that the error occurred on: "time"
I.e., you are trying to add a variable with that name to the file, but it ALREADY has a variable with that name!
[1] "----------------------"
[1] "Var: time"
[1] "Ndims: 3"
[1] "Dimids: "
[1] 2 1 0
Error in ncvar_add(nc, vars[[ivar]], verbose = verbose, indefine = TRUE) :
Error in ncvar_add, defining var time
If "time" dimension is not the problem, what can be the solution to this error?
Thanks for your thoughts on this.
You can try to set a different name in setZ function. When you crop the raster, a new variable with "time" name is created, and the new rasterbrick cannot be created with two variables with the same name.
I had the same problem when I tried to export a cropped raster. I had to change de 'varname' from "time" to something else ("Date", for example).
Related
I want to parse the AAChange.refGene column and then use biomaRt R package to extract information. My code is raising Error in is.single.string(object) : argument "object" is missing, with no default even though the getSequence function is meant to accept multiple arguments.
library(tidyr)
variant_calls = read.delim("variant_calls.txt")
info = tidyr::separate(variant_calls["AAChange.refGene"], AAChange.refGene, c("Refseq ID", "cDNA level change", "Protein level change"), ":")
df = cbind(variant_calls["Gene.refGene"],info)
library(biomaRt)
ensembl <- useMart(biomart="ENSEMBL_MART_ENSEMBL", dataset="hsapiens_gene_ensembl", host="https://grch37.ensembl.org", path="/biomart/martservice")
pep <- vector()
for(i in 1:length(df$`Refseq ID`)){
temp <- getSequence(id=df$`Refseq ID`[i],type='refseq_mrna',seqType='peptide', mart=ensembl)
temp <- sapply(temp$peptide, nchar)
temp <- sort(temp, decreasing = TRUE)
temp <- names(temp[1])
pep[i] <- temp
}
df$Sequence <- pep
Traceback:
Error in is.single.string(object) :
argument "object" is missing, with no default
I got the same error and found out (using ?getSequence) that it was a conflict between packages (classic R), specifically biomart and seqinr which is used to handle fasta format thus probably used together often.
My solution consisted in calling the function like this:
biomaRt::getSequence()
I am trying to extract a variable named Data Fields/OzoneTropColumn at a point location (lon=40, lat=34) at different pressure level (825.40198, 681.29102, 464.16000, 316.22699 hPa) from multiple hdf files
library(raster)
library(ncdf4)
library(RNetCDF)
# read file
nc <- nc_open("E:/Ozone/test1.nc")
list_col1 <- as.list(list.files("E:/Ozone/", pattern = "*.hdf",
full.names = TRUE))
> attributes(nc$var) #using a single hdf file to check its variables
$names
[1] "Data Fields/Latitude" "Data Fields/Longitude"
[3] "Data Fields/O3" "Data Fields/O3DataCount"
[5] "Data Fields/O3Maximum" "Data Fields/O3Minimum"
[7] "Data Fields/O3StdDeviation" "Data Fields/OzoneTropColumn"
[9] "Data Fields/Pressure" "Data Fields/TotColDensDataCount"
[11] "Data Fields/TotColDensMaximum" "Data Fields/TotColDensMinimum"
[13] "Data Fields/TotColDensStdDeviation" "Data Fields/TotalColumnDensity"
[15] "HDFEOS INFORMATION/StructMetadata.0" "HDFEOS INFORMATION/coremetadata"
> pres <- ncvar_get(nc, "Data Fields/Pressure") #looking at pressure level from single file of hdf
> pres
[1] 825.40198 681.29102 464.16000 316.22699 215.44400 146.77901 100.00000 68.12950 46.41580 31.62290
[11] 21.54430 14.67800 10.00000 6.81291 4.64160
ncin <- raster::stack(list_col1,
varname = "Data Fields/OzoneTropColumn",
ncdf=TRUE)
#cannot extract using the following code
o3 <- ncvar_get(list_col1,attributes(list_col1$var)$names[9])
"Error in ncvar_get(list_col1, attributes(list_col1$var)$names[9]) :
first argument (nc) is not of class ncdf4!"
#tried to extract pressure levels
> prsr <- raster::stack(list_col1,varname = "Data Fields/Pressure",ncdf=TRUE)
"Error in h(simpleError(msg, call)) :
error in evaluating the argument 'x' in selecting a method for function 'stack': varname: Data Fields/Pressure does not exist in the file. Select one from:
Data Fields/O3, Data Fields/O3DataCount, Data Fields/O3Maximum, Data Fields/O3Minimum, Data Fields/O3StdDeviation, Data Fields/OzoneTropColumn, Data Fields/TotColDensDataCount, Data Fields/TotColDensMaximum, Data Fields/TotColDensMinimum, Data Fields/TotColDensStdDeviation, Data Fields/TotalColumnDensity"
#tried using index
#Point location can also be written as below 1 deg by 1 deg resolution
lonIdx <- which(lon >32 & lon <36)
latIdx <- which(lat >38 & lat <42)
presIdx <- which(pres >= 400 & pres <= 900)
#also tried
# Option 2 -- subset using array indexing
o3 <- ncvar_get(list_col1,'Data Fields/OzoneTropColumn')
"Error in ncvar_get(list_col1, "Data Fields/OzoneTropColumn") :
first argument (nc) is not of class ncdf4!"
extract2 <- o3[lonIdx, latIdx, presIdx, ]
How to I extract these values vertically at each pressure level ? (SM=Some value)
I would like the output in following way at location (lon=40, lat=34):
Pressure 1 2 3 4 5 .... 10
825.40198 SM1 SM2 SM3 SM4 SM5... SM10
681.29102 SM11 SM12
464.16000
316.22699 SM.. SM.. SM.. SM.. SM.. SM..
Appreciate any help.
Thank you
This might be an issue with how netcdf4 and raster name each of the layers in the file. And perhaps some confusion with trying to create a multilayer object from multiple ncdf at once.
I would do the following, using only raster: Load a single netCDF, using stack() or brick(). This will load the file as a multilayer object in R. Use names() to identify what is the name of the Ozone layer according to the raster package.
firstraster <- stack("E:/Ozone/test1.nc")
names(firstraster)
Once you find out the name, you can just execute a reading of all objects as stack(), extract the information on points of interest, without even assembling all layers in a single stack.
Ozonelayername <- "put name here"
files <- list.files("E:/Ozone/", pattern = "*.hdf", full.names = TRUE)
stacklist <- lapply(files, stack)
Ozonelayerlist <- lapply(stacklist, "[[", "Ozonelayername")
The line above will output a list of rasters objects (not stacks or bricks, just plain rasters), with only the layer you want.
Now we just need to execute an extract on each of these layers. sapply() will format that neatly in a matrix for us.
pointsofinterest <- expand.grid(32:36,38:42)
values <- sapply(Ozonelayerlist, extract, pointsofinterest)
I can test it, since I do not have the data, but I assume this would work.
PROBLEM
I have many .RData files in one folder and I want to extract the coordinates continued in each .rdata file. I'd also like to link the concomitant file name(use_hab) and datetime(dt) to each row of their respective coordinates.
CODE
file.namez<-list.files("C:/fitting/fitdata/7 27 2015") #name of files
#file.namez.rev<-file.namez[grep(".RData",file.namez)]
datastor<-data.frame(matrix(NA,length(file.namez),4))
names(datastor)<-c("use_hab",paste("B",1:3,sep=""))
allresults<-NULL
for(i in 1:length(file.namez))
{
datastor<-NULL
print(file.namez[i])
load(paste("C:/fitting/fitdata/7 27 2015/",file.namez[i], sep=""))
use_hab <- as.character(as.data.frame(strsplit(file.namez[i],"_an"))[2,])# this line is used to remove unwanted parts of the file name
use_hab <- gsub(".RData","", use_hab)
datastor <- fitdata$coords
datastor$use_hab <- use_hab
datastor$dt <- fitdata$dt
allresults <- rbind(allresults, datastor[,c(3,4,1,2)])
}
This is only result before the error message:
[1] "fitdata_anw514_yr2008.RData"
ERROR
Error in datastor[, c(3, 4, 1, 2)] : incorrect number of dimensions
In addition: Warning message:
In datastor$use_hab <- use_hab : Coercing LHS to a list
QUESTION
How am I getting the incorrect number of dimensions? Each file name should have 1098 coordinates and date time. In total, 63 files x 1098 rows with 4 columns(filename, datetime, x, y).
The desired result is to have the file name as the first column, the date time as the second column, and the x and y coordinates as the third and fourth columns.
Replace
datastor <- fitdata$coords
with
datastor$coords <- fitdata$coords
The error message Coercing LHS to a list is thrown when you try to access something with $ that does not support this. datastor <- fitdata$coords changes datastor to the data type of fitdata$coords.
Also, you'd change
allresults<-NULL
datastor<-NULL
to
allresults <- data.frame()
datastor <- data.frame()
but this may just my personal preference.
In R the Limma package can give you a list of differentially expressed genes.
How can I simply get all the probesets with highest signal intensity in the respect of a threshold?
Can I get only the most expressed genes in an healty experiment, for example from one .CEL file? Or the most expressed genes from a set of .CEL files of the same group (all of the control group, or all of the sample group).
If you run the following script, it's all ok. You have many .CEL files and all work.
source("http://www.bioconductor.org/biocLite.R")
biocLite(c("GEOquery","affy","limma","gcrma"))
gse_number <- "GSE13887"
getGEOSuppFiles( gse_number )
COMPRESSED_CELS_DIRECTORY <- gse_number
untar( paste( gse_number , paste( gse_number , "RAW.tar" , sep="_") , sep="/" ), exdir=COMPRESSED_CELS_DIRECTORY)
cels <- list.files( COMPRESSED_CELS_DIRECTORY , pattern = "[gz]")
sapply( paste( COMPRESSED_CELS_DIRECTORY , cels, sep="/") , gunzip )
celData <- ReadAffy( celfile.path = gse_number )
gcrma.ExpressionSet <- gcrma(celData)
But if you delete all .CEL files manually but you leave only one, execute the script from scratch, in order to have 1 sample in the celData object:
> celData
AffyBatch object
size of arrays=1164x1164 features (17 kb)
cdf=HG-U133_Plus_2 (54675 affyids)
number of samples=1
number of genes=54675
annotation=hgu133plus2
notes=
Then you'll get the error:
Error in model.frame.default(formula = y ~ x, drop.unused.levels = TRUE) :
variable lengths differ (found for 'x')
How can I get the most expressed genes from 1 .CEL sample file?
I've found a library that could be useful for my purpose: the panp package.
But, if you run the following script:
if(!require(panp)) { biocLite("panp") }
library(panp)
myGDS <- getGEO("GDS2697")
eset <- GDS2eSet(myGDS,do.log2=TRUE)
my_pa <- pa.calls(eset)
you'll get an error:
> my_pa <- pa.calls(eset)
Error in if (chip == "hgu133b") { : the argument has length zero
even if the platform of the GDS is that expected by the library.
If you run with the pa.call() with gcrma.ExpressionSet as parameter then all work:
my_pa <- pa.calls(gcrma.ExpressionSet)
Processing 28 chips: ############################
Processing complete.
In summary, If you run the script you'll get an error while executing:
my_pa <- pa.calls(eset)
and not while executing
my_pa <- pa.calls(gcrma.ExpressionSet)
Why if they are both ExpressionSet?
> is(gcrma.ExpressionSet)
[1] "ExpressionSet" "eSet" "VersionedBiobase" "Versioned"
> is(eset)
[1] "ExpressionSet" "eSet" "VersionedBiobase" "Versioned"
Your gcrma.ExpressionSet is an object of class "ExpressionSet"; working with ExpressionSet objects is described in the Biobase vignette
vignette("ExpressionSetIntroduction")
also available on the Biobase landing page. In particular the matrix of summarized expression values can be extracted with exprs(gcrma.ExpressionSet). So
> eset = gcrma.ExpressionSet ## easier to display
> which(exprs(eset) == max(exprs(eset)), arr.ind=TRUE)
row col
213477_x_at 22779 24
> sampleNames(eset)[24]
[1] "GSM349767.CEL"
Use justGCRMA() rather than ReadAffy as a faster and more memory efficient way to get to an ExpressionSet.
Consider asking questions about Biocondcutor packages on the Bioconductor support site where you'll get fast responses from knowledgeable members.
I am trying to input a large (> 70 MB) fixed format text file into r. For a smaller file (< 1MB), I can use the read.fwf() function as shown below.
condodattest1a <- read.fwf(impfile1,widths=testcsv3$Varlen,col.names=testcsv3$Varname)
When I try to run the line of code below,
condodattest1 <- read.fwf(impfile,widths=testcsv3$Varlen,col.names=testcsv3$Varname)
I get the following error message:
Error: cannot allocate vector of size 2 Kb
The only difference between the 2 lines is the size of the input file.
The formatting for the file I want to import is given in the dataframe called testcsv3. I show a small snippet of the dataframe below:
> head(testcsv3)
Varlen Varname Varclass Varsep Varforfmt
1 2 "V1" "character" 2 "A2.0"
2 15 "V2" "character" 17 "A15.0"
3 28 "V3" "character" 45 "A28.0"
4 3 "V4" "character" 48 "F3.0"
5 1 "V5" "character" 49 "A1.0"
6 3 "V6" "character" 52 "A3.0"
At least part of my problem is that I am reading in all the data as factors when I use read.fwf() and I end up exceeding the memory limit on my computer.
I tried to use read.table() as a way of formatting each variable but it seems I need a text delimiter with that function. There is a suggestion in section 3.3 in the link below that I could use sep to identify the column where every variable starts.
http://data.princeton.edu/R/readingData.html
However, when I use the command below:
condodattest1b <- read.table(impfile1,sep=testcsv3$Varsep,col.names=testcsv3$Varname, colClasses=testcsv3$Varclass)
I get the following error message:
Error in read.table(impfile1, sep = testcsv3$Varsep, col.names = testcsv3$Varname, : invalid 'sep' argument
Finally, I tried to use:
condodattest1c <- read.fortran(impfile1,lengths=testcsv3$Varlen, format=testcsv3$Varforfmt, col.names=testcsv3$Varname)
but I get the following message:
Error in processFormat(format) : missing lengths for some fields
In addition: Warning messages:
1: In processFormat(format) : NAs introduced by coercion
2: In processFormat(format) : NAs introduced by coercion
3: In processFormat(format) : NAs introduced by coercion
All I am trying to do at this point is format the data when they come into r as something other than factors. I am hoping this will limit the amount of memory I am using and allow me to actually input the file. I would appreciate any suggestions about how I can do this. I know the Fortran formats for all the variables and the column at which each variable begins.
Thank you,
Warren
Maybe this code works for you. You have to fill varlen with the field sizes and add the corresponding type strings (e.g. numeric, character, integer) to colclasses
my.readfwf <- function(filename,varlen,colclasses) {
sidx <- cumsum(c(1,varlen[1:(length(varlen)-1)]))
eidx <- sidx+varlen-1
filecontent <- scan(filename,character(0),sep="\n")
if (any(diff(nchar(filecontent))!=0))
stop("line lengths differ!")
nlines <- length(filecontent)
res <- list()
for (i in seq_along(varlen)) {
res[[i]] <- sapply(filecontent,substring,first=sidx[i],last=eidx[i])
mode(res[[i]]) <- colclasses[i]
}
attributes(res) <- list(names=paste("V",seq_along(res),sep=""),row.names=seq_along(res[[1]]),class="data.frame")
return(res)
}