I am trying to create a NetCDF from a .csv file. I have read several tutorials here and other places and still have some doubts.
I have a table according to this:
lat,long,time,rh,temp
41,-109,6,1,1
40,-107,18,2,2
39,-105,6,3,3
41,-103,18,4,4
40,-109,6,5,2
39,-107,18,6,4
I create the NetCDF using the ncdf4 package in R.
xvals <- data$lon
yvals <- data$lat
nx <- length(xvals)
ny <- length(yvals)
lon1 <- ncdim_def("longitude", "degrees_east", xvals)
lat2 <- ncdim_def("latitude", "degrees_north", yvals)
time <- data$time
mv <- -999 #missing value to use
var_temp <- ncvar_def("temperatura", "celsius", list(lon1, lat2, time), longname="Temp. da superfície", mv)
var_rh <- ncvar_def("humidade", "%", list(lon1, lat2, time), longname = "humidade relativa", mv )
ncnew <- nc_create(filename, list(var_temp, var_rh))
ncvar_put(ncnew, var_temp, dadostemp, start=c(1,1,1), count=c(nx,ny,nt))
When I follow the procedure it states that the NC expects 3 times the number of data that I have.
I understand why, one matrix for each dimension, since I stated that the variables are according to the Longitude, Latitude and Time.
So, how would I import this kind of data, where I already have one Lon, Lat, Time and other variables for each data acquisition?
Could someone shed some light?
PS: The data used here is not my real data, just some example I was using for the tutorials.
I think there is more than one problem in your code. Step by step:
Create dimensions
In a nc file dimensions don't work as key-values there just a vector of values defining what each position in a variable array means.
This means you should create your dimensions like this:
xvals <- unique(data$lon)
xvals <- xvals[order(xvals)]
yvals <- yvals[order(unique(data$lat))]
lon1 <- ncdim_def("longitude", "degrees_east", xvals)
lat2 <- ncdim_def("latitude", "degrees_north", yvals)
time <- data$time
time_d <- ncdim_def("time","h",unique(time))
Where I work we use unlimited dimensions as mere indexes while a 1d-variable with same name as the dimension holds the values. I'm not sure how unlimited dimensions work in R. Since you don't ask for it I leave this out :-)
define variables
mv <- -999 #missing value to use
var_temp <- ncvar_def("temperatura", "celsius",
list(lon1, lat2, time_d),
longname="Temp. da superfície", mv)
var_rh <- ncvar_def("humidade", "%",
list(lon1, lat2, time_d),
longname = "humidade relativa", mv )
add data
Create an nc file: ncnew <- nc_create(f, list(var_temp, var_rh))
When adding values the object holding the data is molten to a 1d-array and a sequential write is started at the position specified by start. The dimension to write along is controlled by the values in count. If you have data like this:
long, lat, time, t
1, 1, 1, 1
2, 1, 1, 2
1, 2, 1, 3
2, 2, 1, 4
The command ncvar_put(ncnew, var_temp,data$t,count=c(2,2,1)) would give you what you (probably) expect.
For you're data the first step is to create the indexes for the dimensions:
data$idx_lon <- match(data$long,xvals)
data$idx_lat <- match(data$lat,yvals)
data$idx_time <- match(data$time,unique(time))
Then create an array with the dimensions appropriate for your data:
m <- array(mv,dim = c(length(yvals),length(xvals),length(unique(time))))
Then fill the array with you're values:
for(i in 1:NROW(data)){
m[data$idx_lat[i],data$idx_lon[i],data$idx_time[i]] <- data$temp[i]
}
if speed is a concern you could calculate the linear index vectorised and use this for value assignment.
Write the data
ncvar_put(ncnew, var_temp,m)
Note that you don't need start and count.
Finally close the nc file to write data to the disk nc_close(ncnew)
Optionally I would recommend you the ncdump console command to check your file.
Edit
Regarding your question whether to write a complete array or use start and count I believe both methods work reliable. Which one to prefer depends on your data and you're personal preferences.
I think the method of building an array, add the values and then write it as whole is easier to understand. However, when asking what is more efficient it depends on the data. If you're data is big and has many NA values I believe using multiple writes with start and count could be faster. If NA's are rare creating one matrix and do single write would be faster. If you're data is so big creating an extra array would exceed you're available memory you have to combine both methods.
Related
When doing raster math, for example raster1-raster2, the datatype of the output raster is 'FLT4S', even if the datatype ot both raster1 and raster 2 is 'INT2S'. How can I force the output to be 'INT2S', without writing to disk? Is there a global way of doing it saying that all raster processing shall result in INT2S data?
The reason for wanting 'INT2S' instead of 'FLT4S' is to save memory space and speed up processing when using for loops on larger raster datasets.
In rasterOptions() one can specify dataType, but as far as I understand that only applies when writing to disk, right?
#load package raster
require (raster)
#create sample rasters
r1<-raster::raster(ext=extent(c(0,10,0,10)), res=1, vals=1:100)
r2<-raster::raster(ext=extent(c(0,10,0,10)), res=1, vals=100:1)
#set dataType of sample rasters to 'INT2S'
dataType(r1)<-'INT2S'
dataType(r2)<-'INT2S'
#check dataType of sample rasters
dataType(r1)
dataType(r2)
#do some simple arithmetics
r3<-r2-r1
#check the dataType of the output raster
dataType(r3)
I would like dataType(r3) to be 'INT2S' as well
In your example, it actually does work. Notice the L after the numbers if the vals argument. That is equivalent to as.integer
library(raster)
r1 <-raster::raster(ext=extent(c(0,10,0,10)), res=1, vals=1:100L)
r2 <-raster::raster(ext=extent(c(0,10,0,10)), res=1, vals=100:1L)
r3 <- r2 - r1
class(values(r3))
#[1] "integer"
But in other cases it does not work
class(values(r3 - 2L))
[1] "numeric"
And you cannot control that behavior.
Note that dataType provides information about the file that the RasterLayer refers to. If there is no file, like in the above example, the value is meaningless. You should also not set it, except for debugging when dealing with an existing file.
So the best you can do is set the datatype when writing to disk as in
writeRaster(r3, filename="test.tif", datatype="INT2S")
I have the same problem with 'INT2U' data and I don't believe it's possible. AFAIK, R supports 'numeric'. Floating point can be coerced to 'integer' with 'as.integer()' but I think it's just truncated.
I've had this same question too. Hopefully someone will chime in if I'm wrong, but I believe you really only have one option and that is to convert the output when you write to disk:
writeRaster(r3,filename="r3.tif", format="GTiff", datatype='INT2S')
then load back into R.
You can convert the output after the calculation, but it does not change the size of the object:
dataType(r3)<-'INT2S'
You can check the object size using:
object.size(r3)
If you write the data to disk and reload the object size will be smaller.
Another option is to use the calc or overlay functions in raster to do some math, save to disk and return the resulting raster in one go. I wrote the example rasters to disk to start since Robert said you should not assign it and in real cases you will probably be reading from disk.
# load package raster
library(raster)
# create sample rasters on disk
r1 <- raster::raster(ext = extent(c(0, 10, 0, 10)), res = 1, vals = 1:100)
r2 <- raster::raster(ext = extent(c(0, 10, 0, 10)), res = 1, vals = 100:1)
r1_fl <- tempfile(fileext = ".tif")
r2_fl <- tempfile(fileext = ".tif")
r3_fl <- tempfile(fileext = ".tif")
writeRaster(r1, r1_fl, datatype = "INT2S")
writeRaster(r2, r2_fl, datatype = "INT2S")
# read rasters from disk
r1 <- raster(r1_fl)
r2 <- raster(r2_fl)
# check dataType of sample rasters
dataType(r1)
dataType(r2)
# do some simple (or complex) arithmetic
r3 <- overlay(r1, r2, fun = function(x, y){x - y}, filename = r3_fl,
datatype = "INT2S")
# check the dataType of the output raster
dataType(r3)
I'm having a memory problem with R giving the Can not allocate vector of size XX Gb error message. I have a bunch of daily files (12784 days) in netcdf format giving sea surface temperature in a 1305x378 (longitude-latitude) grid. That gives 493290 points each day, decreasing to about 245000 when removing NAs (over land points).
My final objective is to build a time series for any of the 245000 points from the daily files and find the temporal trend for each point. And my idea was to build a big data frame with a point per row and a day per column (2450000x12784) so I could apply the trend calculation to any point. But then, building such data frame, the memory problem appeared, as expected.
First I tried a script I had previously used to read data and extract a three column (lon-lat-sst) dataframe by reading nc file and then melting the data. This lead to an excessive computing time when tried for a small set of days and to the memory problem. Then I tried to subset the daily files into longitudinal slices; this avoided the memory problem but the csv output files were too big and the process was very time consuming.
Another strategy I've tried without success to the moment it's been to sequentially read all the nc files and then extract all the daily values for each point and find the trend. Then I would only need to save a single 245000 points dataframe. But I think this would be time consuming and not the proper R way.
I have been reading about big.memory and ff packages to try to declare big.matrix or a 3D array (1305 x 378 x 12784) but had not success by now.
What would be the appropriate strategy to face the problem?
Extract single point time series to calculate individual trends and populate a smaller dataframe
Subset daily files in slices to avoid the memory problem but end with a lot of dataframes/files
Try to solve the memory problem with bigmemory or ff packages
Thanks in advance for your help
EDIT 1
Add code to fill the matrix
library(stringr)
library(ncdf4)
library(reshape2)
library(dplyr)
# paths
ruta_datos<-"/home/meteo/PROJECTES/VERSUS/CMEMS/DATA/SST/"
ruta_treball<-"/home/meteo/PROJECTES/VERSUS/CMEMS/TREBALL/"
setwd(ruta_treball)
sst_data_full <- function(inputfile) {
sstFile <- nc_open(inputfile)
sst_read <- list()
sst_read$lon <- ncvar_get(sstFile, "lon")
sst_read$lats <- ncvar_get(sstFile, "lat")
sst_read$sst <- ncvar_get(sstFile, "analysed_sst")
nc_close(sstFile)
sst_read
}
melt_sst <- function(L) {
dimnames(L$sst) <- list(lon = L$lon, lat = L$lats)
sst_read <- melt(L$sst, value.name = "sst")
}
# One month list file: This ends with a df of 245855 rows x 33 columns
files <- list.files(path = ruta_datos, pattern = "SST-CMEMS-198201")
sst.out=data.frame()
for (i in 1:length(files) ) {
sst<-sst_data_full(paste0(ruta_datos,files[i],sep=""))
msst <- melt_sst(sst)
msst<-subset(msst, !is.na(msst$sst))
if ( i == 1 ) {
sst.out<-msst
} else {
sst.out<-cbind(sst.out,msst$sst)
}
}
EDIT 2
Code used in a previous (smaller) data frame to calculate temporal trend. Original data was a matrix of temporal series, being each column a series.
library(forecast)
data<-read.csv(....)
for (i in 2:length(data)){
var<-paste("V",i,sep="")
ff<-data$fecha
valor<-data[,i]
datos2<-as.data.frame(cbind(data$fecha,valor))
datos.ts<-ts(datos2$valor, frequency = 365)
datos.stl <- stl(datos.ts,s.window = 365)
datos.tslm<-tslm(datos.ts ~ trend)
summary(datos.tslm)
output[i-1]<-datos.tslm$coefficients[2]
}
fecha is date variable name
EDIT 2
Working code from F. Privé answer
library(bigmemory)
tmp <- sst_data_full(paste0(ruta_datos,files[1],sep=""))
library(bigstatsr)
mat <- FBM(length(tmp$sst), length(files),backingfile = "/home/meteo/PROJECTES/VERSUS/CMEMS/TREBALL" )
for (i in seq_along(files)) {
mat[, i] <- sst_data_full(paste0(ruta_datos,files[i],sep=""))$sst
}
With this code a big matrix was created
dim(mat)
[1] 493290 12783
mat[1,1]
[1] 293.05
mat[1,1:10]
[1] 293.05 293.06 292.98 292.96 292.96 293.00 292.97 292.99 292.89 292.97
ncol(mat)
[1] 12783
nrow(mat)
[1] 493290
So, to your read data in a Filebacked Big Matrix (FBM), you can do
files <- list.files(path = "SST-CMEMS", pattern = "SST-CMEMS-198201*",
full.names = TRUE)
tmp <- sst_data_full(files[1])
library(bigstatsr)
mat <- FBM(length(tmp$sst), length(files))
for (i in seq_along(files)) {
mat[, i] <- sst_data_full(files[i])$sst
}
I am organizing weather data into netCDF files in R. Everything goes fine until I try to populate the netcdf variables with data, because it is asking me to specify only one dimension for two-dimensional variables.
library(ncdf)
These are the dimension tags for the variables. Each variable uses the Threshold dimension and one of the other two dimensions.
th <- dim.def.ncdf("Threshold", "level", c(5,6,7,8,9,10,50,75,100))
rt <- dim.def.ncdf("RainMinimum", "cm", c(5, 10, 25))
wt <- dim.def.ncdf("WindMinimum", "m/s", c(18, 30, 50))
The variables are created in a loop, and there are a lot of them, so for the sake of easy understanding, in my example I'll only populate the list of variables with one variable.
vars <- list()
v1 <- var.def.ncdf("ARMM_rain", "percent", list(th, rt), -1, prec="double")
vars[[length(vars)+1]] <- v1
ncdata <- create.ncdf("composite.nc", vars)
I use another loop to extract data from different data files into a 9x3 data frame named subframe while iterating through the variables of the netcdf file with varindex. For the sake of reproducing, I'll give a quick initialization for these values.
varindex <- 1
subframe <- data.frame(matrix(nrow=9, ncol=3, rep(.01, 27)))
The desired outcome from there is to populate each ncdf variable with the contents of subframe. The code to do so is:
for(x in 1:9) {
for(y in 1:3) {
value <- ifelse(is.na(subframe[x,y]), -1, subframe[x,y])
put.var.ncdf(ncdata, varindex, value, start=c(x,y), count=1)
}
}
The error message is:
Error in put.var.ncdf(ncdata, varindex, value, start = c(x, y), count = 1) :
'start' should specify 1 dims but actually specifies 2
tl;dr: I have defined two-dimensional variables using ncdf in R, I am trying to write data to them, but I am getting an error message because R believes they are single-dimensional variables instead.
Anyone know how to fix this error?
Following a previous question (Faster reading of time series from netCDF?) I have re-permuted my netCDF files to provide fast time-series reads (scripts on github to be cleaned up eventually ...).
In short, to make reads faster, I have rearranged the dimensions from lat, lon, time to time, lat, lon. Now, my existing scripts break because they assume that the dimensions will always be lat, lon, time, following the ncdf4 documentation of ncvar_get, for the 'start' argument:
Order is X-Y-Z-T (i.e., the time dimension is last)
However, this is not the case.
Furthermore, there is a related inconsistency in the order of variables listed via the commandline netCDF utility ncdump -h and the R function ncdf4::nc_open. The first says that the dimensions are in the expected (lat, lon, time) order while the latter sees dimensions with time first (time, lat, lon).
For a minimal example, download the file test.nc and run
bash-$ ncdump -h .nc
bash-$ R
R> library(ncdf4)
R> print(nc_open("test.nc")
What I want to do is get records 5-15 from the variable "lwdown"
my.nc <- nc_open("test.nc")
But this doesn't work, since R sees the time dimension first, so I must change my scripts to
ncvar_get(my.nc, "lwdown", start = c(5, 1, 1), count = c(10, 1, 1))
It wouldn't be so bad to update my scripts and functions, except that I want to be able to read data from files regardless of the dimension order.
Other than Is there a way to generalize this function so that it works independent of dimension order?
While asking the question, I figured out this solution, though there is still room for improvement:
The closest I can get is to open the file and find the order in this way:
my.nc$var$lwdown$dim[[1]]$name
[1] "time"
my.nc$var$lwdown$dim[[2]]$name
[1] "lon"
my.nc$var$lwdown$dim[[3]]$name
[1] "lat"
which is a bit unsatisfying, although it led me to this solution:
If I want to start at c(lat = 1, lon = 1, time = 5), but the ncvar_get expects an arbitrary order, I can say"
start <- c(lat = 1, lon = 1, time = 5)
count <- c(lat = 1, lon = 1, time = 10)
dim.order <- sapply(my.nc$var$lwdown$dim, function(x) x$name)
ncvar_get(my.nc, "lwdown", start = start[dim.order], count = count[dim.order])
I ran into this recently as well. I have a netcdf with data in this format
nc_in <- nc_open("my.nc")
nc_in$dim[[1]]$name == "time"
nc_in$dim[[2]]$name == "latitude"
nc_in$dim[[3]]$name == "longitude"
nc_in$dim[[1]]$len == 3653 # this is the number of timesteps in my netcdf
nc_in$dim[[2]]$len == 180 # this is the number of longitude cells
nc_in$dim[[3]]$len == 360 # this is the number of latitude cells
The obnoxious part here is that the DIM component of the netCDF is in the order of T,Y,X
If I try to to grab time series data for the pr var using the indices in the order they appear in nc_in$dim I get an error
ncvar_get(nc_in,"pr")[3653,180,360] # 'subscript out of bounds'
If I instead grab data in X,Y,T order, it works:
ncvar_get(nc_in,"pr")[360,180,3653] # gives me a value
What I don't understand is how the ncvar_get() package knows what variable represents X, Y and T, especially if you have generated your own netCDF.
I've been trying to find a time-efficient way to merge multiple raster images in R. These are adjacent ASTER scenes from the southern Kilimanjaro region, and my target is to put them together to obtain one large image.
This is what I got so far (object 'ast14dmo' representing a list of RasterLayer objects):
# Loop through single ASTER scenes
for (i in seq(ast14dmo.sd)) {
if (i == 1) {
# Merge current with subsequent scene
ast14dmo.sd.mrg <- merge(ast14dmo.sd[[i]], ast14dmo.sd[[i+1]], tolerance = 1)
} else if (i > 1 && i < length(ast14dmo.sd)) {
tmp.mrg <- merge(ast14dmo.sd[[i]], ast14dmo.sd[[i+1]], tolerance = 1)
ast14dmo.sd.mrg <- merge(ast14dmo.sd.mrg, tmp.mrg, tolerance = 1)
} else {
# Save merged image
writeRaster(ast14dmo.sd.mrg, paste(path.mrg, "/AST14DMO_sd_", z, "m_mrg", sep = ""), format = "GTiff", overwrite = TRUE)
}
}
As you surely guess, the code works. However, merging takes quite long considering that each single raster object is some 70 mb large. I also tried Reduce and do.call, but that failed since I couldn't pass the argument 'tolerance' which circumvents the different origins of the raster files.
Anybody got an idea of how to speed things up?
You can use do.call
ast14dmo.sd$tolerance <- 1
ast14dmo.sd$filename <- paste(path.mrg, "/AST14DMO_sd_", z, "m_mrg.tif", sep = "")
ast14dmo.sd$overwrite <- TRUE
mm <- do.call(merge, ast14dmo.sd)
Here with some data, from the example in raster::merge
r1 <- raster(xmx=-150, ymn=60, ncols=30, nrows=30)
r1[] <- 1:ncell(r1)
r2 <- raster(xmn=-100, xmx=-50, ymx=50, ymn=30)
res(r2) <- c(xres(r1), yres(r1))
r2[] <- 1:ncell(r2)
x <- list(r1, r2)
names(x) <- c("x", "y")
x$filename <- 'test.tif'
x$overwrite <- TRUE
m <- do.call(merge, x)
The 'merge' function from the Raster package is a little slow. For large projects a faster option is to work with gdal commands in R.
library(gdalUtils)
library(rgdal)
Build list of all raster files you want to join (in your current working directory).
all_my_rasts <- c('r1.tif', 'r2.tif', 'r3.tif')
Make a template raster file to build onto. Think of this a big blank canvas to add tiles to.
e <- extent(-131, -124, 49, 53)
template <- raster(e)
projection(template) <- '+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs'
writeRaster(template, file="MyBigNastyRasty.tif", format="GTiff")
Merge all raster tiles into one big raster.
mosaic_rasters(gdalfile=all_my_rasts,dst_dataset="MyBigNastyRasty.tif",of="GTiff")
gdalinfo("MyBigNastyRasty.tif")
This should work pretty well for speed (faster than merge in the raster package), but if you have thousands of tiles you might even want to look into building a vrt first.
You can use Reduce like this for example :
Reduce(function(...)merge(...,tolerance=1),ast14dmo.sd)
SAGA GIS mosaicking tool (http://www.saga-gis.org/saga_tool_doc/7.3.0/grid_tools_3.html) gives you maximum flexibility for merging numeric layers, and it runs in parallel by default! You only have to translate all rasters/images to SAGA .sgrd format first, then run the command line saga_cmd.
I have tested the solution using gdalUtils as proposed by Matthew Bayly. It works quite well and fast (I have about 1000 images to merge). However, after checking with document of mosaic_raster function here, I found that it works without making a template raster before mosaic the images. I pasted the example codes from the document below:
outdir <- tempdir()
gdal_setInstallation()
valid_install <- !is.null(getOption("gdalUtils_gdalPath"))
if(require(raster) && require(rgdal) && valid_install)
{
layer1 <- system.file("external/tahoe_lidar_bareearth.tif", package="gdalUtils")
layer2 <- system.file("external/tahoe_lidar_highesthit.tif", package="gdalUtils")
mosaic_rasters(gdalfile=c(layer1,layer2),dst_dataset=file.path(outdir,"test_mosaic.envi"),
separate=TRUE,of="ENVI",verbose=TRUE)
gdalinfo("test_mosaic.envi")
}
I was faced with this same problem and I used
#Read desired files into R
data_name1<-'file_name1.tif'
r1=raster(data_name1)
data_name2<-'file_name2.tif'
r2=raster(data_name2)
#Merge files
new_data <- raster::merge(r1, r2)
Although it did not produce a new merged raster file, it stored in the data environment and produced a merged map when plotted.
I ran into the following problem when trying to mosaic several rasters on top of each other
In vv[is.na(vv)] <- getValues(x[[i]])[is.na(vv)] :
number of items to replace is not a multiple of replacement length
As #Robert Hijmans pointed out, it was likely because of misaligned rasters. To work around this, I had to resample the rasters first
library(raster)
x <- raster("Base_raster.tif")
r1 <- raster("Top1_raster.tif")
r2 <- raster("Top2_raster.tif")
# Resample
x1 <- resample(r1, crop(x, r1))
x2 <- resample(r2, crop(x, r2))
# Merge rasters. Make sure to use the right order
m <- merge(merge(x1, x2), x)
# Write output
writeRaster(m,
filename = file.path("Mosaic_raster.tif"),
format = "GTiff",
overwrite = TRUE)