Combining .nc files and extracting selected variables - r

I have a similar question to u/Ananas here: Sentinel3 OLCI (chl) Average of netcdf files on Python
I am running into similar problems, in so much that I cannot seem to extract the necessary information from the .nc-files and then merge them to create a time-series. In my case,I am trying to do this in R. My current code, which I have followed and customised from here: https://www.youtube.com/watch?v=jWRszWCVWLc&t=1504s , returns an error:
Error in `[<-.data.frame`(`*tmp*`, variable, value = c(0, 0, 0, 0, 0, :
replacement has 1927 rows, data has 2202561
Maybe I am going at it the wrong way from the start and R-s capabilities wiht .nc files are not suited for this? Any suggestions are welcomed.
Here is my code
extract_variable_from_netcdf<- function(nc,variable){
tryCatch(
{
result<-var.get.nc(nc,variable)
return(result)
},
error=function(cond){
message(paste(variable,"attribute not found"))
message("Here is the original error message")
message(cond)
}
)
}
extract_global_attribute_from_netcdf<- function(nc,global_attribute){
tryCatch(
{
result<-att.get.nc(nc,"NC_GLOBAL",global_attribute)
return(result)
},
error=function(cond){
message(paste(global_attribute,"attribute not found"))
message("Here is the original error message")
message(cond)
}
)
}
folder<- "path to folder"
files<- list.files(folder, pattern= ".nc", full.names = TRUE)
variables<- c("conc_chl", "iop_bpart","lat", "lon") #variables I need to extract
global_attrs<- c("start_date", "stop_date")
headers<-c(global_attrs,variables)
df<-data.frame(matrix(ncol=length(headers), nrow=0))
colnames(df)<- headers
for(file in files) {
nc<- open.nc(file)
chl<- var.get.nc(nc, "conc_chl")
num_chl<- length(chl)
newdf<- data.frame(matrix(ncol=length(headers), nrow=num_chl))
colnames(newdf)<- headers
for (global_attribute in global_attrs) {
newdf[global_attribute]<-extract_global_attribute_from_netcdf(nc,global_attribute)
}
for (variable in variables) {
newdf[variable]<-extract_variable_from_netcdf(nc,variable)
}
df<-merge(df,newdf,all=TRUE)
}

The way I have used ".nc" files with satellite data, in R. Have been reading it in with the "raster" library as a raster file.
library(raster)
r <- raster("yuor_file.nc")
plot(r) # quick plot to see if everything is as it should be
The way I read in my timeseries was with a loop, and in addition I used a function found from this site somewhere, to covert the raster into a sensible r-data frame
stack overflow function, to convert the loaded raster to data frame
gplot_data <- function(x, maxpixels = 50000) {
x <- raster::sampleRegular(x, maxpixels, asRaster = TRUE)
coords <- raster::xyFromCell(x, seq_len(raster::ncell(x)))
## Extract values
dat <- utils::stack(as.data.frame(raster::getValues(x)))
names(dat) <- c('value', 'variable')
dat <- dplyr::as.tbl(data.frame(coords, dat))
if (!is.null(levels(x))) {
dat <- dplyr::left_join(dat, levels(x)[[1]],
by = c("value" = "ID"))
}
dat
}
Read in one file at a time, convert with function and return data.frame
files<- list.files(folder, pattern= ".nc", full.names = TRUE)
fun <- function(i) {
#read in one file at a time
r <- raster(files[i])
#convert to normal data frame
temp <- gplot_data(r)
temp #output
}
dat <- plyr::rbind.fill(lapply(1:length(files), fun)) #bind each iteration
Here a plot using ggplot2 and ggforce.
ggplot() +
geom_tile(data = dat,
aes(x = x, y = y, fill = value))
Alternatively if you do not know the context of you file, the following, from the "ncdf4" package, will help you inspect it. https://towardsdatascience.com/how-to-crack-open-netcdf-files-in-r-and-extract-data-as-time-series-24107b70dcd
library(ncdf4)
our_nc_data <- nc_open("/your_file.nc")
print(our_nc_data)
# look for the variable names and assign them to vectors that can be bound together in dataframes
lat <- ncvar_get(our_nc_data, "lat") #names of latitude column
lon <- ncvar_get(our_nc_data, "lon") #name of longitude column
time <- ncvar_get(our_nc_data, "time") #the time was called time
tunits <- ncatt_get(our_nc_data, "time", "units")# check units
lswt_array <- ncvar_get(our_nc_data, "analysed_sst") #select the relevant variable, this is temperature named "analysed_sst"

Related

Streamline code, many years of data to process - lapply, raster, stack, mean, crop to AOI

I have some code in R which does the following:
Uses lapply to bring in files in a set folder e.g. 1997 data
Makes file list into a brick - they are NetCDF files, so I've used brick function
Stacks the bricks into one raster stack of months for each year.
Calculate the mean from the stack
Crops the new mean raster to the Area of Interest (AOI).
I've got a working code, see below, but it is clunky and I feel could be better in one loop to then run through each year's folder (I have data from 1997 to 2018). Could anyone aid in streamlining this into a simple looped code I could run by changing the filepath? I've used loops a bit before but not from scratch.
# Packages:
library(raster)
library(parallel) # Check cores in PC
library(lubridate) # needed for lapply
library(dplyr) # ""
library(sf) # For clipping data
library(rgdal)
# ChlA
# Set file paths for input and outputs:
usingfp <- "/filepath/GIS/ChlA/1997/"
the_dir_ex <- "Data/CHL/1997"
# List all NETCDF files in folder:
CHL_1997 <- list.files(path = usingfp, pattern = "\\.nc$", full.names = TRUE,
recursive = FALSE)
# Make file list into brick
CHL_1997_brick <- lapply(CHL_1997,
FUN = brick,
the_dir = the_dir_ex)
# Stack bricks
s <- stack(CHL_1997_brick)
# Calculate mean from stack
mean <- calc(s, fun = mean, na.rm = T)
plot(mean)
# Load vector boundary to "crop" to
AOI <- readOGR("/filepath/AOI/AOI.shp")
plot(AOI,
main = "Shapefile imported into R - crop extent",
axes = TRUE,
border = "blue",
add = T)
# crop the raster using the vector extent
CHL_1997_mean <- crop(mean, AOI)
plot(CHL_1997_mean, main = "Cropped mean CHL - 1997")
# add shapefile on top of the existing raster
plot(AOI, add = TRUE)
Thanks very much.
Something like this should work
library(raster)
AOI <- shapefile("/filepath/AOI/AOI.shp")
path <- "/filepath/GIS/ChlA/"
years <- 1997:2018
for (yr in years) {
fp <- file.path(path, yr)
fout <- file.path(fp, paste0(year, ".tif"))
print(fout); flush.console()
# if (file.exists(fout)) next
files <- list.files(path=fp, pattern="\\.nc$", full.names=TRUE)
b <- lapply(files, brick)
s <- stack(b)
s <- mean(s)
s <- crop(s, AOI, filename=fout) #, overwrite=TRUE)
}
Notes:
mean(s) is more efficient than calc(s, mean)
If the AOI is relatively small, it can be more efficient to first
use crop, then mean (and then use writeRaster)
You can also use terra like this:
library(terra)
AOI <- vect("/filepath/AOI/AOI.shp")
path <- "/filepath/GIS/ChlA/"
years <- 1997:2018
for (yr in years) {
fp <- file.path(path, year)
fout <- file.path(fp, paste0(year, ".tif"))
print(fout); flush.console()
# if (file.exists(fout)) next
files <- list.files(path=fp, pattern="\\.nc$", full.names=TRUE)
r <- rast(files)
s <- mean(r)
s <- crop(s, AOI, filename=fout) #, overwrite=TRUE)
}

Not able to plot multiple layers in ggplot. Error: subscript out of bounds

I'm trying to plot multiple layers (around 100) on the same plot using ggplot2. To do that I create a dataframe which contains the data assigned according to 'i'.
rm(list = ls())
# file directory
setwd("E:/stage/Flight_Data/Flight_Data/0200")
library(ggplot2)
# to read txt files
liste_fichier <- list.files("./", pattern=".txt")
n <- length(liste_fichier)
times = list();
altitudes = list();
dataframes = list();
# loop to read the files
for (i in 1:n) {
fichier = liste_fichier[i]
smp <- read.table(fichier, header=FALSE, sep="", skip=3, fill=TRUE)
names(smp) <- c("Date","Time", "Latitude", "Longitude", "Baro.Altitude",
"Radio.Altitude", "Pressure", "A340.static.T", "A340.air.speed",
"A340.ground.speed", "zonal.wind", "meridian.wind", "ozone.vmr",
"H2O.static.T", "relative.humidity", "RH.validity", "RH.accuracy",
"H2O.mmr","CO.vmr", "NOy.vmr", "NO.vmr", "NOx.vmr",
"NOy.uncertainty", "NOy.validity")
smp$ultratime <- strptime(paste(smp$Date, sprintf("%06d",strtoi(smp$Time)), sep=" "),
format="%Y%m%d %H%M%S")
dataframes[[i]] <- data.frame(smp$ultratime,smp$Baro.Altitude)
# Here I want to plot all the files in the same graph, first i want to try only 2
print(ggplot(NULL, aes(x=smp$ultratime, y=smp$Baro.Altitude)) +
geom_line(data=dataframes[[1]])+geom_line(data=dataframes[[2]]))
}
I'm able to plot a single layer for dataframes[[1]], but when I add more, I get the following error:
Error in dataframes[[2]] : subscript out of bounds

Mean values from multiple csv to data frame

After having searched for help in different threads on this topic, I still have not become wiser. Therefore: Here comes another question on looping through multiple data files...
OK. I have multiple CSV files in one folder containing 5 columns of data. The filenames are as follows:
Moist yyyymmdd hh_mm_ss.csv
I would like to create a script that reads processes the CSV-files one by one doing the following steps:
1) load file
2) check number of rows and exclude file if less than 3 registrations
3) calculate mean value of all measurements (=rows) for column 2
4) calculate mean value of all measurements (=rows) for column 4
5) output the filename timestamp, mean column 2 and mean column 4 to a data frame,
I have written the following function
moist.each.mean <- function() {
library("tcltk")
directory <- tk_choose.dir("","Choose folder for Humidity data files")
setwd(directory)
filelist <- list.files(path = directory)
filetitles <- regmatches(filelist, regexpr("[0-9].*[0-9]", filelist))
mdf <- data.frame(timestamp=character(), humidity=numeric(), temp=numeric())
for(i in 1:length(filelist)){
file.in[[i]] <- read.csv(filelist[i], header=F)
if (nrow(file.in[[i]]<3)){
print("discard")
} else {
newrow <- c(filetitles[[i]], round(mean(file.in[[i]]$V2),1), round(mean(file.in[[i]]$V4),1))
mdf <- rbind(mdf, newrow)
}
}
names(mdf) <- c("timestamp", "humidity", "temp")
}
but i keep getting an error:
Error in `[[<-.data.frame`(`*tmp*`, i, value = list(V1 = c(10519949L, :
replacement has 18 rows, data has 17
Any ideas?
Thx, kruemelprinz
I'd also suggest to use (l)apply... Here's my take:
getMeans <- function(fpath,runfct,
target_cols = c(2),
sep=",",
dec=".",
header = T,
min_obs_threshold = 3){
f <- list.files(fpath)
fcsv <- f[grepl("\.csv",f)]
fcsv <- paste0(fpath,fcsv)
csv_list <- lapply(fcsv,read.table,sep = sep,
dec = dec, header = header)
csv_rows <- sapply(csv_list,nrow)
rel_csv_list <- csv_list[!(csv_rows < min_obs_threshold)]
lapply(rel_csv_list,function(x) colMeans(x[,target_cols]))
}
Also with that kind of error message, the debugger might be very helpful.
Just run debug(moist.each.mean) and execute the function stepwise.
Here's a slightly different approach. Use lapply to read each csv file, exclude it if necessary, otherwise create a summary. This gives you a list where each element is a data frame summary. Then use rbind to create the final summary data frame.
Without a sample of your data, I can't be sure the code below exactly matches your problem, but hopefully it will be enough to get you where you want to go.
# Get vector of filenames to read
filelist=list.files(path=directory, pattern="csv")
# Read all the csv files into a list and create summaries
df.list = lapply(filelist, function(f) {
file.in = read.csv(f, header=TRUE, stringsAsFactors=FALSE)
# Set to empty data frame if file has less than 3 rows of data
if (nrow(file.in) < 3) {
print(paste("Discard", f))
# Otherwise, capture file timestamp and summarise data frame
} else {
data.frame(timestamp=substr(f, 7, 22),
humidity=round(mean(file.in$V2),1),
temp=round(mean(file.in$V4),1))
}
})
# Bind list into final summary data frame (excluding the list elements
# that don't contain a data frame because they didn't have enough rows
# to be included in the summary)
result = do.call(rbind, df.list[sapply(df.list, is.data.frame)])
One issue with your original code is that you create a vector of summary results rather than a data frame of results:
c(filetitles[[i]], round(mean(file.in[[i]]$V2),1), round(mean(file.in[[i]]$V4),1)) is a vector with three elements. What you actually want is a data frame with three columns:
data.frame(timestamp=filetitles[[i]],
humidity=round(mean(file.in[[i]]$V2),1),
temp=round(mean(file.in[[i]]$V4),1))
Thanks for the suggestions using lapply. This is definitely of value as it saves a whole lot of code as well! Meanwhile, I managed to fix my original code as well:
library("tcltk")
# directory: path to csv files
directory <-
tk_choose.dir("","Choose folder for Humidity data files")
setwd(directory)
filelist <- list.files(path = directory)
filetitles <-
regmatches(filelist, regexpr("[0-9].*[0-9]", filelist))
mdf <- data.frame()
for (i in 1:length(filelist)) {
file.in <- read.csv(filelist[i], header = F, skipNul = T)
if (nrow(file.in) < 3) {
print("discard")
} else {
newrow <-
matrix(
c(filetitles[[i]], round(mean(file.in$V2, na.rm=T),1), round(mean(file.in$V4, na.rm=T),1)), nrow = 1, ncol =
3, byrow = T
)
mdf <- rbind(mdf, newrow)
}
}
names(mdf) <- c("timestamp", "humidity", "temp")
Only I did not get it to work as a function because then I would only have one row in mdf containing the last file data. Somehow it did not add rows but overwrite row 1 with each iteration. But using it without a function wrapper worked fine...

Saving multiple variables inside a loop R

I'm using this code to do the same operation over a series of data frames. Everything seems to works ok, until the moment I try to save each data frame with a different variable in two .Rda files.
The problem is that I end up with the two .Rda files with the variables names of the data frame correctly but the content of these variables is exactly the same. It seems that is only working with the last data frame of variables. I think it's related with the loop for, but I don't know how to solve this.
#Previous steps, loading files...
setwd("path_to_files")
list.files()
files <- list.files(pattern = "toptable")
envar <- new.env()
#Load all files
for (i in files){load(i, envir = envar)}
varName <- ls(envar)
rm(i)
attach(envar)
variables <- as.list(envar)
#From now on the function
setwd("/path_to_save")
filtering_data <- function(x){
#In here I'm just filtering data frames by certain values of it's columns (P.Value column and t column)
x <- as.data.frame(x)
pval <- which(x$P.Value < 0.05)
pval <- x[pval,]
up.pval <- which(pval$t > 0)
down.pval <- which(pval$t < 0)
up.pval <- pval[up.pval,]
down.pval <- pval[down.pval,]
#Modify names to use them as variable names when saving
up_varNames <- paste0(varName, ".", "up")
down_varNames <- paste0(varName, ".", "down")
for (i in seq_along(up_varNames)){
assign(up_varNames[i], up.pval)
assign(down_varNames[i], down.pval)
}
save(list = up_varNames, file = "up.Rda" )
save(list = down_varNames, file = "down.Rda" )
}
#function
lapply(variables, filtering_data)
detach(envar)

Creating a list of raster bricks from a multivariate netCDF file

I've been working with the RCP (Representative Concentration Pathway) spatial data. It's a nice gridded dataset in netCDF format. How can I get a list of bricks where each element represents one variable from a multivariate netCDF file (by variable I don't mean lat,lon,time,depth...etc). This is what Iv'e tried to do. I can't post an example of the data, but I've set up the script below to be reproducible if you want to look in to it. Obviously questions welcome... I might not have expressed the language associated with the code smoothly. Cheers.
A: Package requirements
library(sp)
library(maptools)
library(raster)
library(ncdf)
library(rgdal)
library(rasterVis)
library(latticeExtra)
B: Gather data and look at the netCDF file structure
td <- tempdir()
tf <- tempfile(pattern = "fileZ")
download.file("http://tntcat.iiasa.ac.at:8787/RcpDb/download/R85_NOX.zip", tf , mode = 'wb' )
nc <- unzip( tf , exdir = td )
list.files(td)
## Take a look at the netCDF file structure, beyond this I don't use the ncdf package directly
ncFile <- open.ncdf(nc)
print(ncFile)
vars <- names(ncFile$var)[1:12] # I'll try to use these variable names later to make a list of bricks
C: Create a raster brick for one variable. Levels correspond to years
r85NOXene <- brick(nc, lvar = 3, varname = "emiss_ene")
NAvalue(r85NOXene) <- 0
dim(r85NOXene) # [1] 360 720 12
D: Names to faces
data(wrld_simpl) # in maptools
worldPolys <- SpatialPolygons(wrld_simpl#polygons)
cTheme <- rasterTheme(region = rev(heat.colors(20)))
levelplot(r85NOXene,layers = 4,zscaleLog = 10,main = "2020 NOx Emissions From Power Plants",
margin = FALSE, par.settings = cTheme) + layer(sp.polygons(worldPolys))
E: Summarize all grid cells for each year one variable "emis_ene", I want to do this for each variable of the netCDF file I'm working with.
gVals <- getValues(r85NOXene)
dim(gVals)
r85NOXeneA <- sapply(1:12,function(x){ mat <- matrix(gVals[,x],nrow=360)
matfun <- sum(mat, na.rm = TRUE) # Other conversions are needed, but not for the question
return(matfun)
})
F: Another meet and greet. Check out how E looks
library(ggplot2) # loaded here because of masking issues with latticeExtra
years <- c(2000,2005,seq(2010,2100,by=10))
usNOxDat <- data.frame(years=years,NOx=r85NOXeneA)
ggplot(data=usNOxDat,aes(x=years,y=(NOx))) + geom_line() # names to faces again
detach(package:ggplot2, unload=TRUE)
G: Attempt to create a list of bricks. A list of objects created in part C
brickLst <- lapply(1:12,function(x){ tmpBrk <- brick(nc, lvar = 3, varname = vars[x])
NAvalue(tmpBrk) <- 0
return(tmpBrk)
# I thought a list of bricks would be a good structure to do (E) for each netCDF variable.
# This doesn't break but, returns all variables in each element of the list.
# I want one variable in each element of the list.
# with brick() you can ask for one variable from a netCDF file as I did in (C)
# Why can't I loop through the variable names and return on variable for each list element.
})
H: Get rid of the junk you might have downloaded... Sorry
file.remove(dir(td, pattern = "^fileZ",full.names = TRUE))
file.remove(dir(td, pattern = "^R85",full.names = TRUE))
close(ncFile)
Your (E) step can be simplified using cellStats.
foo <- function(x){
b <- brick(nc, lvar = 3, varname = x)
NAvalue(b) <- 0
cellStats(b, 'sum')
}
sumLayers <- sapply(vars, foo)
sumLayers is the result you are looking for, if I understood correctly your question.
Moreover, you may use the zoo package because you are dealing with time series.
library(zoo)
tt <- getZ(r85NOXene)
z <- zoo(sumLayers, tt)
xyplot(z)

Resources