Manipulate variables in netcdf files and write them again - r

I have several netcdf files. each nc file has several variables. I am only interested in two variables "Soil_Moisture" and "Soil_Moisture_Dqx".
I would like to filter "Soil_Moisture" based on "Soil_Moisture_Dqx". I want to replace values in "Soil_Moisture" by NA whenever corresponding "Soil_Moisture_Dqx" pixels have values greater than 0.04.
:Here are the files to download:
1- I tried this loop but when I typed f[1] or f[2] I got something weird which means that my loop is incorrect.I am grateful to anyhelp to get my loop corrected.
a<-list.files("C:\\3 nc files", "*.DBL", full.names = TRUE)
for(i in 1:length(a)){
f=open.ncdf(a[i])
A1 = get.var.ncdf(nc=f,varid="Soil_Moisture",verbose=TRUE)
A1* -0.000030518509475997 ## scale factor
A2 = get.var.ncdf(nc=f,varid="Soil_Moisture_Dqx",verbose=TRUE)
A2*-0.0000152592547379985## scale factor
A1[A2>0.04]=NA ## here is main calculation I need
}
2- Can anybody tell me to write them again?

Missing values are special values in netCDF files whose value is to be taken as indicating the data
is "missing". So you need to use set.missval.ncdf to set this values.
a<-list.files("C:\\3 nc files", "*.DBL", full.names = TRUE)
SM_NAME <- "Soil_Moisture"
SM_SDX_NAME <- "Soil_Moisture_Dqx"
library(ncdf)
lapply(a, function(filename){
nc <- open.ncdf( filename,write=TRUE )
SM <- get.var.ncdf(nc=nc,varid=SM_NAME)
SM_dqx <- get.var.ncdf(nc=nc,varid=SM_SDX_NAME)
SM[SM_dqx > 0.4] <- NA
newMissVal <- 999.9
set.missval.ncdf( nc, SM_NAME, newMissVal )
put.var.ncdf( nc, SM_NAME, SM )
close.ncdf(nc)
})
EDIT add some check
It is intersting here to count how many points will tagged as missed.
Whithout applying the odd scale factor we have:
lapply(a, function(filename){
nc <- open.ncdf( filename,write=TRUE )
SM_dqx <- get.var.ncdf(nc=nc,varid=SM_SDX_NAME)
table(SM_dqx > 0.4)
})
[[1]]
[1] 810347 91
[[2]]
[1] 810286 152
[[3]]
[1] 810287 151
[[4]]
[1] 810355 83

This can also be accomplished from the command line using CDO.
As I understand it both variables are contained in your input file (which I will call "datafile.nc", you will want to presumably do the following in a loop over the file lists), so first of all we will extract those two variables into two separate files:
cdo selvar,Soil_Moisture datafile.nc soil_moisture.nc
cdo selvar,Soil_Moisture_Dqx datafile.nc dqx.nc
Now we will define a mask file that contains 1 when dqx<0.04 but contains NAN when dqx>=0.04
cdo setctomiss,0 -ltc,0.04 dqx.nc mask.nc
The ltc is "than than constant" (you may want instead lec for <= ), the setctomiss replaces all the zeros with NAN.
Now we multiply these together with CDO - NAN*C=NAN and 1*C=C, so this gives you a netcdf with your desired field:
cdo mul mask.c soil_moisture.nc masked_soil_moisture.nc
you can actually combine those last two lines together if you like, and avoid the I/O of writing the mask file:
cdo mul -setctomiss,0 -ltc,0.04 dqx.nc soil_moisture.nc masked_soil_moisture.nc
But it is easier to explain the steps separately :-)
You can put the whole thing in a loop over files easily in bash.

Related

Why is Mosaic rasters not working in some files?

I want to create a new raster from 2 tiles. For the date "247" everything works fine. But for the next date "248", the raster I'm getting in return is only the first tile h12v12, is not merging the consecutive one (h13v12).
This is just an example. I've worked with more files and had the same problem with some random files. Still, for some of them, I've got the desired results.
The files:
https://drive.google.com/drive/folders/1FSUAg-H8ePP9jeZjCTatlGFsDsKCq3cN?usp=sharing
#Open files for a day
files <- c("AOD2022248.h12v12.tif", "AOD2022248.h13v12.tif")
files_2 <- c("AOD2022247.h12v12.tif", "AOD2022247.h13v12.tif")
#create stack for each one
test_1 <- stack(files[1])
test_2 <- stack(files[2])
test_3 <- stack(files_2[1])
test_4 <- stack(files_2[2])
The first try was with raster::mosaic. For the 248 file, I've got an error. For the 247 file, it worked.
joint <- mosaic(test_1, test_2, fun=mean, filename = "joint.tif", overwrite=TRUE)
Error in v[cells, i] <- as.vector(getValues(x[[i]])) :
number of items to replace is not a multiple of replacement length
joint2 <- mosaic(test_3, test_4, fun=mean, filename = "joint2.tif", overwrite=TRUE)
I don't get this error, the rasters are not the same extent because they are supposed to be side by side, not on top of each other.
So... the second try was with gdalUtils::mosaic_rasters. Although I didn't get any errors here, when I open the new tif in QGIS the 248 only has the first tile, and the 247 has the two of them.
mosaic_rasters(files, dst_dataset= "files.tif")
[1] "AOD2022248.h12v12.tif" "AOD2022248.h13v12.tif"
NULL
mosaic_rasters(files_2, dst_dataset= "files2.tif")
[1] "AOD2022247.h12v12.tif" "AOD2022247.h13v12.tif"
NULL
When I use verbose = TRUE in mosaic_rasters...
I've got for 247
Input file size is 2400, 12000...10...20...30...40...50...60...70...80...90...100 - done.
and for 248
Input file size is 1200, 12000...10...20...30...40...50...60...70...80...90...100 - done.
I also compared the rasters to see if there is any difference in h12v12 or h13v12 between days but they are the same.
> compareRaster(test_1, test_3)
[1] TRUE
> compareRaster(test_2, test_4)
[1] TRUE
If the files are the same, why mosaic/merge is only working right with some of them?
The raster stacks need to have the same number of layers
nlayers(test_1)
# [1] 4
nlayers(test_2)
# [1] 3
If we add another dummy layer filled with NA, then the mosaic works. I assumed that the missing layer is the fourth, but you may need to figure out which one is missing in your specific case.
test_2a = stack(test_2, init(test_2[[1]], fun=function(x) rep(NA,x)))
joint <- mosaic(test_1, test_2a, fun=mean, filename = "joint.tif", overwrite=TRUE)
Or, we can use package terra instead of raster
library(terra)
test_1 <- rast(files[1])
test_2 <- rast(files[2])
joint <- mosaic(test_1, test_2)

Finding two directories (which are in ten min bins) based on a time. A diabolical directory disaster

I have looked all round and can't find a working solution. A bit of background:
I am using R to find raw images based on a validated image name (all this bit works). The issue is there are at least 30 date directories with each of these having a large number of time directories, these are divided up into 10 min bins. Looking in all the bins or just the parent directory is asking a bit too much computationally. An example format of the bin would be
R_Experiments\RawImageFinder\Raw\2016-10-08\1536
R_Experiments\RawImageFinder\Raw\2016-10-08\1546
It is important to note that the bins are not consistent with their starting minutes; it can vary and here in lies the problem.
I know what time the image was taken from the file name using the following bit of code
SingleImage <- Pia1.2016-10-08.1103+N2353_hc.tif
TimeDir <- sub('.*?\\.\\d{4}-\\d{2}-\\d{2}\\.(\\d{2})(\\d{2}).*', '\\1:\\2', SingleImage)
TimeDir <- sub(':','', TimeDir)
#
> print(TimeDir)
[1] "1103"
So the image could belong in any of the following bins:
\1053,\1054,\1055,..you get the idea...,\1112,\1113
it just depends when the bin was started. So I want the "finder" code to look in all possible bins that are within tin mins either side (as per the example above), obviously some of them will not exist.
I thought about doing:
TimeDir1 <- as.numeric(TimeDir)+1
TimeDir2 <- as.numeric(TimeDir)+2
but the issue arises if we get to 59 mins, because there is no such thing as 61 mins in the hour (haha).
I then use the following to tell which directories to search, although I am a bit stuck also on how to tell it to look in multiple directories.
Directorytosearch <- ParentDirectory
#this has the \ in it, same for time, it works
Directorytosearch <- sub('$',paste(DateDir), Directorytosearch)
Directorytoserach <- sub('$',paste(TimeDir), Directorytoserach)
IMAGEtocopy <- list.files(
path = c(Directorytosearch),
recursive = TRUE,
include.dirs = FALSE,
full.names = FALSE,
pattern = SingleImagePattern)
Any help really would be great!
Could be using the strptime function?
Many thanks
Jim
Update for #Nya
test <- strptime("1546", format = "%H%M")
dirs[select.image.dir(test, dirs.time)]
> dirs[select.image.dir(test, dirs.time)]
[1] "test/1546"
To list directories, you are looking for the list.dirs() function. Let's assume that the following example was obtained from such a search through all the directories.
# directories possibly obtained with list.dirs
dirs <- c("test/1536", "test/1546", "test/1556", "test/1606")
A good practice then would be to extract both the date and time components from the directories and image file names. Here, I will only use the time since that was the original request.
# convert times
dirs.time <- sub(".*/(\\d+)$", "\\1", dirs)
dirs.time <- strptime(dirs.time, format="%H%M")
# test data, in your case from image file names
test <- strptime(c("1538", "1559", "1502"), format="%H%M")
The function that will select the desired directories by comparing if the time from the image file is within the 10 minutes interval up and down the time of the directory. It will then provide the indices, where the image could be located.
select.image.dir <- function(i, dt){
res <- NULL
# adding and substracting 10 minutes converted to seconds
ik <- c(i - 600, i + 600)
condition <- c(ik[1] <= dt & ik[2] >= dt)
if(any(condition)){
res <- which(condition)
} else { res <- NA }
res
}
Note that the updated function accepts a single image file time to test in each round. The indices can then be used to extract the path to the image directory. The last time is outside the range of the directories and thus the function returns NA.
dirs[select.image.dir(test[1], dirs.time)]
# [1] "test/1536" "test/1546"
dirs[select.image.dir(test[2], dirs.time)]
# [1] "test/1556" "test/1606"
dirs[select.image.dir(test[3], dirs.time)]
# [1] NA NA NA NA

Populating a matrix (or a DF or a DT) with a loop from a folder containing txt files

I wrote my first code in R for treating some spectra [basically .txt files with a Xcol (wavelength) and Ycol (intensity)].
The code works for single files, provided I write the file name in the code. Here the code working for the first file HKU47_PSG_1_LW_0.txt.
setwd("C:/Users/dd16722/R/Raman/Data")
# import Spectra
PSG1_LW<-read.table("HKU47_PSG_1_LW_0.txt")
colnames(PSG1_LW)[colnames(PSG1_LW)=="V2"] <- "PSG1_LW"
PSG2_LW<-read.table("HKU47_PSG_2_LW_all_0.txt")
colnames(PSG2_LW)[colnames(PSG2_LW)=="V2"] <- "PSG2_LW"
#Plot 2 spectra and define the Y range
plot(PSG1_LW$V1, PSG1_LW$PSG1_LW, type="l",xaxs="i", yaxs="i", main="Raman spectra", xlab="Raman shift (cm-1)", ylab="Intensity", ylim=range(PSG1_LW,PSG2_LW))
lines(PSG2_LW$V1, PSG2_LW$PSG2_LW, col=("red"), yaxs="i")
# Temperature-excitation line correction
laser = 532
PSG1_LW_corr <- PSG1_LW$PSG1_LW*((10^7/laser)^3*(1-exp(-6.62607*10^(-34)*29979245800*PSG1_LW$V1/(1.3806488*10^(-23)*293.15)))*PSG1_LW$V1/((10^7/laser)-PSG1_LW$V1)^4)
PSG1_Raw_Corr <-cbind (PSG1_LW,PSG1_LW_corr)
lines(PSG1_LW$V1, PSG1_LW_corr, col="red")
plot(PSG1_LW$V1, PSG1_Raw_Corr$PSG1_LW_corr, type="l",xaxs="i", yaxs="i", xlab="Raman shift (cm-1)", ylab="Intensity")
Now, it's time for another little step forward. In the folder, there are many spectra (in the code above I reported the second one: HKU47_PSG_2_LW_all_0.txt) having again 2 columns, same length of the first file. I suppose I should merge all the files in a matrix (or DF or DT).
Probably I need a loop as I need a code able to check automatically the number of files contained in the folder and ultimately to create an object with several columns (i.e. the double of the number of the files).
So I started like this:
listLW <- list.files(path = ".", pattern = "LW")
numLW <- as.integer(length(listLW))
numLW represents the number of iterations I need to set. The question is: how can I populate a matrix (or DF or DT) in order to have in the first 2 columns the first txt file in my folder, then the second file in the 3rd and 4th columns etc? Considering that I need to perform some other operations as I showed above in the code.
I have been reading about loop in R since yestarday but actually could not find the best and easy solution.
Thanks!
You could do something like
# Load data.table library
require(data.table)
# Import the first file
DT_final <- fread(file = listLW[1])
# Loop over the rest of the files and use cbind to merge them into 1 DT
for(file in setdiff(listLW, listLW[1])) {
DT_temp <- fread(file)
DT_final <- cbind(DT_final, DT_temp)
}

Multple csv files reading in a loop and calculating column wise average in R

I have 3 csv files, I have three columns in all the three files( Maths, Physics and Chemistry) and marks of all the students. I created a loop to read all the files and saved in a dataframe as follows. In every file line numbers 1,2,4,5 need to be skipped.
files <- list.files(pattern = ".csv")
for(i in 1:length(files)){
data <- read.csv(files[i], header=F, skip=2) # by writing skip=2 I could only skip first two lines.
View(data)
mathavg[i] <- sum(as.numeric(data$math), na.rm=T)/nrow(data)
}
result <- cbind(files,mathavg)
write.csv(result,"result_mathavg.csv")
I could not able to calculate the average of math column in all the three files.
Like this I need to calculate for all the three subjects across three files.
any help????
This should work,
files <- c("testa.csv","testb.csv","testc.csv")
list_files <- lapply(files,read.csv,header=F,stringsAsFactors=F)
list_files <- lapply(list_files, function(x) x[-c(1,2,4,5),])
mathav <- sapply(list_files,function(x) mean(as.numeric(x[,2]),na.rm=T))
result <- cbind(files,mathav)
write.csv(result,"result_mathavg.csv",row.names=F)
I didn't have access to your files, so I made up three and called them 'files'. I used the lapply function to load the files, then to remove the lines that you didn't want. I got the average using the sapply function then I went back to your code to get result, etc.
mathavg needs to be initialized before it can be operated on with []. To remove lines 4 and 5 you just need to perform a subsetting operation after reading the data. lines 4 and 5 become 2 and 3 if you skip the first 2 lines when reading the data.
files <- list.files(pattern = ".csv")
mathavg<-''
for(i in 1:length(files)){
data <- read.csv(files[i], header=F, skip=2, stringsAsFactors=F) # by writing skip=2 I could only skip first two lines.
data<-data[-c(2,3),]
mathavg[i] <- mean(as.numeric(data$math), rm.NA=T) ##best to use R's builtin function to calculate the mean
}
result <- cbind(files,mathavg)
write.csv(result,"result_mathavg.csv")

Opening and reading multiple netcdf files with RnetCDF

Using R, I am trying to open all the netcdf files I have in a single folder (e.g 20 files) read a single variable, and create a single data.frame combining the values from all files. I have been using RnetCDF to read netcdf files. For a single file, I read the variable with the following commands:
library('RNetCDF')
nc = open.nc('file.nc')
lw = var.get.nc(nc,'LWdown',start=c(414,315,1),count=c(1,1,240))
where 414 & 315 are the longitude and latitude of the value I would like to extract and 240 is the number of timesteps.
I have found this thread which explains how to open multiple files. Following it, I have managed to open the files using:
filenames= list.files('/MY_FOLDER/',pattern='*.nc',full.names=TRUE)
ldf = lapply(filenames,open.nc)
but now I'm stuck. I tried
var1= lapply(ldf, var.get.nc(ldf,'LWdown',start=c(414,315,1),count=c(1,1,240)))
but it doesn't work.
The added complication is that every nc file has a different number of timestep. So I have 2 questions:
1: How can I open all files, read the variable in each file and combine all values in a single data frame?
2: How can I set the last dimension in count to vary for all files?
Following #mdsummer's comment, I have tried a do loop instead and have managed to do everything I needed:
# Declare data frame
df=NULL
#Open all files
files= list.files('MY_FOLDER/',pattern='*.nc',full.names=TRUE)
# Loop over files
for(i in seq_along(files)) {
nc = open.nc(files[i])
# Read the whole nc file and read the length of the varying dimension (here, the 3rd dimension, specifically time)
lw = var.get.nc(nc,'LWdown')
x=dim(lw)
# Vary the time dimension for each file as required
lw = var.get.nc(nc,'LWdown',start=c(414,315,1),count=c(1,1,x[3]))
# Add the values from each file to a single data.frame
rbind(df,data.frame(lw))->df
}
There may be a more elegant way but it works.
You're passing the additional function parameters wrong. You should use ... for that. Here's a simple example of how to pass na.rm to mean.
x.var <- 1:10
x.var[5] <- NA
x.var <- list(x.var)
x.var[[2]] <- 1:10
lapply(x.var, FUN = mean)
lapply(x.var, FUN = mean, na.rm = TRUE)
edit
For your specific example, this would be something along the lines of
var1 <- lapply(ldf, FUN = var.get.nc, variable = 'LWdown', start = c(414, 315, 1), count = c(1, 1, 240))
though this is untested.
I think this is much easier to do with CDO as you can select the varying timestep easily using the date or time stamp, and pick out the desired nearest grid point. This would be an example bash script:
# I don't know how your time axis is
# you may need to use a date with a time stamp too if your data is not e.g. daily
# see the CDO manual for how to define dates.
date=20090101
lat=10
lon=50
files=`ls MY_FOLDER/*.nc`
for file in $files ; do
# select the nearest grid point and the date slice desired:
# %??? strips the .nc from the file name
cdo seldate,$date -remapnn,lon=$lon/lat=$lat $file ${file%???}_${lat}_${lon}_${date}.nc
done
Rscript here to read in the files
It is possible to merge all the new files with cdo, but you would need to be careful if the time stamp is the same. You could try cdo merge or cdo cat - that way you can read in a single file to R, rather than having to loop and open each file separately.

Resources