Writing a loop to store many data variables in r - r

I want to begin by saying that I am not a programmer, I'm just trying to store data so it's easily readable to myself.
I just downloaded a large .nc file of weather data and I am trying to take data from the file and store it in .csv format so I can easily view it in excel. The problem with the data is that it contains 53 variables with three 'dimensions': latitude, longitude, and time. I have written some code to only take a certain latitude and longitude and every timestamp so I get one nice column for each variable (with a single latitude and longitude but every timestamp). My problem is that I want to have the loop store a column for every variable to a different (arbitrary) object in R so that I just have to run it once and then write all the data to one .csv file with the write.csv function.
Here's the code I've written so far, where janweather is the .nc file.
while( j <= 53){
v1 <- janweather$var[[j]]
varsize <- v1$varsize
ndims <- v1$ndims
nt <- varsize[ndims] # Remember timelike dim is always the LAST dimension!
j <- j +1;
for( i in 1:nt ) {
# Initialize start and count to read one timestep of the variable.
start <- rep(1,ndims) # begin with start=(1,1,1,...,1)
start[1] <- i
start[2] <- i# change to start=(i,i,1
count <- varsize # begin w/count=(nx,ny,nz,...), reads entire var
count[1] <- 1
count[2] <- 1
data3 <- get.var.ncdf( janweather, v1, start=start, count=count )
}
}
Here are the details of the nc file from print.ncdf(janweather):
[1] "file netcdf-atls04-20150304074032-33683-0853.nc has 3 dimensions:" [1] "longitude Size: 240" [1] "latitude Size: 121" [1] "time Size: 31" [1] "------------------------" [1] "file netcdf-atls04-20150304074032-33683-0853.nc has 53 variables:"
My main goal is to have all the variables stored under a different name by the get.var.ncdf function. Right now I realize that it just keeps overwritting 'data3' until it reaches the last variable so all I've accomplished is getting data3 written to the last variable. I'd like to think there is an easy solution to this but I'm not exactly sure how to generate strings to store the variables under.
Again, I'm not a programmer so I'm sorry if anything I've said doesn't make any sense, I'm not very well versed in the lingo or anything.
Thanks for any and all help you guys bring!

If your not a programmer and want only to get variables in csv format, you can use the NCO commands. With this commands you can do multiple operations on netcdf files.
So with the command ncks you can output the data from a variable with and specific dimensions slice.
ncks -H -v latitude janweather.nc
This command will list on the screen the values in the latitude variable.
ncks -s '%f ,' -H -v temperature janweather.nc
This command will list the values of the variable temperature, with the format specified with the -p argument (sprintf style).
So just pass the output to a file and there you have the contents of a variables in a text file.
ncks -s '%f ,' -H -v temperature janweather.nc > temperature.csv

Related

How to merge netcdf files separately in R?

I am a user of R and would like some help in the following:
I have two netcdf files (each of dimensions 30x30x365) and one more with 30x30x366. These 3 files contain a year's worth of daily data, where the last dimension refers to the time dimension. I wanted to combine them separately i.e. I wanted the output file to contain 30x30x1096.
Note: I have seen a similar question but the output results in an average (i.e. 30x30x3) which I do not want.
from the comment I see below you seem to want to merge 3 files in the time dimension. As an alternative to R, you could do this quickly from the command line using cdo (climate data operators):
cdo mergetime file1.nc file2.nc file3.nc mergedfile.nc
or using wildcards:
cdo mergetime file?.nc mergedfile.nc
cdo is easy to install under ubuntu:
sudo apt install cdo
Without knowing exactly what dimensions and variables you have, this may be enough to get you started:
library(ncdf4)
output_data <- array(dim = c(30, 30, 1096))
files <- c('file1.nc', 'file2.nc', 'file3.nc')
days <- c(365, 365, 366)
# Open each file and add it to the final output array
for (i in seq_along(files)) {
nc <- nc_open(files[i])
input_arr <- ncvar_get(nc, varid='var_name')
nc_close(nc)
# Calculate the indices where each file's data should go
if (i > 1) {
day_idx <- (1:days[i]) + sum(days[1:(i-1)])
} else {
day_idx <- 1:days[i]
}
output_data[ , , day_idx] <- input_arr
}
# Write out output_data to a NetCDF. How exactly this should be done depends on what
# dimensions and variables you have.
# See here for more:
# https://publicwiki.deltares.nl/display/OET/Creating+a+netCDF+file+with+R

How to modify the R program to reduce the use of for loop?

I'm doing Assignment Part 2 at the following address:
https://www.coursera.org/learn/r-programming/supplement/amLgW/programming-assignment-1-instructions-air-pollution
Question:
The zip file contains 332 comma-separated-value (CSV) files containing pollution monitoring data for fine particulate matter (PM) air pollution at 332 locations in the United States. Each file contains data from a single monitor and the ID number for each monitor is contained in the file name. For example, data for monitor 200 is contained in the file "200.csv". Each file contains three variables:
Date: the date of the observation in YYYY-MM-DD format (year-month-day)
sulfate: the level of sulfate PM in the air on that date (measured in micrograms per cubic meter)
nitrate: the level of nitrate PM in the air on that date (measured in micrograms per cubic meter)
For this programming assignment you will need to unzip this file and create the directory 'specdata'. Once you have unzipped the zip file, do not make any modifications to the files in the 'specdata' directory. In each file you'll notice that there are many days where either sulfate or nitrate (or both) are missing (coded as NA). This is common with air pollution monitoring data in the United States.
Part 2
Write a function that reads a directory full of files and reports the number of completely observed cases in each data file. The function should return a data frame where the first column is the name of the file and the second column is the number of complete cases.
My code is as following:
complete <- function(directory="d:/dev/r/documents/specdata", id) {
df <- data.frame(no=integer(), nobs=integer())
for (i in id) {
sum=0
myfilename = paste(directory,"/",formatC(i, width=3, flag="0"),".csv",
sep="")
masterfile = read.table(myfilename, header=TRUE, sep=",")
for (j in 1:nrow(masterfile)){
if (!is.na(masterfile[j, 2]) && !is.na(masterfile[j, 3])){
sum = sum + 1
}
}
df[i,]<-c(i, sum)
}
df
}
Note that I put all the 001.csv, 002.csv, ... in the directory d:/dev/r/documents/specdata, and that's why I have this string as default in the parameter. You can see that I use nested for loops to make this work, and I realize that I should be able to replace at least one of the for loop with lapply. But I'm struggling with this as I'm quite familiar with C++ so I really have no idea how to implement lapply. I read a few codes on Stackoverflow and I understand most of them, but when it came to writing my own codes I could not make it work.
Thanks in advance! In the mean time I will try again.
You can start with replacing the inner cycle first with something like this:
rows_to_sum <- !is.na(masterfile[, 2]) & !is.na(masterfile[, 3])
df[i,] <- sum(masterfile[rows_to_sum, 1])
This assignment gives you a hint by using the phrase "complete cases" multiple times. You should check out the R function complete.cases(). It would replace the need for your inner for loop.
For each file, run complete.cases(file). Count the number of TRUE
elements in the returned vector. Output the name of the file and the
above count.

Read Multiple ncdf files and make average in R

By using R ill try to open my NetCDF data that contain 5 dimensional space with 15 variables. (variable for calculation is in matrix 1000X920 )
This problem actually look like the same with the other question before.
I got explanation from here and the others
At first I used RNetCDF package, but after some trial i found unconsistensy when the package read my data. And then finally better after used ncdf package.
there is no problem for opening data in a single file, but after ill try for looping in more than hundred data inside folder for a spesific variable (for example: var no 15) the program was failed.
> days = formatC(001:004, width=3, flag="0")
> ncfiles = lapply (days,
> function(d){ filename = paste("data",d,".nc",sep="")
> open.ncdf(filename) })
also when i try the command like this for a spesific variable
> sapply(ncfiles,function(file,{get.var.ncdf(file,"var15")})
so my question is, any solution to read all netcdf file with special variable then make calculation in one frame. From the solution before i was failed for generating the variable no 15 on whole netcdf data.
thanks for any solution to this problem.
UPDATE:
this is the last what i have done
when i write
library(ncdf)
files=list.files("allnc/",pattern='*nc',full.names=TRUE)
for(i in seq_along(files)) {
nc <- lapply(files[i],open.ncdf)
lw = get.var.ncdf(nc,"var15")
x=dim(lw)
rbind(df,data.frame(lw))->df
}
i can get all netcdf data by > nc
so i how i can get variable data with new name automatically like lw1,lw2...etc
i cant apply
var1 <- lapply(files, FUN = get.var.ncdf, variable = "var15")
then i can do calculation with all data.
the other technique i try used RNetCDF package n doing a looping
# Declare data frame
df=NULL
#Open all files
files= list.files("allnc/",pattern='*.nc',full.names=TRUE)
# Loop over files
for(i in seq_along(files)) {
nc = open.nc(files[i])
# Read the whole nc file and read the length of the varying dimension (here, the 3rd dimension, specifically time)
lw = var.get.nc(nc,'DBZH')
x=dim(lw)
# Vary the time dimension for each file as required
lw = var.get.nc(nc,'var15')
# Add the values from each file to a single data.frame
}
i can take a variable data but i just got one data from my all file nc.
note: sampe of my data name ( data20150102001.nc,data20150102002.nc.....etc)
This solution uses NCO, not R. You may use it to check your R solution:
ncra -v var15 data20150102*.nc out.nc
That is all.
Full documentation in NCO User Guide.
You can use the ensemble statistics capabilities of CDO, but note that on some systems the number of files is limited to 256:
cdo ensmean data20150102*.nc ensmean.nc
you can replace "mean" with the statistic of your choice, max, std, var, min etc...

Opening and reading multiple netcdf files with RnetCDF

Using R, I am trying to open all the netcdf files I have in a single folder (e.g 20 files) read a single variable, and create a single data.frame combining the values from all files. I have been using RnetCDF to read netcdf files. For a single file, I read the variable with the following commands:
library('RNetCDF')
nc = open.nc('file.nc')
lw = var.get.nc(nc,'LWdown',start=c(414,315,1),count=c(1,1,240))
where 414 & 315 are the longitude and latitude of the value I would like to extract and 240 is the number of timesteps.
I have found this thread which explains how to open multiple files. Following it, I have managed to open the files using:
filenames= list.files('/MY_FOLDER/',pattern='*.nc',full.names=TRUE)
ldf = lapply(filenames,open.nc)
but now I'm stuck. I tried
var1= lapply(ldf, var.get.nc(ldf,'LWdown',start=c(414,315,1),count=c(1,1,240)))
but it doesn't work.
The added complication is that every nc file has a different number of timestep. So I have 2 questions:
1: How can I open all files, read the variable in each file and combine all values in a single data frame?
2: How can I set the last dimension in count to vary for all files?
Following #mdsummer's comment, I have tried a do loop instead and have managed to do everything I needed:
# Declare data frame
df=NULL
#Open all files
files= list.files('MY_FOLDER/',pattern='*.nc',full.names=TRUE)
# Loop over files
for(i in seq_along(files)) {
nc = open.nc(files[i])
# Read the whole nc file and read the length of the varying dimension (here, the 3rd dimension, specifically time)
lw = var.get.nc(nc,'LWdown')
x=dim(lw)
# Vary the time dimension for each file as required
lw = var.get.nc(nc,'LWdown',start=c(414,315,1),count=c(1,1,x[3]))
# Add the values from each file to a single data.frame
rbind(df,data.frame(lw))->df
}
There may be a more elegant way but it works.
You're passing the additional function parameters wrong. You should use ... for that. Here's a simple example of how to pass na.rm to mean.
x.var <- 1:10
x.var[5] <- NA
x.var <- list(x.var)
x.var[[2]] <- 1:10
lapply(x.var, FUN = mean)
lapply(x.var, FUN = mean, na.rm = TRUE)
edit
For your specific example, this would be something along the lines of
var1 <- lapply(ldf, FUN = var.get.nc, variable = 'LWdown', start = c(414, 315, 1), count = c(1, 1, 240))
though this is untested.
I think this is much easier to do with CDO as you can select the varying timestep easily using the date or time stamp, and pick out the desired nearest grid point. This would be an example bash script:
# I don't know how your time axis is
# you may need to use a date with a time stamp too if your data is not e.g. daily
# see the CDO manual for how to define dates.
date=20090101
lat=10
lon=50
files=`ls MY_FOLDER/*.nc`
for file in $files ; do
# select the nearest grid point and the date slice desired:
# %??? strips the .nc from the file name
cdo seldate,$date -remapnn,lon=$lon/lat=$lat $file ${file%???}_${lat}_${lon}_${date}.nc
done
Rscript here to read in the files
It is possible to merge all the new files with cdo, but you would need to be careful if the time stamp is the same. You could try cdo merge or cdo cat - that way you can read in a single file to R, rather than having to loop and open each file separately.

Manipulate variables in netcdf files and write them again

I have several netcdf files. each nc file has several variables. I am only interested in two variables "Soil_Moisture" and "Soil_Moisture_Dqx".
I would like to filter "Soil_Moisture" based on "Soil_Moisture_Dqx". I want to replace values in "Soil_Moisture" by NA whenever corresponding "Soil_Moisture_Dqx" pixels have values greater than 0.04.
:Here are the files to download:
1- I tried this loop but when I typed f[1] or f[2] I got something weird which means that my loop is incorrect.I am grateful to anyhelp to get my loop corrected.
a<-list.files("C:\\3 nc files", "*.DBL", full.names = TRUE)
for(i in 1:length(a)){
f=open.ncdf(a[i])
A1 = get.var.ncdf(nc=f,varid="Soil_Moisture",verbose=TRUE)
A1* -0.000030518509475997 ## scale factor
A2 = get.var.ncdf(nc=f,varid="Soil_Moisture_Dqx",verbose=TRUE)
A2*-0.0000152592547379985## scale factor
A1[A2>0.04]=NA ## here is main calculation I need
}
2- Can anybody tell me to write them again?
Missing values are special values in netCDF files whose value is to be taken as indicating the data
is "missing". So you need to use set.missval.ncdf to set this values.
a<-list.files("C:\\3 nc files", "*.DBL", full.names = TRUE)
SM_NAME <- "Soil_Moisture"
SM_SDX_NAME <- "Soil_Moisture_Dqx"
library(ncdf)
lapply(a, function(filename){
nc <- open.ncdf( filename,write=TRUE )
SM <- get.var.ncdf(nc=nc,varid=SM_NAME)
SM_dqx <- get.var.ncdf(nc=nc,varid=SM_SDX_NAME)
SM[SM_dqx > 0.4] <- NA
newMissVal <- 999.9
set.missval.ncdf( nc, SM_NAME, newMissVal )
put.var.ncdf( nc, SM_NAME, SM )
close.ncdf(nc)
})
EDIT add some check
It is intersting here to count how many points will tagged as missed.
Whithout applying the odd scale factor we have:
lapply(a, function(filename){
nc <- open.ncdf( filename,write=TRUE )
SM_dqx <- get.var.ncdf(nc=nc,varid=SM_SDX_NAME)
table(SM_dqx > 0.4)
})
[[1]]
[1] 810347 91
[[2]]
[1] 810286 152
[[3]]
[1] 810287 151
[[4]]
[1] 810355 83
This can also be accomplished from the command line using CDO.
As I understand it both variables are contained in your input file (which I will call "datafile.nc", you will want to presumably do the following in a loop over the file lists), so first of all we will extract those two variables into two separate files:
cdo selvar,Soil_Moisture datafile.nc soil_moisture.nc
cdo selvar,Soil_Moisture_Dqx datafile.nc dqx.nc
Now we will define a mask file that contains 1 when dqx<0.04 but contains NAN when dqx>=0.04
cdo setctomiss,0 -ltc,0.04 dqx.nc mask.nc
The ltc is "than than constant" (you may want instead lec for <= ), the setctomiss replaces all the zeros with NAN.
Now we multiply these together with CDO - NAN*C=NAN and 1*C=C, so this gives you a netcdf with your desired field:
cdo mul mask.c soil_moisture.nc masked_soil_moisture.nc
you can actually combine those last two lines together if you like, and avoid the I/O of writing the mask file:
cdo mul -setctomiss,0 -ltc,0.04 dqx.nc soil_moisture.nc masked_soil_moisture.nc
But it is easier to explain the steps separately :-)
You can put the whole thing in a loop over files easily in bash.

Resources