Using loops to extract coordinates, match them and write to file - r

I am trying to use for loops (or the apply function as an alternative) to extract coordinates from a data.frame, search for the closest point within the E-OBS gridded dataset, extract the temperature-data for time x1-x2 and write it to another excel file.
While the code works to extract single data points, I seem unable to include this code within a loop and to add the results potentially next to the input-coordinates.
library(sp)
library(raster)
library(ncdf4)
#Coordinates
df
ID site E N
1 1 site_place_date1 7.558758 47.81004
2 2 site_place_date2 7.582749 47.63411
3 3 site_place_date3 7.607968 48.01475
4 4 site_place_date4 7.644660 47.67139
... ... ... ...`
Set coordinates of target point MANUALLY:
lon <- 7.558758 # longitude of location
lat <- 47.81004 # latitude of location
#Mean daily temperature
ncin <- nc_open("tg_0.25deg_reg_v17.0.nc")
print(ncin)
t <- ncvar_get(ncin,"time")
tunits <- ncatt_get(ncin,"time","units")nt <- dim(t)
nt
obsoutput <- ncvar_get(ncin,
start= c(which.min(abs(ncin$dim$longitude$vals - lon)), # look for closest long
which.min(abs(ncin$dim$latitude$vals - lat)), # look for closest lat
1),
count=c(1,1,-1))
DataMeanT <- data.frame(DateN= t, MeanDailyT = obsoutput)
nc_close(ncin)
head(DataMeanT)
#check if there are NAs =999
summary(DataMeanT)
Data = DataMeanT
Data$Date = as.Date(Data$DateN,origin="20000-01-01")
Data$Year = format(Data$Date,"%Y")
Data$Month = format(Data$Date,"%m")
head(Data)
Data$YearMonth = format(Data$Date, format="%Y-%b")
Data_annual = aggregate(("T_AnnualMean" = MeanDailyT) ~ Year,data = Data, FUN = mean,na.action = na.pass)
names(Data_annual)[2] <- "AirT"
head(Data_annual)
#Export table
write.table(Data_annual, "Site_AirTemp.csv", row.names = FALSE, append = FALSE, col.names = TRUE, sep = ", ", quote = TRUE)
The aim is to run the script as part of a loop for all coordinates in df and to write the temperature data to a new data-table with information on site-ID or alternatively into the next columns of df.

Simply wrap your entire process in a defined method and use an apply function to pass in lon/lat coordinates. One great candidate is mapply or its wrapper Map to iterate elementwise between both vectors of df$E and df$N. Also, a third argument, df$site, is passed into method for unique CSV names as right now the same file will be overwritten.
Below some non-assignment lines such as head or summary are removed since they do nothing inside a method. Also context managers, within and with are used to avoid repetition of Data$ for more streamlined data manipulation. The Map call writes to file AND builds a list of aggregated data frames for use later.
Function
my_function <- function(lon, lat, site) {
# Mean daily temperature
ncin <- nc_open("tg_0.25deg_reg_v17.0.nc")
print(ncin)
t <- ncvar_get(ncin,"time")
tunits <- ncatt_get(ncin,"time","units")nt <- dim(t)
# look for closest lon and lat
obsoutput <- ncvar_get(ncin,
start = c(which.min(abs(ncin$dim$longitude$vals - lon)),
which.min(abs(ncin$dim$latitude$vals - lat)),
1),
count = c(1,1,-1))
DataMeanT <- data.frame(DateN = t, MeanDailyT = obsoutput)
nc_close(ncin)
Data <- within(DataMeanT, {
Date <- as.Date(DateN, origin="2000-01-01")
Year <- format(Date,"%Y")
Month <- format(Date,"%m")
YearMonth <- format(Date, format="%Y-%b")
})
Data_annual <- with(Data, aggregate(list("AirT" = MeanDailyT), list(Year=Year),
FUN = mean, na.action = na.pass))
# Export table
write.table(Data_annual, paste0("Site_AirTemp_", site, "_.csv"), row.names=FALSE,
append = FALSE, col.names = TRUE, sep = ", ", quote = TRUE)
# SAVE AGGREGATED DATA FRAME
return(Data_annual)
}
Call
# ITERATE THROUGH EACH LON/LAT PAIR ELEMENTWISE
df_list <- Map(my_function, df$E, df$N, df$site)
# df_list <- mapply(my_function, df$E, df$N, df$site, SIMPLIFY=FALSE) # EQUIVALENT CALL

You can probably do:
library(raster)
b <- brick("tg_0.25deg_reg_v17.0.nc")
e <- extract(b, df[, c('E', 'N')])

Related

double loop does not match

I'm having trouble with a loop that I'm trying to do.
I have data for monthly temperature (web-scraping from TerraClimate). I have to set the latitude and longitude, and my code gives me one observation for the average temperature of each month.
I have many pairs of coordinates that I need to web scrape so I'm doing a loop. The problem is that I'm trying to paste the name of the location and it's not working.
Here is an example:
I have a list with the coordinates
and an object with the names
loc1 <- c(-70.6666667, -33.4500000)
loc2 <- c(-71.6163889,-33.0458333)
c=list(loc1,loc2)
x=c("Loc 1", "Loc 2")
And my code looks like this:
# web scraping temperature data
var <- "tmax"
terra <- paste0(paste0("http://thredds.northwestknowledge.net:8080/thredds/dodsC/agg_terraclimate_",var),"_1958_CurrentYear_GLOBE.nc")
nc <- nc_open(terra)
lon <- ncvar_get(nc, "lon")
lat <- ncvar_get(nc, "lat")
# creating empty object to paste each set of data
DAT <- NULL
# loop that brings up the temperatures for each set of coordinates in c
for(i in c){
# names of locations
for(n in x){
flon = match(abs(lon -i[1]) < 1/48, 1)
lonindex = which(flon %in% 1)
flat = match(abs(lat -i[2]) < 1/48, 1)
latindex = which(flat %in% 1)
start <- c(lonindex, latindex, 1)
count <- c(1, 1, -1)
data <- as.numeric(ncvar_get(nc, varid = var,start = start, count))
data <- as.data.table(data)
data$date <- seq(as.Date("1958-02-01"), length.out=756, by="month")
data$ym_chr <- format(data$date, format = "%Y-%m")
data[,date:=NULL]
data$LOCATION <- as.character(c[n])
DAT <- rbind(DAT, data)
}
}
The problem is that the second part of the loop is not matching how I would like. Each location has 756 observations, so in total I should have 1512 obs. For the first time the loop runs, I would like "Loc 1" to be under LOCATION, and then, the second time the loop runs, for LOCATION to be "Loc 2".
But instead I'm getting 3024 observations.
This is a replicable code.

Creating a loop to create NDVI images from raster stacks and naming them from the file name

I have been trying to write a loop to go through two folders of Sentinel 2 satellite images (Band 4 and 5) and get a NDVI for each date.
A stack is created for each band, some cropping and resampling to finally proceed to the NDVI calculation. I struggle with the integration of the NDVI calculation in the loop and the file name creation.
I'd simply want my loop to generate x files for x dates and then give each NDVI images the date as a name "YYYY/MM/DD.tif" extracted from the file name. But I can't think of a way to do so, after a lot of unsuccessful trial and error.
#list files
files4 <- list.files(path4, pattern = "jp2$", full.names = TRUE)
files5 <- list.files(path5, pattern = "jp2$", full.names = TRUE)
ms5 <- stack()
ms4 <- stack()
for (f in files4){
# loading a raster
r4 <- raster(f)
proj4string(r4)
proj4string(emprise)
emprise <- spTransform(emprise, proj4string(r4))
r4b <- crop(r4, emprise)
ms4<- stack(ms4,r4b)
#copy the date from the file to give a name to the final NDVI image (I have to get ride of everything but the date
x <- gsub("[A-z //.//(//)]", "", r4)
y <- substr(x, 4, 11)
}
for (f in files5){
# load the raster
r5 <- raster(f)
proj4string(r5)
proj4string(emprise)
emprise <- spTransform(emprise, proj4string(r5))
r5b <- crop(r5, emprise)
ms5<- stack(ms5,r5b)
}
#Resampling : setting the Band 5 to the same resolution as Band 4
b5_resamp <- resample(ms5, ms4)
Have you considered looping over dates rather than files? I can't give more specific advice without example data, but here is the general idea:
# List files
files4 <- list.files("./band4", pattern = ".tif", full.names = TRUE)
#> "band4/T31UDR_20170126T105321_B04.tif" "band4/T31UDR_20180126T105321_B04.tif"
files5 <- list.files("./band5", pattern = ".tif", full.names = TRUE)
#> "./band5/T31UDR_20170126T105321_B05.tif" "./band5/T31UDR_20180126T105321_B05.tif"
# Get dates
dates <- unique(gsub(pattern = ".*_(\\d{8}).*", replacement = "\\1", x = c(files4, files5)))
#> "20170126" "20180126"
# Define empty stacks
ms5 <- stack()
ms4 <- stack()
for(date in dates){
## Band 4
f4 <- list.files("./band4", pattern = date, full.names = TRUE)
# loading a raster
r4 <- raster(f4)
proj4string(r4)
proj4string(emprise)
emprise <- spTransform(emprise, proj4string(r4))
r4b <- crop(r4, emprise)
ms4 <- stack(ms4,r4b)
## Band 5
f5 <- list.files("./band5", pattern = date, full.names = TRUE)
# load the raster
r5 <- raster(f5)
proj4string(r5)
proj4string(emprise)
emprise <- spTransform(emprise, proj4string(r5))
r5b <- crop(r5, emprise)
ms5<- stack(ms5,r5b)
## Resampling : setting the Band 5 to the same resolution as Band 4
b5_resamp <- resample(ms5, ms4)
## Write to file
writeRaster(b5_resamp, filename = paste0(date, ".tif"))
}

Latitude in raster calc operation

I'm trying to create a Koppen world map using data from http://worldclim.org. To find the right Koppen climate I need precipitation and temperature data (I have one raster map for each month for each of those two variables) and the latitude.
I tried doing the following :
prast <- list.files(path = "prec25/", pattern = glob2rx('*.tif'), full.names = T)
trast <- list.files(path = "temp25/", pattern = glob2rx('*.tif'), full.names = T)
lrast <- c(prast, trast)
climrast <- stack(lrast)
koppen_map <- calc(climrast, filename = "koppen.tif", fun = function(x) koppen(x[13:24], x[1:12], yFromCell(climrast, x[1])))
climrast is a RasterStack with the 24 different layers (12 layers with temperature data and 12 layers with precipitation data). The koppen function needs a vector with 12 values for temperature (that would be x[13:24]) and 12 values for temperature (x[1:12]).
yFromCell(climrast, x[1]) should give me the latitude but the calc operation fails because yFromCell(climrast, x[1]) returns NA in some cases.
If I replace the yFromCell(climrast, x[1]) with an arbitrary number like 10, the calc operation works fine.
Any idea what I'm doing wrong?
The memory-safe (and simple) way to get a RasterLayer with latitude values, you can do:
x <- init(climrast, 'y')
A working example with worldclim data:
library(raster)
prast <- getData('worldclim', var='prec', res=10)
tmin <- getData('worldclim', var='tmin', res=10)
tmax <- getData('worldclim', var='tmin', res=10)
trast <- (tmin + tmax) / 2
lat <- init(trast, 'y')
lrast <- stack(prast, trast, lat)
climrast <- crop(lrast, extent(25,30,-5,0))
# example function
koppen <- function(temp, prec, lat) {
(sum(temp * prec) + lat) / 1000
}
koppen_map <- calc(climrast, filename = "koppen.tif", fun = function(x) koppen(x[13:24], x[1:12], x[25]), overwrite=TRUE)
In your calc you are passing x[1] to yFromCell. But x[1] is the value of the raster cell, whereas you need to pass the cell number to yFromCell. I can illustrate with a minimal example:
First lets make a small dummy raster
library(raster)
set.seed(0)
clim = raster(matrix(sample(c(1:10,NA), 100, T), 10, 10))
Now lets try to get its latitudes using an analogy of what you had in the example
lat = calc(clim, function(x) yFromCell(clim, x))
plot(lat)
As you can see, that's not right at all - we got entirely the wrong latitude values because we passed the cell value rather than the cell number.
So lets make a raster layer that has the correct latitudes
lat = clim
lat[] = yFromCell(clim, 1:ncell(clim))
plot(lat)
That's much better. Now we can add this as a layer to our climate data, so that calc can access these values on a cell by cell basis.
climrast = stack(list(clim, lat))
koppen = calc(climrast, function(x) x[1]*x[2])

Improving performance when working with geodata in R

I wrote the following script to produce the raw data for plotting the above map. The problem is, for 550,000 data points, this takes about 2 hours to run on a relatively powerful machine. I'm new to R, however, and I'm wondering if there are any optimized functions I can take advantage of?
The basic idea is that, given a set of geospatial data, you split the set into 200 rows, and split each row into a bunch of squares. You then calculate the total of a value in each square in a row. The approach I've taken below is to take the "upper left" point of a square, calculate the latitude/longitude of edges of the square, and exclude all points not in those bounds, and then sum what remains. Is there a better way without using a solution like PostGIS?
all.data <- read.csv("FrederictonPropertyTaxDiffCleanedv3.csv", header=TRUE,
stringsAsFactors=FALSE)
all.data$X <- as.numeric(all.data$X)
all.data$Y <- as.numeric(all.data$Y)
startEnd <- function(lats, lngs) {
# Find the "upper left" (NW) and "bottom right" (SE) coordinates of a set of data.
#
# Args:
# lats: A list of latitude coordinates
# lngs: A list of longitude coordinates
#
# Returns:
# A list of values corresponding to the northwest-most and southeast-most coordinates
# Convert to real number and remove NA values
lats <- na.omit(as.numeric(lats))
lngs <- na.omit(as.numeric(lngs))
topLat <- max(lats)
topLng <- min(lngs)
botLat <- min(lats)
botLng <- max(lngs)
return(c(topLat, topLng, botLat, botLng))
}
startEndVals <- startEnd(all.data$Y, all.data$X)
startLat <- startEndVals[1]
endLat <- startEndVals[3]
startLng <- startEndVals[2]
endLng <- startEndVals[4]
num_intervals = 200.0
interval <- (startEndVals[1] - startEndVals[3]) / num_intervals
# testLng <- -66.6462379307115
# testLat <- 45.9581234392
# Prepare the data to be sent in
data <- all.data[,c("Y", "X", "levy2014_ha")]
sumInsideSquare <- function(pointLat, pointLng, interval, data) {
# Sum all the values that fall within a square on a map given a point,
# an interval of the map, and data that contains lat, lng and the values
# of interest
colnames(data) <- c("lat", "lng", "value")
# Data east of point
data <- data[data$lng > pointLng,]
# Data west of point + interval
data <- data[data$lng < pointLng + interval,]
# Data north of point + interval (down)
data <- data[data$lat > pointLat - interval,]
# Data south of point
data <- data[data$lat < pointLat, ]
# Clean remaining data
data <- na.omit(data)
return(sum(data$value))
}
# Debugging
# squareSumTemp <- sumInsideSquare(testLat, testLng, interval, data)
# Given a start longitude and an end longitude, calculate an array of values
# corresponding to the sums for that latitude
calcSumLat <- function(startLng, endLng, lat, interval, data) {
row <- c()
lng <- startLng
while (lng < endLng) {
row <- c(row, sumInsideSquare(lat, lng, interval, data))
lng <- lng + interval
}
return(row)
}
# Debugging
# rowTemp <- calcSumLat(startLng, endLng, testLat, interval, data)
# write.csv(rowTemp, file = "Temp.csv", row.names = FALSE)
# Get each line of data to plot
lat <- startLat
rowCount <- 1
all.sums <- list()
while (lat > endLat) {
col <- calcSumLat(startLng, endLng, lat, interval, data)
all.sums[[as.character(rowCount)]] <- col
lat <- lat - interval
rowCount <- rowCount + 1
}
# Convert to data frame
all.sums.frame <- data.frame(all.sums)
# Save to disk so I don't have to run it again
write.csv(all.sums.frame, file = "Levy2014Sums200.csv", row.names = FALSE)
Ended up finding a solution to this myself. The key to it was using the foreach package with the doParallel package so it could take advantage of all the cores on my computer. There is a great guide on it here: http://www.r-bloggers.com/a-brief-foray-into-parallel-processing-with-r/

Reading series of values in R

I have read a series of 332 files like below by storing the data in each file as a data frame in List.
files <- list.files()
data <- list()
for (i in 1:332){
data[[i]] = read.csv(files[[i]])
}
The data has 3 columns with names id, city, town. Now I need to calculate the mean of all values under city corresponding to the id values 1:10 for which I wrote the below code
for(j in 1:10){
req.data <- data[[j]]$city
}
mean(na.omit(req.data))
But it is giving me a wrong value and when I call it in a function its transferring null values. Any help is highly appreciated.
Each time you iterate through j = 1:10 you assign data[[j]]$city to the object req.data. In doing so, for steps j = 2:10 you are overwriting the previous version of req.data with the contents of the jth data set. Hence req.data only ever contains at any one time a single city's worth of data and hence you are getting the wrong answer sa you are computing the mean for the last city only, not all 10.
Also note that you could do mean(req.data, na.rm = TRUE) to remove the NAs.
You can do this without an explicit loop at the user R level using lapply(), for example, with dummy data,
set.seed(42)
data <- list(data.frame(city = rnorm(100)),
data.frame(city = rnorm(100)),
data.frame(city = rnorm(100)))
mean(unlist(lapply(data, `[`, "city")), na.rm = TRUE)
which gives
> mean(unlist(lapply(data, `[`, "city")), na.rm = TRUE)
[1] -0.02177902
So in your case, you need:
mean(unlist(lapply(data[1:10], `[`, "city")), na.rm = TRUE)
If you want to write a loop, then perhaps
req.data <- vector("list", length = 3) ## allocate, adjust to length = 10
for (j in 1:3) { ## adjust to 1:10 for your data / Q
req.data[[j]] <- data[[j]]$city ## fill in
}
mean(unlist(req.data), na.rm = TRUE)
> mean(unlist(req.data), na.rm = TRUE)
[1] -0.02177902
is one way. Or alternatively, compute the mean of the individual cities and then average those means
vec <- numeric(length = 3) ## allocate, adjust to length = 10
for (j in 1:3) { ## adjust to 1:10 for your question
vec[j] <- mean(data[[j]]$city, na.rm = TRUE)
}
mean(vec)

Resources