How to compute the overall mean for several files in R? - r

I have 365 files for one year(considered as matrix with nrows=500 and ncol=700) that I want to compute the overall mean of that year.
to read one files:
con <- file("C:\\Users\\data.img","rb")
dat<- readBin(con, numeric(), size=4, n=700*500, signed=TRUE)
str(dat)
num [1:810438] 0.5 0.2 0.1...
to read all files:
dir1<- list.files("C:\\Users\\datsets", "*.img", full.names = TRUE)
to loop thru files:
for (.files in seq_along(dir1)){
file1 <- readBin(dir1[.files], numeric(), size = 4, n = 700*500, signed = T)}
any idea please on how to compute the mean of all values (pixel by pixel)so end up with one file with mean values?
Edit: I forgot to mention, I only want to compute the mean among elements (pixels) that have a positive value.

Here are two methods I can think of:
1) Using a for loop (memory efficient):
sum.dat <- rep(0, 810438)
sum.pos <- rep(0, 810438)
for (.file in dir1) {
dat <- readBin(.file, numeric(), size = 4, n = 700*500, signed = TRUE)
pos <- dat >= 0
sum.dat <- sum.dat + dat * pos
sum.pos <- sum.pos + pos
}
mean.dat <- sum.dat / sum.pos
2) Using vapply (concise code but not memory efficient as it loads all the data into memory at once. This might be what you want though if you plan to do further processing on all the data.)
dats <- vapply(dir1, readBin, FUN.VALUE = numeric(810438),
what = numeric(), size = 4, n = 700*500, signed = TRUE)
mean.dat <- rowmeans(ifelse(dats >= 0, dats, NA), na.rm = TRUE)

Related

Using "lapply" in R to create multiple raster files from folder with lidar data

How can I read all files in a folder, perform a script and create separate outputs from all files
containing the original name? I have a folder with .las files and I need to create corresponding .asc files from them. My script as below:
library(lidR)
# Path to data
LASfile <- ("path/1234.las")
# Sorting out points in point cloud data, keeping vegetation and ground point classes.
las <- readLAS(LASfile, filter="-keep_class 1 2") # Keep high vegetation and ground point classes
# Normalizing ground points to 0 elevation (idwinterpolation), instead of meters above sea level.
dtm <- grid_terrain(las, algorithm = knnidw(k = 8, p = 2))
las_normalized <- normalize_height(las, dtm)
# Create a filter to remove points above 95th percentile of height
lasfilternoise = function(las, sensitivity)
{
p95 <- grid_metrics(las, ~quantile(Z, probs = 0.95), 10)
las <- merge_spatial(las, p95, "p95")
las <- filter_poi(las, Z < p95*sensitivity)
las$p95 <- NULL
return(las)
}
# Generating a pitfree canopy height modela model without null values (Khosravipour et al., 2014)
las_denoised <- lasfilternoise(las_normalized, sensitivity = 1.2)
chm <- grid_canopy(las_denoised, 0.32, pitfree(c(0,2,5,10,15), c(3,1.5), subcircle = 0.2))
# Applying a median filter, 5x5 moving window to smooth the image and remove noise
ker <- matrix(1,3,3)
chms <- raster::focal(chm, w = ker, fun = median)
plot(chms)
library(raster)
# Writing output file
writeRaster(chms, filename="path/1234.asc", format="ascii", overwrite=TRUE) # Ändra till relevant för varje körning
citation("lidR")
I tried using lapply but I dont know how to use it in the right way.
Must be something like this to read all files in the folder: list.files("path", pattern = "*.las", full.names = TRUE)
and something like this to write the output files: lapply(r, writeRaster, filename = paste0(f, ".asc"), format = "ascii")
But I cannot get it right
An example of my LAZ to LAS+Index conversion:
convertLAZ <- function(lazfile, outdir = "") {
if(!dir.exists({{outdir}})) { dir.create({{outdir}}, recursive = TRUE)}
print(lazfile)
las <- lidR::readLAS(files = {{lazfile}}, filter = "-keep_class 2 9")
.file <- stringi::stri_replace_all_regex({{lazfile}}, "^.*/", "")
lidR::writeLAS(las, file = paste0({{outdir}}, "/", stringi::stri_replace_all_fixed(.file, "laz", "las")), index = TRUE)
}
f <- list.files("data/laz", pattern = "*.laz", full.names = TRUE)
lapply(f, convertLAZ, outdir = "data/las22")
You can expand it to rasterization, normalization, etc and saving as .asc. But I would encourage you to have a look on https://r-lidar.github.io/lidRbook/engine.html. In short: process your LAZ/LAS files as LAScatalog, and then tile the result raster and save to .asc.
And an example how to use parallel processing (in below example 3+1 processes - please note, it can be memory hungry, so be careful with number of workers/processing parameters like opt_chunk_buffer.
library(future)
options(parallelly.availableCores.methods = "mc.cores")
options(mc.cores = 3)
plan(multisession)
parallelly::availableWorkers()
library(lidR)
myPath <- "data/las"
ctg <- readLAScatalog(myPath)
crs(ctg) <- "EPSG:2180"
ctg#output_options$drivers$SpatRaster$param$overwrite <- TRUE
opt_output_files(ctg) <- "data/dtm2/barycz__{XLEFT}_{YBOTTOM}"
opt_chunk_size(ctg) <- 500
opt_chunk_buffer(ctg) <- 600
opt_filter(ctg) <- "-keep_class 2 9"
summary(ctg)
vr <- rasterize_terrain(ctg, 0.25, tin())
plot(vr)
Solved it now
.libPaths( c( "C:/Users/Public/R/win-library/4.2" , .libPaths() ) )
library(lidR)
createASCI <- function(lasfile, outdir = "") {
if(!dir.exists({{outdir}})) { dir.create({{outdir}}, recursive = TRUE)}
print(lasfile)
las <- lidR::readLAS(files = {{lasfile}}, filter = "-keep_class 1 2 3 4 5")
.file <- stringi::stri_replace_all_regex({{lasfile}}, "^.*/", "")
# Normalizing ground points to 0 elevation (idwinterpolation), instead of meters above sea level.
dtm <- grid_terrain(las, algorithm = knnidw(k = 8, p = 2))
las_normalized <- normalize_height(las, dtm)
# Create a filter to remove points above 95th percentile of height
lasfilternoise = function(las, sensitivity)
{
p95 <- grid_metrics(las, ~quantile(Z, probs = 0.95), 10)
las <- merge_spatial(las, p95, "p95")
las <- filter_poi(las, Z < p95*sensitivity)
las$p95 <- NULL
return(las)
}
# Generating a pitfree canopy height modela model without null values (Khosravipour et al., 2014)
las_denoised <- lasfilternoise(las_normalized, sensitivity = 1.2)
chm <- grid_canopy(las_denoised, 0.32, pitfree(c(0,2,5,10,15), c(3,1.5), subcircle = 0.2))
# Applying a median filter, 5x5 moving window to smooth the image and remove noise
ker <- matrix(1,3,3)
chms <- raster::focal(chm, w = ker, fun = median)
writeRaster(chms, file = paste0({{outdir}}, "/", stringi::stri_replace_all_fixed(.file, "las", "asc")), index = TRUE)
}
f <- list.files("C:/Lasdata", pattern = "*.las", full.names = TRUE)
lapply(f, createASCI, outdir = "C:/Lasdata/nytt")

Repeat iteration in a for loop in r

I am trying to generate a for loop that will repeat a sequence of the following:
sample(x = 1:14, size = 10, replace = TRUE, prob = c(1/4,1/4,1/4,1/4)
I want it to repeat 5000 times. So far, I include the above as the body of the loop and added
for (i in seq_along[1:5000]){
at the beginning but I am getting an error message saying
Error in seq_along[1:10000] : object of type 'builtin' is not subsettable
We need replicate
out <- replicate(5000, sample(x = 1:14, size = 10, replace = TRUE, prob = c(1/4,1/4,1/4,1/4)), simplify = FALSE)
There are a few issues here.
#MartinGal noted the syntax issues with seq_along and the missing ). Note that you can use seq(n) or 1:n in defining the number of loops.
You are not storing the sampled vectors anywhere, so the for loop will run the code but you won't capture the output.
You have x = 1:14 but you only have 4 prob values, which suggests you intended x = 1:4 (either that or you are 10 prob values short).
Here's one way to address these issues using a for loop.
n <- 5
s <- 10
xmax <- 4
p <- 1/4
out <- matrix(nrow = n, ncol = s, byrow = TRUE)
set.seed(1L)
for (i in seq(n)) {
out[i, ] <- sample(x = seq(xmax), size = s, replace = TRUE, prob = rep(p, xmax))
}
As andrew reece notes in his comment, it looks like you want x = 1:4 Depending what you want to do with your result you could generate all of the realizations at one time since you are sampling with replacement and then store the result in a matrix with 5000 rows of 10 realizations per row. So:
x <- sample(1:4, size = 5000 * 10, replace = TRUE, prob = c(1/4,1/4,1/4,1/4))
result <- matrix(x, nrow = 5000)

Sample draw in sapply without replacement

How does one draw a sample within a sapply function without replacement? Consider the following MWE below. What I am trying to achieve is for a number in idDRAW to receive a letter from chrSMPL (given the sample size of chrSMPL). Whether a number from idDRAW receives a letter is determined by the respective probabilities, risk factors and categories. This is calculated in the sapply function and stored in tmp.
The issue is sample replacement, leading to a number being named with a letter more than once. How can one avoid replacement whilst still using the sapply function? I have tried to adjust the code from this question (Alternative for sample) to suit my needs, but no luck. Thanks in advance.
set.seed(3)
chr<- LETTERS[1:8]
chrSMPL<- sample(chr, size = 30, replace = TRUE)
idDRAW<- sort(sample(1:100, size = 70, replace = FALSE))
p_mat<- matrix(runif(16, min = 0, max = 0.15), ncol = 2); rownames(p_mat) <- chr ## probability matrix
r_mat <- matrix(rep(c(0.8, 1.2), each = length(chr)), ncol = 2); rownames(r_mat) <- chr ## risk factor matrix
r_cat<- sample(1:2, 70, replace = TRUE) ## risk categories
# find number from `idDRAW` to be named a letter:
Out<- sapply(chrSMPL, function(x){
tmp<- p_mat[x, 1] * r_mat[x, r_cat]
sample(idDRAW, 1, prob = tmp)
})
> sort(Out)[1:3]
G B B
5 5 5
I managed with an alternative solution using a for loop as seen below. If anyone can offer suggestions on how the desired result can be achieved without using a for loop it would be greatly appreciated.
set.seed(3)
Out <- c()
for(i in 1:length(chrSMPL)){
tmp <- p_mat[chrSMPL[i], 1] * r_mat[chrSMPL[i], r_cat]
Out <- c(Out, sample(idDRAW, 1, prob = tmp))
rm <- which(idDRAW == Out[i])
idDRAW <- idDRAW[-rm]
r_cat <- r_cat[-rm]
}
names(Out) <- chrSMPL
sort(Out)[1:3]

Adding a column to a data frame by calculating each value to be added

Good evening,
I asked a question earlier and found it hard to implement the solution so I am gonna reask it in a more clear way.
I have the problem, that I want to add a column to a dataframe of daily returns of a stock. Lets say its normally distributed and I would like to add a column that contains the value at risk (hist) whose function I wrote myself.
The restriction is that each observation should be assigned to my function and take the last 249 observations as well.
So when the next observation is calculated it should also take only the last 249 observations of the das before. So the input values should move as the time goes on. In other words I want values from 251 days ago to be excluded. Hopefully I explained myself well enough. If not maybe the code speaks for me:
df<- data.frame(Date=seq(ISOdate(2000,1,1), by = "days", length.out = 500), Returns=rnorm(500))
#function
VaR.hist<- function(x, n=250, hd=20, q=0.05){
width<-nrow(x)
NA.x<-na.omit(x)
quantil<-quantile(NA.x[(width-249):width],probs=q)
VaR<- quantil*sqrt(hd)%>%
return()
}
# Run the function on the dataframe
df$VaR<- df$Returns%>%VaR.hist()
Error in (width - 249):width : argument of length 0
This is the Error code that I get and not my new Variable...
Thanks !!
As wibom wrote in the comment nrow(x) does not work for vectors. What you need is length() instead. Also you do not need return() in the last line as R automatically returns the last line of a function if there is no early return() before.
library(dplyr)
df<- data.frame(Date=seq(ISOdate(2000,1,1), by = "days", length.out = 500), Returns=rnorm(500))
#function
VaR.hist <- function(x, n=250, hd=20, q=0.05){
width <- length(x) # here you need length as x is a vector, nrow only works for data.frames/matrixes
NA.x <- na.omit(x)
quantil <- quantile(NA.x[(width-249):width], probs = q)
quantil*sqrt(hd)
}
# Run the function on the dataframe
df$VaR <- df$Returns %>% VaR.hist()
It's a bit hard to understand what you want to do exactly.
My understanding is that you wish to compute a new variable VarR, calculated based on the current and previous 249 observations of df$Returns, right?
Is this about what you wish to do?:
library(tidyverse)
set.seed(42)
df <- tibble(
Date = seq(ISOdate(2000, 1, 1), by = "days", length.out = 500),
Returns=rnorm(500)
)
the_function <- function(i, mydata, hd = 20, q = .05) {
r <-
mydata %>%
filter(ridx <= i, ridx > i - 249) %>%
pull(Returns)
quantil <- quantile(r, probs = q)
VaR <- quantil*sqrt(hd)
}
df <-
df %>%
mutate(ridx = row_number()) %>%
mutate(VaR = map_dbl(ridx, the_function, mydata = .))
If you are looking for a base-R solution:
set.seed(42)
df <- data.frame(
Date = seq(ISOdate(2000, 1, 1), by = "days", length.out = 500),
Returns = rnorm(500)
)
a_function <- function(i, mydata, hd = 20, q = .05) {
r <- mydata$Returns[mydata$ridx <= i & mydata$ridx > (i - 249)]
quantil <- quantile(r, probs = q)
VaR <- quantil*sqrt(hd)
}
df$ridx <- 1:nrow(df) # add index
df$VaR <- sapply(df$ridx, a_function, mydata = df)

Reading series of values in R

I have read a series of 332 files like below by storing the data in each file as a data frame in List.
files <- list.files()
data <- list()
for (i in 1:332){
data[[i]] = read.csv(files[[i]])
}
The data has 3 columns with names id, city, town. Now I need to calculate the mean of all values under city corresponding to the id values 1:10 for which I wrote the below code
for(j in 1:10){
req.data <- data[[j]]$city
}
mean(na.omit(req.data))
But it is giving me a wrong value and when I call it in a function its transferring null values. Any help is highly appreciated.
Each time you iterate through j = 1:10 you assign data[[j]]$city to the object req.data. In doing so, for steps j = 2:10 you are overwriting the previous version of req.data with the contents of the jth data set. Hence req.data only ever contains at any one time a single city's worth of data and hence you are getting the wrong answer sa you are computing the mean for the last city only, not all 10.
Also note that you could do mean(req.data, na.rm = TRUE) to remove the NAs.
You can do this without an explicit loop at the user R level using lapply(), for example, with dummy data,
set.seed(42)
data <- list(data.frame(city = rnorm(100)),
data.frame(city = rnorm(100)),
data.frame(city = rnorm(100)))
mean(unlist(lapply(data, `[`, "city")), na.rm = TRUE)
which gives
> mean(unlist(lapply(data, `[`, "city")), na.rm = TRUE)
[1] -0.02177902
So in your case, you need:
mean(unlist(lapply(data[1:10], `[`, "city")), na.rm = TRUE)
If you want to write a loop, then perhaps
req.data <- vector("list", length = 3) ## allocate, adjust to length = 10
for (j in 1:3) { ## adjust to 1:10 for your data / Q
req.data[[j]] <- data[[j]]$city ## fill in
}
mean(unlist(req.data), na.rm = TRUE)
> mean(unlist(req.data), na.rm = TRUE)
[1] -0.02177902
is one way. Or alternatively, compute the mean of the individual cities and then average those means
vec <- numeric(length = 3) ## allocate, adjust to length = 10
for (j in 1:3) { ## adjust to 1:10 for your question
vec[j] <- mean(data[[j]]$city, na.rm = TRUE)
}
mean(vec)

Resources