Confusion using gdaltranslate.exe when rasters overlap - raster

I'm confused about what gdaltranslate is doing here. My understanding is that files are written on top of each other. So the last file, file_b, with 5 NAs in it should be present in the final tif. But the NAs aren't, the values 95,96,97,98,99 are.
library("terra")
library("glue")
a <- rast(nrow = 10, ncol = 10, vals = 1:100)
writeRaster(a, "file_a.tif")
b <- rast(nrow = 10, ncol = 10, vals = c(1:95, NA, NA, NA, NA, NA))
writeRaster(b, "file_b.tif")
raster_list <- c("file_a.tif", "file_b.tif")
list_for_vrt <- "vrt_list.txt"
lapply(raster_list, write, file = list_for_vrt, append = TRUE)
vrt_file <- "temp.vrt"
system(glue('{double_quote("C://Program Files//QGIS 3.26.1//bin//gdalbuildvrt.exe")} -b 1 -q -input_file_list {double_quote(list_for_vrt)} {double_quote(vrt_file)}'))
system(glue('{double_quote("C://Program Files//QGIS 3.26.1//bin//gdal_translate.exe")} -q -co {double_quote("PREDICTOR=2")} -co {double_quote("COMPRESS=LZW")} -of GTiff -co BIGTIFF=YES -co TILED=YES -co NUM_THREADS=ALL_CPUS {double_quote(vrt_file)} {double_quote(gsub(".vrt", ".tif", vrt_file))}'))
rast("temp.tif") |> setMinMax() |> minmax()
temp
min 1
max 100
Where have the NAs gone?
Finally, my goal here is that I want to take the max value of overlapping rasters. So I want to reliably be able to do that (in the way that seems to be happening here, but I'm not sure why).
If I flip things around, gdaltranslate does seem to write the second file over the first one.
library("terra")
library("glue")
a <- rast(nrow = 10, ncol = 10, vals = 101:200)
writeRaster(a, "file_a.tif", overwrite = TRUE)
b <- rast(nrow = 10, ncol = 10, vals = c(1:100))
writeRaster(b, "file_b.tif", overwrite = TRUE)
raster_list <- c("file_a.tif", "file_b.tif")
list_for_vrt <- "vrt_list.txt"
unlink(list_for_vrt)
lapply(raster_list, write, file = list_for_vrt, append = TRUE)
vrt_file <- "temp.vrt"
unlink(vrt_file)
system(glue('{double_quote("C://Program Files//QGIS 3.26.1//bin//gdalbuildvrt.exe")} -b 1 -q -input_file_list {double_quote(list_for_vrt)} {double_quote(vrt_file)}'))
system(glue('{double_quote("C://Program Files//QGIS 3.26.1//bin//gdal_translate.exe")} -q -co {double_quote("PREDICTOR=2")} -co {double_quote("COMPRESS=LZW")} -of GTiff -co BIGTIFF=YES -co TILED=YES -co NUM_THREADS=ALL_CPUS {double_quote(vrt_file)} {double_quote(gsub(".vrt", ".tif", vrt_file))}'))
rast("temp.tif") |> setMinMax() |> minmax()
temp
min 1
max 100
Thanks.

Related

Repeat iteration in a for loop in r

I am trying to generate a for loop that will repeat a sequence of the following:
sample(x = 1:14, size = 10, replace = TRUE, prob = c(1/4,1/4,1/4,1/4)
I want it to repeat 5000 times. So far, I include the above as the body of the loop and added
for (i in seq_along[1:5000]){
at the beginning but I am getting an error message saying
Error in seq_along[1:10000] : object of type 'builtin' is not subsettable
We need replicate
out <- replicate(5000, sample(x = 1:14, size = 10, replace = TRUE, prob = c(1/4,1/4,1/4,1/4)), simplify = FALSE)
There are a few issues here.
#MartinGal noted the syntax issues with seq_along and the missing ). Note that you can use seq(n) or 1:n in defining the number of loops.
You are not storing the sampled vectors anywhere, so the for loop will run the code but you won't capture the output.
You have x = 1:14 but you only have 4 prob values, which suggests you intended x = 1:4 (either that or you are 10 prob values short).
Here's one way to address these issues using a for loop.
n <- 5
s <- 10
xmax <- 4
p <- 1/4
out <- matrix(nrow = n, ncol = s, byrow = TRUE)
set.seed(1L)
for (i in seq(n)) {
out[i, ] <- sample(x = seq(xmax), size = s, replace = TRUE, prob = rep(p, xmax))
}
As andrew reece notes in his comment, it looks like you want x = 1:4 Depending what you want to do with your result you could generate all of the realizations at one time since you are sampling with replacement and then store the result in a matrix with 5000 rows of 10 realizations per row. So:
x <- sample(1:4, size = 5000 * 10, replace = TRUE, prob = c(1/4,1/4,1/4,1/4))
result <- matrix(x, nrow = 5000)

How to convert character string to executable code in R?

I have a dataframe e.g.
df_reprex <- data.frame(id = rep(paste0("S",round(runif(100, 1000000, 9999999),0)), each=10),
date = rep(seq.Date(today(), by=-7, length.out = 10), 100),
var1 = runif(1000, 10, 20),
var2 = runif(1000, 20, 50),
var3 = runif(1000, 2, 5),
var250 = runif(1000, 100, 200),
var1_baseline = rep(runif(100, 5, 10), each=10),
var2_baseline = rep(runif(100, 50, 80), each=10),
var3_baseline = rep(runif(100, 1, 3), each=10),
var250_baseline = rep(runif(100, 20, 70), each=10))
I want to write a function containing a for loop that for each row in the dataframe will subtract every "_baseline" column from the non-baseline column with the same name.
I have created a script that automatically creates a character string containing the code I would like to run:
df <- df_reprex
# get only numeric columns
df_num <- df %>% dplyr::select_if(., is.numeric)
# create a version with no baselines
df_nobaselines <- df_num %>% select(-contains("baseline"))
#extract names of non-baseline columns
numeric_cols <- names(df_nobaselines)
#initialise empty string
mutatestring <- ""
#write loop to fill in string:
for (colname in numeric_cols) {
mutatestring <- paste(mutatestring, ",", paste0(colname, "_change"), "=", colname, "-", paste0(colname, "_baseline"))
# df_num <- df_num %>%
# mutate(paste0(col, "_change") = col - paste0(col, "_baseline"))
}
mutatestring <- substr(mutatestring, 4, 9999999) # remove stuff at start (I know it's inefficient)
mutatestring2 <- paste("df %>% mutate(", mutatestring, ")") # add mutate call
but when I try to call "mutatestring2" it just prints the character string e.g.:
[1] "df %>% mutate( var1_change = var1 - var1_baseline , var2_change = var2 - var2_baseline , var3_change = var3 - var3_baseline , var250_change = var250 - var250_baseline )"
I thought that this part would be relatively easy and I'm sure I've missed something obvious, but I just can't get the text inside that string to run!
I've tried various slightly ridiculous methods but none of them return the desired output (i.e. the result returned by the character string if it was entered into the console as a command):
call(mutatestring2)
eval(mutatestring2)
parse(mutatestring2)
str2lang(mutatestring2)
mget(mutatestring2)
diff_func <- function() {mutatestring2}
diff_func1 <- function() {
a <-mutatestring2
return(a)}
diff_func2 <- function() {str2lang(mutatestring2)}
diff_func3 <- function() {eval(mutatestring2)}
diff_func4 <- function() {parse(mutatestring2)}
diff_func5 <- function() {call(mutatestring2)}
diff_func()
diff_func1()
diff_func2()
diff_func3()
diff_func4()
diff_func5()
I'm sure there must be a very straightforward way of doing this, but I just can't work it out!
How do I convert a character string to something that I can run or pass to a magrittr pipe?
You need to use the text parameter in parse, then eval the result. For example, you can do:
eval(parse(text = "print(5)"))
#> [1] 5
However, using eval(parse()) is normally a very bad idea, and there is usually a more sensible alternative.
In your case you can do this without resorting to eval(parse()), for example in base R you could subtract all the appropriate variables from each other like this:
baseline <- grep("_baseline$", names(df_reprex), value = TRUE)
non_baseline <- gsub("_baseline", "", baseline)
df_new <- cbind(df_reprex, as.data.frame(setNames(mapply(
function(i, j) df_reprex[[i]] - df_reprex[[j]],
baseline, non_baseline, SIMPLIFY = FALSE),
paste0(non_baseline, "_corrected"))))
Or if you want to keep the whole thing in a single pipe without storing intermediate variables, you could do:
mapply(function(i, j) df_reprex[[i]] - df_reprex[[j]],
grep("_baseline$", names(df_reprex), value = TRUE),
gsub("_baseline", "", grep("_baseline$", names(df_reprex), value = TRUE)),
SIMPLIFY = FALSE) %>%
setNames(gsub("_baseline", "_corrected",
grep("_baseline$", names(df_reprex), value = TRUE))) %>%
as.data.frame() %>%
{cbind(df_reprex, .)}

How to pre-process .csv files on integer values with cmd and fread from data.table

I have simulated data stored locally which I read into a list with lapply and fread and then work from the list thereafter. All files are equal in dimensions and format. I am wondering how I can improve the efficiency of reading time by using the cmd argument of fread.
For example, if I am interested in certain variables the select argument comes handy. The nrows argument is helpful when data from a specific year only is needed by first identifying the rows that include the years less than and equal to the year of interest and then filtering afterwards.
Having come across this article I have realised that the reading of files can be further optimised. However, I don't understand how to use the sprintf function since I want to preprocess on an integer value (the year in my case) rather than a string. Maybe using lapply creates a problem too (?).
If someone could please help on preprocessing multiple files (subsetting by an integer value and selecting desired variables) with fread and the cmd argument it would be greatly appreciated.
Here is a MWE of my approach:
keepRows <- min(which(df$year > 3)) - 1
all.df <- lapply(seq_along(dataFiles), function(x){
x <- fread(dataFiles[x], select = c("year", "a", "c", "e"),
nrows = keepRows) #,
x <- filter(x, year == 3)
})
Data:
df <- data.frame(year = rep(1:6, each = 3),
a = sample(20:25, 18, replace = TRUE),
b = rnorm(18, 1),
c = rbinom(18, 30, 0.25),
d = sample(letters, 18),
e = paste0(sample(letters, 18), sample(1:3, 18, replace = TRUE)))
dataFiles <- paste0("df", 1:5, ".csv")
We can use awk within cmd (not tested):
library(data.table)
myCols <- c("year", "a", "b", "c", "d", "e")
mySelect <- c("year", "a", "c", "e")
myYear <- 3
ixCol <- match(mySelect, myCols)
all.df <- lapply(list.files("path/to/files/", pattern = "*.csv"), function(x){
myCmd <- paste0("awk -F, '$1 == ", myYear, "' ", x)
fread(cmd = myCmd, select = ixCol, col.names = myCols[ ixCol ])
})
Here is my go at things using the findstr-command in Window 10
What it does: It builds a command that lookes for all line sthat start with the string 3,, 4,,5,,6,,7,,8, or 9,, and passes these results to fread.
it needs the full path to your csv-files.. relative wo'nt work.. You will also have to use gsub to get windows-readable filenames
#what files to read?
files.to.read <- list.files( path = "C:/Users/your/Documents/here/temp", pattern = "^df[1-5]\\.csv$", full.names = TRUE )
#read files, only line stat start with regex "^[3-9],"
pattern = "^[3-9],"
data.table::rbindlist(
lapply( files.to.read, function(x) {
data.table::fread( cmd = paste0( 'findstr /R "', pattern, '" ', gsub( "\\/", "\\\\", x ) ),
sep = ",",
header = FALSE )
}),
use.names = TRUE, idcol = "from_df" )

Reading data from excel book with many thousand tabs

I am reading in data from individual xlsx files, with the data stored in 10-20 thousand individual tabs in each workbook file. The first sheet contains a master data table, including links to the individual tabs with further data.
The column based 'tabbed' data is summarized and transposed before being appended to the master data.
The master data table is large (10' thousands rows x hundreds cols) in its own right, the additional data tabs are small in their own rights (a few cols by 10's to a few '00 rows).
Using XLConnect package crashed out-of-memory already on calling loadWorkbook() (R 3.4.0, RStudio 1.1.383, 64bit, 8G machine), otherwise I could work along the lines of this.
Because I need to load from individual tabs, I am currently using a nested for() loop to load each individual tab data. However, with my number of tabs this takes nearly a minute per loop putting the total execution time to nearly a week! Using a nested for() loop is also decidedly non-tidy, so I suspect there is a neater and (much) faster way to achieve this, but can't see it.
I have read in the links into a dedicated df (linkReferences) in R.
The data source is not mine, so I am stuck with the input as provided.
The problem is purely related to the speed of reading the sheets, which appears to grow as the number of sheets in a file (and thus the file size) grows.
I am looking for any solution to speed this up, updated with self-contained minimum example.
On my pc: n = 10 gives time/sheet 0.16 sec, n = 100 ~0.56 sec/sheet and n = 1000 ~3 sec/sheet, which is similar to what i'm seeing in my real data (<10 sec/sheet for 16k sheets)
library(tidyverse)
number_of_sheets= 100
# =========================================================================
# CREATE SAMPLE FILE . Layout similar to actual data
library(openxlsx)
my.sheets.file <- "sampleXLSX.xlsx"
linkReferences <- data_frame( sheet = str_c("Data ",seq(1:number_of_sheets)) )
wb <- write.xlsx(linkReferences, file=my.sheets.file)
sample_header <-data.frame( head_name = c("head1", "head2","head3","head4","head5") ,
head_text = c("text1", "text2","text3","text4","text5") )
set.seed(31415)
for (i in 1:number_of_sheets) {
cat(i,"..")
sheet_name_i <- paste0("Data ",i)
addWorksheet(wb, sheetName = sheet_name_i)
writeData(wb, sheet=sheet_name_i, sample_header, startCol = "B", startRow=2)
n = ceiling( runif(1)*200 )
sample_data <- data_frame(A=seq(1:n),
B= runif(n),
C= sample(seq(1:5),n,replace=TRUE))
writeData(wb, sheet=sheet_name_i, sample_data, startCol = "B", startRow=10)
}
saveWorkbook(wb, file=my.sheets.file, overwrite=TRUE)
#===========================================================================
# THIS IS THE ACTUAL QUESTION
# Read from file with many tabs
library(readxl)
library(stringr)
linkReferences <- linkReferences %>%
mutate( Head1 = NA, Head2 = NA, Head3 = NA, Head4 = NA, Head5 = NA,
A.1 = NA, B.1 = NA, C.1 = NA,
A.2 = NA, B.2 = NA, C.2 = NA,
A.3 = NA, B.3 = NA, C.3 = NA,
A.4 = NA, B.4 = NA, C.4 = NA,
A.5 = NA, B.5 = NA, C.5 = NA
)
linkReferences.nrows = nrow(linkReferences)
lRnames <- names(linkReferences)
start.row=1
start_time <- Sys.time()
for (i in start.row:linkReferences.nrows){
cat("i=",i, " / ",linkReferences.nrows,"\n")
start_time_i=Sys.time()
linked_data <- read_xlsx(my.sheets.file,
sheet=as.character(linkReferences[i,"sheet"]),
skip=2,
col_types = c("text","text","text"),
col_names=FALSE)
print(Sys.time()-start_time_i) # This takes 99% of the loop time
linkReferences[i,2:6] <- unlist( linked_data[1:5,2])
data_head_row <- which( linked_data[,1]=="A")
names(linked_data) <- c("A","B","C")
linked_data <- linked_data[ (data_head_row+1):(nrow(linked_data)),]
# create a (rather random) sample summary
summary_linked_data <- linked_data%>%
group_by(C) %>%
summarise(B=last(B), A=last(A)) %>%
arrange(desc(C))
# not all data has the full range of options, so use actual number
summary_linked_data_nrows <- nrow(summary_linked_data)
#start_time_i2 <- Sys.time()
for( ii in 1:summary_linked_data_nrows) {
linkReferences[i, match(str_c("A.",ii),lRnames):match(str_c("C.",ii),lRnames)] <-
summary_linked_data[ii,]
}
#print(Sys.time()-start_time_i2)
print(linkReferences[i,2:20])
# ________________________________________________________
# BELOW IS ONLY FOR TEST LOOP TIMING STATS IN THIS EXAMPLE
delta_time <- Sys.time() - start_time
delta_time_attr <- attr(delta_time, "units")
row_time <- delta_time/(i-start.row+1)
if (delta_time_attr =="mins") {
row_time <- row_time*60
} else if( delta_time_attr == "hours") {
row_time <- row_time*3600
}
total_time <- row_time*(linkReferences.nrows-start.row-1)/3600
cat( "Passed time: ", delta_time, attr(delta_time, "units"),
" | time/row: ", round(row_time,2), "secs.",
" | Est total time:",
round(total_time*60,2), "mins = )",
round(total_time,2), "hours )",
"\n---------------\n")
}
# Conversion of data loaded as character to numeric can all happen outside loop once all data is loaded.
After some digging: XLConnect(), with its vectorised sheet reading capability (see here), is the clear winner, provided you can your workbook in memory.
I had to a. reduce the size of my workbook, and b. set XLconnect memory to 4GB as per #Joshua's link here.
For the 1000 sheets example as per the question above:
wb <- loadWorkbook() took 15 seconds,
linked_data_lst = readWorksheet() took 34 seconds,
and the data extraction for (i in 1:nr_linked_data){...} from the now in-memory list, took 86 seconds.
Giving a total time of 0.135 sec/sheet (22x faster than the code above)
#============================================================================
# now read it again
library(stringr)
options(java.parameters = "-Xmx4g" )
library(XLConnect)
linkReferences <- linkReferences %>%
mutate( Head1 = NA, Head2 = NA, Head3 = NA, Head4 = NA, Head5 = NA,
A.1 = NA, B.1 = NA, C.1 = NA,
A.2 = NA, B.2 = NA, C.2 = NA,
A.3 = NA, B.3 = NA, C.3 = NA,
A.4 = NA, B.4 = NA, C.4 = NA,
A.5 = NA, B.5 = NA, C.5 = NA
)
linkReferences.nrows = nrow(linkReferences)
lRnames <- names(linkReferences)
lRcols <- c(match(str_c("A.1"),lRnames):match(str_c("C.5"),lRnames))
lRheadCols <- c((lRcols[1]-5):(lRcols[1]-1))
start_time <- Sys.time()
wb <- loadWorkbook(my.sheets.file)
Sys.time() - start_time
start.row=1
end.row = linkReferences.nrows
start_time0 <- Sys.time()
linked_data_lst = readWorksheet(wb,
sheet=linkReferences[start.row:end.row,][["sheet"]],
startCol = 2,
endCol = 4,
startRow = 3,
header = FALSE)
delta_time <- (Sys.time() - start_time0) %>% print()
nr_linked_data <- length(linked_data_lst)
start_time <- Sys.time()
for (i in 1:nr_linked_data ) {
cat("i=",i, " / ",nr_linked_data,"\n")
linked_data <- as_tibble(linked_data_lst[[i]])
# EVERYTHING BELOW HERE IS EXACTLY SAME AS IN QUESTION CODE
# =========================================================
linkReferences[i,lRheadCols] <- unlist( linked_data[1:5,2])
data_head_row <- which( linked_data[,1]=="A")
names(linked_data) <- c("A","B","C")
linked_data <- linked_data[ (data_head_row+1):(nrow(linked_data)),]
linked_data <- linked_data %>% mutate_all( funs(as.numeric) )
# create a (rather random) sample summary
summary_linked_data <- linked_data%>%
group_by(C) %>%
summarise(B=last(B), A=last(A)) %>%
arrange(desc(C))
# not all data has the full range of options, so use actual number
summary_linked_data_nrows <- nrow(summary_linked_data)
#start_time_i2 <- Sys.time()
for( ii in 1:summary_linked_data_nrows) {
linkReferences[i, match(str_c("A.",ii),lRnames):match(str_c("C.",ii),lRnames)] <-
summary_linked_data[ii,]
}
#print(Sys.time()-start_time_i2)
print(linkReferences[i,lRheadCols[1]:max(lRcols)])
delta_time <- Sys.time() - start_time
delta_time_attr <- attr(delta_time, "units")
row_time <- delta_time/(i-start.row+1)
if (delta_time_attr =="mins") {
row_time <- row_time*60
} else if( delta_time_attr == "hours") {
row_time <- row_time*3600
}
total_time <- row_time*(linkReferences.nrows-start.row-1)/3600
cat( "Passed time: ", delta_time, attr(delta_time, "units"),
" | time/row: ", round(row_time,2), "secs.",
" | Est total time:",
round(total_time*60,2), "mins = )",
round(total_time,2), "hours )",
"\n---------------\n")
}

How to compute the overall mean for several files in R?

I have 365 files for one year(considered as matrix with nrows=500 and ncol=700) that I want to compute the overall mean of that year.
to read one files:
con <- file("C:\\Users\\data.img","rb")
dat<- readBin(con, numeric(), size=4, n=700*500, signed=TRUE)
str(dat)
num [1:810438] 0.5 0.2 0.1...
to read all files:
dir1<- list.files("C:\\Users\\datsets", "*.img", full.names = TRUE)
to loop thru files:
for (.files in seq_along(dir1)){
file1 <- readBin(dir1[.files], numeric(), size = 4, n = 700*500, signed = T)}
any idea please on how to compute the mean of all values (pixel by pixel)so end up with one file with mean values?
Edit: I forgot to mention, I only want to compute the mean among elements (pixels) that have a positive value.
Here are two methods I can think of:
1) Using a for loop (memory efficient):
sum.dat <- rep(0, 810438)
sum.pos <- rep(0, 810438)
for (.file in dir1) {
dat <- readBin(.file, numeric(), size = 4, n = 700*500, signed = TRUE)
pos <- dat >= 0
sum.dat <- sum.dat + dat * pos
sum.pos <- sum.pos + pos
}
mean.dat <- sum.dat / sum.pos
2) Using vapply (concise code but not memory efficient as it loads all the data into memory at once. This might be what you want though if you plan to do further processing on all the data.)
dats <- vapply(dir1, readBin, FUN.VALUE = numeric(810438),
what = numeric(), size = 4, n = 700*500, signed = TRUE)
mean.dat <- rowmeans(ifelse(dats >= 0, dats, NA), na.rm = TRUE)

Resources