Related
I'm trying to make a function in R, that performs some specific operations on a lot of different data sets, with the following code:
library(parallel)
cluster = makeCluster(2)
setwd("D:\\Speciale")
data_func <- function(kommune) {
rm(list=ls())
library(dplyr)
library(data.table)
library (tidyr)
#Load address and turbine datasets
distances <- fread(file="Adresser og distancer\\kommune.csv", header=TRUE, sep=",", colClasses = c("longitude" = "character", "latitude" = "character", "min_distance" = "character", "distance_turbine" = "character", "id_turbine" = "character"), encoding="Latin-1")
turbines <- fread(file="turbines_DK.csv", header=TRUE, sep=",", colClasses = c("lon" = "character", "lat" = "character", "id_turbine" = "character", "total_height" = "character", "location" = "character"), encoding="Latin-1")
Some cleaning of the data and construction of new variables
#write out the dataset
setwd("D:\\Speciale\\Analysedata")
fwrite(mock_final, file = "final_kommune.csv", row.names = FALSE)
}
do.call(rbind, parLapply(cl = cluster, c("Albertslund", "Alleroed"), data_func))
When I do this, I get the following error message:
Error in checkForRemoteErrors(val) :
2 nodes produced errors; first error: File 'Adresser og distancer\kommune.csv' does not exist or is non-readable. getwd()=='C:/Users/KSAlb/OneDrive/Dokumenter'
I need it to change the name of the files. Here it should insert Albertslund instead of kommune in the file names, perform the operations, write out a CSV file (changing "final_kommune.csv" to "final_Albertslund.csv"), clear the environment and then move on to the next data set, Alleroed.
Albertslund and Alleroed are just examples, there is a total of 98 data sets I need to process.
Maybe something like the code below can be of help. Untested, since there are no data.
library(parallel)
library(dplyr)
library(data.table)
library(tidyr)
data_func <- function(kommune, inpath = "Adresser og distancer",
turbines, outpath = "D:/Speciale/Analysedata") {
filename <- paste0(kommune, ".csv")
filename <- file.path(inpath, filename)
#Load address and turbine datasets
distances <- fread(
file = filename,
header = TRUE,
sep = ",",
colClasses = c("longitude" = "character", "latitude" = "character", "min_distance" = "character", "distance_turbine" = "character", "id_turbine" = "character"),
encoding = "Latin-1"
)
#Some cleaning of the data and construction of new variables
#write out the dataset
outfile <- paste0("final_", kommune, ".csv")
outfile <- file.path(outpath, outfile)
fwrite(mock_final, file = outfile, row.names = FALSE)
}
cluster = makeCluster(2)
setwd("D:\\Speciale")
# Read turbines file just once
turbines <- fread(
file = "turbines_DK.csv",
header = TRUE,
sep=",",
colClasses = c("lon" = "character", "lat" = "character", "id_turbine" = "character", "total_height" = "character", "location" = "character"),
encoding = "Latin-1"
)
kommune_vec <- c("Albertslund", "Alleroed")
do.call(rbind, parLapply(cl = cluster, kommune_vec, data_func, turbines = turbines))
I'm using the raster package on an R server to process a large set (30000 files) of data files (10MB each).
For now, processing consists of parsing the data and subsequently rasterizing it via the rasterize function.
The data is very sparse (only along roads) but has a high resolution and large extent. I've seen temporary files of 30GB for a raster created from one of the input files.
Because of the amount of files I'm using a foreach() %dopar% approach to processing the files, giving one file to each thread. I've set the raster options as follows:
rasterOptions(maxmemory = 15000000000)
rasterOptions(chunksize = 14000000000)
rasterOptions(todisk = TRUE)
This should come out to 15GB/thread * 32 threads = 480GB of RAM used at maximum for the rasters. Add some overhead and I would expect somewhere between 10GB to 20GB of the 512GB RAM to remain. However, that is not the case and I can't seem to figure out why.
R gobbles up RAM until only 100MB to 2GB remain and only then seems to release previously allocated memory, only to be fed straight back into R for the next raster. I checked the RAM usage repeatedly over several hours to observe this.
I'm using SpatialPolygonDataFrames as input for rasterize, and suspected they might take up a lot of RAM as well. But when I checked their size, they were rather small, at about 100MB. Playing around with maxmemory, chunksize and only 16 threads also didn't seem to have any effect.
I also looked at the rasterize source code to see if I find an explanation there, but that didn't get me far:
setMethod('rasterize', signature(x='SpatialPoints', y='Raster'),
function(x, y, field, fun='last', background=NA, mask=FALSE, update=FALSE, updateValue='all', filename="", na.rm=TRUE, ...){
.pointsToRaster(x, y, field=field, fun=fun, background=background, mask=mask, update=update, updateValue=updateValue, filename=filename, na.rm=na.rm, ...)
}
)
I have no clue where to find .pointsToRaster
Does anyone have an explanation for this behaviour or ideas for things to check? Did I simply overlook something? I´d like to not use the entire RAM so that other users can still work on the server. From what I understand my code should regulate how much RAM is used.
Here's the code I use:
library('iterators')
library('parallel')
library('foreach')
library('doParallel')
#init parallelisation
nCores = 32
cCluster = makeCluster(nCores, type = "FORK", outFile = "parseProcess")
registerDoParallel(cCluster)
foreach(j = 1:length(fileList)) %dopar%{
#load all libraries for every thread
library('sp')
library('raster')
library('spatial')
library('gstat')
library('rgdal')
library('dismo')
library('deldir')
library('rgeos')
library('sjmisc')
#set rasteroptions per thread
rasterOptions(maxmemory = 15000000000)
rasterOptions(chunksize = 14000000000)
rasterOptions(todisk = TRUE)
tmpFolder = paste0("[PATH TO STORAGE]/rtmp",j)
dir.create(tmpFolder)
rasterOptions(tmpdir = tmpFolder)
#generate names for raster files
fileName = basename(fileList[j])
print(paste("Processing:", fileName))
rNameMax0 = sub(pattern = ".bin", replacement = "_scan0_max.tif", fileName)
#repeat this for all 11 scans
rasterStorage = "[PATH TO OTHER STORAGE]" #path to raster folder
scanList = parseFile(fileList[j]) #any memory allocated in this functions should be released on function return
#create template raster
bounds = as.vector(t(bbox(scanList$scan0)))
resolution = c(0.0000566, 0.0000359)
tmp = raster(xmn = bounds[1], xmx = bounds[2], ymn = bounds[3], ymx = bounds[4], res = resolution)
#create rasters from data
coordinates(scanList$scan0) = ~Long+Lat
proj4string(scanList$scan0) = WGS84CRS
rScanMax0 = rasterize(scanList$scan0, tmp, fun = 'max', filename = paste0(rasterStorage, rNameMax0))
rm('rScanMax0')
#repeat for scans 1 to 4
removeTmpFiles(h = 0.2)
unlink(tmpFolder, recursive = TRUE, force = TRUE)
dir.create(tmpFolder)
rasterOptions(tmpdir = tmpFolder)
coordinates(scanList$scan5) = ~Long+Lat
proj4string(scanList$scan5) = WGS84CRS
rScanMax5 = rasterize(scanList$scan5, tmp, fun = 'max', filename = paste0(rasterStorage, rNameMax5))
rm('rScanMax5')
#repeat for scans 6 to 10
removeTmpFiles(h = 0.2)
unlink(tmpFolder, recursive = TRUE, force = TRUE)
}
stopCluster(cCluster)
Here's the (gutted) code of the parseFile function:
parseFile = function(fileName){
con = file(fileName, "rb")
intSize = 4
fileEndian = "little"
#create data frames for each scan
scan0 = data.frame(matrix(ncol = n1, nrow = 0))
colnames(scan0) = c("Lat", "Long", ...)
scan1 = data.frame(matrix(ncol = n2, nrow = 0))
colnames(scan1) = c("Lat", "Long", ...)
scan2 = data.frame(matrix(ncol = n3, nrow = 0))
colnames(scan2) = c("Lat", "Long", ...)
scan3 = data.frame(matrix(ncol = n4, nrow = 0))
colnames(scan3) = c("Lat", "Long", ...)
scan4 = data.frame(matrix(ncol = n5, nrow = 0))
colnames(scan4) = c("Lat", "Long", ...)
scan5 = data.frame(matrix(ncol = n6, nrow = 0))
colnames(scan5) = c("Lat", "Long", ...)
scan6 = data.frame(matrix(ncol = n7, nrow = 0))
colnames(scan6) = c("Lat", "Long", ...)
scan7 = data.frame(matrix(ncol = n8, nrow = 0))
colnames(scan7) = c("Lat", "Long", ...)
scan8 = data.frame(matrix(ncol = n9, nrow = 0))
colnames(scan8) = c("Lat", "Long", ...)
scan9 = data.frame(matrix(ncol = n10, nrow = 0))
colnames(scan9) = c("Lat", "Long", ...)
scan10 = data.frame(matrix(ncol = n11, nrow = 0))
colnames(scan10) = c("Lat", "Long", ...)
header = readBin(con, raw(), n = 36)
i = 1
while(i){
blockHeader = readBin(con, integer(), n = 3, size = intSize, endian = fileEndian)
if(...){ #check whether file ended
break
}
i = i + 1
#sort data to correct scan, assign GPS tag
blockTrailer = readBin(con, raw(), n = 8)
}
#clean up
close(con)
#return parsed data
returnList = list("scan0" = scan0, "scan1" = scan1, "scan2" = scan2, "scan3" = scan3, "scan4" = scan4,
"scan5" = scan5, "scan6" = scan6, "scan7" = scan7, "scan8" = scan8, "scan9" = scan9, "scan10" = scan10)
return(returnList)
}
I'm also looking at the solutions posted here as another approach, but I'd still like to know why my code doesn't work as I expect it to.
I built this R script that generate a map and a background tiles, the problem is, I need to run it on PowerBI service, which has a very constrained resources (Ram and CPU), I attached a reproducible example
This example works fine in PowerBI service, but when I tried it with my real data only the raster or the map works, but when I do both, I get you exceed the resource available, and as it is not documented, I don't know if the issue is CPU or RAM.
what's the best way to profile this code and check which section to change
please notice the dataset is a raster saved as ASCII, using saveRDS, it is done outside PowerBI and loaded as a csv file, as PowerBI does not read binary data
# Input load. Please do not change, the dataset is generated by PowerBI, I change it only to have a reproducible example #
`dataset` = read.csv('https://raw.githubusercontent.com/djouallah/loadRobjectPBI/master/powerbidf.csv', check.names = FALSE, encoding = "UTF-8", blank.lines.skip = FALSE);
# Original Script. Please update your script content here and once completed copy below section back to the original editing window #
library(sf)
library(dplyr)
library(tmap)
library(tidyr)
tempdf <- dataset %>%
filter(!is.na(Value))%>%
dplyr::select(Index,Value)%>%
arrange(Index)%>%
mutate(Value = strsplit(as.character(Value), "#")) %>%
unnest(Value)%>%
dplyr::select(Value)
write.table(tempdf, file="test3.rds",row.names = FALSE,quote = FALSE, col.names=FALSE)
rm(tempdf)
background <- readRDS('test3.rds', refhook = NULL)
dataset <- dataset[c("x","y","color","status","labels")]
dataset$color <- as.character(dataset$color)
dataset$labels <- as.character(dataset$labels)
map <- st_as_sf(dataset,coords = c("x", "y"), crs = 4326)
chartlegend <- dataset %>%
dplyr::select(status,color)%>%
distinct(status, color)%>%
arrange(status)
rm(dataset)
tm_shape(background)+
tm_rgb() +
rm(background)+
tm_shape(map) +
tm_symbols(col = "color", size = 0.04,shape=19)+
tm_shape(filter(map, !is.na(labels))) +
tm_text(text="labels",col="white")+
rm(map)+
tm_add_legend(type='fill',labels=chartlegend$status, col=chartlegend$color)+
tm_layout(frame = FALSE,bg.color = "transparent",legend.width=2)+
tm_legend(position=c("left", "top"),text.size = 1.3)+
rm(chartlegend)
changing the code to use base R only did help a bit
# Input load. Please do not change #
`dataset` = read.csv('https://raw.githubusercontent.com/djouallah/loadRobjectPBI/master/powerbidf.csv', check.names = FALSE, encoding = "UTF-8", blank.lines.skip = FALSE);
# Original Script. Please update your script content here and once completed copy below section back to the original editing window #
library(sf)
library(tmap)
tempdf <- dataset[dataset$Value!="",]
tempdf <- tempdf[c("Index","Value")]
tempdf <- tempdf[order(tempdf$Index),]
tempdf <- stack(setNames(strsplit(as.character(tempdf$Value),'#'), tempdf$Index))
tempdf <- tempdf["values"]
write.table(tempdf, file="test3.rds",row.names = FALSE,quote = FALSE, col.names=FALSE)
rm(tempdf)
background <- readRDS('test3.rds', refhook = NULL)
dataset <- dataset[c("x","y","color","status","labels")]
dataset$color <- as.character(dataset$color)
map <- st_as_sf(dataset,coords = c("x", "y"), crs = 4326)
chartlegend <- unique(dataset[c("status","color")])
rm(dataset)
tm_shape(background)+
tm_rgb() +
rm(background)+
tm_shape(map) +
tm_symbols(col = "color", size = 0.04,shape=19)+
tm_text(text="labels",col="white")+
rm(map)+
tm_add_legend(type='fill',labels=chartlegend$status, col=chartlegend$color)+
tm_layout(frame = FALSE,outer.margins = c(0.005, 0.6, 0.06, 0.005),bg.color = "transparent",legend.width=2)+
tm_legend(position=c("right", "top"),text.size = 1.3)+
rm(chartlegend)
When creating a PNG file using writeGDAL, a georeferencing file is created (.aux.xml) along with the PNG file. Is there a way to prevent this from happening?
The following code creates the files as explained above.
library(raster)
library(rgdal)
r <- raster(xmn=742273.5, xmx=742702.5, ymn=6812515.5, ymx=6812995.5, ncols=144, nrows=161)
r <- setValues(r, 1:ncell(r))
rSpdf <- as(r, 'SpatialPixelsDataFrame')
rSpdf$colors <- as.numeric(cut(rSpdf$layer, breaks = 10))
writeGDAL(rSpdf[, 'colors'], 'test.png', drivername = 'PNG', type = 'Byte', mvFlag = 0, colorTables = list(colorRampPalette(c('black', 'white'))(11)))
By setting rgdal::setCPLConfigOption("GDAL_PAM_ENABLED", "FALSE") the .aux.xml file is not created.
Thank you Val for pointing me to the post.
library(raster)
library(rgdal)
rgdal::setCPLConfigOption("GDAL_PAM_ENABLED", "FALSE")
r <- raster(xmn=742273.5, xmx=742702.5, ymn=6812515.5, ymx=6812995.5, ncols=144, nrows=161)
r <- setValues(r, 1:ncell(r))
rSpdf <- as(r, 'SpatialPixelsDataFrame')
rSpdf$colors <- as.numeric(cut(rSpdf$layer, breaks = 10))
writeGDAL(rSpdf[, 'colors'], 'test.png', drivername = 'PNG', type = 'Byte', mvFlag = 0, colorTables = list(colorRampPalette(c('black', 'white'))(11)))
I have a series of CSV files that I want to prepare to append together. My appended file will be large, so I'd like to convert some string variables to numeric and date formats in the individual files rather than the larger appended file.
With other software, I would have one for loop that opens the file and nested for loops that would iterate over certain groups of variables. For this project, I am attempting to use R and apply functions.
I have mapply and lapply functions that work independently. I'm now trying to figure out how to combine them. Can I nest them? (See below for the independent parts and the nesting.)
(This code references code in the answer to How do I update data frame variables with sapply results?)
(Is it customary to provide an example CSV to give a reproducible example? Does R have built-in example CSVs?)
These work separately:
insert.division <- function(fileroot, divisionname){
ext <- ".csv"
file <- paste(fileroot, ext, sep = "")
data <- read.csv(file, header = TRUE, stringsAsFactors = FALSE)
data$division <- divisionname
write.csv(data, file = paste(fileroot, "_adj3", ext, sep = ""),
row.names = FALSE)
}
files <- c(
"file1",
"file2",
"file3",
"file4",
"file5"
)
divisions <- c(1:5)
#Open the files, insert division name, save new versions
mapply(insert.division, fileroot = files, divisionname = divisions)
#Change currency variables from string to numeric
currency.vars <- c(
"Price",
"RetailPrice"
)
df[currency.vars] <- lapply(
df[currency.vars],
function(x) as.numeric(sub("^\\(","-", gsub("[$,]|\\)$","", x)))
)
Combined version:
file.prep <- function(fileroot, divisionname, currency.vars){
ext <- ".csv"
file <- paste(fileroot, ext, sep = "")
data <- read.csv(file, header = TRUE, stringsAsFactors = FALSE)
data$division <- divisionname
df[currency.vars] <- lapply(
df[currency.vars],
function(x) as.numeric(sub("^\\(","-", gsub("[$,]|\\)$","", x)))
)
write.csv(data, file = paste(fileroot, "_adj", ext, sep = ""),
row.names = FALSE)
}
#Open the files, insert division name, change the currency variables,
#save new versions
mapply(file.prep, fileroot = files, divisionname = divisions,
currency.vars = df[currency.vars])
I'm not really sure why you're writing it back to file after changing the data, but here's an example of how I might approach the problem.
## Set up three csv files
set.seed(1)
DF <- data.frame(
w = paste0("($", sample(1500, 30) / 100, ")"),
x = Sys.Date() + 0:29,
y = sample(letters, 30, TRUE),
z = paste0("($", sample(1500, 30) / 100, ")")
)
fnames <- paste0("file", 1:3, ".csv")
Map(write.csv, split(DF, c(1, 10, 20)), fnames, row.names = FALSE)
Using your file.prep() function, you could adjust it a little and do
file.prep <- function(fileroot, divname, vars) {
ext <- ".csv"
file <- paste0(fileroot, ext)
data <- read.csv(file, stringsAsFactors = FALSE)
data$division <- divname
data[vars] <- lapply(data[vars], function(x) {
type.convert(gsub("[()$]", "", x))
})
write.csv(data, row.names = FALSE, file = paste0(fileroot, "_adj", ext))
}
divname <- 1:3
fnames <- paste0("file", divname)
Map(file.prep, fnames, divname, MoreArgs = list(vars = c("w", "z")))