I'm using the raster package on an R server to process a large set (30000 files) of data files (10MB each).
For now, processing consists of parsing the data and subsequently rasterizing it via the rasterize function.
The data is very sparse (only along roads) but has a high resolution and large extent. I've seen temporary files of 30GB for a raster created from one of the input files.
Because of the amount of files I'm using a foreach() %dopar% approach to processing the files, giving one file to each thread. I've set the raster options as follows:
rasterOptions(maxmemory = 15000000000)
rasterOptions(chunksize = 14000000000)
rasterOptions(todisk = TRUE)
This should come out to 15GB/thread * 32 threads = 480GB of RAM used at maximum for the rasters. Add some overhead and I would expect somewhere between 10GB to 20GB of the 512GB RAM to remain. However, that is not the case and I can't seem to figure out why.
R gobbles up RAM until only 100MB to 2GB remain and only then seems to release previously allocated memory, only to be fed straight back into R for the next raster. I checked the RAM usage repeatedly over several hours to observe this.
I'm using SpatialPolygonDataFrames as input for rasterize, and suspected they might take up a lot of RAM as well. But when I checked their size, they were rather small, at about 100MB. Playing around with maxmemory, chunksize and only 16 threads also didn't seem to have any effect.
I also looked at the rasterize source code to see if I find an explanation there, but that didn't get me far:
setMethod('rasterize', signature(x='SpatialPoints', y='Raster'),
function(x, y, field, fun='last', background=NA, mask=FALSE, update=FALSE, updateValue='all', filename="", na.rm=TRUE, ...){
.pointsToRaster(x, y, field=field, fun=fun, background=background, mask=mask, update=update, updateValue=updateValue, filename=filename, na.rm=na.rm, ...)
}
)
I have no clue where to find .pointsToRaster
Does anyone have an explanation for this behaviour or ideas for things to check? Did I simply overlook something? I´d like to not use the entire RAM so that other users can still work on the server. From what I understand my code should regulate how much RAM is used.
Here's the code I use:
library('iterators')
library('parallel')
library('foreach')
library('doParallel')
#init parallelisation
nCores = 32
cCluster = makeCluster(nCores, type = "FORK", outFile = "parseProcess")
registerDoParallel(cCluster)
foreach(j = 1:length(fileList)) %dopar%{
#load all libraries for every thread
library('sp')
library('raster')
library('spatial')
library('gstat')
library('rgdal')
library('dismo')
library('deldir')
library('rgeos')
library('sjmisc')
#set rasteroptions per thread
rasterOptions(maxmemory = 15000000000)
rasterOptions(chunksize = 14000000000)
rasterOptions(todisk = TRUE)
tmpFolder = paste0("[PATH TO STORAGE]/rtmp",j)
dir.create(tmpFolder)
rasterOptions(tmpdir = tmpFolder)
#generate names for raster files
fileName = basename(fileList[j])
print(paste("Processing:", fileName))
rNameMax0 = sub(pattern = ".bin", replacement = "_scan0_max.tif", fileName)
#repeat this for all 11 scans
rasterStorage = "[PATH TO OTHER STORAGE]" #path to raster folder
scanList = parseFile(fileList[j]) #any memory allocated in this functions should be released on function return
#create template raster
bounds = as.vector(t(bbox(scanList$scan0)))
resolution = c(0.0000566, 0.0000359)
tmp = raster(xmn = bounds[1], xmx = bounds[2], ymn = bounds[3], ymx = bounds[4], res = resolution)
#create rasters from data
coordinates(scanList$scan0) = ~Long+Lat
proj4string(scanList$scan0) = WGS84CRS
rScanMax0 = rasterize(scanList$scan0, tmp, fun = 'max', filename = paste0(rasterStorage, rNameMax0))
rm('rScanMax0')
#repeat for scans 1 to 4
removeTmpFiles(h = 0.2)
unlink(tmpFolder, recursive = TRUE, force = TRUE)
dir.create(tmpFolder)
rasterOptions(tmpdir = tmpFolder)
coordinates(scanList$scan5) = ~Long+Lat
proj4string(scanList$scan5) = WGS84CRS
rScanMax5 = rasterize(scanList$scan5, tmp, fun = 'max', filename = paste0(rasterStorage, rNameMax5))
rm('rScanMax5')
#repeat for scans 6 to 10
removeTmpFiles(h = 0.2)
unlink(tmpFolder, recursive = TRUE, force = TRUE)
}
stopCluster(cCluster)
Here's the (gutted) code of the parseFile function:
parseFile = function(fileName){
con = file(fileName, "rb")
intSize = 4
fileEndian = "little"
#create data frames for each scan
scan0 = data.frame(matrix(ncol = n1, nrow = 0))
colnames(scan0) = c("Lat", "Long", ...)
scan1 = data.frame(matrix(ncol = n2, nrow = 0))
colnames(scan1) = c("Lat", "Long", ...)
scan2 = data.frame(matrix(ncol = n3, nrow = 0))
colnames(scan2) = c("Lat", "Long", ...)
scan3 = data.frame(matrix(ncol = n4, nrow = 0))
colnames(scan3) = c("Lat", "Long", ...)
scan4 = data.frame(matrix(ncol = n5, nrow = 0))
colnames(scan4) = c("Lat", "Long", ...)
scan5 = data.frame(matrix(ncol = n6, nrow = 0))
colnames(scan5) = c("Lat", "Long", ...)
scan6 = data.frame(matrix(ncol = n7, nrow = 0))
colnames(scan6) = c("Lat", "Long", ...)
scan7 = data.frame(matrix(ncol = n8, nrow = 0))
colnames(scan7) = c("Lat", "Long", ...)
scan8 = data.frame(matrix(ncol = n9, nrow = 0))
colnames(scan8) = c("Lat", "Long", ...)
scan9 = data.frame(matrix(ncol = n10, nrow = 0))
colnames(scan9) = c("Lat", "Long", ...)
scan10 = data.frame(matrix(ncol = n11, nrow = 0))
colnames(scan10) = c("Lat", "Long", ...)
header = readBin(con, raw(), n = 36)
i = 1
while(i){
blockHeader = readBin(con, integer(), n = 3, size = intSize, endian = fileEndian)
if(...){ #check whether file ended
break
}
i = i + 1
#sort data to correct scan, assign GPS tag
blockTrailer = readBin(con, raw(), n = 8)
}
#clean up
close(con)
#return parsed data
returnList = list("scan0" = scan0, "scan1" = scan1, "scan2" = scan2, "scan3" = scan3, "scan4" = scan4,
"scan5" = scan5, "scan6" = scan6, "scan7" = scan7, "scan8" = scan8, "scan9" = scan9, "scan10" = scan10)
return(returnList)
}
I'm also looking at the solutions posted here as another approach, but I'd still like to know why my code doesn't work as I expect it to.
Related
I am trying to run the Monocle3 function find_gene_modules() on a cell_data_set (cds) but am getting a variety of errors in this. I have not had any other issues before this. I am working with an imported Seurat object. My first error came back stating that the number of rows were not the same between my cds and cds#preprocess_aux$gene_loadings values. I took a look and it seems my gene loadings were a list under cds#preprocess_aux#listData$gene_loadings. I then ran the following code to make a dataframe version of the gene loadings:
test <- seurat#assays$RNA#counts#Dimnames[[1]]
test <- as.data.frame(test)
cds#preprocess_aux$gene_loadings <- test
rownames(cds#preprocess_aux$gene_loadings) <- cds#preprocess_aux$gene_loadings[,1]
Which created a cds#preprocess_aux$gene_loadings dataframe with the same number of rows and row names as my cds. This resolved my original error but now led to a new error being thrown from uwot as:
15:34:02 UMAP embedding parameters a = 1.577 b = 0.8951
Error in uwot(X = X, n_neighbors = n_neighbors, n_components = n_components, :
No numeric columns found
Running traceback() produces the following information.
> traceback()
4: stop("No numeric columns found")
3: uwot(X = X, n_neighbors = n_neighbors, n_components = n_components,
metric = metric, n_epochs = n_epochs, alpha = learning_rate,
scale = scale, init = init, init_sdev = init_sdev, spread = spread,
min_dist = min_dist, set_op_mix_ratio = set_op_mix_ratio,
local_connectivity = local_connectivity, bandwidth = bandwidth,
gamma = repulsion_strength, negative_sample_rate = negative_sample_rate,
a = a, b = b, nn_method = nn_method, n_trees = n_trees, search_k = search_k,
method = "umap", approx_pow = approx_pow, n_threads = n_threads,
n_sgd_threads = n_sgd_threads, grain_size = grain_size, y = y,
target_n_neighbors = target_n_neighbors, target_weight = target_weight,
target_metric = target_metric, pca = pca, pca_center = pca_center,
pca_method = pca_method, pcg_rand = pcg_rand, fast_sgd = fast_sgd,
ret_model = ret_model || "model" %in% ret_extra, ret_nn = ret_nn ||
"nn" %in% ret_extra, ret_fgraph = "fgraph" %in% ret_extra,
batch = batch, opt_args = opt_args, epoch_callback = epoch_callback,
tmpdir = tempdir(), verbose = verbose)
2: uwot::umap(as.matrix(preprocess_mat), n_components = max_components,
metric = umap.metric, min_dist = umap.min_dist, n_neighbors = umap.n_neighbors,
fast_sgd = umap.fast_sgd, n_threads = cores, verbose = verbose,
nn_method = umap.nn_method, ...)
1: find_gene_modules(cds[pr_deg_ids, ], reduction_method = "UMAP",
max_components = 2, umap.metric = "cosine", umap.min_dist = 0.1,
umap.n_neighbors = 15L, umap.fast_sgd = FALSE, umap.nn_method = "annoy",
k = 20, leiden_iter = 1, partition_qval = 0.05, weight = FALSE,
resolution = 0.001, random_seed = 0L, cores = 1, verbose = T)
I really have no idea what I am doing wrong or how to proceed from here. Does anyone with experience with uwot know where my error is coming from? Really appreciate the help!
I have generated several Utilisation Distributions (UD) with AdehabitatHR and stored them as Geotiffs. I am now using the same UDs with the Lattice package to generate some maps and saving them to a high-res tiff image with LZW compression. Problem is that I have literally hundreds of maps to make, save and name. Is there a way automatically do this once i have loaded all the necessary files from a directory? Each one of my UDs has the following structure of the filename "UD_resolution_species_area_year_season. tif" and in the final name I give to my map I would like to keep the same structure (or entire filename) but add the prefix "blablabla_" e.g. "blablabla_UD_resolution_species_area_year_season.tiff". The image also include a main name, a capital letter, which should also change.
At the moment I am using the following:
rlist = list.files(getwd(), pattern = "tif$", full.names = FALSE)
for (i in rlist) {
assign(unlist(strsplit(i, "[.]"))[1], raster(i))
}
shplist = list.files(getwd(), pattern = "shp$", full.names = FALSE)
for (i in shplist) {
assign(unlist(strsplit(i, "[.]"))[1], readOGR(i))
}
UD <- 'UD_resolution_species_area_year_season'
ext <- extent(UD) + 0.3 # set the extent for the plot
aa <-
quantile(UD,
probs = c(0.25, 0.75),
type = 8,
names = TRUE)
my.at <- c(aa[1], aa[2])
my.at <- round(my.at, 3)
maxval <- maxValue(UD)
tiff(
"C:/myworkingdirectory/maps/blablabla_UD_resolution_species_area_year_season.tiff",
res = 600,
compression = "lzw",
width = 15,
height = 15,
units = "cm"
)
levelplot(
UD,
xlab = "",
ylab = "",
xlim = c(ext[1], ext[2]),
ylim = c(ext[3], ext[4]),
margin = FALSE,
contour = FALSE,
col.regions = viridis(1000),
colorkey = list(at = seq(0, maxval)),
main = "A",
maxpixels = 2e5
) + latticeExtra::layer(sp.polygons(Land, fill = "grey50", col = NA)) + contourplot(
`UD`,
at = my.at[1],
labels = FALSE,
margin = FALSE,
lty = 2,
col = "orange",
pretty = TRUE
) + contourplot(
UD,
at = my.at[2],
labels = FALSE,
margin = FALSE,
lty = 2,
col = "red",
pretty = TRUE,
)
dev.off()
It is a common beginners mistake to use assign. Do not use it, it creates the type of trouble you are now facing. In stead, you can make lists and/or use a loop.
Also what you are asking is basic R stuff, but you are complicating the question with adding lots of irrelevant detail about setting the extent, and levelplot. It is better to learn about doing these basic things by removing the clutter and focus on a simple case first. That is also how you should write questions for this forum.
In essence you have a bunch of files you want to process. Below I show how you can loop over a vector of the names and then loop and do what you need to do in that loop.
library(raster)
rastfiles <- list.files(pattern = "tif$", full.names=TRUE)
outputfiles <- file.path("output/path", paste0("prefix_", basename(rastfiles)))
for (i in 1:length(rastfiles))
r <- raster(rastfiles[i])
png(outputfiles[i])
plot(r)
dev.off()
}
You can also first read all the files into a list
rastfiles <- list.files(pattern = "tif$", full.names=TRUE)
rlist <- lapply(rastfiles, raster)
names(rlist) <- gsub(".tif$", "", basename(rastfiles))
rastfiles <- list.files(pattern = "shp$", full.names=TRUE)
slist <- lapply(shpfiles, readOGR)
names(slist) <- gsub(".shp$", "", basename(shpfiles))
And perhaps create a vector of output filenames
outputtif <- file.path("output/dir", basename(rastfiles))
And then loop over the items in the list, or the output filenames
Dear All R Developers,
I maintain a package GENEAread and have recently found a bug in the package which comes from within the function header.info. This function is designed to read in the header information stored in a GENEActiv binary file, from the Actigraphy watch GENEActiv. This information is stored in the first 100 lines of the binary file.
The part of this function that is reading in values incorrectly uses the function scan(). Until recently this has worked, however the frequency which is read in by the function header.info now takes a different form because of the varying output of scan() that now occurs.
Below is some sample code which demonstrates the issue:
install.packages(“GENEAread”)
library(GENEAread)
binfile = system.file("binfile/TESTfile.bin", package = "GENEAread")[1]
nobs = 300
info <- vector("list", 15)
# index <- c(2, 20:22, 26:29)
tmpd = readLines(binfile, 300)
#try to find index positions - so will accomodate multiple lines in the
notes sections
#change when new version of binfile is produced.
ind.subinfo = min(which((tmpd == "Subject Info" )& (1:length(tmpd) >= 37)))
ind.memstatus = max(which(tmpd == "Memory Status"))
ind.recdata = (which(tmpd == "Recorded Data"))
ind.recdata = ind.recdata[ind.recdata > ind.memstatus][1:2]
ind.calibdata = max(which(tmpd == "Calibration Data"))
ind.devid = min(which(tmpd == "Device Identity"))
ind.config = min(which(tmpd == "Configuration Info"))
ind.trial = min(which(tmpd == "Trial Info"))
index = c(ind.devid + 1, ind.recdata[1] + 8, ind.config + 2:3, ind.trial +
1:4, ind.subinfo + 1:7, ind.memstatus + 1)
if (max(index) == Inf){
stop("Corrupt headers or not Geneactiv file!", call = FALSE)
}
# Read in header info
nm <- NULL
for (i in 1:length(index)) {
line = strsplit(tmpd[index[i]], split = ":")[[1]]
el = ""
if (length(line) > 1){
el <- paste(line[2:length(line)],collapse=":")
}
info[[i]] <- el
nm[i] <- paste(strsplit(line[1], split = " ")[[1]], collapse = "_")
}
info <- as.data.frame(matrix(info), row.names = nm)
colnames(info) <- "Value"
Decimal_Separator = "."
if (length( grep(",", paste(tmpd[ind.memstatus + 8:9], collapse = "")) ) > 0){
Decimal_Separator = ","
}
info = rbind(info,
Decimal_Separator = Decimal_Separator)
# more here
# if (more){
# grab calibration data etc as well
calibration = list()
fc = file(binfile, "rt")
index = sort(c(ind.config + 4,
ind.calibdata + 1:8,
ind.memstatus + 1,
ind.recdata + 3,
ind.recdata[1] + c(2,8))
)
#### First appearance in the function header.info of the function scan. ####
# tmp <- substring(scan(fc,
# skip = index[1] - 1,
# what = "",
# n = 3,
# sep = " ",
# quiet = TRUE)[3],
# c(1,2,5),
# c(1, 3, 6))
# Isolating scan and running multiple times #
scan(fc,
skip = index[1] - 1,
what = "",
n = 3,
sep = " ",
quiet = TRUE)[3]
scan(fc,
skip = index[1] - 1,
what = "",
n = 3,
sep = " ",
quiet = TRUE)[3]
scan(fc,
skip = (index[1] - 1),
what = "",
n = 3,
sep = " ",
quiet = TRUE)[3]
#### Checking the same thing happens with the substring ####
# Checking by using 3.4.3 possibly
substring(scan(fc,
skip = index[1] - 1,
what = "",
n = 3,
sep = " ",
quiet = TRUE)[3],
c(1,2,5),
c(1, 3, 6))
substring(scan(fc,
skip = index[1] - 1,
what = "",
n = 3,
sep = " ",
quiet = TRUE)[3],
c(1,2,5),
c(1, 3, 6))
substring(scan(fc,
skip = index[1] - 1,
what = "",
n = 3,
sep = " ",
quiet = TRUE)[3],
c(1,2,5),
c(1, 3, 6))
Why does the output of the scan function vary? I have run the examples given on the scan help page and the output is the same if the code is ran more than once. What in the build up to running this function can cause the output to vary?
Any help would be much appreciated.
You opened the fc connection using
fc = file(binfile, "rt")
This means scan() will read from it and leave it open, with the file pointer advanced to the end of the read. Each time you call scan(), you are reading a later part of the file. That's why the results vary.
If you want to always read the same part of the file, you would do it something like this:
seek(fc, 0)
scan(fc, ...)
seek(fc, 0)
scan(fc, ...)
Alternatively, don't open fc when you create it, and scan() will open and close it each time. You do this by writing
fc <- file(binfile) # No open specified
Or even more simply (but a tiny bit less efficiently)
fc <- binfile
which will create a new connection each time.
I used the bibliometrix function in R, and want to plot some useful graphs.
library(bibliometrix)
??bibliometrix
D<-readFiles("E:\\RE\\savedrecs.txt")
M <- convert2df(D,dbsource = "isi", format= "plaintext")
results <- biblioAnalysis(M ,sep = ";" )
S<- summary(object=results,k=10, pause=FALSE)
plot(x=results,k=10,pause=FALSE)
options(width=100)
S <- summary(object = results, k = 10, pause = FALSE)
NetMatrix <- biblioNetwork(M1, analysis = "co-occurrences", network = "author_keywords", sep = ";")
S <- normalizeSimilarity(NetMatrix, type = "association")
net <- networkPlot(S, n = 200, Title = "co-occurrence network",type="fruchterman", labelsize = 0.7, halo = FALSE, cluster = "walktrap",remove.isolates=FALSE, remove.multiple=FALSE, noloops=TRUE, weighted=TRUE)
res <- thematicMap(net, NetMatrix, S)
plot(res$map)
But in the net <- networkPlot(S, n = 200, Title = "co-occurrence network",type="fruchterman", labelsize = 0.7, halo = FALSE, cluster = "walktrap",remove.isolates=FALSE, remove.multiple=FALSE, noloops=TRUE, weighted=TRUE), it shows error
Error in V<-(*tmp*, value = *vtmp*) : invalid indexing
. Also I cannot do the CR, it always shows unlistCR. I cannot use the NetMatrix function neither.
Some help me plsssssssss
The problem is in the data itself not in the code you presented. When I downloaded the data from bibliometrix.com and changed M1 to M (typo?) in biblioNetwork function call everything worked perfectly. Please see the code below:
library(bibliometrix)
# Plot bibliometric analysis results
D <- readFiles("http://www.bibliometrix.org/datasets/savedrecs.txt")
M <- convert2df(D, dbsource = "isi", format= "plaintext")
results <- biblioAnalysis(M, sep = ";")
S <- summary(results)
plot(x = results, k = 10, pause = FALSE)
# Plot Bibliographic Network
options(width = 100)
S <- summary(object = results, k = 10, pause = FALSE)
NetMatrix <- biblioNetwork(M, analysis = "co-occurrences", network = "author_keywords", sep = ";")
S <- normalizeSimilarity(NetMatrix, type = "association")
net <- networkPlot(S, n = 200, Title = "co-occurrence network", type = "fruchterman",
labelsize = 0.7, halo = FALSE, cluster = "walktrap",
remove.isolates = FALSE, remove.multiple = FALSE, noloops = TRUE, weighted = TRUE)
# Plot Thematic Map
res <- thematicMap(net, NetMatrix, S)
str(M)
plot(res$map)
I would like to render a dynamic network in R using the fast MDSJ library. Unfortunately, however, all the vertices' coordinates seem to be 0,0 using this rendering engine, which is not the case when using one of the other layouts (kamadakawai or Graphviz. If you paste the code below, you should be able to reproduce the problem.
if (!require("pacman")) install.packages("pacman")
library("pacman")
pacman::p_load(network, networkDynamic, ndtv)
#animation.mode = "MDSJ"
#animation.mode = "Graphviz"
animation.mode = "kamadakawai"
people <- c("A","B","C","D","E")
documents <- paste0("a",1:10)
edges <- data.frame(from = c("A","A","A","B","B","C","D"),
to = c("a1","a2","a3","a4","a5","a1","a1"),
active = c(1,2,3,3,4,4,4))
net <- network.initialize(0, directed = TRUE, bipartite = length(people))
add.vertices.networkDynamic(net, 5, vertex.pid = people)
add.vertices.networkDynamic(net, 10, vertex.pid = documents)
net %v% "vertex.names" <- c(people, documents)
net %v% "vertex.col" <- c(rep("blue", length(people)), rep("gray", length(documents)))
set.network.attribute(net,'vertex.pid','vertex.names')
add.edges.networkDynamic(net,
tail = get.vertex.id(net, edges[[1]]),
head = get.vertex.id(net, edges[[2]]),
edge.pid = paste0(edges[[1]], "->", edges[[2]]))
activate.edges(net, e = 1:7, at = edges[[3]])
reconcile.vertex.activity(net = net, mode = "encompass.edges", edge.active.default = FALSE)
slice.par <- list(start = 1, end = 4, interval = 1, aggregate.dur = 2, rule = "earliest")
compute.animation(net,
animation.mode = animation.mode,
slice.par = slice.par)
render.d3movie(net,
slice.par = slice.par,
displaylabels = TRUE,
output.mode = "htmlWidget",
vertex.col = 'vertex.col')
Using kamadakawai, one gets a dynamic view like this:
Using MDSJ, all slides look like this:
This code works on my system with MDSJ. Does it install correctly on yours? When it's first used, it has to download and install a Java application mdsj.jar.