Spatial data and memory - r

I am trying to add up geotiffs but am running into memory issues. R is using all 32GB according the the following R error...
In writeValues(y, x, start = 1) :
Reached total allocation of 32710Mb: see help(memory.size)
I also checked the properties of R and it is 64 bit and the target is...
"C:\Program Files\R\R-3.3.0\bin\x64\Rgui.exe"
The version is
R.Version()
$platform
[1] "x86_64-w64-mingw32"
$arch
[1] "x86_64"
$os
[1] "mingw32"
$system
[1] "x86_64, mingw32"
$status
[1] ""
$major
[1] "3"
$minor
[1] "3.0"
$year
[1] "2016"
$month
[1] "05"
$day
[1] "03"
$`svn rev`
[1] "70573"
$language
[1] "R"
$version.string
[1] "R version 3.3.0 (2016-05-03)"
$nickname
[1] "Supposedly Educational"
So it looks like my max memory is being used by R. I tried the to use bigmemory package in R. So in the code below I tried changing the matrix to big.matrix but that failed and the error occurs when trying to write the output file. Any suggestions for trying to alter the code so less memory is used or try to work in the package ff or bigmemory?
############ LOOP THROUGH AGE MAPS TO COMPILE THE NUMBER OF TIMES A CELL BURNS DURING A GIVEN SPAN OF TIME ####################
## Empirical Fires
print("1 of 3: 2010-2015")
burn.mat<- matrix(0,nrow,ncol) #create matrix of all zero's, the dimension of your landscape (row, col)
# Read in Historical Fire maps
for (j in 2010:2015){ #Year Loop
age.tmp<- as.matrix(raster(paste('fr',j,'.tif',sep=''))) #read in Age Map
burn.mat<- burn.mat+(age.tmp==1) #when something has burned in ALFRESCO empirical fire history files, AGE=1. (age.tmp==0) is a 'logic' cmd, returning a 0,1 map for True/False
#Write the data to a geotiff
out <- raster(burn.mat,xmn=-1692148,xmx= 1321752, ymn = 490809.9, ymx = 2245610, crs = '+proj=aea +lat_1=55 +lat_2=65 +lat_0=50 +lon_0=-154 +x_0=0 +y_0=0 +ellps=GRS80 +datum=NAD83 +units=m +no_defs')
writeRaster(out,filename=paste(outdir,'/burn.mat.hist.1950-2007.tif',sep=''),format = 'GTiff',options='COMPRESS=LZW',datatype='FLT4S',overwrite=T)
}

The problem will probably go away if you use Raster* objects rather than matrices. Something like
library(raster)
r <- raster('fr2010.tif')
burn.mat <- setValues(r, 0)
for (j in 2010:2015) {
age.tmp <- raster(paste0('fr', j, '.tif'))
burn.mat <- burn.mat + (age.tmp==1)
# if age.tmp only has values of 0 and 1 use this instead:
# burn.mat <- burn.mat + age.tmp
}
# write the results outside of the loop
writeRaster(burn.mat, filename=file.path(outdir, 'burn.mat.hist.1950-2007.tif'), options='COMPRESS=LZW',datatype='FLT4S',overwrite=TRUE)
A more direct approach without a loop
files <- paste0('fr', 2010:2015, '.tif'))
s <- stack(files)
burn <- sum(s)
Or
burn <- sum(s == 1)
Or to write to a file in one step
b <- calc(s, sum, filename=file.path(outdir, 'burn.mat.hist.1950-2007.tif'), options='COMPRESS=LZW', datatype='FLT4S', overwrite=TRUE)

Related

Printing side-by-side data.table with title to PDF

I created a table from a data.frame using:
t_eth1 <- table(df_all$lettergrade1,df_all$ethnicity).
My output is great. I can run chisq.test on it just fine. class(t_eth1) = "Table"
It looks like: (there are headers for columns tmp2....tmp8 but no header for column tmp1...this gives me grief when trying to tibble as there is size mismatch and I don't know what to do about that).
tmp1 <- c("A","B","C","D","F")
tmp2 <- c(0,1,1,1,0)
tmp3 <- c(4,1,4,1,0)
tmp4 <- c(0,0,2,0,0)
tmp5 <- c(3,5,12,2,0)
tmp6 <- c(1,5,6,2,0)
tmp7 <- c(8,10,16,3,2)
tmp8 <- c(0,2,2,1,0)
My table will look similar to:
tst <- table(mtcars$cyl,mtcars$mpg)
I have many of these tables. I'd like to print them in some sort of grid fashion with a title above each table and the chisq.stat I calculated below each table.
table1title1 tabletitle2
| val val val | | val val val|
| val val val | | val val val|
| val val val | | val val val|
p-value chi stuff p-value chi stuff
I cannot seem to get these to print using either of these. (it will print to the code box in Rstudio, but not to console like I'd expect from plot or ggplot
pdf(file="testprint.pdf",onefile=T)
t_eth1 <or> print(t_eth1)
dev.off()
I've tried gtable, but it wants a tibble input and I can't seem to convert data.table to tibble.
I've tried miktex, but it won't take the table format; in the error log it give me: "! Sorry, but C:\Users\datad\AppData\Local\Programs\MiKTeX\miktex\bin\x64\pdflatex.exe did not succeed."
I've tried expss which allows me to set.caption above tables, but it won't print the table to the pdf file.
I'm stuck. A little hand holding would be great.
> R.Version()
$platform
[1] "x86_64-w64-mingw32"
$arch
[1] "x86_64"
$os
[1] "mingw32"
$system
[1] "x86_64, mingw32"
$status
[1] ""
$major
[1] "4"
$minor
[1] "0.2"
$year
[1] "2020"
$month
[1] "06"
$day
[1] "22"
$`svn rev`
[1] "78730"
$language
[1] "R"
$version.string
[1] "R version 4.0.2 (2020-06-22)"
$nickname
[1] "Taking Off Again"

R cbind is very slow

There is a question that has a very similar title (cbind runs very slow) but it does not help me with my problem. I am retrieving 10+ JSON files with 100 variables each and I try to create one big data.frame/table with 1000 columns. In practice, I do not use the very same JSON-file as per the example but different ones. Ideally only the problematic line cx <- cbind(cx, bx) would speed up as the other lines (unlist, as.data.table) work well for me and I would not know what else to use. I know, "cbind is slow" but do I have any alternatives? Ideally with Base R.
library(jsonlite)
library(data.table)
starttime <- Sys.time()
for (i in 1:10) { # loop through all 10 json files
zz <- Sys.time() # measuring the time for each loop
urlx <- "http://mysafeinfo.com/api/data?list=englishmonarchs&format=json"
jsnx <- fromJSON(urlx)
if(i==1) {
ax <- unlist(jsnx)
bx <- as.data.table(ax)
cx <- bx
}
for (j in 1:100) { # loop through all 100 variables in each file
ax <- unlist(jsnx)
bx <- as.data.table(ax)
cx <- cbind(cx, bx) # <---- VERY SLOW ----
}
zz <- round(Sys.time()-zz,1)
print(sprintf("%1.1f", zz))
flush.console()
}
endtime <- Sys.time()
endtime-starttime
This gets slower and slower with more files, here my timings.
[1] "0.7"
[1] "1.3"
[1] "1.3"
[1] "1.6"
[1] "2.1"
[1] "2.2"
[1] "2.5"
[1] "3.2"
[1] "3.4"
[1] "3.5"

Count number of times a word-wildcard appears in text (in R)

I have a vector of either regular words ("activated") or wildcard words ("activat*"). I want to:
1) Count the number of times each word appears in a given text (i.e., if "activated" appears in text, "activated" frequency would be 1).
2) Count the number of times each word wildcard appears in a text (i.e., if "activated" and "activation" appear in text, "activat*" frequency would be 2).
I'm able to achieve (1), but not (2). Can anyone please help? thanks.
library(tm)
library(qdap)
text <- "activation has begun. system activated"
text <- Corpus(VectorSource(text))
words <- c("activation", "activated", "activat*")
# Using termco to search for the words in the text
apply_as_df(text, termco, match.list=words)
# Result:
# docs word.count activation activated activat*
# 1 doc 1 5 1(20.00%) 1(20.00%) 0
Is it possible that this might have to do something with the versions? I ran the exact same code (see below) and got what you expected
> text <- "activation has begunm system activated"
> text <- Corpus(VectorSource(text))
> words <- c("activation", "activated", "activat")
> apply_as_df(text, termco, match.list=words)
docs word.count activation activated activat
1 doc 1 5 1(20.00%) 1(20.00%) 2(40.00%)
Below is the output when I run R.version(). I am running this in RStudio Version 0.99.491 on Windows 10.
> R.Version()
$platform
[1] "x86_64-w64-mingw32"
$arch
[1] "x86_64"
$os
[1] "mingw32"
$system
[1] "x86_64, mingw32"
$status
[1] ""
$major
[1] "3"
$minor
[1] "2.3"
$year
[1] "2015"
$month
[1] "12"
$day
[1] "10"
$`svn rev`
[1] "69752"
$language
[1] "R"
$version.string
[1] "R version 3.2.3 (2015-12-10)"
$nickname
[1] "Wooden Christmas-Tree"
Hope this helps
Maybe consider different approach using library stringi?
text <- "activation has begun. system activated"
words <- c("activation", "activated", "activat*")
library(stringi)
counts <- unlist(lapply(words,function(word)
{
newWord <- stri_replace_all_fixed(word,"*", "\\p{L}")
stri_count_regex(text, newWord)
}))
ratios <- counts/stri_count_words(text)
names(ratios) <- words
ratios
Result is:
activation activated activat*
0.2 0.2 0.4
In the code I convert * into \p{L} which means any letter in regex pattern. After that I count found regex occurences.

R point in polygon speed slow

The function gIntersects() in rgeos package to test if point is located in polygons is pretty slow. Is there a good way to speed up the computation?
I tried a speed comparison between sp::point.in.polygon, sp::over and rgeos::gIntersects() which were already mentioned in the comments. Note that there is also a point.in.poly function in {spatialEco} but seems that is just a wrapper of sp::over.
I realized that sp::point.in.polygon doesn't handle well multi-part polygons (as also pointed here) and needs to be provided with raw coordinates (so I presume that for multi-part polygons needs to be used in a loop). Note that, sp::point.in.polygon it was faster than all others only in the case of a square polygon, which makes me think is faster only for simpler shapes. All in all, whenever hitting some speed issues, juts try to test for your specific case. For my specific choice of examples, sp::over seems a better choice overall, but I would not generalize. Hope my examples are ok, otherwise feel free to correct me.
Since there is no data provided, I used some examples below.
Testing with world map data
Prepare data & functions
library(rgeos)
library(sp)
library(microbenchmark)
library(ggplot2)
library(maps)
library(maptools)
library(raster)
# Get world map data
# (conversion code from "Applied Spatial Data Analysis with R")
worldmap <- maps::map("world", fill=TRUE, plot=FALSE)
# transform to SpatialPolygons
worldmapPolys <- maptools::map2SpatialPolygons(worldmap,
IDs=sapply(strsplit(worldmap$names, ":"), "[", 1L),
proj4string=CRS("+proj=longlat +datum=WGS84"))
# Generate random points for entire world
set.seed(2017)
pts <- sp::spsample(worldmapPolys, n=10^5, type="random")
# Define functions to test for speed
gIntersects_tst <- function(my.pts, my.poly){
rgeos::gIntersects(spgeom1 = my.pts,
spgeom2 = my.poly,
byid = TRUE)
}
over_tst <- function(my.pts, my.poly){
sp::over(x = my.pts, y = my.poly)
}
point.in.polygon_tst <- function(my.pts, my.poly){
# get coordinates from polygon
XY <- raster::geom(my.poly)
sp::point.in.polygon(point.x = my.pts#coords[,1],
point.y = my.pts#coords[,2],
pol.x = XY[,5],
pol.y = XY[,6],
mode.checked = TRUE)
}
Testing for single-part polygon
# Micro-benchmarking
# The idea is to test which points fall into a selected polygon (country)
res <- microbenchmark(TF1 <- gIntersects_tst(pts, worldmapPolys[183,]),
TF2 <- gIntersects_tst(worldmapPolys[183,], pts),
idx <- over_tst(pts, worldmapPolys[183,]),
codes <- point.in.polygon_tst(pts, worldmapPolys[183,]))
print(res)
## Unit: milliseconds
## expr min
## TF1 <- gIntersects_tst(pts, worldmapPolys[183, ]) 142.61992
## TF2 <- gIntersects_tst(worldmapPolys[183, ], pts) 125.99551
## idx <- over_tst(pts, worldmapPolys[183, ]) 50.72425
## codes <- point.in.polygon_tst(pts, worldmapPolys[183, ]) 224.57961
## lq mean median uq max neval cld
## 153.46915 174.42346 162.90885 177.69223 338.2691 100 b
## 136.13762 158.88218 144.89180 156.91664 352.3276 100 b
## 55.50899 69.67542 63.80366 78.12026 132.8704 100 a
## 243.12288 276.71458 257.38068 275.46144 589.9082 100 c
ggplot2::autoplot(res) + ggtitle("single-polygon: 100 evaluations")
Note that for gIntersects(), order of arguments seems to matter. Differences are both in speed and structure of results.
identical(TF1,TF2)
## [1] FALSE
identical(TF1[,1:length(pts)], TF2[1:length(pts),])
## [1] TRUE
class(TF1); str(TF1)
## [1] "matrix"
## logi [1, 1:100000] FALSE FALSE FALSE FALSE FALSE FALSE ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr "Romania"
## ..$ : chr [1:100000] "1" "2" "3" "4" ...
class(TF2); str(TF2)
## [1] "matrix"
## logi [1:100000, 1] FALSE FALSE FALSE FALSE FALSE FALSE ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:100000] "1" "2" "3" "4" ...
## ..$ : chr "Romania"
# Subset world points
pts.gI1 <- pts[TF1,]
pts.gI2 <- pts[TF2,]
pts.ovr <- pts[!is.na(idx),]
pts.PiP <- pts[as.logical(codes),]
# All subsets are identical
identical(pts.gI1, pts.gI2)
## [1] TRUE
identical(pts.gI2, pts.ovr)
## [1] TRUE
identical(pts.ovr, pts.PiP)
## [1] TRUE
Simpler shapes - test with two square polygons
# Generate two square polygons
grd <- sp::GridTopology(c(1,1), c(1,1), c(2,1))
polys <- sp::as.SpatialPolygons.GridTopology(grd)
# Generate some random points
set.seed(2017)
pts2 <- sp::spsample(polys, n=10^5, type="random")
# Micro-benchmarking
# Test only for those points falling in first square
res <- microbenchmark(TF1 <- gIntersects_tst(pts2, polys[1,]),
TF2 <- gIntersects_tst(polys[1,], pts2),
idx <- over_tst(pts2, polys[1,]),
codes <- point.in.polygon_tst(pts2, polys[1,]))
print(res)
## Unit: milliseconds
## expr min lq
## TF1 <- gIntersects_tst(pts2, polys[1, ]) 151.35336 165.23526
## TF2 <- gIntersects_tst(polys[1, ], pts2) 123.26241 135.90883
## idx <- over_tst(pts2, polys[1, ]) 54.84891 63.89454
## codes <- point.in.polygon_tst(pts2, polys[1, ]) 9.39330 10.66513
## mean median uq max neval cld
## 189.67848 177.62808 190.89566 365.92728 100 d
## 157.47151 148.50073 160.37567 314.02700 100 c
## 76.42608 70.66998 79.81225 240.55570 100 b
## 14.09199 11.37738 16.88741 46.19245 100 a
ggplot2::autoplot(res) + ggtitle("square polygon: 100 evaluations")
pts2.gI1 <- pts2[TF1,]
pts2.gI2 <- pts2[TF2,]
pts2.ovr <- pts2[!is.na(idx),]
pts2.PiP <- pts2[as.logical(codes),]
# All subsets are identical
identical(pts2.gI1, pts2.gI2)
## [1] TRUE
identical(pts2.gI2, pts2.ovr)
## [1] TRUE
identical(pts2.ovr, pts2.PiP)
## [1] TRUE
Session Info
sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] raster_2.5-8 mapview_1.2.0 leaflet_1.1.0 maptools_0.9-2 maps_3.1.1
[6] ggplot2_2.2.1 microbenchmark_1.4-2.1 sp_1.2-4 rgeos_0.3-23

Data loss during read.csv in R

I have a .csv file to be imported into R, which has more than 1K observations. However, when I used the read.csv function as usual, the imported file only has 21 observations. This is strange. I've never seen this before.
t <- read.csv("E:\\AH1_09182014.CSV",header=T, colClasses=c(rep("character",3),rep("numeric",22)),na.string=c("null","NaN",""),stringsAsFactors=FALSE)
Can anyone help me figure out the problem? I am giving a link to my data file:
https://drive.google.com/file/d/0B86_a8ltyoL3TzBza0x1VTd2OTQ/edit?usp=sharing
You have some messy characters in your data--things like embedded control characters.
A workaround is to read the file in binary mode, and use read.csv on the text file read in.
This answer proposes a basic function to do those steps.
The function looks like this:
sReadLines <- function(fnam) {
f <- file(fnam, "rb")
res <- readLines(f)
close(f)
res
}
You can use it as follows:
temp <- read.csv(text = sReadLines("~/Downloads/AH1_09182014.CSV"),
stringsAsFactors = FALSE)
Have all lines been read in?
dim(temp)
# [1] 1449 25
Where is that problem line?
unlist(temp[21, ], use.names = FALSE)
# [1] "A-H Log 1" "09/18/2014" "0:19:00" "7.866" "255" "0.009"
# [7] "525" "7" "4468" "76" "4576.76" "20"
# [13] "71" "19" "77" "1222" "33857" "-3382"
# [19] "26\032)" "18.30" "84.80" "991.43" "23713.90" "0.85"
# [25] "10.54"
^^ see item [19] above.
Because of this, you won't be able to specify all of your column types up front--unless you clean the CSV first.

Resources