I am trying to apply a IDW (inverse distance weighting) to different groups in a database. I am trying to use dplyr to apply this function to each group, but i am making a mistake in the Split-Apply-Combine. The current function returns 10 values for each group of 10 observations, but currently dplyr tries to insert 10 return values in each mutated cell, rather than one new value for mutated cell.
The problem is likely function-agnostic, but i could unfortunately not find a simpler function that showcases the same error.
I get the error message that the dataframe is corrupt, and the new column is filled with values.
group N Lat Long Obs idw_val
1 A 1 49.43952 20.42646 11 <dbl[10]>
2 B 1 49.76982 19.70493 8 <dbl[10]>
The example hopefully clarifies this. The solution is probably very simple - some pointers to help me much appreciated...
require(ggmap)
require(dplyr)
require(raster)
require(sp)
require(gstat)
require(lattice)
####create dataset
set.seed(123)
dh = expand.grid(group = c("A","B","C"),
N=1:10)
dh$Lat=rnorm(nrow(dh),50,1)
dh$Long=rnorm(nrow(dh),20,1)
dh$Obs=rpois(nrow(dh),10)
dh
#####create grid
pixels <- 10
#####function defintion
idw_w=function(x,y,z){
geog2 <- data.frame(x,y,z)
coordinates(geog2) = ~x+y
geog.grd <- expand.grid(x=seq(floor(min(coordinates(geog2)[,1])),
ceiling(max(coordinates(geog2)[,1])),
length.out=pixels),
y=seq(floor(min(coordinates(geog2)[,2])),
ceiling(max(coordinates(geog2)[,2])),
length.out=pixels))
# Assigning coordinates results in spdataframe.
grd.pts <- SpatialPixels(SpatialPoints((geog.grd)))
grd <- as(grd.pts, "SpatialGrid")
##### IDW interpolation.
geog2.idw <- idw(z ~ 1, geog2, grd, idp=4)
####overlay
pts <- SpatialPoints(cbind(x, y))
over(pts, geog2.idw["var1.pred"])
}
#### test function
idw_w(dh$Lat,dh$Long,dh$Obs)
####groupwise dplyr
dh2 = dh %>%
# arrange(Block, Species, Date) %>%
group_by(group) %>%
mutate(idw_val=idw_w(x=Lat,y=Long,z=Obs))
dh2
str(dh2)
If I understand what you want correctly it's just a matter of making sure your function returns a vector of values rather than a data.frame object. I think this function will do what you want when run through the mutate() step:
idw_w=function(x,y,z){
geog2 <- data.frame(x,y,z)
coordinates(geog2) = ~x+y
geog.grd <- expand.grid(x=seq(floor(min(coordinates(geog2)[,1])),
ceiling(max(coordinates(geog2)[,1])),
length.out=pixels),
y=seq(floor(min(coordinates(geog2)[,2])),
ceiling(max(coordinates(geog2)[,2])),
length.out=pixels))
# Assigning coordinates results in spdataframe.
grd.pts <- SpatialPixels(SpatialPoints((geog.grd)))
grd <- as(grd.pts, "SpatialGrid")
##### IDW interpolation.
geog2.idw <- idw(z ~ 1, geog2, grd, idp=4)
####overlay
pts <- SpatialPoints(cbind(x, y))
(over(pts, geog2.idw["var1.pred"]))[,1]
}
Related
I would like to find what is the most memory and time efficient way to calculate euclidean distances on a large matrix. I've ran this small benchmark below comparing a few packages I know: parallelDist, geodist, fields and stats. I've also considered this customized function that combines Rcppand bigmemory. Here are the results I've found (reprex below), but I'd like to know whether there are other efficient pacakges / solutions to do this task:
Results
benchmrk
#> package time alloc
#>1: parDist 0.298 5.369186e-04
#>2: fields 1.079 9.486198e-03
#>3: rcpp 54.422 2.161113e+00
#>4: stats 0.770 5.788603e+01
#>5: geodist 2.513 1.157635e+02
# plot
ggplot(benchmrk, aes(x=alloc , y=time, color= package, label=package)) +
geom_label(alpha=.5) +
coord_trans(x="log10", y="log10") +
theme(legend.position = "none")
Reprex
library(parallelDist)
library(geodist)
library(fields)
library(stats)
library(bigmemory)
library(Rcpp)
library(lineprof)
library(geobr)
library(sf)
library(ggplot2)
library(data.table)
# data input
df <- geobr::read_weighting_area()
gc(reset = T)
# convert projection to UTM
df <- st_transform(df, crs = 3857)
# get spatial coordinates
coords <- suppressWarnings(st_coordinates( st_centroid(df) ))
# prepare customized rcpp function
sourceCpp("euc_dist.cpp")
bigMatrixEuc <- function(bigMat){
zeros <- big.matrix(nrow = nrow(bigMat)-1,
ncol = nrow(bigMat)-1,
init = 0,
type = typeof(bigMat))
BigArmaEuc(bigMat#address, zeros#address)
return(zeros)
}
### Start tests
perf_fields <- lineprof(dist_fields <- fields::rdist(coords) )
perf_geodist <- lineprof(dist_geodist <- geodist::geodist(coords, measure = "cheap") )
perf_stats <- lineprof(dist_stats <- stats::dist(coords) )
perf_parDist <- lineprof(dist_parDist <- parallelDist::parDist(coords, method = "euclidean") )
perf_rcpp <- lineprof(dist_rcpp <- bigMatrixEuc( as.big.matrix(coords) ) )
perf_fields$package <- 'fields'
perf_geodist$package <- 'geodist'
perf_stats$package <- 'stats'
perf_parDist$package <- 'parDist'
perf_rcpp$package <- 'rcpp'
# gather results
benchmrk <- rbind(perf_fields, perf_geodist, perf_stats , perf_parDist, perf_rcpp)
benchmrk <- setDT(benchmrk)[, .(time =sum(time), alloc = sum(alloc)), by=package][order(alloc)]
benchmrk
Here, I try to propose an answer 'theoretically'.
I think a combination of the rccp approach (here) and the parDist (here) might allow for working on very large data sets while keeping execution times at an acceptable level?
Unfortunately, I did not work with rccp, RcppParallel nor RcppArmadilloyet. But it seems the parDist and the rccp-big.matrix approach build upon the same 'infrastructure'.
Maybe some more experienced users will take up the challenge.
I'm currently having an issue where I'm trying to nest simulated data for an efficient frontier inside a tibble containing all 250 simulations. The tibble will have 1 column named "sim" which indicates the number of the simulation, i.e. the rows in this column runs from 1:250. The other column should contain the nested simulation data which is a 3x123 tibble for each simulation. I've successfully, with help from a nice soul here, managed to create this tibble containing the efficient frontiers. Now I need to make a loop running through this tibble and plotting all of the 250 efficient frontiers in one plot.
I've tried to replicate the problem such that you don't need all of the previous code and data to see the issue. In this simple and reproducible example I have a table which is a 5x2 Tibble where the column 'sim' lists simulations (1:5) and 'obs' holds an individual 5x3 tibble with some coordinates:
library(tidyverse)
library(ggplot2)
counter = 0
table <- tibble(sim = 1:5, obs = NA)
for(i in (1:5)){
counter = counter + 1
tibble <- tibble(a = NA, b = 1:5, x = c(counter + 1), y = c(counter*2-1))
tibble$a <- counter
nested_tibble <- tibble %>% nest(data = -a) %>% select(-a)
table$obs[i] <- nested_tibble[[1]]
}
for (i in (1:5)){
print(ggplot()+
geom_point( data = (table %>% filter(sim == i) %>% .$obs)[[1]],
aes(x = x, y = y),
color = "red",
size = 4))
}
As mentioned I wish for it to plot all of the 5 coordinates in one plot such that I can replicate this to plot 250 efficient frontiers. However, when I run the code it only returns the last coordinate.
I hope my formulation makes sense. If you need any additional documentation please let me know.
I am not sure, but this should do the job. I think using lists is way better to store nested structure. So, the code below returns a list called table_out.
Please, have a look if this is what you want.
library(tibble)
library(data.table)
library(ggplot2)
N_sim <- 5
table_out <- vector("list", 5)
for ( i in seq_len(N_sim) ) {
current_table <- tibble(a = i, b = 1L:N_sim, x = i + 1, y = i*2 - 1)
table_out[[ i ]] <- current_table
}
# this creates a data.table (like a data.frame) from a list
final <- rbindlist( table_out )
ggplot(final, aes(x, y)) +
geom_point(color = "red", size = 4)
Created on 2021-03-03 by the reprex package (v1.0.0)
I have a simple issue with a for loop in R - I am trying to make it run for the entire dataset and it only runs for the last row. This is done with quite complex datasets which are both shapefiles and I am testing the intersection of the geometries. That is why I can't quite make a reproducible example here.
Nevertheless, this is my code:
for(i in 1:nrow(data1)){
#get intersections between data2 and data1 for specific years
output = st_join(
x = data1[i, ],
y = data2[which(data2$year %in% data1$lag.year[i]:data1$year[i]), ],
join = st_intersects
)
#Get area of intersections
output = transform(output,
inter_area = mapply(function(x, y) {
as.numeric(sf::st_area(
sf::st_intersection(x, y)
))}, x = geometry, y = geom_2))
## obtaining the proportion of area in data1 intersected by data2
output = transform(output, prop_inter = inter_area/area)
#get cycle-specific values
output <- output%>%
group_by(code, year.x)%>%
dplyr::summarise(prop_inter = sum(prop_inter),
end_date= max(end_date),
start_date= max(start_date))%>%
ungroup()
return(output)
}
As you can see I am testing the intersections of data2 on data1 and see which percentage of data1 is intersected dependent on the values they take on year and lag.year. The issue is that when I run this it only returns the desidered outcome for the last row, instead of the entire data1 object. I've tested all the different bits of code inside the loop separately and they all do as I want, but once I try to scale all of it up to the entire dataframe it just does it for the last row.
So I assume this must be some simple stupid mistake I am making for the loop.
Thanks!
You keep rewriting the output object; you may want to create a vector of length nrow(data) and assign the result to its i-th element. I don't think this relates to {sf} or GIS in general, it is more about how for loops and vectors work in R - consider this example:
for (i in 1:50) {
output <- i # rewriting output object 50 times
}
print(output) # this will be a single element for last row (50)
output <- numeric(50)
for (i in 1:50) {
output[i] <- i # storing result in a new element of output for each i
}
print(output) # this will be 1:50 as expected
You may want to consider something along these lines (hard to make certain without access to your data, but it should get you started).
result <- numeric(nrow(data1)) # init the vector
for(i in 1:nrow(data1)){
#get intersections between data2 and data1 for specific years
output = st_join(
x = data1[i, ],
y = data2[which(data2$year %in% data1$lag.year[i]:data1$year[i]), ],
join = st_intersects
)
#Get area of intersections
output = transform(output,
inter_area = mapply(function(x, y) {
as.numeric(sf::st_area(
sf::st_intersection(x, y)
))}, x = geometry, y = geom_2))
## obtaining the proportion of area in data1 intersected by data2
output = transform(output, prop_inter = inter_area/area)
#get cycle-specific values
result[i] <- output%>% # store in i-th element of result instead
group_by(code, year.x)%>%
dplyr::summarise(prop_inter = sum(prop_inter),
end_date= max(end_date),
start_date= max(start_date))%>%
ungroup()
# return(output) # no need for return unless you are in a function
}
Currently I have two data.frames, one of polygons (poly.x, poly.y, enum) and one of points (pt.x, pt.y) where enum is the id of the polygon. I am trying to determine which points belong to which polygons so I get a data.frame of (pt.x, pt.y, enum).
My first attempt uses point.in.polygon from the sp package and lapply functions to find which polygon(s) the point belongs to. While my code works, it takes a long time on large data sets.
My second attempt uses over also from the sp package, cobbled together from questions on gis stackexchange. While it is much faster, I cannot seem to get the correct output from over as it is a dataframe of 1s and NAs.
Below I've included a simplified working example (npoly can be changed to test the speed of different methods) as well as my working attempt using sp::point.in.polygon and nonsensical output from my sp::over attempt. I'm not fussed which method I end up using as long as it's fast.
Any help would be much appreciated!
#-------------------------------------------
# Libraries
library(ggplot2) # sample plots
library(dplyr) # bind_rows(), etc
library(sp) # spatial data
# Sample data
npoly = 100
# polygons
localpolydf <- data.frame(
x = rep(c(0, 1, 1, 0), npoly) + rep(0:(npoly-1), each = 4),
y = rep(c(0, 0, 1, 1), npoly),
enum = rep(1:npoly, each = 4))
# points
offsetdf <- data.frame(
x = seq(min(localpolydf$x) - 0.5, max(localpolydf$x) + 0.5, by = 0.5),
y = runif(npoly*2 + 3, 0, 1))
# Sample plot
ggplot() +
geom_polygon(aes(x, y, group = enum),
localpolydf, fill = NA, colour = "black") +
geom_point(aes(x, y), offsetdf)
#-------------------------------------------
# Dplyr and lapply solution for point.in.polygon
ptm <- proc.time() # Start timer
# create lists
offsetlist <- split(offsetdf, rownames(offsetdf))
polygonlist <- split(localpolydf, localpolydf$enum)
# lapply over each pt in offsetlist
pts <- lapply(offsetlist, function(pt) {
# lapply over each polygon in polygonlist
ptpoly <- lapply(polygonlist, function(poly) {
data.frame(
enum = poly$enum[1],
ptin = point.in.polygon(pt[1,1], pt[1,2], poly$x, poly$y))
})
ptpoly <- bind_rows(ptpoly) %>% filter(ptin != 0)
if (nrow(ptpoly) == 0) return(data.frame(x = pt$x, y = pt$y, enum = NA, ptin = NA))
ptpoly$x = pt$x
ptpoly$y = pt$y
return(ptpoly[c("x", "y", "enum", "ptin")])
})
pts_apply <- bind_rows(pts)
proc.time() - ptm # end timer
#-------------------------------------------
# Attempted sp solution for over
ptm <- proc.time() # Start timer
# Split the dataframe into a list based on enum and then remove enum from df in the list
polygonlist <- split(localpolydf, localpolydf$enum)
polygonlist <- lapply(polygonlist, function(x) x[,c("x", "y")])
# Convert the list to Polygon, then create a Polygons object
polygonsp <- sapply(polygonlist, Polygon)
polygonsp <- Polygons(polygonsp, ID = 1)
polygonsp <- SpatialPolygons(list(polygonsp))
plot(polygonsp)
# Convert points to coordinates
offsetps <- offsetdf
coordinates(offsetps) <- ~x+y
points(offsetps$x, offsetps$y)
# Determine polygons points are in
pts_sp <- over(offsetps, polygonsp)
proc.time() - ptm # end timer
#===========================================
# Output
# Apply: point.in.polygon
> head(pts_apply)
x y enum ptin
1 -0.5 0.2218138 NA NA
2 4.0 0.9785541 4 2
3 4.0 0.9785541 5 2
4 49.0 0.3971479 49 2
5 49.0 0.3971479 50 2
6 49.5 0.1177206 50 1
user system elapsed
4.434 0.002 4.435
# SP: over
> head(pts_sp)
1 2 3 4 5 6
NA 1 1 NA 1 NA
user system elapsed
0.048 0.000 0.047
An alternative to using over is to use sf::intersection as the sf package is becoming more and more popular.
Getting the data into sf objects took me a little bit of work but if you are working with external data you can just read in with st_read and it will already be in the correct form.
Here is how to approach:
library(tidyverse)
library(sf)
# convert into st_polygon friendly format (all polygons must be closed)
# must be a nicer way to do this!
localpoly <- localpolydf %>% split(localpolydf$enum) %>%
lapply(function(x) rbind(x,x[1,])) %>%
lapply(function(x) x[,1:2]) %>%
lapply(function(x) list(as.matrix(x))) %>%
lapply(function(x) st_polygon(x))
# convert points into sf object
points <- st_as_sf(offsetdf,coords=c('x','y'),remove = F)
#convert polygons to sf object and add id column
polys <- localpoly %>% st_sfc() %>% st_sf(geom=.) %>%
mutate(id=factor(1:100))
#find intersection
joined <- polys %>% st_intersection(points)
# Sample plot
ggplot() + geom_sf(data=polys) +
geom_sf(data=joined %>% filter(id %in% c(1:10)),aes(col=id)) +
lims(x=c(0,10))
Note that to use geom_sf at the time of writing you will need to install the development version of ggplot.
plot output:
over returns an index of points inside a geometry. Perhaps something like this:
xy <- offsetps[names(na.omit(pts_sp == 1)), ]
plot(polygonsp, axes = 1, xlim = c(0, 10))
points(offsetps)
points(xy, col = "red")
After having another look, I realised Roman did pts_sp == 1 because I only had 1 ID for all of my squares, i.e. when I did ID = 1.
Once I fixed that, I was able to a column with ID = enum. To handle points in multiple polygons I can use returnList = TRUE and add additional lines to convert the list to a data.frame but it isn't necessar here.
# Attempted sp solution
ptm <- proc.time() # Start timer
# Split the dataframe into a list based on enum and then remove enum from df in the list
polygonlist <- split(localpolydf, localpolydf$enum)
# Convert the list to Polygon, then create a Polygons object
polygonsp <- sapply(polygonlist, function(poly){
Polygons(list(Polygon(poly[, c("x", "y")])), ID = poly[1, "enum"])
})
# polygonsp <- Polygons(polygonsp, ID = 1)
polygonsp <- SpatialPolygons(polygonsp)
plot(polygonsp)
# Convert points to coordinates
offsetps <- offsetdf
coordinates(offsetps) <- ~x+y
points(offsetps$x, offsetps$y)
# Determine polygons points are in
pts_sp <- over(offsetps, polygonsp)
pts_sp <- data.frame(
x = offsetps$x, y = offsetps$y,
enum = unique(localpolydf$enum)[pts_sp])
proc.time() - ptm # end timer
I have a gridded climate dataset, such as:
# generate time vector
time1 <- seq(14847.5,14974.5, by = 1)
time2 <- seq(14947.5,14974.5, by = 1)
time <- c(time1,time2)
time <- as.POSIXct(time*86400,origin='1970-01-01 00:00')
# generate lat and lon coordinates
lat <- seq(80,90, by = 1)
lon <- seq(20,30, by = 1)
# generate 3dimensional array
dat <- array(runif(length(lat)*length(lon)*length(time)),
dim = c(length(lon),length(lat),length(time)))
such that
> dim(dat)
[1] 11 11 156
the dimensions of the data are describing the variable at different longitude (dim = 1), latitude (dim = 2), and time (dim = 3).
The issue I have at the moment is that some of the times are repeated, something to do with overlapping sensors measuring the data. Therefore, I was wondering if it was possible to only keep the unique times for dat, but average the data within the grid for the duplicated times i.e. if there are two repeated days we take the average value in each latitude and longitude grid for that time.
I can find the unique times as:
# only select unique times
new_time <- unique(time)
unique_time <- unique(time)
The following code then aims to loop through each grid (lat/lon) and average all of the duplicated days.
# loop through lat/lon coordinates to generate new data
new_dat <- array(dim = c(length(lon),length(lat),length(new_time)))
for(i in 1:length(lon)){
for(ii in 1:length(lat)){
dat2 <- dat[i,ii,]
dat2b <- NA
for(k in 1:length(unique_time)){
idx <- time == unique_time[k]
dat2b[k] <- mean(dat2[idx], na.rm = TRUE)
}
new_dat[i,ii,] <- dat2b
}
}
I'm convinced that this provides the correct answer, but I'm certain there is a much cleaner method do achieve this.
I should also note that my data is quite large (i.e. k = 7000), so this last loop is not very efficient, to say the least.
My original answer:
This is a bit more concise and efficient by use of aggregate:
for(i in 1:length(lon)){
for(ii in 1:length(lat)){
new_dat[i,ii,] <- as.numeric(aggregate(dat[i,ii,], by=list(time),mean)$x)
}
}
It still has 2 out of the 3 of the loops, but it manages to bypass creating dat2, dat2b, and unique_time.
My improved answer:
f <- function(i, ii){as.numeric(aggregate(dat[i,ii,], by=list(time),mean)$x)}
for(i in 1:nrow(expand.grid(1:length(lon),1:length(lat)))){
new_dat[expand.grid(1:length(lon),1:length(lat))[i,1],
expand.grid(1:length(lon),1:length(lat))[i,2],] <-
f(expand.grid(1:length(lon),1:length(lat))[i,1],expand.grid(1:length(lon),
1:length(lat))[i,2])
}
Got it down to just 1 loop. We could probably bypass that loop too with an apply.