For species niche modeling I am trying to fetch building heights from the briljant 3D BAG data of the TU Delft https://3dbag.nl/nl/download. I want to do this for the city of Haarlem. It is possible to select and download tiles manually. But this is quite labor-intensive and prone to errors (a missing tile), and I want to repeat this action for more cities. So I am trying to use the AFS service to download features. I created a bounding box of Haarlem with a 1.2 extent for the AFS request. However, the maximum record the server delivers is 5000. Despite many alternative attempts I have failed so far to overcome this number. This is partly caused by my confusion in the AWF semantics, when I check with GetCapabilities it is hard to find out the name space, featureTypes and individual attributes (or properties). What I have tried:
Add pagination. But all the tutorials I have read so far need the number of actual/maximum features beside the server maximum (resultType = "hits"). And I was not able to easily retrieve this maximum for the limits of the bounding box.
Select tiles. I figured it should be possible to extract the tile ids that match with the bounding box, using the tile_id, an attribute from the BAG3D_v2:bag_tiles_3k layer, and then somehow build an apply or loop to extract the features per tile. But I already failed to create a cql_filter to select an individual tile.
Create tiles. Since I am not entirely sure whether the individual tiles from the 3D BAG service already exceed the 5000 feature limit, an alternative approach could be to split the bounding box is many small tiles using the R package slippymath, and then extract the features per tile. But then the challenge of filtering remains the same.
Any help with this would be appreciated. The basic code I used in many different ways:
library(httr)
url <- parse_url("https://data.3dbag.nl/api/BAG3D_v2/wfs")
url$query <- list(service = "WFS",
version = "2.0.0",
request = "GetFeature",
typename = "BAG3D_v2:lod22",
#cql_filter = "BAG3D_v2:tile_id ='4199'",
bbox = "100768.4,482708.5,107923.1,494670.4",
startindex = 10000,
sortBy = "gid")
request <- build_url(url)
test <- st_read(request)
qtm(test)
One solution is to loop over startindex 5000 by 5000. Then stop the code when the shape returned contained less than 5000 which mean it's done (unless the total number of features is a multiple of 5000...)
Below a piece of code adapted from happign package.
library(httr)
library(sf)
# function for building url
build_3DBAG_url <- function(startindex){
url <- parse_url("https://data.3dbag.nl/api/BAG3D_v2/wfs")
url$query <- list(service = "WFS",
version = "2.0.0",
request = "GetFeature",
typename = "BAG3D_v2:lod22",
#cql_filter = "BAG3D_v2:tile_id ='4199'",
bbox = "100768.4,482708.5,107923.1,494670.4",
startindex = startindex,
count = 5000,
sortBy = "gid")
url <- build_url(url)
return(url)
}
# initialize first request
resp <- read_sf(build_3DBAG_url(startindex = 0))
message("Features downloaded : ", nrow(resp), appendLF = F)
# loop until returned shape is less than 5000
i <- 5000
temp <- resp
while(nrow(temp) == 5000){
message("...", appendLF = F)
temp <- read_sf(build_3DBAG_url(startindex = i))
resp <- rbind(resp, temp)
message(nrow(resp), appendLF = F)
i <- i + 5000
}
Related
About my project: I am using the academic twitter api and the package AcademicTwitteR to first scrape all tweets of amnesty international UK. This has worked fine.
The next step is to use the conversation ids of those ~30,000 tweets to get the entire threads behind them, which is where my problem lies.
This is the code I am running:
`ai_t <-
get_all_tweets(
users = "AmnestyUK",
start_tweets = "2008-01-01T00:00:00Z",
end_tweets = "2022-11-14T00:00:00Z",
bearer_token = BearerToken,
n = Inf
)`
`conversations <- c()`
`for (i in list){
x<- get_all_tweets(
start_tweets = "2008-01-01T00:00:00Z",
end_tweets = "2022-11-14T00:00:00Z",
bearer_token = BearerToken,
n = Inf,
conversation_id = c(i))
conversations <- c(conversations, x)`
The problem I have is that this is an abundance of individual queries, but the package only allows to run one id at a time, putting in the list directly instead of the for loop produces an error, hence why I am using a loop.
Apart from the rate limit sleep timer, individual queries already take anywhere between ~3 seconds, when not many tweets are retrieved, and more than that, when there are for example 2000 tweets with that conversation_id. A rough calculation already put this at multiple days of running this code, if I am not making a mistake.
The code itself seems to be working fine, I have tried with a short sample of the conversation ids:
`list2 <- list[c(1:3)]`
`for (i in list2){
x<- get_all_tweets(
start_tweets = "2008-01-01T00:00:00Z",
end_tweets = "2022-11-14T00:00:00Z",
bearer_token = BearerToken,
n = Inf,
conversation_id = c(i))
conversations <- c(conversations, x)
`
Has anybody a solution for this or will is this the most efficient way and this will just take forever?
I am unfortunately not experienced in python at all, but if there is an easier way in that language I would also be interested.
Cheers
I have several rasters, 343 to be more exact, from Cropscape. I need to get the locations (centroids) and area measurements of pixels that represent potatoes and tomatoes based on the associated values in the rasters. The pixel values are 43 and 54, respectively. Cropscape provides rasters separated by year and state, except for 2016, which has the lower 48 states combined. The rasters are saved as GeoTiffs on a Google Drive and I am using Google File Stream to connect to the rasters locally.
I want to create a SpatialPointsDataFrame from the centroids of each pixel or group of adjacent pixels for tomatoes and potatoes in all the rasters. Right now, my code will
Subset the rasters to potatoes and tomatoes
Change the raster subsets to polygons, one for potatoes and one for tomatoes
Create centroids from each polygon
Create a SpatialPointsDataFrame based on the centroids
Extract the area measurement for each area of interest with SpatialPointsDataFrame
Write the raster subsets and each polygon to a file.
Code:
library(raster)
library(rgdal)
library(rgeos)
dat_dir2 = getwd()
mepg <- make_EPSG()
ae_pr <- mepg[mepg$code == "5070", "prj4"]
# Toy raster list for use with code
# I use `list.files()` with the directories that hold
# the rasters and use list that is generated from
# that to read in the files to raster. My list is called
# "tiflist". Not used in the code, but mentioned later.
rmk1 <- function(x, ...) {
r1 = raster(ncol = 1000, nrow = 1000)
r1[] = sample(1:60, 1000000, replace = T)
proj4string(r1) = CRS(ae_pr)
return(r1)
}
rlis <- lapply(1:5, rmk1)
#Pixel values needed
ptto <- c(43, 54)
# My function to go through rasters for locations and area measurements.
# This code is somewhat edited to work with the demo raster list.
# It produces the same output as what I wanted, but with the demo list.
pottom <- function(x, ...) {
# Next line is not necessary with the raster list created above.
# temras = raster(x)
now = format(Sys.time(), "%b%d%H%M%S")
nwnm = paste0(names(x), now)
rasmatx = match(x = x, table = ptto)
writeRaster(rasmatx, file.path( dat_dir2, paste0(nwnm,"ras")), format = "GTiff")
tempol = rasterToPolygons(rasmatx, fun = function(x) { x > 0 & x < 4}, dissolve = T)
tempol2 = disaggregate(tempol)
# for potatoes
tempol2p = tempol2[tempol2$layer == '1',]
if (nrow(tempol2p) > 0) {
temcenp = gCentroid(tempol2p, byid = T)
temcenpdf = SpatialPointsDataFrame(temcenp, data.frame(ID = 1:length(temcenp) , temcenp))
temcenpdf$pot_p = extract(rasmatx, temcenpdf)
temcenpdf$areap_m = gArea(tempol2p, byid = T)
# writeOGR(temcenpdf, dsn=file.path(dat_dir2), paste0(nwnm, "p"), driver = "ESRI Shapefile")
}
# for tomatoes
tempol2t = tempol2[tempol2$layer == '2',]
if (nrow(tempol2t) > 0) {
temcent = gCentroid(tempol2t, byid = T)
temcentdf = SpatialPointsDataFrame(temcent, data.frame(ID = 1:length(temcent) , temcent))
temcentdf$tom_t = extract(rasmatx, temcentdf)
temcentdf$areat_m = gArea(tempol2t, byid = T)
writeOGR(temcentdf, dsn=file.path(dat_dir2), paste0(nwnm,"t"), driver = "ESRI Shapefile")
}
}
lapply(rlis, pottom)
I know I should provide some toy data and I created some, but I don't know if they exactly recreate my problem, which follows.
Besides my wonky code, which seems to work, I have a bigger problem. A lot of memory is used when this code runs. The tiflist can only get through the first 4 files of the list and by then RAM, which is 16 GB on my laptop, is completely consumed. I'm pretty sure it's the connections to the Google Drive, since the cache for the drive stream is at least 8 GB. I guess each raster is staying open after being connected to in the Google Drive? I don't know how to confirm that.
I think I need to get the function to clear out all of the objects that are created, e.g. temras, rasmatx, tempol, etc., after processing each raster, but I'm not sure how to do that. I did try adding rm(temras ...) to the end of the function, but when I did that, there was no output at all from the function after 10 minutes and by then, I've usually got the first 3 rasters processed.
27/Oct EDIT after comments from RobertHijmans. It seems that the states with large geographic extents are causing problems with rasterToPolygons(). I edited the code from the way it works for me locally to work with the demo data I included, since RobertHijmans pointed out it wasn't functional. So I hope this is now reproducible.
I feel silly answering my own question, but here it is: the rasterToPolygons function is notoriously slow. I was unaware of this issue. I waited 30 minutes before killing the process with no result in one of my attempts. It works on the conditions I require for rasters for Alabama and Arkansas for example, but not California.
A submitted solution, which I am in the process of testing, comes from this GitHub repo. The test is ongoing at 12 minutes, so I don't know if it works for an object as large as California. I don't want to copy and paste someone else's code in an answer to my own question.
One of the comments suggested using profvis, but I couldn't really figure out the output. And it hung with the process too.
I would like to calculate the route distance of coordinate to coordinate from OSRM (although I'm open to other services).
For example, a row will have a "from" and "to" coordinate, and instead of getting the point-to-point distance, use the routing to have a more accurate picture of the distance traveled.
I've tried every iteration of the script provided here, and have cut my data to be 25 rows.
https://www.rdocumentation.org/packages/osrm/versions/3.3.0/topics/osrmTable
# Set the working directory
setwd("C:/Users/...")
# Load libraries
library(dplyr)
library(osrm)
library(geosphere)
# Bring in the data
mydata <- read.csv("coordinates.csv", stringsAsFactors=FALSE)
# Check and eliminate the rows that don't have location information
mydata <- mydata[!is.na(mydata$fromlat),]
mydata <- subset(mydata, fromlat!=0)
mydata <- mydata[!is.na(mydata$tolat),]
mydata <- subset(mydata, tolat!=0)
# Create date for route
src <- mydata[c(7,10,9)]
dst <- mydata[c(7,12,11)]
# Travel time matrix with different sets of origins and destinations
route <- osrmTable(src = src, dst = dst, exclude = NULL,
gepaf = FALSE, measure = "distance")
Ideally, I would like for a new column to be put in the data that has the distance between the two coordinates using the routing.
I've figured it out for a point-to-point distance, but I am having difficulty doing it with routing.
I get this message after I run my script:
The OSRM server returned an error:
Error in function (type, msg, asError = TRUE) : Failed to connect to router.project-osrm.org port 80: Timed out
Update:
I've tried using gmapsdistance, and am also getting a connectivity issue. I suspect it's a workplace firewall issue. I will look into it and will post the results.
Indeed I am behind a firewall that is blocking access to OSRM. To solve this problem, I am running an instance of R through RStudio Cloud.
I am trying to create a loop where I select one file name from a list of file names, and use that one file to run read.capthist and subsequently discretize, fit, derived, and save the outputs using save. The list contains 10 files of identical rows and columns, the only difference between them are the geographical coordinates in each row.
The issue I am running into is that capt needs to be a single file (in the secr package they are 'captfile' types), but I don't know how to select a single file from this list and get my loop to recognize it as a single entity.
This is the error I get when I try and select only one file:
Error in read.capthist(female[[i]], simtraps, fmt = "XY", detector = "polygon") :
requires single 'captfile'
I am not a programmer by training, I've learned R on my own and used stack overflow a lot for solving my issues, but I haven't been able to figure this out. Here is the code I've come up with so far:
library(secr)
setwd("./")
files = list.files(pattern = "female*")
lst <- vector("list", length(files))
names(lst) <- files
for (i in 1:length(lst)) {
capt <- lst[i]
femsimCH <- read.capthist(capt, simtraps, fmt = 'XY', detector = "polygon")
femsimdiscCH <- discretize(femsimCH, spacing = 2500, outputdetector = 'proximity')
fit <- secr.fit(femsimdiscCH, buffer = 15000, detectfn = 'HEX', method = 'BFGS', trace = FALSE, CL = TRUE)
save(fit, file="C:/temp/fit.Rdata")
D.fit <- derived(fit)
save(D.fit, file="C:/temp/D.fit.Rdata")
}
simtraps is a list of coordinates.
Ideally I would also like to have my outputs have unique identifiers as well, since I am simulating data and I will have to compare all the results, I don't want each iteration to overwrite the previous data output.
I know I can use this code by bringing in each file and running this separately (this code works for non-simulation runs of a couple data sets), but as I'm hoping to run 100 simulations, this would be laborious and prone to mistakes.
Any tips would be greatly appreciated for an R novice!
The popular graph database Neo4j can be used within R thanks to the package/driver RNeo4j (https://github.com/nicolewhite/Rneo4j).
The package author, #NicoleWhite, provides several great examples of its usage on GitHub.
Unfortunately for me, the examples given by #NicoleWhite and the documentation are a bit oversimplistic, in that they manually create each graph node and its associated labels and properties, such as:
mugshots = createNode(graph, "Bar", name = "Mugshots", location = "Downtown")
parlor = createNode(graph, "Bar", name = "The Parlor", location = "Hyde Park")
nicole = createNode(graph, name = "Nicole", status = "Student")
addLabel(nicole, "Person")
That's all good and fine when you're dealing with a tiny example dataset, but this approach isn't feasible for something like a large social graph with thousands of users, where each user is a node (such graphs might not utilize every node in every query, but they still need to be input to Neo4j).
I'm trying to figure out how to do this using vectors or dataframes. Is there a solution, perhaps invoving an apply statement or for loop?
This basic attempt:
for (i in 1:length(df$user_id)){
paste(df$user_id[i]) = createNode(graph, "user", name = df$name[i], email = df$email[i])
}
Leads to Error: 400 Bad Request
As a first attempt, you should look at the functionality I just added for the transactional endpoint:
http://nicolewhite.github.io/RNeo4j/docs/transactions.html
library(RNeo4j)
graph = startGraph("http://localhost:7474/db/data/")
clear(graph)
data = data.frame(Origin = c("SFO", "AUS", "MCI"),
FlightNum = c(1, 2, 3),
Destination = c("PDX", "MCI", "LGA"))
query = "
MERGE (origin:Airport {name:{origin_name}})
MERGE (destination:Airport {name:{dest_name}})
CREATE (origin)<-[:ORIGIN]-(:Flight {number:{flight_num}})-[:DESTINATION]->(destination)
"
t = newTransaction(graph)
for (i in 1:nrow(data)) {
origin_name = data[i, ]$Origin
dest_name = data[i, ]$Dest
flight_num = data[i, ]$FlightNum
appendCypher(t,
query,
origin_name = origin_name,
dest_name = dest_name,
flight_num = flight_num)
}
commit(t)
cypher(graph, "MATCH (o:Airport)<-[:ORIGIN]-(f:Flight)-[:DESTINATION]->(d:Airport)
RETURN o.name, f.number, d.name")
Here, I form a Cypher query and then loop through a data frame and pass the values as parameters to the Cypher query. Your attempts right now will be slow, because you're sending a separate HTTP request for each node created. By using the transactional endpoint, you create several things under a single transaction. If your data frame is very large, I would split it up into roughly 1000 rows per transaction.
As a second attempt, you should consider using LOAD CSV in the neo4j-shell.