I have a dataframe df with three variable: city, state and country. I want to calculate 3 things
Calculate the latitude/longitude for each row.
Calculate the distance of each city from an arbitrary point, say country capital
Calculate the elevation of each point from the lat/long.
I can use the dismo package for 1, but can't figure out a way to "bulk process" from df, instead of copying and pasting the city, state and country names directly into the geocode(c()) code. As for 2 and 3, I am stumped completely. Any help would be appreciated.
EDIT NOTE: To other readers of this post... I appreciate the help from both Paul and Spacedman. The system won't allow me to mark more than one response as correct. I gave Paul the thumbs up because he responded before Spaceman. Please read what both of them have taken the time to write up. Thanks.
I'll share my thoughts on each of the points:
For 1. you can use paste to create an input vector for geocode:
df$geocode_string = with(df, paste(city, state, country, sep = ", "))
coords_latlong = geocode(df$geocode_string)
In regard to point number 2, after converting df to one the classes provided by the sp package (SpatialPointsDataFrame, look at the coordinates function from sp), you can use spDistsN1 to find the distance of all the points to one other point.
The final is a bit more tricky, to find the height you need a DEM (digital elevation model). Maybe there is a more easy way along the lines of geocode which I am not aware of.
You can use the geonames.org service to query a location on either SRTM or ASTER elevation databases:
http://www.geonames.org/export/web-services.html
and you might even be able to use my geonames package:
https://r-forge.r-project.org/projects/geonames/
to make life easier.
Related
I have a dataset like the following picture and want to read and extract each cell and assign them into parameters in an optimization model. For example, considering only one part of a row:
ID, Min, speed, Distance, Time Latitude, Longitude
1 2506 23271 11.62968 17.7 -37.731 144.898
Every row depicts a persons' information. So, is it better to define a dictionary of person and put all these into that? Or is it better to define a tuple? Arrays(like below)?
for i in 1:n_people
person_id = i
push!(requests, Request(ID[i], Min[i], speed[i], Distance[i], Latitude[i], Longitude[i]))
end
In any case, how can I access (extract), let's say, distance for that person?
I mean, I need to have a set of people in my model like
people[i] and then for each of these connect them to their information (model parameters) like person's 'i' distance, speed,... and then compare them with person j.
What is the best way to do that?
Since JuMP is agnostic to the format of the input data, the answer is: it depends on what you want to do with it. Pick whatever makes the most sense to you.
There are a few data related tutorials that address how to read data into a DataFrame and use that to create JuMP variables and constraints. Reading those is a good next step:
https://jump.dev/JuMP.jl/stable/tutorials/getting_started/getting_started_with_data_and_plotting/
https://jump.dev/JuMP.jl/stable/tutorials/linear/diet/
I have two spatial datasets. One dataset contains lots polygons (more than 150k in total) specifying different features, like rivers, vegetation. The other dataset contains much less polygons (500) specifying different areas.
I need to intersect those two datasets to get the features in the different areas.
I can subset the first dataset by the different features. If I use a subset from a small feature (2,500 polygons) the intersection with the areas is quite fast (5min). But if I want to interest a bigger feature subset (20,000 polygons) the computation runs really long (I terminated it after two hours). And this is not even the biggest feature (50,000 polygons) I need to intersect.
This is the code snipped I run:
clean_intersect_save = function(geo_features, areas) {
# make geometries valid
data_valid_geoms = st_parallel(sf_df = st_geometry(geo_features),
sf_func = st_make_valid,
n_cores = 4)
# remove unnecessary columns
data_valid = st_drop_geometry(x) %>% select("feature")
data_valid = st_sf(data_clean, geometry = data_valid_geoms)
# intersect the geo-features and areas
data_valid_split = st_parallel(sf_df = bezirke,
sf_func = st_intersection,
n_cores = 4,
data_clean)
# save shp file
st_write(data_valid_split, "data_valid_splir.shp")
return(data_valid_split)
}
Where both inputs are sf data frames.
st_parallel is a function I found
here.
My question is: How would experienced spatial data people solve such a task usually? Do I just need more cores and/or more patience? Am I using sf wrong? Is R/sf the wrong tool?
Thanks for any help.
This is my very first spatial data analysis project, so sorry if I oversee some obvious thinks.
As there probably won´t come a real answer to this vague question I will answer it on my own.
Thanks #Chris and #TimSalabim for the help. I ended up with a combination of both ideas.
I ended up using PostGIS which is from my experience a pretty intuitive way to work with spatial data.
The three things which speeded up the calculations of intersection for me are:
In my chase the spatial data was stored in MULTIPOLYGONS when loading from shapefile. I expanded those into POLYGONS using ST_DUMP:
https://postgis.net/docs/ST_Dump.html
I created a Spatial Index on the POLYGONS: https://postgis.net/workshops/postgis-intro/indexing.html
I used a combination of ST_Intersection and ST_Intersects to only call the costly ST_Intersection when realy needed (As #TimSalabim suggested, this approach could also speed up things in R....But I currently have no time to test this approach): https://postgis.net/2014/03/14/tip_intersection_faster/
Hello stackoverflow community,
Given a data frame of two columns, points of latitude and longitude (dfLatLon), I’d like to add a third column that looks up the number of crimes in a ~0.1 mile radius from a separate data frame with a list of crimes in the city of Denver (dfCrimes).
The solution below works (which I found searching stackoverflow, thank you!), but I have a hunch it’s inefficient, so I’d like to improve it if possible.
I’ve attempted to make this code reproducible (I’ve read enough posts to know that’s a must), but you will have to go to one of the sites below to download the crime data, which is ~74MB. Hope this isn’t an issue. Alternatively, you can use the other method below (uncomment it, comment the other), which doesn’t require you to download the file separately and is thus perhaps more reproducible, but I found it to be much slower (336 sec vs. 16 sec to load CSV).
Crime Web Page: http://data.denvergov.org/dataset/city-and-county-of-denver-crime
Direct Link to the Data: http://data.denvergov.org/download/gis/crime/csv/crime.csv
# Initialization Stuff
library(dplyr)
library(doParallel)
registerDoParallel(cores=4)
set.seed(77) #In honor of Red Grange … Go Illini!
#Set Working Directory
#setwd("INSERT YOUR WD HERE, LOCATION OF CRIME DATA")
#Load Crime Data
dfCrimes <- read.csv("crime.csv")
#Alternate method to obtain file, no separate download or setting working directory required, but MUCH slower.
#dfCrimes <- read.csv("http://data.denvergov.org/download/gis/crime/csv/crime.csv")
#Set Degrees per Mile Constants
cstDegPerMileLat <- 0.01450737744
cstDegPerMileLon <- 0.01882547865
#Create Lats & Lons Data Frame (a grid of ~0.1mi centers in a square around Denver)
vecLat <- seq(39.6098, 39.9146, cstDegPerMileLat * 0.1)
vecLon <- seq(-105.1100, -104.5987, cstDegPerMileLon * 0.1)
dfLatLon <- expand.grid(vecLat, vecLon)
colnames(dfLatLon) <- c("lat","lon")
#Add 3rd Column
#THIS IS THE PART THAT I THINK CAN BE MORE EFFICIENT … PLEASE HELP!
system.time(dfLatLon <- dfLatLon %>% rowwise %>% mutate(newcol = sum( dfCrimes$IS_CRIME [ ((dfCrimes$GEO_LAT - lat)*(1/cstDegPerMileLat))^2 + ((dfCrimes$GEO_LON - lon)*(1/cstDegPerMileLon))^2 < 0.1^2 ])) )
#Wrapped formula above in system.time to measure efficiency.
#At its core, the formula above is just the basic formula for a circle, x^2 + y^2 = r^2, with adjustments to convert degrees of lat/lon to miles.
This took 747 seconds (~12 minutes) to run and used only 1 processor, which is way better than the ~30 min it took to run in Excel using all 4 processors. I realize 12 minutes isn’t prohibitively long, but if I scale this solution to larger problems, it’ll matter more.
Also, I’m running R Studio in a Windows 10 OS.
Here are my specific questions:
Are there ways of using dplyr to run this more efficiently? I’ve read it’s super-efficient for this kind of problem. I suspect rowwise is a performance killer (perhaps it’s not vectorized; could it be?), but I’ve not been able to get it to run without using rowwise.
1a. Should I convert from data frames to data.tables?
1b. How can I use multiple processors to increase speed? Seems a waste to let 3 processor cores sit idle.
1c. Should I be using some kind of join or group_by? I don’t think so because there’s not a direct reference to exact lat’s and lon’s, but I’m open to this possibility if that’s the right (faster) answer.
If there’s a better way to tackle this problem without dplyr, I’d like to see it, but I’d also like to see a solution in dplyr just so I can learn it better.
Finally, please note I’m doing this analysis for the sake of learning R and data science (I don’t work for the City of Denver), and am exploring larger data sets to improve my skills. I’m a beginner to R and data science, but I’ve been an analyst doing some fairly sophisticated analyses in Excel for years. I know Excel’s limitations, and am thus fascinated by the potential of R, data science, and machine learning.
This is my first post, so hopefully I’ve covered everything. Please let me know in the comments if I’m missing something, posted in the wrong spot, violated some rule, etc.
Thanks so much for your help!!!
p.s., I realize my lats and lons aren’t evenly spaced, and the degrees of lat/lon per mile aren’t uniform either, both due to convergence toward the north pole and Earth’s not-quite-spherical shape. I’m ignoring this for now. I just want to know how to efficiently reference an ‘outside’ data frame using dplyr.
p.p.s., Eventually I plan to make a predictive model from this data, likely experimenting using caret, but for now I’m trying to improve my pre-processing skills.
is there a way to use WMS->GetFeatureInfo specifying a TIME period (eg: 2014-01-01/2014-03-01) to extract a series of values from a raster layer loaded from a GeoServer instance?
Thanks in advance
Not at the moment, no. It may be added in the future though, it's not the first time I hear this request. I don't have a ETA, it depends on when funding to work on it shows up.
In the meantime, a somewhat complex workaround might be to configure the image mosaic index as a WFS feature type, query it by date, figure out the exact time values intersected by the interval, and then do N GetFeatureInfo requests, one for each of those values.
I want to analyse angles in movement of animals. I have tracking data that has 10 recordings per second. The data per recording consists of the position (x,y) of the animal, the angle and distance relative to the previous recording and furthermore includes speed and acceleration.
I want to analyse the speed an animal has while making a particular angle, however since the temporal resolution of my data is so high, each turn consists of a number of minute angles.
I figured there are two possible ways to work around this problem for both of which I do not know how to achieve such a thing in R and help would be greatly appreciated.
The first: Reducing my temporal resolution by a certain factor. However, this brings the disadvantage of losing possibly important parts of the data. Despite this, how would I be able to automatically subsample for example every 3rd or 10th recording of my data set?
The second: By converting straight movement into so called 'flights'; rule based aggregation of steps in approximately the same direction, separated by acute turns (see the figure). A flight between two points ends when the perpendicular distance from the main direction of that flight is larger than x, a value that can be arbitrarily set. Does anyone have any idea how to do that with the xy coordinate positional data that I have?
It sounds like there are three potential things you might want help with: the algorithm, the math, or R syntax.
The algorithm you need may depend on the specifics of your data. For example, how much data do you have? What format is it in? Is it in 2D or 3D? One possibility is to iterate through your data set. With each new point, you need to check all the previous points to see if they fall within your desired column. If the data set is large, however, this might be really slow. Worst case scenario, all the data points are in a single flight segment, meaning you would check the first point the same number of times as you have data points, the second point one less, etc. The means n + (n-1) + (n-2) + ... + 1 = n(n-1)/2 operations. That's O(n^2); the operating time could have quadratic growth with respect to the size of your data set. Hence, you may need something more sophisticated.
The math to check whether a point is within your desired column of x is pretty straightforward, although maybe more sophisticated math could help inform a better algorithm. One approach would be to use vector arithmetic. To take an example, suppose you have points A, B, and C. Your goal is to see if B falls in a column of width x around the vector from A to C. To do this, find the vector v orthogonal to C, then look at whether the magnitude of the scalar projection of the vector from A to B onto v is less than x. There is lots of literature available for help with this sort of thing, here is one example.
I think this is where I might start (with a boolean function for an individual point), since it seems like an R function to determine this would be convenient. Then another function that takes a set of points and calculates the vector v and calls the first function for each point in the set. Then run some data and see how long it takes.
I'm afraid I won't be of much help with R syntax, although it is on my list of things I'd like to learn. I checked out the manual for R last night and it had plenty of useful examples. I believe this is very doable, even for an R novice like myself. It might be kind of slow if you have a big data set. However, with something that works, it might also be easier to acquire help from people with more knowledge and experience to optimize it.
Two quick clarifying points in case they are helpful:
The above suggestion is just to start with the data for a single animal, so when I talk about growth of data I'm talking about the average data sample size for a single animal. If that is slow, you'll probably need to fix that first. Then you'll need to potentially analyze/optimize an algorithm for processing multiple animals afterwards.
I'm implicitly assuming that the definition of flight segment is the largest subset of contiguous data points where no "sub" flight segment violates the column rule. That is to say, I think I could come up with an example where a set of points satisfies your rule of falling within a column of width x around the vector to the last point, but if you looked at the column of width x around the vector to the second to last point, one point wouldn't meet the criteria anymore. Depending on how you define the flight segment then (e.g. if you want it to be the largest possible set of points that meet your condition and don't care about what happens inside), you may need something different (e.g. work backwards instead of forwards).