Distance between points in GPX file becomes too large - r

I want to analyze distance traveled based on GPS tracks But when i calculate the distance it always comes out as too large.
I use python to make a csv file with the latitude and longitude for all points in a track which i then analyze with R. The data frame looks like this:
| lat| lon| lat.p1| lon.p1| dist_to_prev|
|--------:|--------:|--------:|--------:|------------:|
| 60.62061| 15.66640| 60.62045| 15.66660| 28.103099|
| 60.62045| 15.66660| 60.62037| 15.66662| 8.859034|
| 60.62037| 15.66662| 60.62026| 15.66636| 31.252373|
| 60.62026| 15.66636| 60.62018| 15.66636| 8.574722|
| 60.62018| 15.66636| 60.62010| 15.66650| 17.787905|
| 60.62001| 15.66672| 60.61996| 15.66684| 14.393267|
| 60.61996| 15.66684| 60.61989| 15.66685| 7.584996|
...
I could post the whole data frame here for reproducability, it's only 59 rows, but i'm not sure of the etiquette for posting big chunks of data here? Let me know how i can best share it.
lat.next and lon.next is just the lat and lon from the row below. dist_to_prev is calculated with distm() from geosphere:
library(geosphere)
library(dplyr)
df$dist_to_prev <- apply(df, 1 , FUN = function (row) {
distm(c(as.numeric(row["lat"]), as.numeric(row["lon"])),
c(as.numeric(row["lat.p1"]), as.numeric(row["lon.p1"])),
fun = distHaversine)})
df %>% filter(dist_to_prev != "NA") %>% summarise(sum(dist_to_prev))
# A tibble: 1 x 1
`sum(dist_to_prev)`
<dbl>
1 1266.
I took this track as an example from Trailforks and if you look at their track description it should be 787m, not 1266m as i got. This is not unique to this track but to all tracks i've looked at. When i do it they all come out 30-50% too long.
One thing that might be the cause is that there is only 5 decimal-places for the lats/lons. There is 6 decimal-places in the csv but i can only see 5 when i open it in Rstudio. I was thinking it was just formatting to make it easier to read and that the "whole" number was there but maybe not? The lat/lons are of type: double.
Why are my distances much larger than the ones displayed on the website i got the gpx-file from?

There are couple of problems in the code above. The function distHaversine is a vectorized function thus you can avoid the loop / apply statement. This will significantly improve the performance.
Most important is with the geosphere package the first coordinate is longitude and not latitude.
df<- read.table(header =TRUE, text=" lat lon lat.p1 lon.p1
60.62061 15.66640 60.62045 15.66660
60.62045 15.66660 60.62037 15.66662
60.62037 15.66662 60.62026 15.66636
60.62026 15.66636 60.62018 15.66636
60.62018 15.66636 60.62010 15.66650
60.62001 15.66672 60.61996 15.66684
60.61996 15.66684 60.61989 15.66685")
library(geosphere)
#Lat is first column (incorrect)
distHaversine(df[,c("lat", "lon")], df[,c("lat.p1", "lon.p1")])
#incorrect
#[1] 28.103099 8.859034 31.252373 8.574722 17.787905 14.393267 7.584996
#Longitude is first (correct)
distHaversine(df[,c("lon", "lat")], df[,c("lon.p1", "lat.p1")])
#correct result.
#[1] 20.893456 8.972291 18.750046 8.905559 11.737448 8.598240 7.811479

Related

Dividing Individual Spatial Polygons Equally in R

I have a shapefile of polygons that are the townships in the state of Iowa.I'd like to divide each element (ie each township) into 9 equal parts (i.e. a 3 x 3 grid for each township). I've figured out how to do this, but am having trouble forming a new dataframe out of the new polygons. My code is below. The data can be downloaded here: https://ufile.io/wi6tt
library(sf)
library(tidyverse)
setwd("~/Desktop")
iowa<-st_read( dsn="Townships/iowa", layer="PLSS_Township_Boundaries", stringsAsFactors = F) # import data
## Make division
r<-NULL
for (row in 1:nrow(iowa)) {
r[[row]]<-st_make_grid(iowa[row,],n=c(3,3))
}
# Combine together
region<-NULL
for (row in 1:nrow(iowa)) {
region<-rbind(region,r[[row]])
}
region<-st_sfc(region,crs=4326) #convert to sfc
reg_id<-data.frame(reg_id=1:length(region)) #make ID for dataframe
# Make SF
region_df<-st_sf(reg_id,region)
The last line gives the following error:
Error in `[[<-.data.frame`(`*tmp*`, all_sfc_names[i], value = list(list( : replacement has 1644 rows, data has 14796
1664 is the number of rows in the initial Iowa dataframe.
Clearly the number of rows does not match the number of elements.
This might be a general r thing, rather than a spatial one, but I figured I'd post the whole thing in case someone had an idea on how to do the entirety of this a little cleaner

spatial join on two simple features {sf} with over 1 mil. entries as fast as possible

I hope this is not too trivial but I really can't find an answer and I'm too new to the topic to come up with alternatives myself. So here is the Problem:
I have two shapefiles x and y that represent different processing levels of a Sentinel2 satellite image.
x contains about 1.300.000 polygons/Segments completely covering the image extend without any further vital information.
y contains about 500 polygons representing the cloud-free area of the image (also covering most of the image except for a few "cloud-holes") as well as information about the used image in 4 columns (Sensor, Time...)
I'm trying to add the image information to x in places x is covered by y. pretty simple? I just can't find a way to make it happen without taking days.
I read x in as a simple feature {sf}, as reading it with shapefile / readOGR takes ages.
I tried different things with y
when I try merge(x,y) I can only take one sf as merge doesn't support two sf's.
merging x (as sf) and y (as shp) gives me the error "cannot allocate vector of size 13.0 Gb"
so I tried sf::st_join(x,y), which supports both Variables to be sf but still didn't finish for 28 hours now
sf::st_intersect(x,y) took about 9 minutes for a 10.000 segment subset, so that might not be a lot faster for the whole piece.
could subsetting x to a few smaller pieces solve the whole thing or is there another simple solution? could I do something with my workspace to make the merge work or is there simply no shortcut to joining that amount of polygons?
Thanks a lot in advance and I hope my description isn't too fuzzy!
my tiny work station:
win 7 64 bit
8 GB RAM
intel i7-4790 # 3,6 GHz
I often face this kind of problems and as #manotheshark2 afirms, I prefer to work in a loop subseting my vector layer. Here is my advice:
Load your data
library(raster)
library(rgdal)
x <- readOGR('C:/', 'sentinelCovers')
y <- readOGR('C:/', 'cloudHoles')
Assign an y ID for identify which x polygons intersects y polygons and create the column in x table
x$xyID <- NA # Answer col
y$yID <- 1:nrow(y#data) # ID col
Run a loop subseting x
for (posX in 1:nrow(x#data)){
pol.x <- x[posX, ]
intX <- raster::intersect(pol.x, y)
# x$xyID[posX] <- intX#data$yID ## Run this if there's unique y polygons
# x$xyID[posX] <- paste0(intX#data$yID, collapse = ',') ## Run this if there's multiple y polygons
}
You can check if is better to run the loop on x o y layer
x$xyID <- NA # Answer col
x$xID <- 1:nrow(x#data) # ID Col
for (posY in 1:nrow(y#data)){
pol.y <- y[posY, ]
intY <- tryCatch(raster::intersect(pol.y, x), finally = 'NULL')
if (is.null(intY)) next
x$xyID[x#data$xID %in% intY#data$xID] <- pol.y$yID
}

Cant get lat and longitude values of tweets

I collected some twitter data doing this:
#connect to twitter API
setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)
#set radius and amount of requests
N=200 # tweets to request from each query
S=200 # radius in miles
lats=c(38.9,40.7)
lons=c(-77,-74)
roger=do.call(rbind,lapply(1:length(lats), function(i) searchTwitter('Roger+Federer',
lang="en",n=N,resultType="recent",
geocode=paste (lats[i],lons[i],paste0(S,"mi"),sep=","))))
After this I've done:
rogerlat=sapply(roger, function(x) as.numeric(x$getLatitude()))
rogerlat=sapply(rogerlat, function(z) ifelse(length(z)==0,NA,z))
rogerlon=sapply(roger, function(x) as.numeric(x$getLongitude()))
rogerlon=sapply(rogerlon, function(z) ifelse(length(z)==0,NA,z))
data=as.data.frame(cbind(lat=rogerlat,lon=rogerlon))
And now I would like to get all the tweets that have long and lat values:
data=filter(data, !is.na(lat),!is.na(lon))
lonlat=select(data,lon,lat)
But now I only get NA values.... Any thoughts on what goes wrong here?
As Chris mentioned, searchTwitter does not return the lat-long of a tweet. You can see this by going to the twitteR documentation, which tells us that it returns a status object.
Status Objects
Scrolling down to the status object, you can see that 11 pieces of information are included, but lat-long is not one of them. However, we are not completely lost, because the user's screen name is returned.
If we look at the user object, we see that a user's object at least includes a location.
So I can think of at least two possible solutions, depending on what your use case is.
Solution 1: Extracting a User's Location
# Search for recent Trump tweets #
tweets <- searchTwitter('Trump', lang="en",n=N,resultType="recent",
geocode='38.9,-77,50mi')
# If you want, convert tweets to a data frame #
tweets.df <- twListToDF(tweets)
# Look up the users #
users <- lookupUsers(tweets.df$screenName)
# Convert users to a dataframe, look at their location#
users_df <- twListToDF(users)
table(users_df[1:10, 'location'])
❤ Texas ❤ ALT.SEATTLE.INTERNET.UR.FACE
2 1 1
Japan Land of the Free New Orleans
1 1 1
Springfield OR USA United States USA
1 1 1
# Note that these will be the users' self-reported locations,
# so potentially they are not that useful
Solution 2: Multiple searches with limited radius
The other solution would be to conduct a series of repeated searches, increment your latitude and longitude with a small radius. That way you can be relatively sure that the user is close to your specified location.
Not necessarily an answer, but more an observation too long for comment:
First, you should look at the documentation of how to input geocode data. Using twitteR:
setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)
#set radius and amount of requests
N=200 # tweets to request from each query
S=200 # radius in miles
Geodata should be structured like this (lat, lon, radius):
geo <- '40,-75,200km'
And then called using:
roger <- searchTwitter('Roger+Federer',lang="en",n=N,resultType="recent",geocode=geo)
Then, I would instead use twListtoDF to filter:
roger <- twListToDF(roger)
Which now gives you a data.frame with 16 cols and 200 observations (set above).
You could then filter using:
setDT(roger) #from data.table
roger[latitude > 38.9 & latitude < 40.7 & longitude > -77 & longitude < -74]
That said (and why this is an observation vs. an answer) - it looks as though twitteR does not return lat and lon (it is all NA in the data I returned) - I think this is to protect individual users locations.
That said, adjusting the radius does affect the number of results, so the code does have access to the geo data somehow.
Assuming that some tweets were downloaded, there are some geo-referenced tweets and some tweets without geographical coordinates:
prod(dim(data)) > 1 & prod(dim(data)) != sum(is.na(data)) & any(is.na(data))
# TRUE
Let's simulate data between your longitude/latitude points for simplicity.
set.seed(123)
data <- data.frame(lon=runif(200, -77, -74), lat=runif(200, 38.9, 40.7))
data[sample(1:200, 10),] <- NA
Rows with longitude/latitude data can be selected by removing the 10 rows with missing data.
data2 <- data[-which(is.na(data[, 1])), c("lon", "lat")]
nrow(data) - nrow(data2)
# 10
The last line replaces the last two lines of your code. However, note that this only works if the missing geographical coordinates are stored as NA.

Simple lookup to insert values in an R data frame

This is a seemingly simple R question, but I don't see an exact answer here. I have a data frame (alldata) that looks like this:
Case zip market
1 44485 NA
2 44488 NA
3 43210 NA
There are over 3.5 million records.
Then, I have a second data frame, 'zipcodes'.
market zip
1 44485
1 44486
1 44488
... ... (100 zips in market 1)
2 43210
2 43211
... ... (100 zips in market 2, etc.)
I want to find the correct value for alldata$market for each case based on alldata$zip matching the appropriate value in the zipcode data frame. I'm just looking for the right syntax, and assistance is much appreciated, as usual.
Since you don't care about the market column in alldata, you can first strip it off using and merge the columns in alldata and zipcodes based on the zip column using merge:
merge(alldata[, c("Case", "zip")], zipcodes, by="zip")
The by parameter specifies the key criteria, so if you have a compound key, you could do something like by=c("zip", "otherfield").
Another option that worked for me and is very simple:
alldata$market<-with(zipcodes, market[match(alldata$zip, zip)])
With such a large data set you may want the speed of an environment lookup. You can use the lookup function from the qdapTools package as follows:
library(qdapTools)
alldata$market <- lookup(alldata$zip, zipcodes[, 2:1])
Or
alldata$zip %l% zipcodes[, 2:1]
Here's the dplyr way of doing it:
library(tidyverse)
alldata %>%
select(-market) %>%
left_join(zipcodes, by="zip")
which, on my machine, is roughly the same performance as lookup.
The syntax of match is a bit clumsy. You might find the lookup package easier to use.
alldata <- data.frame(Case=1:3, zip=c(44485,44488,43210), market=c(NA,NA,NA))
zipcodes <- data.frame(market=c(1,1,1,2,2), zip=c(44485,44486,44488,43210,43211))
alldata$market <- lookup(alldata$zip, zipcodes$zip, zipcodes$market)
alldata
## Case zip market
## 1 1 44485 1
## 2 2 44488 1
## 3 3 43210 2

R storing different columns in different vectors to compute conditional probabilities

I am completely new to R. I tried reading the reference and a couple of good introductions, but I am still quite confused.
I am hoping to do the following:
I have produced a .txt file that looks like the following:
area,energy
1.41155882174e-05,1.0914586287e-11
1.46893363946e-05,5.25011714434e-11
1.39244046855e-05,1.57904991488e-10
1.64155121046e-05,9.0815757601e-12
1.85202830392e-05,8.3207522281e-11
1.5256036289e-05,4.24756620609e-10
1.82107587343e-05,0.0
I have the following command to read the file in R:
tbl <- read.csv("foo.txt",header=TRUE).
producing:
> tbl
area energy
1 1.411559e-05 1.091459e-11
2 1.468934e-05 5.250117e-11
3 1.392440e-05 1.579050e-10
4 1.641551e-05 9.081576e-12
5 1.852028e-05 8.320752e-11
6 1.525604e-05 4.247566e-10
7 1.821076e-05 0.000000e+00
Now I want to store each column in two different vectors, respectively area and energy.
I tried:
area <- c(tbl$first)
energy <- c(tbl$second)
but it does not seem to work.
I need to different vectors (which must include only the numerical data of each column) in order to do so:
> prob(energy, given = area), i.e. the conditional probability P(energy|area).
And then plot it. Can you help me please?
As #Ananda Mahto alluded to, the problem is in the way you are referring to columns.
To 'get' a column of a data frame in R, you have several options:
DataFrameName$ColumnName
DataFrameName[,ColumnNumber]
DataFrameName[["ColumnName"]]
So to get area, you would do:
tbl$area #or
tbl[,1] #or
tbl[["area"]]
With the first option generally being preferred (from what I've seen).
Incidentally, for your 'end goal', you don't need to do any of this:
with(tbl, prob(energy, given = area))
does the trick.

Resources