I collected some twitter data doing this:
#connect to twitter API
setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)
#set radius and amount of requests
N=200 # tweets to request from each query
S=200 # radius in miles
lats=c(38.9,40.7)
lons=c(-77,-74)
roger=do.call(rbind,lapply(1:length(lats), function(i) searchTwitter('Roger+Federer',
lang="en",n=N,resultType="recent",
geocode=paste (lats[i],lons[i],paste0(S,"mi"),sep=","))))
After this I've done:
rogerlat=sapply(roger, function(x) as.numeric(x$getLatitude()))
rogerlat=sapply(rogerlat, function(z) ifelse(length(z)==0,NA,z))
rogerlon=sapply(roger, function(x) as.numeric(x$getLongitude()))
rogerlon=sapply(rogerlon, function(z) ifelse(length(z)==0,NA,z))
data=as.data.frame(cbind(lat=rogerlat,lon=rogerlon))
And now I would like to get all the tweets that have long and lat values:
data=filter(data, !is.na(lat),!is.na(lon))
lonlat=select(data,lon,lat)
But now I only get NA values.... Any thoughts on what goes wrong here?
As Chris mentioned, searchTwitter does not return the lat-long of a tweet. You can see this by going to the twitteR documentation, which tells us that it returns a status object.
Status Objects
Scrolling down to the status object, you can see that 11 pieces of information are included, but lat-long is not one of them. However, we are not completely lost, because the user's screen name is returned.
If we look at the user object, we see that a user's object at least includes a location.
So I can think of at least two possible solutions, depending on what your use case is.
Solution 1: Extracting a User's Location
# Search for recent Trump tweets #
tweets <- searchTwitter('Trump', lang="en",n=N,resultType="recent",
geocode='38.9,-77,50mi')
# If you want, convert tweets to a data frame #
tweets.df <- twListToDF(tweets)
# Look up the users #
users <- lookupUsers(tweets.df$screenName)
# Convert users to a dataframe, look at their location#
users_df <- twListToDF(users)
table(users_df[1:10, 'location'])
❤ Texas ❤ ALT.SEATTLE.INTERNET.UR.FACE
2 1 1
Japan Land of the Free New Orleans
1 1 1
Springfield OR USA United States USA
1 1 1
# Note that these will be the users' self-reported locations,
# so potentially they are not that useful
Solution 2: Multiple searches with limited radius
The other solution would be to conduct a series of repeated searches, increment your latitude and longitude with a small radius. That way you can be relatively sure that the user is close to your specified location.
Not necessarily an answer, but more an observation too long for comment:
First, you should look at the documentation of how to input geocode data. Using twitteR:
setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)
#set radius and amount of requests
N=200 # tweets to request from each query
S=200 # radius in miles
Geodata should be structured like this (lat, lon, radius):
geo <- '40,-75,200km'
And then called using:
roger <- searchTwitter('Roger+Federer',lang="en",n=N,resultType="recent",geocode=geo)
Then, I would instead use twListtoDF to filter:
roger <- twListToDF(roger)
Which now gives you a data.frame with 16 cols and 200 observations (set above).
You could then filter using:
setDT(roger) #from data.table
roger[latitude > 38.9 & latitude < 40.7 & longitude > -77 & longitude < -74]
That said (and why this is an observation vs. an answer) - it looks as though twitteR does not return lat and lon (it is all NA in the data I returned) - I think this is to protect individual users locations.
That said, adjusting the radius does affect the number of results, so the code does have access to the geo data somehow.
Assuming that some tweets were downloaded, there are some geo-referenced tweets and some tweets without geographical coordinates:
prod(dim(data)) > 1 & prod(dim(data)) != sum(is.na(data)) & any(is.na(data))
# TRUE
Let's simulate data between your longitude/latitude points for simplicity.
set.seed(123)
data <- data.frame(lon=runif(200, -77, -74), lat=runif(200, 38.9, 40.7))
data[sample(1:200, 10),] <- NA
Rows with longitude/latitude data can be selected by removing the 10 rows with missing data.
data2 <- data[-which(is.na(data[, 1])), c("lon", "lat")]
nrow(data) - nrow(data2)
# 10
The last line replaces the last two lines of your code. However, note that this only works if the missing geographical coordinates are stored as NA.
Related
I have a dataset of around 10000 rows. I have the Address, City, State and Zipcode values. I do not have lat/long coordinates. I would like to retrieve the county name without taking a large amount of time. I have tried library(tinygeocoder) but it takes around 14 seconds for 100 values, and is giving a 'time-out' error when I put in the entire dataset. Plus, it's outputting a fip code, which I have to join to get the actual county name. Reproducible example:
library(tidygeocoder)
library(dplyr)
df <- tidygeocoder::louisville[,1:4]
county_fips <- data.frame (fips = c("111", "112"),
county = c("Jefferson", "Montgomery"))
geocoded <- df %>% geocode(street = street, city = city, state = state,
method = 'census', full_results = TRUE,
api_options = list(census_return_type = 'geographies'))
df$fips <- geocoded$county_fips
df_new <- merge(x=df, y=county_fips, by="fips", all.x = T)
You can use a public dataset that links city and/or zipcode to county. I found these websites with such data:
https://www.unitedstateszipcodes.org/zip-code-database
https://simplemaps.com/data/us-cities
You can then do a left join on the linking column (presumably city or zipcode but will depend on the dataset):
df = merge(x=df, y=public_dataset, by="City", all.x=T)
If performance is an issue, you can select just the county and linking columns from the public data set before you do the merge.
public_dataset = public_dataset %>% select(County, City)
The slow performance is due to tinygeocoder's use of the the Census Bureau's API in order to match data. Asking the API to match thousands of addresses is the slow down, and I'm not aware of a different way to do this.
However, we can at least pare down the number of addresses that you are putting into the API. Maybe if we get that number low enough the code will run.
The ZIP Code Tabulation Areas (ZCTA) shows the relationships between ZIP Codes and county names (as well as FIPS). A "|" delimited file with a description of the data can be found on the Bureau's website.
Counting the number of times a ZIP code shows up tells us if a ZIP code spans multiple counties. If the frequency == 1, then you can freely translate the ZIP code to the county.
ZCTA <- read.delim("tab20_zcta520_county20_natl.txt", sep="|")
n_occur <- data.frame(table(ZCTA$GEOID_ZCTA5_20))
head(n_occur, 10)
Var1
Freq
1
601
2
2
602
2
3
603
2
4
606
3
5
610
4
6
611
1
7
612
3
8
616
1
9
617
2
10
622
1
In these results, addresses with ZIP codes 00611 and 00622 can be mapped to the corresponding counties without sending the addresses through the API. If your addresses are very urban, then you may be lucky in that the ZIP codes are small area-wise and may not span typically multiple counties.
I have a dataframe that contains information about 3000 hospitals, along with a dataframe containing instances of 1million+ vaccinations. Im trying to map each vaccination to either a hospital included in the hospitals data or simply a vaccination site thats not a hospital.
All hospitals in the hospital dataframe have a geolocation field of latitude and longitude coordinates. The vaccinations dataframe also has latitude and longitude columns. The issue is that a lot of times these coordinates aren't exact, so I can't simply map them directly to lat long values in the hospitals dataset. Therefore Im creating a small spatial buffer and lat long range for each of the hospitals geolocation and looking to see if the vaccinations fall within those ranges. The issue is that the code Im running is taking a while and Im sure theres a simpler way to do what Im trying to do.
Thanks!
total_in_hospital_vaccinations <- 0
total_out_hospital_vaccinations <- 0
hospitals$vaccinations <- 0
for(i in 1:NROW(hospitals)){
hospital <- hospital[i,]
name <- hospital$name
in_range <- vaccinations[(vaccinations$long >= hospital$longitude_low & vaccinations$long <= hospital$longitude_high
& vaccination$lat <= hospital$latitude_high & vaccination$lat >= hospital$latitude_low),]
hospitals[(hospitals$name == name),]$vaccinations <- NROW(in_range)
total_in_hospital_vaccinations <- total_in_hospital_vaccinations + NROW(in_range)
total_out_hospital_vaccinations <- NROW(vaccinations) - total_in_hospital_vaccinations
I want to analyze distance traveled based on GPS tracks But when i calculate the distance it always comes out as too large.
I use python to make a csv file with the latitude and longitude for all points in a track which i then analyze with R. The data frame looks like this:
| lat| lon| lat.p1| lon.p1| dist_to_prev|
|--------:|--------:|--------:|--------:|------------:|
| 60.62061| 15.66640| 60.62045| 15.66660| 28.103099|
| 60.62045| 15.66660| 60.62037| 15.66662| 8.859034|
| 60.62037| 15.66662| 60.62026| 15.66636| 31.252373|
| 60.62026| 15.66636| 60.62018| 15.66636| 8.574722|
| 60.62018| 15.66636| 60.62010| 15.66650| 17.787905|
| 60.62001| 15.66672| 60.61996| 15.66684| 14.393267|
| 60.61996| 15.66684| 60.61989| 15.66685| 7.584996|
...
I could post the whole data frame here for reproducability, it's only 59 rows, but i'm not sure of the etiquette for posting big chunks of data here? Let me know how i can best share it.
lat.next and lon.next is just the lat and lon from the row below. dist_to_prev is calculated with distm() from geosphere:
library(geosphere)
library(dplyr)
df$dist_to_prev <- apply(df, 1 , FUN = function (row) {
distm(c(as.numeric(row["lat"]), as.numeric(row["lon"])),
c(as.numeric(row["lat.p1"]), as.numeric(row["lon.p1"])),
fun = distHaversine)})
df %>% filter(dist_to_prev != "NA") %>% summarise(sum(dist_to_prev))
# A tibble: 1 x 1
`sum(dist_to_prev)`
<dbl>
1 1266.
I took this track as an example from Trailforks and if you look at their track description it should be 787m, not 1266m as i got. This is not unique to this track but to all tracks i've looked at. When i do it they all come out 30-50% too long.
One thing that might be the cause is that there is only 5 decimal-places for the lats/lons. There is 6 decimal-places in the csv but i can only see 5 when i open it in Rstudio. I was thinking it was just formatting to make it easier to read and that the "whole" number was there but maybe not? The lat/lons are of type: double.
Why are my distances much larger than the ones displayed on the website i got the gpx-file from?
There are couple of problems in the code above. The function distHaversine is a vectorized function thus you can avoid the loop / apply statement. This will significantly improve the performance.
Most important is with the geosphere package the first coordinate is longitude and not latitude.
df<- read.table(header =TRUE, text=" lat lon lat.p1 lon.p1
60.62061 15.66640 60.62045 15.66660
60.62045 15.66660 60.62037 15.66662
60.62037 15.66662 60.62026 15.66636
60.62026 15.66636 60.62018 15.66636
60.62018 15.66636 60.62010 15.66650
60.62001 15.66672 60.61996 15.66684
60.61996 15.66684 60.61989 15.66685")
library(geosphere)
#Lat is first column (incorrect)
distHaversine(df[,c("lat", "lon")], df[,c("lat.p1", "lon.p1")])
#incorrect
#[1] 28.103099 8.859034 31.252373 8.574722 17.787905 14.393267 7.584996
#Longitude is first (correct)
distHaversine(df[,c("lon", "lat")], df[,c("lon.p1", "lat.p1")])
#correct result.
#[1] 20.893456 8.972291 18.750046 8.905559 11.737448 8.598240 7.811479
I am trying to figure out how to download weather(temperature,radiation,etc) by zip code tabulation area(ZCTA). Although Census data have census information by ZCTA, this is not the case for weather data.
I tried find information from http://cdo.ncdc.noaa.gov/qclcd/QCLCD?prior=N
but couldn't figure it out.
Has anyone of you ever downloaded the weather data by ZCTA? if not, has anyone had experience to convert weather observation station information to ZCTA?
The National Weather Service provides two web-based API's to extract weather forecast information from the National Digital Forecast Database(NDFD), a SOAP interface, and a REST interface. Both return data in Digital Weather Markup Language (DWML), which is an XML dialect. Data elements which can be returned are listed here.
IMO the REST interface is by far the easier to use. Below is an example where we extract the forecast temperature, relative humidity, and wind speed for Zip Code 10001 (Lower Manhattan) in 3-hour increments for the next 5 days.
# NOAA NWS REST API Example
# 3-hourly forecast for Lower Mannhattan (Zip Code: 10001)
library(httr)
library(XML)
url <- "http://graphical.weather.gov/xml/sample_products/browser_interface/ndfdXMLclient.php"
response <- GET(url,query=list(zipCodeList="10001",
product="time-series",
begin=format(Sys.Date(),"%Y-%m-%d"),
Unit="e",
temp="temp",rh="rh",wspd="wspd"))
doc <- content(response,type="text/xml") # XML document with the data
# extract the date-times
dates <- doc["//time-layout/start-valid-time"]
dates <- as.POSIXct(xmlSApply(dates,xmlValue),format="%Y-%m-%dT%H:%M:%S")
# extract the actual data
data <- doc["//parameters/*"]
data <- sapply(data,function(d)removeChildren(d,kids=list("name")))
result <- do.call(data.frame,lapply(data,function(d)xmlSApply(d,xmlValue)))
colnames(result) <- sapply(data,xmlName)
# combine into a data frame
result <- data.frame(dates,result)
head(result)
# dates temperature wind.speed humidity
# 1 2014-11-06 19:00:00 52 8 96
# 2 2014-11-06 22:00:00 50 7 86
# 3 2014-11-07 01:00:00 50 7 83
# 4 2014-11-07 04:00:00 47 11 83
# 5 2014-11-07 07:00:00 45 14 83
# 6 2014-11-07 10:00:00 50 16 61
It is possible to query multiple zip codes in a single request, but this complicates parsing the returned XML.
To get the NOAA QCLCD data into zip codes you need to use the latitude/longitude values from the station.txt file and compare that with data from Census. This can only be done with GIS related tools. My solution is to use a PostGIS enabled database so you can use the ST_MakePoint function:
ST_MakePoint(longitude, latitude)
You would then need to load the ZCTA from Census Bureau into the database as well to determine which zip codes contain which stations. The ST_Contains function will help with that.
ST_Contains(zip_way, ST_MakePoint(longitude, latitude))
A full query might look something like this:
SELECT s.wban, z.zip5, s.state, s.location
FROM public.station s
INNER JOIN public.zip z
ON ST_Contains(z.way, ST_MakePoint(s.longitude, s.latitude)
I'm obviously making assumptions about the column names, but the above should be a great starting point.
You should be able to accomplish the same tasks with QGIS (Free) or ArcGIS (expensive) too. That gets rid of the overhead of installing a PostGIS enabled database, but I'm not as familiar with the required steps in those software packages.
Weather data is only available for weather stations, and there is not a weather station for each ZCTA (the ZCTA's are much smaller than the regions covered by weather stations).
I have seen options on the noaa website where you can enter a latitude and longitude and it will find the weather from the appropriate weather station. So if you can convert your ZCTA's of interest into a lat/lon pair (center, random corner, etc.) you could submit that to the website. But note that if you do this for a large number of ZCTA's that are close together you will be downloading redundant information. It would be better to do a one time matching of ZCTA to weather station, then download weather info from each station only once and then merge with the ZCTA data.
I have a long list of city names and countries and I would like to plot them on a map. In order to do this I need the longitude and latitude information of each of the cities.
My table is called test and has the following structure:
Cityname CountryCode
New York US
Hamburg DE
Amsterdam NL
With the following code I have successfully solved the problem.
library(RJSONIO)
nrow <- nrow(test)
counter <- 1
test$lon[counter] <- 0
test$lat[counter] <- 0
while (counter <= nrow){
CityName <- gsub(' ','%20',test$CityLong[counter]) #remove space for URLs
CountryCode <- test$Country[counter]
url <- paste(
"http://nominatim.openstreetmap.org/search?city="
, CityName
, "&countrycodes="
, CountryCode
, "&limit=9&format=json"
, sep="")
x <- fromJSON(url)
if(is.vector(x)){
test$lon[counter] <- x[[1]]$lon
test$lat[counter] <- x[[1]]$lat
}
counter <- counter + 1
}
As this is calling an external service (openstreetmaps.org) it can take a while for larger datasets. However, you probably only do this once in a while when new cities have been added to the list.
A few other options for you.
ggmaps
ggmaps has a function geocode which uses Google Maps to geocode. This limits you to 2,500 per day.
taRifx.geo
taRifx.geo's latest version has a geocode function which uses either Google or Bing Maps to geocode. The Bing version requires you to use a (free) Bing account, but in return you can geocode way more entries. Features in this version:
Service choice (Bing and Google Maps both supported)
Log-in support (particularly for Bing, which requires an account key but in exchange allows for an order of magnitude more daily requests)
Geocode a whole data.frame at a time, including some time-savers like ignoring any rows which have already been geocoded
Robust batch geocoding (so that any error does not cause the whole data.frame's worth of geocoding to be lost, for bigger jobs)
Route finding (travel times from point A to point B)
Try this I think it will better solution for this problem
> library(ggmap)
Loading required package: ggplot2
Google Maps API Terms of Service: http://developers.google.com/maps/terms.
Please cite ggmap if you use it: see citation('ggmap') for details.
#Now you can give city name or country name individually
> geocode("hamburg")
Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=hamburg&sensor=false
lon lat
1 9.993682 53.55108
geocode("amsterdam")
Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=amsterdam&sensor=false
lon lat
1 4.895168 52.37022
> geocode("new york")
Information from URL : http://maps.googleapis.com/maps/api/geocode/json?address=new+york&sensor=false
lon lat
1 -74.00594 40.71278