Historical GTFS data - r

I'm trying to obtain a couple of weeks worth of historical train performance (delay) data on trains arriving at Central Station (id=600016...600024), Brisbane, Australia for research purposes. I found a package called gtfsway developed by SymbolixAU, and some brief code, but I don't know how to specify a date and the station id.
I'm new to GTFS and any help is appreciated.
library(gtfsway)
url <- "https://gtfsrt.api.translink.com.au/Feed/SEQ"
response <- httr::GET(url)
FeedMessage <- gtfs_realtime(response)
## the function gtfs_tripUpdates() extracts the 'trip_update' feed
lst <- gtfs_tripUpdates(FeedMessage)

Related

Automate large data scraping

I am very new to using R and coding in general.
I would like to ask for any help I could get to solve my issue.
I try to download a large file from https://www.ncei.noaa.gov/access/search/data-search/global-summary-of-the-day.
So I use the GSODR package to download all data however, no matter what is the data length (1 year, 3 months, 5 years) it only downloaded around 5000 data (obs. of around 49 variables).
:(
another problem is actually I have to automate the download. my friend told me that I need to make a server and scrape it with the website (not the GSODR package) to do that. I will appreciate it if anyone gives me any advice since I'm very new to all of these.
thank you very much. I really appreciate your time.
library(GSODR)
library(dplyr)
library(reshape2)
if (!require("remotes")) {
install.packages("remotes", repos = "http://cran.rstudio.com/")
library("remotes")
}
install_github("ropensci/GSODR")
load(system.file("extdata", "isd_history.rda", package = "GSODR"))
# create data.frame for Laos only
Oz <- subset(isd_history, COUNTRY_NAME == "LAOS")
Oz
# download climatic data Laos for whole years
ie_2018 <- get_GSOD(years = 2017:2021, country = "Laos")
data(ie_2018)

Obtain State Name from Google Trends Interest by City

Suppose you inquire the following:
gtrends("google", geo="US")$interest_by_city
This returns how many searches for the term "google" occurred across cities in the US. However, it does not provide any information regarding which state each city belongs to.
I have tried merging this data set with several others including city and state names. Given that the same city name can be present in many states, it is unclear to me how to identify which city was the one Google Trends provided data for.
I provide below a more detailed MWE.
library(gtrendsR)
library(USAboundariesData)
data1 <- gtrends("google", geo= "US")$interest_by_city
data1$city <- data1$location
data2 <- us_cities(map_date = NULL)
data3 <- merge(data1, data2, by="city")
And this yields the following problem:
city state
Alexandria Louisiana
Alexandria Indiana
Alexandria Kentucky
Alexandria Virginia
Alexandria Minnesota
making it difficult to know which "Alexandria" Google Trends provided the data for.
Any hints in how to identify the state of each city would be much appreciated.
One way around this is to collect the cities per state and then just rbind the respective data frames. You could first make a vector of state codes like so
states <- paste0("US-",state.abb)
I then just used purrr for its map and reduce functionality to create a single frame
data <- purrr::reduce(purrr::map(states, function(x){
cities = gtrends("google", geo = x)$interest_by_city
}),
rbind)

How to access Historical Data from The Weather Company (TWC) using IBM Data Science Experience (DSX)

I'm using IBM Data Science Experience (DSX), https://datascience.ibm.com/. I use R with RStudio.
What's the lowest level of data available (say, seconds or minutes or hourly, ...)?
Looking for an example code to access lowest level of data say for the period 1 January 2016 to 31 November 2017 for a certain location.
As per API documentation historical data only gives last 24 hours of data
https://twcservice.mybluemix.net/rest-api/#!/Historical_Data/v1geotimeseriesobs
Anyhow here is the implementation for R and Rstudio to get data for last 23 hours in a dataframe for latitude and longitude(33.40/-83.42):-
library(jsonlite)
username <- "<PUT-YOUR-WEATHERDATA-USERNAME>"
password <- "<PUT-YOUR-WEATHERDATA-PASSWORD>"
base <- "https://twcservice.mybluemix.net/api/weather/v1/geocode/33.40/-83.42/observations/timeseries.json?hours=23"
library(httr)
get_data <- GET(base, authenticate(username,password, type = "basic"))
get_data
get_data_text <- content(get_data, "text")
get_data_text
get_data_json <- fromJSON(get_data_text,flatten = TRUE)
get_data_json
get_data_df <- as.data.frame(get_data_json)
View(get_data_df)

Cant get lat and longitude values of tweets

I collected some twitter data doing this:
#connect to twitter API
setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)
#set radius and amount of requests
N=200 # tweets to request from each query
S=200 # radius in miles
lats=c(38.9,40.7)
lons=c(-77,-74)
roger=do.call(rbind,lapply(1:length(lats), function(i) searchTwitter('Roger+Federer',
lang="en",n=N,resultType="recent",
geocode=paste (lats[i],lons[i],paste0(S,"mi"),sep=","))))
After this I've done:
rogerlat=sapply(roger, function(x) as.numeric(x$getLatitude()))
rogerlat=sapply(rogerlat, function(z) ifelse(length(z)==0,NA,z))
rogerlon=sapply(roger, function(x) as.numeric(x$getLongitude()))
rogerlon=sapply(rogerlon, function(z) ifelse(length(z)==0,NA,z))
data=as.data.frame(cbind(lat=rogerlat,lon=rogerlon))
And now I would like to get all the tweets that have long and lat values:
data=filter(data, !is.na(lat),!is.na(lon))
lonlat=select(data,lon,lat)
But now I only get NA values.... Any thoughts on what goes wrong here?
As Chris mentioned, searchTwitter does not return the lat-long of a tweet. You can see this by going to the twitteR documentation, which tells us that it returns a status object.
Status Objects
Scrolling down to the status object, you can see that 11 pieces of information are included, but lat-long is not one of them. However, we are not completely lost, because the user's screen name is returned.
If we look at the user object, we see that a user's object at least includes a location.
So I can think of at least two possible solutions, depending on what your use case is.
Solution 1: Extracting a User's Location
# Search for recent Trump tweets #
tweets <- searchTwitter('Trump', lang="en",n=N,resultType="recent",
geocode='38.9,-77,50mi')
# If you want, convert tweets to a data frame #
tweets.df <- twListToDF(tweets)
# Look up the users #
users <- lookupUsers(tweets.df$screenName)
# Convert users to a dataframe, look at their location#
users_df <- twListToDF(users)
table(users_df[1:10, 'location'])
❤ Texas ❤ ALT.SEATTLE.INTERNET.UR.FACE
2 1 1
Japan Land of the Free New Orleans
1 1 1
Springfield OR USA United States USA
1 1 1
# Note that these will be the users' self-reported locations,
# so potentially they are not that useful
Solution 2: Multiple searches with limited radius
The other solution would be to conduct a series of repeated searches, increment your latitude and longitude with a small radius. That way you can be relatively sure that the user is close to your specified location.
Not necessarily an answer, but more an observation too long for comment:
First, you should look at the documentation of how to input geocode data. Using twitteR:
setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)
#set radius and amount of requests
N=200 # tweets to request from each query
S=200 # radius in miles
Geodata should be structured like this (lat, lon, radius):
geo <- '40,-75,200km'
And then called using:
roger <- searchTwitter('Roger+Federer',lang="en",n=N,resultType="recent",geocode=geo)
Then, I would instead use twListtoDF to filter:
roger <- twListToDF(roger)
Which now gives you a data.frame with 16 cols and 200 observations (set above).
You could then filter using:
setDT(roger) #from data.table
roger[latitude > 38.9 & latitude < 40.7 & longitude > -77 & longitude < -74]
That said (and why this is an observation vs. an answer) - it looks as though twitteR does not return lat and lon (it is all NA in the data I returned) - I think this is to protect individual users locations.
That said, adjusting the radius does affect the number of results, so the code does have access to the geo data somehow.
Assuming that some tweets were downloaded, there are some geo-referenced tweets and some tweets without geographical coordinates:
prod(dim(data)) > 1 & prod(dim(data)) != sum(is.na(data)) & any(is.na(data))
# TRUE
Let's simulate data between your longitude/latitude points for simplicity.
set.seed(123)
data <- data.frame(lon=runif(200, -77, -74), lat=runif(200, 38.9, 40.7))
data[sample(1:200, 10),] <- NA
Rows with longitude/latitude data can be selected by removing the 10 rows with missing data.
data2 <- data[-which(is.na(data[, 1])), c("lon", "lat")]
nrow(data) - nrow(data2)
# 10
The last line replaces the last two lines of your code. However, note that this only works if the missing geographical coordinates are stored as NA.

Download weather data by zip code tabulate area in R

I am trying to figure out how to download weather(temperature,radiation,etc) by zip code tabulation area(ZCTA). Although Census data have census information by ZCTA, this is not the case for weather data.
I tried find information from http://cdo.ncdc.noaa.gov/qclcd/QCLCD?prior=N
but couldn't figure it out.
Has anyone of you ever downloaded the weather data by ZCTA? if not, has anyone had experience to convert weather observation station information to ZCTA?
The National Weather Service provides two web-based API's to extract weather forecast information from the National Digital Forecast Database(NDFD), a SOAP interface, and a REST interface. Both return data in Digital Weather Markup Language (DWML), which is an XML dialect. Data elements which can be returned are listed here.
IMO the REST interface is by far the easier to use. Below is an example where we extract the forecast temperature, relative humidity, and wind speed for Zip Code 10001 (Lower Manhattan) in 3-hour increments for the next 5 days.
# NOAA NWS REST API Example
# 3-hourly forecast for Lower Mannhattan (Zip Code: 10001)
library(httr)
library(XML)
url <- "http://graphical.weather.gov/xml/sample_products/browser_interface/ndfdXMLclient.php"
response <- GET(url,query=list(zipCodeList="10001",
product="time-series",
begin=format(Sys.Date(),"%Y-%m-%d"),
Unit="e",
temp="temp",rh="rh",wspd="wspd"))
doc <- content(response,type="text/xml") # XML document with the data
# extract the date-times
dates <- doc["//time-layout/start-valid-time"]
dates <- as.POSIXct(xmlSApply(dates,xmlValue),format="%Y-%m-%dT%H:%M:%S")
# extract the actual data
data <- doc["//parameters/*"]
data <- sapply(data,function(d)removeChildren(d,kids=list("name")))
result <- do.call(data.frame,lapply(data,function(d)xmlSApply(d,xmlValue)))
colnames(result) <- sapply(data,xmlName)
# combine into a data frame
result <- data.frame(dates,result)
head(result)
# dates temperature wind.speed humidity
# 1 2014-11-06 19:00:00 52 8 96
# 2 2014-11-06 22:00:00 50 7 86
# 3 2014-11-07 01:00:00 50 7 83
# 4 2014-11-07 04:00:00 47 11 83
# 5 2014-11-07 07:00:00 45 14 83
# 6 2014-11-07 10:00:00 50 16 61
It is possible to query multiple zip codes in a single request, but this complicates parsing the returned XML.
To get the NOAA QCLCD data into zip codes you need to use the latitude/longitude values from the station.txt file and compare that with data from Census. This can only be done with GIS related tools. My solution is to use a PostGIS enabled database so you can use the ST_MakePoint function:
ST_MakePoint(longitude, latitude)
You would then need to load the ZCTA from Census Bureau into the database as well to determine which zip codes contain which stations. The ST_Contains function will help with that.
ST_Contains(zip_way, ST_MakePoint(longitude, latitude))
A full query might look something like this:
SELECT s.wban, z.zip5, s.state, s.location
FROM public.station s
INNER JOIN public.zip z
ON ST_Contains(z.way, ST_MakePoint(s.longitude, s.latitude)
I'm obviously making assumptions about the column names, but the above should be a great starting point.
You should be able to accomplish the same tasks with QGIS (Free) or ArcGIS (expensive) too. That gets rid of the overhead of installing a PostGIS enabled database, but I'm not as familiar with the required steps in those software packages.
Weather data is only available for weather stations, and there is not a weather station for each ZCTA (the ZCTA's are much smaller than the regions covered by weather stations).
I have seen options on the noaa website where you can enter a latitude and longitude and it will find the weather from the appropriate weather station. So if you can convert your ZCTA's of interest into a lat/lon pair (center, random corner, etc.) you could submit that to the website. But note that if you do this for a large number of ZCTA's that are close together you will be downloading redundant information. It would be better to do a one time matching of ZCTA to weather station, then download weather info from each station only once and then merge with the ZCTA data.

Resources