Download weather data by zip code tabulate area in R - r

I am trying to figure out how to download weather(temperature,radiation,etc) by zip code tabulation area(ZCTA). Although Census data have census information by ZCTA, this is not the case for weather data.
I tried find information from http://cdo.ncdc.noaa.gov/qclcd/QCLCD?prior=N
but couldn't figure it out.
Has anyone of you ever downloaded the weather data by ZCTA? if not, has anyone had experience to convert weather observation station information to ZCTA?

The National Weather Service provides two web-based API's to extract weather forecast information from the National Digital Forecast Database(NDFD), a SOAP interface, and a REST interface. Both return data in Digital Weather Markup Language (DWML), which is an XML dialect. Data elements which can be returned are listed here.
IMO the REST interface is by far the easier to use. Below is an example where we extract the forecast temperature, relative humidity, and wind speed for Zip Code 10001 (Lower Manhattan) in 3-hour increments for the next 5 days.
# NOAA NWS REST API Example
# 3-hourly forecast for Lower Mannhattan (Zip Code: 10001)
library(httr)
library(XML)
url <- "http://graphical.weather.gov/xml/sample_products/browser_interface/ndfdXMLclient.php"
response <- GET(url,query=list(zipCodeList="10001",
product="time-series",
begin=format(Sys.Date(),"%Y-%m-%d"),
Unit="e",
temp="temp",rh="rh",wspd="wspd"))
doc <- content(response,type="text/xml") # XML document with the data
# extract the date-times
dates <- doc["//time-layout/start-valid-time"]
dates <- as.POSIXct(xmlSApply(dates,xmlValue),format="%Y-%m-%dT%H:%M:%S")
# extract the actual data
data <- doc["//parameters/*"]
data <- sapply(data,function(d)removeChildren(d,kids=list("name")))
result <- do.call(data.frame,lapply(data,function(d)xmlSApply(d,xmlValue)))
colnames(result) <- sapply(data,xmlName)
# combine into a data frame
result <- data.frame(dates,result)
head(result)
# dates temperature wind.speed humidity
# 1 2014-11-06 19:00:00 52 8 96
# 2 2014-11-06 22:00:00 50 7 86
# 3 2014-11-07 01:00:00 50 7 83
# 4 2014-11-07 04:00:00 47 11 83
# 5 2014-11-07 07:00:00 45 14 83
# 6 2014-11-07 10:00:00 50 16 61
It is possible to query multiple zip codes in a single request, but this complicates parsing the returned XML.

To get the NOAA QCLCD data into zip codes you need to use the latitude/longitude values from the station.txt file and compare that with data from Census. This can only be done with GIS related tools. My solution is to use a PostGIS enabled database so you can use the ST_MakePoint function:
ST_MakePoint(longitude, latitude)
You would then need to load the ZCTA from Census Bureau into the database as well to determine which zip codes contain which stations. The ST_Contains function will help with that.
ST_Contains(zip_way, ST_MakePoint(longitude, latitude))
A full query might look something like this:
SELECT s.wban, z.zip5, s.state, s.location
FROM public.station s
INNER JOIN public.zip z
ON ST_Contains(z.way, ST_MakePoint(s.longitude, s.latitude)
I'm obviously making assumptions about the column names, but the above should be a great starting point.
You should be able to accomplish the same tasks with QGIS (Free) or ArcGIS (expensive) too. That gets rid of the overhead of installing a PostGIS enabled database, but I'm not as familiar with the required steps in those software packages.

Weather data is only available for weather stations, and there is not a weather station for each ZCTA (the ZCTA's are much smaller than the regions covered by weather stations).
I have seen options on the noaa website where you can enter a latitude and longitude and it will find the weather from the appropriate weather station. So if you can convert your ZCTA's of interest into a lat/lon pair (center, random corner, etc.) you could submit that to the website. But note that if you do this for a large number of ZCTA's that are close together you will be downloading redundant information. It would be better to do a one time matching of ZCTA to weather station, then download weather info from each station only once and then merge with the ZCTA data.

Related

How to get US county name from Address, city and state in R?

I have a dataset of around 10000 rows. I have the Address, City, State and Zipcode values. I do not have lat/long coordinates. I would like to retrieve the county name without taking a large amount of time. I have tried library(tinygeocoder) but it takes around 14 seconds for 100 values, and is giving a 'time-out' error when I put in the entire dataset. Plus, it's outputting a fip code, which I have to join to get the actual county name. Reproducible example:
library(tidygeocoder)
library(dplyr)
df <- tidygeocoder::louisville[,1:4]
county_fips <- data.frame (fips = c("111", "112"),
county = c("Jefferson", "Montgomery"))
geocoded <- df %>% geocode(street = street, city = city, state = state,
method = 'census', full_results = TRUE,
api_options = list(census_return_type = 'geographies'))
df$fips <- geocoded$county_fips
df_new <- merge(x=df, y=county_fips, by="fips", all.x = T)
You can use a public dataset that links city and/or zipcode to county. I found these websites with such data:
https://www.unitedstateszipcodes.org/zip-code-database
https://simplemaps.com/data/us-cities
You can then do a left join on the linking column (presumably city or zipcode but will depend on the dataset):
df = merge(x=df, y=public_dataset, by="City", all.x=T)
If performance is an issue, you can select just the county and linking columns from the public data set before you do the merge.
public_dataset = public_dataset %>% select(County, City)
The slow performance is due to tinygeocoder's use of the the Census Bureau's API in order to match data. Asking the API to match thousands of addresses is the slow down, and I'm not aware of a different way to do this.
However, we can at least pare down the number of addresses that you are putting into the API. Maybe if we get that number low enough the code will run.
The ZIP Code Tabulation Areas (ZCTA) shows the relationships between ZIP Codes and county names (as well as FIPS). A "|" delimited file with a description of the data can be found on the Bureau's website.
Counting the number of times a ZIP code shows up tells us if a ZIP code spans multiple counties. If the frequency == 1, then you can freely translate the ZIP code to the county.
ZCTA <- read.delim("tab20_zcta520_county20_natl.txt", sep="|")
n_occur <- data.frame(table(ZCTA$GEOID_ZCTA5_20))
head(n_occur, 10)
Var1
Freq
1
601
2
2
602
2
3
603
2
4
606
3
5
610
4
6
611
1
7
612
3
8
616
1
9
617
2
10
622
1
In these results, addresses with ZIP codes 00611 and 00622 can be mapped to the corresponding counties without sending the addresses through the API. If your addresses are very urban, then you may be lucky in that the ZIP codes are small area-wise and may not span typically multiple counties.

Historical GTFS data

I'm trying to obtain a couple of weeks worth of historical train performance (delay) data on trains arriving at Central Station (id=600016...600024), Brisbane, Australia for research purposes. I found a package called gtfsway developed by SymbolixAU, and some brief code, but I don't know how to specify a date and the station id.
I'm new to GTFS and any help is appreciated.
library(gtfsway)
url <- "https://gtfsrt.api.translink.com.au/Feed/SEQ"
response <- httr::GET(url)
FeedMessage <- gtfs_realtime(response)
## the function gtfs_tripUpdates() extracts the 'trip_update' feed
lst <- gtfs_tripUpdates(FeedMessage)

How to access Historical Data from The Weather Company (TWC) using IBM Data Science Experience (DSX)

I'm using IBM Data Science Experience (DSX), https://datascience.ibm.com/. I use R with RStudio.
What's the lowest level of data available (say, seconds or minutes or hourly, ...)?
Looking for an example code to access lowest level of data say for the period 1 January 2016 to 31 November 2017 for a certain location.
As per API documentation historical data only gives last 24 hours of data
https://twcservice.mybluemix.net/rest-api/#!/Historical_Data/v1geotimeseriesobs
Anyhow here is the implementation for R and Rstudio to get data for last 23 hours in a dataframe for latitude and longitude(33.40/-83.42):-
library(jsonlite)
username <- "<PUT-YOUR-WEATHERDATA-USERNAME>"
password <- "<PUT-YOUR-WEATHERDATA-PASSWORD>"
base <- "https://twcservice.mybluemix.net/api/weather/v1/geocode/33.40/-83.42/observations/timeseries.json?hours=23"
library(httr)
get_data <- GET(base, authenticate(username,password, type = "basic"))
get_data
get_data_text <- content(get_data, "text")
get_data_text
get_data_json <- fromJSON(get_data_text,flatten = TRUE)
get_data_json
get_data_df <- as.data.frame(get_data_json)
View(get_data_df)

Cant get lat and longitude values of tweets

I collected some twitter data doing this:
#connect to twitter API
setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)
#set radius and amount of requests
N=200 # tweets to request from each query
S=200 # radius in miles
lats=c(38.9,40.7)
lons=c(-77,-74)
roger=do.call(rbind,lapply(1:length(lats), function(i) searchTwitter('Roger+Federer',
lang="en",n=N,resultType="recent",
geocode=paste (lats[i],lons[i],paste0(S,"mi"),sep=","))))
After this I've done:
rogerlat=sapply(roger, function(x) as.numeric(x$getLatitude()))
rogerlat=sapply(rogerlat, function(z) ifelse(length(z)==0,NA,z))
rogerlon=sapply(roger, function(x) as.numeric(x$getLongitude()))
rogerlon=sapply(rogerlon, function(z) ifelse(length(z)==0,NA,z))
data=as.data.frame(cbind(lat=rogerlat,lon=rogerlon))
And now I would like to get all the tweets that have long and lat values:
data=filter(data, !is.na(lat),!is.na(lon))
lonlat=select(data,lon,lat)
But now I only get NA values.... Any thoughts on what goes wrong here?
As Chris mentioned, searchTwitter does not return the lat-long of a tweet. You can see this by going to the twitteR documentation, which tells us that it returns a status object.
Status Objects
Scrolling down to the status object, you can see that 11 pieces of information are included, but lat-long is not one of them. However, we are not completely lost, because the user's screen name is returned.
If we look at the user object, we see that a user's object at least includes a location.
So I can think of at least two possible solutions, depending on what your use case is.
Solution 1: Extracting a User's Location
# Search for recent Trump tweets #
tweets <- searchTwitter('Trump', lang="en",n=N,resultType="recent",
geocode='38.9,-77,50mi')
# If you want, convert tweets to a data frame #
tweets.df <- twListToDF(tweets)
# Look up the users #
users <- lookupUsers(tweets.df$screenName)
# Convert users to a dataframe, look at their location#
users_df <- twListToDF(users)
table(users_df[1:10, 'location'])
❤ Texas ❤ ALT.SEATTLE.INTERNET.UR.FACE
2 1 1
Japan Land of the Free New Orleans
1 1 1
Springfield OR USA United States USA
1 1 1
# Note that these will be the users' self-reported locations,
# so potentially they are not that useful
Solution 2: Multiple searches with limited radius
The other solution would be to conduct a series of repeated searches, increment your latitude and longitude with a small radius. That way you can be relatively sure that the user is close to your specified location.
Not necessarily an answer, but more an observation too long for comment:
First, you should look at the documentation of how to input geocode data. Using twitteR:
setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)
#set radius and amount of requests
N=200 # tweets to request from each query
S=200 # radius in miles
Geodata should be structured like this (lat, lon, radius):
geo <- '40,-75,200km'
And then called using:
roger <- searchTwitter('Roger+Federer',lang="en",n=N,resultType="recent",geocode=geo)
Then, I would instead use twListtoDF to filter:
roger <- twListToDF(roger)
Which now gives you a data.frame with 16 cols and 200 observations (set above).
You could then filter using:
setDT(roger) #from data.table
roger[latitude > 38.9 & latitude < 40.7 & longitude > -77 & longitude < -74]
That said (and why this is an observation vs. an answer) - it looks as though twitteR does not return lat and lon (it is all NA in the data I returned) - I think this is to protect individual users locations.
That said, adjusting the radius does affect the number of results, so the code does have access to the geo data somehow.
Assuming that some tweets were downloaded, there are some geo-referenced tweets and some tweets without geographical coordinates:
prod(dim(data)) > 1 & prod(dim(data)) != sum(is.na(data)) & any(is.na(data))
# TRUE
Let's simulate data between your longitude/latitude points for simplicity.
set.seed(123)
data <- data.frame(lon=runif(200, -77, -74), lat=runif(200, 38.9, 40.7))
data[sample(1:200, 10),] <- NA
Rows with longitude/latitude data can be selected by removing the 10 rows with missing data.
data2 <- data[-which(is.na(data[, 1])), c("lon", "lat")]
nrow(data) - nrow(data2)
# 10
The last line replaces the last two lines of your code. However, note that this only works if the missing geographical coordinates are stored as NA.

Subsetting data by multiple date ranges - R

I'll get straight to the point: I have been given some data sets in .csv format containing regularly logged sensor data from a machine. However, this data set also contains measurements taken when the machine is turned off, which I would like to separate from the data logged from when it is turned on. To subset the relevant data I also have a file containing start and end times of these shutdowns. This file is several hundred rows long.
Examples of the relevant files for this problem:
file: sensor_data.csv
sens_name,time,measurement
sens_A,17/12/11 06:45,32.3321
sens_A,17/12/11 08:01,36.1290
sens_B,17/12/11 05:32,17.1122
sens_B,18/12/11 03:43,12.3189
##################################################
file: shutdowns.csv
shutdown_start,shutdown_end
17/12/11 07:46,17/12/11 08:23
17/12/11 08:23,17/12/11 09:00
17/12/11 09:00,17/12/11 13:30
18/12/11 01:42,18/12/11 07:43
To subset data in R, I have previously used the subset() function with simple conditions which has worked fine, but I don't know how to go about subsetting sensor data which fall outside multiple shutdown date ranges. I've already formatted the date and time data using as.POSIXlt().
I'm suspecting some scripting may be involved to come up with a good solution, but I'm afraid I am not yet experienced enough to handle this type of data.
Any help, advice, or solutions will be greatly appreciated. Let me know if there's anything else needed for a solution.
I prefer POSIXct format for ranges within data frames. We create an index for sensors operating during shutdowns with t < shutdown_start OR t > shutdown_end. With these ranges we can then subset the data as necessary:
posixct <- function(x) as.POSIXct(x, format="%d/%m/%y %H:%M")
sensor_data$time <- posixct(sensor_data$time)
shutdowns[] <- lapply(shutdowns, posixct)
ind1 <- sapply(sensor_data$time, function(t) {
sum(t < shutdowns[,1] | t > shutdowns[,2]) == length(sensor_data$time)})
#Measurements taken when shutdown
sensor_data[ind1,]
# sens_name time measurement
# 1 sens_A 2011-12-17 06:45:00 32.3321
# 3 sens_B 2011-12-17 05:32:00 17.1122
#Measurements taken when not shutdown
sensor_data[!ind1,]
# sens_name time measurement
# 2 sens_A 2011-12-17 08:01:00 36.1290
# 4 sens_B 2011-12-18 03:43:00 12.3189

Resources