searchTwitter('patriots', geocode='42.375,-71.1061111,10mi')
This query returns a list of tweets. However, most of these tweets have no location:
retweeted longitude latitude
1 FALSE <NA> <NA>
2 FALSE <NA> <NA>
3 FALSE <NA> <NA>
...
Why is that? How did twitter know that these tweets are within the range of the search query? Is there a way to get an estimate of the coordinates of these tweets?
The documentation for search/tweets says
The location is preferentially taking from the Geotagging API, but will fall back to their Twitter profile.
You should find that if you check the user of each tweet they have set their profile location and it falls within 10 miles of 42.375,-71.1061111'.
Related
I am trying to scrape data from this traffic camera map:
https://traffic.houstontranstar.org/layers/layers_ve.aspx?&inc=true&rc=true&lc=false&cam=true&dms=false&spd=true&tr=false&wthr=false&rfw=true&rm=false&nml=false&nmd=false
This is what it looks like: 1
The map shows the positions of a bunch of traffic cameras with these camera icons: 2
I am currently able to manually inspect element at each camera location and get the source link for the image in the camera's video feed. For example, I can get this link here for an example camera: https://www.houstontranstar.org/snapshots/cctv/913.jpg?arg=1661886693594
However, I also need to get the location of these cameras. Does anyone know if there is a way for me to get some sense of geographic location of these cameras on the map?
The second question I had was that if anyone knew of a good way for me to iterate upon the cameras? I'm completely new to web scraping so am currently just inspecting the elements of the website manually and going from there.
Any help would be great thanks!
The following code is one way of obtaining those cameras locations (lat/lon):
import requests
import pandas as pd
url = 'https://traffic.houstontranstar.org/data/layers/cctvSnapshots_json.js?arg=1661888238910'
r = requests.get(url)
df = pd.DataFrame(r.json()['cameras'])
print(df)
This returns a dataframe with 1126 rows × 8 columns:
location
id
lat
lng
dir
fc
fp
url
0
10 EAST # SAN JACINTO
1002
29.7682
-95.3557
South
6
300
showWin('/layers/gc.aspx?cam=1002&loc=IH-10_East_at_SAN_JACINTO')
1
10 EAST # SAN JACINTO (E)
1003
29.7698
-95.3505
South
1
0
showWin('/layers/gc.aspx?cam=1003&loc=IH-10_East_at_SAN_JACINTO_(E)')
2
10 EAST # JENSEN
1004
29.7703
-95.3437
North
1
0
showWin('/layers/gc.aspx?cam=1004&loc=IH-10_East_at_JENSEN')
3
10 East # Gregg
1005
29.7699
-95.3357
North
1
0
showWin('/layers/gc.aspx?cam=1005&loc=IH-10_East_at_Gregg')
4
10 East # Waco
1006
29.7729
-95.3246
North
1
0
showWin('/layers/gc.aspx?cam=1006&loc=IH-10_East_at_Waco')
[...]
I am trying to change a set of zipcodes into states. However, the result comes in a different order than what I inputed, except for null values. This is a different set I created, which produces the same issue. I'm importing my actual file from a CSV if that is relevant.
I'm using the zipcodeR package.
zipcodestest = as.data.frame(c('85364','91910','30004','filler','90210','help'))
colnames(zipcodestest) = "zip"
statetest =as.data.frame(reverse_zipcode(zipcodestest$zip)$state)
zipcodestest$statetest = statetest
View(zipcodestest)
The states are showing up in a different order than the zips. Is there a way I can make sure they pair up properly?
Thanks so much.
zipcodestest %>%
left_join(reverse_zipcode(.$zip),
by = c(zip ='zipcode')) %>%
select(zip, state)
zip state
1 85364 AZ
2 91910 CA
3 30004 GA
4 filler <NA>
5 90210 CA
6 help <NA>
My Dataset looks something like this. Note below is hypothetical dataset.
Objective: Sales employee has to go to a particular location and verify the houses/Stores/buildings and device captures below mentioned information
Sr.No.
Store_Name
Phone-No.
Agent_id
Area
Lat-Long
1
ABC Stores
89099090
121
Bay Area
23.909090,89.878798
2
Wuhan Masks
45453434
122
Santa Fe
24.452134,78.123243
3
Twitter Cafe
67556090
123
Middle East
11.889766,23.334483
4
abc
33445569
121
Santa Cruz
23.345678,89.234213
5
Silver Gym
11004110
234
Worli Sea Link
56.564311, 78.909087
6
CK Clothings
00908876
223
90 th Street
34.445887, 12.887654
Facts:
#1 Unique Identifier for finding Duplicates – ** Check Sr.No 1 & 4 basically same
In this dummy dataset all the columns can be manipulated i.e. for same store/house/building-outlet
a) Since Name is entered manually for same house/store names can be changed and entered in the system -
multiple visits can happen
b) Mobile number can also be manipulated, different number can be associated with same outlet
c) Device with Agent capturing lat-long info also can be fudged - by moving closer or near to the building
Problem:
How to make Lat-Long Data as the Unique Identifier keeping in mind point - c), above for finding duplicates in the huge dataset.
Deploying QR is not also very helpful as this can also be tweaked.
Hereby stopping the fraudulent practice by an employee ( Same emp can visit same store/outlet or a different emp can also again visit the same store outlet to increase visit count)
Right now I can only think of Lat-Long Column to make UID please feel free to suggest if anything else can be made
I collected some twitter data doing this:
#connect to twitter API
setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)
#set radius and amount of requests
N=200 # tweets to request from each query
S=200 # radius in miles
lats=c(38.9,40.7)
lons=c(-77,-74)
roger=do.call(rbind,lapply(1:length(lats), function(i) searchTwitter('Roger+Federer',
lang="en",n=N,resultType="recent",
geocode=paste (lats[i],lons[i],paste0(S,"mi"),sep=","))))
After this I've done:
rogerlat=sapply(roger, function(x) as.numeric(x$getLatitude()))
rogerlat=sapply(rogerlat, function(z) ifelse(length(z)==0,NA,z))
rogerlon=sapply(roger, function(x) as.numeric(x$getLongitude()))
rogerlon=sapply(rogerlon, function(z) ifelse(length(z)==0,NA,z))
data=as.data.frame(cbind(lat=rogerlat,lon=rogerlon))
And now I would like to get all the tweets that have long and lat values:
data=filter(data, !is.na(lat),!is.na(lon))
lonlat=select(data,lon,lat)
But now I only get NA values.... Any thoughts on what goes wrong here?
As Chris mentioned, searchTwitter does not return the lat-long of a tweet. You can see this by going to the twitteR documentation, which tells us that it returns a status object.
Status Objects
Scrolling down to the status object, you can see that 11 pieces of information are included, but lat-long is not one of them. However, we are not completely lost, because the user's screen name is returned.
If we look at the user object, we see that a user's object at least includes a location.
So I can think of at least two possible solutions, depending on what your use case is.
Solution 1: Extracting a User's Location
# Search for recent Trump tweets #
tweets <- searchTwitter('Trump', lang="en",n=N,resultType="recent",
geocode='38.9,-77,50mi')
# If you want, convert tweets to a data frame #
tweets.df <- twListToDF(tweets)
# Look up the users #
users <- lookupUsers(tweets.df$screenName)
# Convert users to a dataframe, look at their location#
users_df <- twListToDF(users)
table(users_df[1:10, 'location'])
❤ Texas ❤ ALT.SEATTLE.INTERNET.UR.FACE
2 1 1
Japan Land of the Free New Orleans
1 1 1
Springfield OR USA United States USA
1 1 1
# Note that these will be the users' self-reported locations,
# so potentially they are not that useful
Solution 2: Multiple searches with limited radius
The other solution would be to conduct a series of repeated searches, increment your latitude and longitude with a small radius. That way you can be relatively sure that the user is close to your specified location.
Not necessarily an answer, but more an observation too long for comment:
First, you should look at the documentation of how to input geocode data. Using twitteR:
setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)
#set radius and amount of requests
N=200 # tweets to request from each query
S=200 # radius in miles
Geodata should be structured like this (lat, lon, radius):
geo <- '40,-75,200km'
And then called using:
roger <- searchTwitter('Roger+Federer',lang="en",n=N,resultType="recent",geocode=geo)
Then, I would instead use twListtoDF to filter:
roger <- twListToDF(roger)
Which now gives you a data.frame with 16 cols and 200 observations (set above).
You could then filter using:
setDT(roger) #from data.table
roger[latitude > 38.9 & latitude < 40.7 & longitude > -77 & longitude < -74]
That said (and why this is an observation vs. an answer) - it looks as though twitteR does not return lat and lon (it is all NA in the data I returned) - I think this is to protect individual users locations.
That said, adjusting the radius does affect the number of results, so the code does have access to the geo data somehow.
Assuming that some tweets were downloaded, there are some geo-referenced tweets and some tweets without geographical coordinates:
prod(dim(data)) > 1 & prod(dim(data)) != sum(is.na(data)) & any(is.na(data))
# TRUE
Let's simulate data between your longitude/latitude points for simplicity.
set.seed(123)
data <- data.frame(lon=runif(200, -77, -74), lat=runif(200, 38.9, 40.7))
data[sample(1:200, 10),] <- NA
Rows with longitude/latitude data can be selected by removing the 10 rows with missing data.
data2 <- data[-which(is.na(data[, 1])), c("lon", "lat")]
nrow(data) - nrow(data2)
# 10
The last line replaces the last two lines of your code. However, note that this only works if the missing geographical coordinates are stored as NA.
I am trying to figure out how to download weather(temperature,radiation,etc) by zip code tabulation area(ZCTA). Although Census data have census information by ZCTA, this is not the case for weather data.
I tried find information from http://cdo.ncdc.noaa.gov/qclcd/QCLCD?prior=N
but couldn't figure it out.
Has anyone of you ever downloaded the weather data by ZCTA? if not, has anyone had experience to convert weather observation station information to ZCTA?
The National Weather Service provides two web-based API's to extract weather forecast information from the National Digital Forecast Database(NDFD), a SOAP interface, and a REST interface. Both return data in Digital Weather Markup Language (DWML), which is an XML dialect. Data elements which can be returned are listed here.
IMO the REST interface is by far the easier to use. Below is an example where we extract the forecast temperature, relative humidity, and wind speed for Zip Code 10001 (Lower Manhattan) in 3-hour increments for the next 5 days.
# NOAA NWS REST API Example
# 3-hourly forecast for Lower Mannhattan (Zip Code: 10001)
library(httr)
library(XML)
url <- "http://graphical.weather.gov/xml/sample_products/browser_interface/ndfdXMLclient.php"
response <- GET(url,query=list(zipCodeList="10001",
product="time-series",
begin=format(Sys.Date(),"%Y-%m-%d"),
Unit="e",
temp="temp",rh="rh",wspd="wspd"))
doc <- content(response,type="text/xml") # XML document with the data
# extract the date-times
dates <- doc["//time-layout/start-valid-time"]
dates <- as.POSIXct(xmlSApply(dates,xmlValue),format="%Y-%m-%dT%H:%M:%S")
# extract the actual data
data <- doc["//parameters/*"]
data <- sapply(data,function(d)removeChildren(d,kids=list("name")))
result <- do.call(data.frame,lapply(data,function(d)xmlSApply(d,xmlValue)))
colnames(result) <- sapply(data,xmlName)
# combine into a data frame
result <- data.frame(dates,result)
head(result)
# dates temperature wind.speed humidity
# 1 2014-11-06 19:00:00 52 8 96
# 2 2014-11-06 22:00:00 50 7 86
# 3 2014-11-07 01:00:00 50 7 83
# 4 2014-11-07 04:00:00 47 11 83
# 5 2014-11-07 07:00:00 45 14 83
# 6 2014-11-07 10:00:00 50 16 61
It is possible to query multiple zip codes in a single request, but this complicates parsing the returned XML.
To get the NOAA QCLCD data into zip codes you need to use the latitude/longitude values from the station.txt file and compare that with data from Census. This can only be done with GIS related tools. My solution is to use a PostGIS enabled database so you can use the ST_MakePoint function:
ST_MakePoint(longitude, latitude)
You would then need to load the ZCTA from Census Bureau into the database as well to determine which zip codes contain which stations. The ST_Contains function will help with that.
ST_Contains(zip_way, ST_MakePoint(longitude, latitude))
A full query might look something like this:
SELECT s.wban, z.zip5, s.state, s.location
FROM public.station s
INNER JOIN public.zip z
ON ST_Contains(z.way, ST_MakePoint(s.longitude, s.latitude)
I'm obviously making assumptions about the column names, but the above should be a great starting point.
You should be able to accomplish the same tasks with QGIS (Free) or ArcGIS (expensive) too. That gets rid of the overhead of installing a PostGIS enabled database, but I'm not as familiar with the required steps in those software packages.
Weather data is only available for weather stations, and there is not a weather station for each ZCTA (the ZCTA's are much smaller than the regions covered by weather stations).
I have seen options on the noaa website where you can enter a latitude and longitude and it will find the weather from the appropriate weather station. So if you can convert your ZCTA's of interest into a lat/lon pair (center, random corner, etc.) you could submit that to the website. But note that if you do this for a large number of ZCTA's that are close together you will be downloading redundant information. It would be better to do a one time matching of ZCTA to weather station, then download weather info from each station only once and then merge with the ZCTA data.