R httr content(req) to data frame after receiving data from API - r

I am trying to batch geocode a group of addresses through the US Census Geocoder: http://geocoding.geo.census.gov/geocoder/
I have found this question:
Posting to and Receiving data from API using httr in R
and Hadley's suggestion works perfectly to send my data frame to the API and get the geocoded addresses back. The problem I am running in to is how to get the returned data back in to a data frame. I would've commented on his response there, but unfortunately since this is a new account I am not able to comment yet.
So my code is as follows:
req <- POST("http://geocoding.geo.census.gov/geocoder/geographies/addressbatch",
body = list(
addressFile = upload_file("mydata.csv"),
benchmark = "Public_AR_Census2010",
vintage = "Census2010_Census2010"
),
encode = "multipart",
verbose())
stop_for_status(req)
content(req)
When I run content(req), I get data that looks like this:
"946\",\"123 MY STREET, ANYTOWN, TX,
99999\",\"Match\",\"Non_Exact\",\"123 MY STREET, ANYTOWN, TX,
99999\",\"-75.43486,80.423775\",\"95495654\",\"L\",\"99\",\"999\",\"021999\",\"3
005\"\n\"333\",\"456 MY STREET, ANYTOWN, TX,
99999\",\"Match\",\"Exact\",\"456 MY STREET, ANYTOWN, TX,
99999\",\"-75.38545,80.383747\",\"6546542\",\"R\",\"99\",\"999\",\"021999\",\"3002\"\n\
I've tried using the jsonlite approach mentioned here: Successfully coercing paginated JSON object to R dataframe
as well as googling httr/content to data frame, and haven't had any luck. The closest I have come to getting what I want is using
cat(content(req, "text"), "\n") which gets results that look like a CSV I could use as a data frame:
"476","123 MY STREET, ANYTOWN, TX, 99999","Match","Exact",
"123 MY STREET, ANYTOWN, TX,
99999","-75.438644,80.426025","654651321","L","99","999","0219999","3013"
But I was also unable to find any help on getting the results of a cat() into a data frame as I believe the function only prints the results.
When I use a browser and upload a csv I get a csv back that has the following columns:
RowID, Address, Match, MatchType, MatchedAddress, Lat, Long, StreetSide, State, County, Tract, Block
I would prefer to do this all through R, so my end result needs to be a data frame with those columns. The data is there in the content(req), I just haven't figured out how to get it in a data frame.
Thanks for the help!

Use textConnection to make it one liner
df <- read.csv(textConnection(content(req, 'text')))

Perhaps now over 6 months later, this question has been resolved. But in case others have the same issue:
The problem is that you are missing a column header in your list of variables and you have two column headers for coordinates. And you can't use the ones provided by the Census Bureau, because they do not provide a complete header row for all variables. First send the output to a CSV file:
cat(content(req, "text"), file="reqoutput.csv")
Then read it back in as a dataframe, providing your own header row:
reqdata<-read.csv(file="reqoutput.csv", skip=1,
col.names = c('RowID', 'Address', 'Match', 'MatchType',
'MatchedAddress', 'LongLat', 'thing',
'Streetside', 'State', 'County', 'Tract',
'Block'))
In your example output, note that the Census bureau provides coordinates as one field in double-quotes, and it's Longitude followed by Latitude.
After coordinates, there is a nine digit string of numbers, I don't know what that is. I called it 'thing'.

Related

Geocode Error: is.character(location) is not TRUE

I'm trying to use geocode to get the latitudes and longitudes of city names - I had been able to successfully use geocode on full addresses that combined street addresses and city names, but wanted to try to get a couple more values from the observations that geocode had missed the first time by only passing in the city name rather than the full address.
I currently have the code
addresses_pt_2 <- tibble("address_2" = lat_longs_na$city_country)
addresses_pt_2$address_2 <- as.character(addresses_pt_2$address_2)
where lat_longs_na is a dataframe containing just the city for each address that geocode had returned as NA the first time. I even then made the address feature a character just in case. However, when I run
lat_longs_pt_2 <- addresses_pt_2 %>%
geocode(address_2, method = 'osm')
I get the following error: "Error in geocode(., address_2, method = "osm") : is.character(location) is not TRUE"
I've seen that there are posts about this but haven't been able to find anything that fixes this - I tried making it a dataframe instead of a tibble and adding in stringsAsFactors = FALSE, I've updated the package - nothing works, and I don't understand why when geocode worked the first time using the same formatting.

How do you download data from API and export it into a nice CSV file to query?

I am trying to figure out how to 'download' data into a nice CSV file to be able to analyse.
I am currently looking at WHO data here:
I am doing so through following documentation and getting output like so:
test_data <- jsonlite::parse_json(url("http://apps.who.int/gho/athena/api/GHO/WHS6_102.json?profile=simple"))
head(test_data)
This gives me a rather messy list of list of lists.
For example:
I get this
It is not very easy to analyse and rather messy. How could I clean this up by using say two columns that is returned from this json_parse, information only from say dim like REGION, YEAR, COUNTRY and then the values from the column Value. I would like to make this into a nice dataframe/CSV file so I can then more easily understand what is happening.
Can anyone give any advice?
jsonlite::fromJSON gives you data in a better format and the 3rd element in the list is where the main data is.
url <- 'https://apps.who.int/gho/athena/api/GHO/WHS6_102.json?profile=simple'
tmp <- jsonlite::fromJSON(url)
data <- tmp[[3]]

Skipping Error in a Loop of Google Trends requests

For my Bachelor Thesis I need to pull Google Trends Data for several Brands in different countries.
As I am totally new to R a friend of mine helped me create the code for a loop, which does it automatically.
After a while the error
data must be a data frame, or other object coercible by fortify(), not a list
appears and the loop stops. When checking with the google trends page itself i found out that there is not enough data to support the request.
My question now would be, if it is possible to continue the loop, regardless of the error and just "skip" the request responsible for the error.
I Already looked around in other threads but the try() appears not to work here or I did it wrong.
Also I changed the low_search_volume = FALSEwhich is the default to TRUE, but that didn't change anything.
for (row in 1:nrow(my_data)) {
country_code <- as.character(my_data[row, "Country_Code"])
query <- as.character(my_data[row, "Brand"])
trend <- gtrends(
c(query),
geo = country_code,
category = 68,
low_search_volume = TRUE,
time = "all"
)
plot(trend)
export <- trend[["interest_over_time"]]
filepath <- paste(
"C:\\Users\\konst\\Desktop\\Uni\\Bachelorarbeit\\R\\Ganzer Datensatz\\",
query, "_", country_code,
".csv",
sep = ""
)
write.csv(export, filepath)
}
To reproduce the error use following list:
Brand Country Code
Gucci MA
Gucci US
allsaints MA
allsaints US
The allsaints MA request should produce the error. Therefore, the allsaints US will not processed.
Thank you all in advance for your assistance.
Best wishes from Hamburg, Germany

Extracting/Parsing from a PDF to CSV using R?

I am trying to extract data from a poorly formatted PDF into a .csv file for geocoding. The data I am concerned with are the locations of Farmers' Markets in Colorado for 2018 (https://www.colorado.gov/pacific/sites/default/files/Colorado%20Farmers%27%20Markets.pdf). The necessary fields I am looking to have are Business_Name, Address, City, State, Zip, Hours, Season, Email, and Website. The trouble is that the data are all in one column, and not all of the entries have 100% complete data. That is to say that one entry may have five attributes under it (name, address, hours, zip, website) and another may only have 2 lines of the attributes (name, address).
I found an embedded map of locations here (http://www.coloradofarmers.org/find-markets/) that references the PDF file above. I was able to save this map to MyMaps and copy/paste the table to a CSV, but there are missing entries.
Is there a way to cleanly parse this data from PDF to CSV? I imagine what I need to do is create a dictionary of Colorado towns with markets (e.g. 'Denver', 'Canon City', 'Telluride') and then basically have R look through the column, put every new line that exists between look-up cities on the previous city's line all in one row in separate field columns. Or as one comma-delimited field to then parse out based on what the fields looks like.
Here's what I have so far:
#Set the working directory
setwd("C:/Users/bwhite/Desktop")
#download the PDF of data
?download.file
download.file("https://www.colorado.gov/pacific/sites/default/files/Colorado%20Farmers%27%20Markets.pdf", destfile = "./ColoradoMarkets2018.pdf", method = "auto", quiet = FALSE, mode = "w", cacheOK=TRUE)
#import the pdf table library from CRAN
install.packages("pdftables")
library(pdftables)
#convert pdf to CSV
?convert_pdf
convert_pdf("Colorado Farmers' Markets.pdf",output_file = "FarmersMarkets.csv",
format = "csv", message = TRUE, api_key = "n7qgsnz2nkun")
# read in CSV
Markets18 <-read.csv("./FarmersMarkets.csv")
#create a look-up table list of Colorado cities
install.packages("htmltab")
library(htmltab)
CityList <-htmltab("https://en.wikipedia.org/wiki/List_of_cities_and_towns_in_Colorado",1)
names(CityList)
Any help is appreciated.
You can only attempt to extract information that is consistent. I'm not an expert but I tried to build a logic for some part. Pages 2-20 are somewhat free of dirty data. Also if you notice, each group can be split at p.m. (for most part). Since number of columns is different for some of them, it was difficult to build one logic. Even the extracted dataframe would require some transformation.
library(pdftools)
text<-pdf_text("Colorado Farmers' Markets.pdf")
library(plyr)
new<-data.frame()
text4<-data.frame(Reduce(rbind, text),row.names =c() ,stringsAsFactors = FALSE)
for (i in 2:20){
list1<-text4[i,1]
list1<-strsplit(list1,'p.m.')
final<-data.frame(Reduce(rbind, list1),row.names =c() ,stringsAsFactors = FALSE)
for (i in 1:dim(final)[1]){
c<-final[i,]
c<-strsplit(c,'\n')
new<-rbind.fill(new,data.frame(t(data.frame(c,row.names =c()))))
}
}

Retrieving subsets of data with EUtilsGet

I am new to r and RISmed, so please accept my apologies if this is a very simple question.
I have been following a tutorial on how to get data on a large number of references from PubMed. When I use:
pubmed_data <- data.frame('Title'=ArticleTitle(records),'Abstract'=AbstractText(records))
head(pubmed_data,1)
It returns the data for Title and Abstract as expected, however when I add instructions to return Author, Journal, Year, Country and Keyword it still only returns the Title and Abstract. What am I missing? I use the following code:
pubmed_data <- data.frame('Title'=ArticleTitle(records),'Abstract'=AbstractText(records), 'Journal'=Journal(records), 'Year'=DateCreated(records), 'Author'=AuthorList(records), 'Country'=Country(records), 'Keyword'=KeywordList(records))
head(pubmed_data,1)
Next to ArticleTitle and AbstractText functions, you're also using Journal, DateCreated, AuthorList, Country, and KeywordList. Some of these functions are not listed directly in the package reference manual (only Country appears to be a valid function).

Resources