zillow api with R - XML issue - r

I'm trying to read information from the Zillow API and am running into some data structure issues in R. My outputs are supposed to be xml and appear to be, but aren't behaving like xml.
Specifically, the object that GetSearchResults() returns to me is in a format similar to XML, but not quite right to read in R's XML reading functions.
Can you tell me how I should approach this?
#set directory
setwd('[YOUR DIRECTORY]')
# setup libraries
library(dplyr)
library(XML)
library(ZillowR)
library(RCurl)
# setup api key
set_zillow_web_service_id('[YOUR API KEY]')
xml = GetSearchResults(address = '120 East 7th Street', citystatezip = '10009')
data = xmlParse(xml)
This throws the following error:
Error: XML content does not seem to be XML
The Zillow API documentation clearly states that the output should be XML, and it certainly looks like it. I'd like to be able to easily access various components of the API output for larger-scale data manipulation / aggregation. Let me know if you have any ideas.

This was a fun opportunity for me to get acquainted with the Zillow API. My approach, following How to parse XML to R data frame, was to convert the response to a list, for ease of inspection. The onerous bit was figuring out the structure of the data through inspecting the list, particularly because each property might have some missing data. This was why I wrote the getValRange function to deal with parsing the Zestimate data.
results <- xmlToList(xml$response[["results"]])
getValRange <- function(x, hilo) {
ifelse(hilo %in% unlist(dimnames(x)), x["text",hilo][[1]], NA)
}
out <- apply(results, MAR=2, function(property) {
zpid <- property$zpid
links <- unlist(property$links)
address <- unlist(property$address)
z <- property$zestimate
zestdf <- list(
amount=ifelse("text" %in% names(z$amount), z$amount$text, NA),
lastupdated=z$"last-updated",
valueChange=ifelse(length(z$valueChange)==0, NA, z$valueChange),
valueLow=getValRange(z$valuationRange, "low"),
valueHigh=getValRange(z$valuationRange, "high"),
percentile=z$percentile)
list(id=zpid, links, address, zestdf)
})
data <- as.data.frame(do.call(rbind, lapply(out, unlist)),
row.names=seq_len(length(out)))
Sample output:
> data[,c("id", "street", "zipcode", "amount")]
id street zipcode amount
1 2098001736 120 E 7th St APT 5A 10009 2321224
2 2101731413 120 E 7th St APT 1B 10009 2548390
3 2131798322 120 E 7th St APT 5B 10009 2408860
4 2126480070 120 E 7th St APT 1A 10009 2643454
5 2125360245 120 E 7th St APT 2A 10009 1257602
6 2118428451 120 E 7th St APT 4A 10009 <NA>
7 2125491284 120 E 7th St FRNT 1 10009 <NA>
8 2126626856 120 E 7th St APT 2B 10009 2520587
9 2131542942 120 E 7th St APT 4B 10009 1257676

# setup libraries
pacman::p_load(dplyr,XML,ZillowR,RCurl) # I use pacman, you don't have to
# setup api key
set_zillow_web_service_id('X1-mykey_31kck')
xml <- GetSearchResults(address = '120 East 7th Street', citystatezip = '10009')
dat <- unlist(xml)
str(dat)
Named chr [1:653] "120 East 7th Street" "10009" "Request successfully
processed" "0" "response" "results" "result" "zpid" "text"
"2131798322" "links" ...
- attr(*, "names")= chr [1:653] "request.address" "request.citystatezip" "message.text" "message.code" ...
dat <- as.data.frame(dat)
dat <- gsub("text","", dat$dat)
I'm not exactly sure what you wanted to do with these results but they're there and they look fine:
head(dat, 20)
[1] "120 East 7th Street"
[2] "10009"
[3] "Request successfully processed"
[4] "0"
[5] "response"
[6] "results"
[7] "result"
[8] "zpid"
[9] ""
[10] "2131798322"
[11] "links"
[12] "homedetails"
[13] ""
[14] "http://www.zillow.com/homedetails/120-E-7th-St-APT-5B-New-York-NY-10009/2131798322_zpid/"
[15] "mapthishome"
[16] ""
[17] "http://www.zillow.com/homes/2131798322_zpid/"
[18] "comparables"
[19] ""
[20] "http://www.zillow.com/homes/comps/2131798322_zpid/"

As stated previously, the trick is to get the API into a list (as opposed to XML). Then it becomes quite simple to pull out whatever data you are interested in.
I wrote an R package that simplifies this. Take a look on github - https://github.com/billzichos/homer. It comes with a vignette.
Assuming the Zillow ID of the property you were interested in was 36086728, the code would look like.
home_estimate("36086728")

Related

How do I access (new july 2022) targeting information from the Facebook Ad Library API (R solution preferred)?

As this announcement mentions (https://www.facebook.com/business/news/transparency-social-issue-electoral-political-ads) new targeting information (or a summary) has been made available in the Facebook Ad Library.
I am used to use the 'Radlibrary' package in R, but I can't seem to find any fields in 'Radlibrary' which allows me to get this information? Does anyone know either how to access this information from the Radlibrary package in R (preferred, since this is what I know and usually works with) or how to access this from the API in another way?
I use it to look at how politicians choose to target their ads, why it would be a too big of a task to manually look it up at the facebook.com/ads/library
EDIT
The targeting I refer to is found browsering the ad library like the screenshots below
Thanks for highlighting this data being published which I did not know had been announced. I just registered for an API token to play around with it.
It seems to me that looking for ads from a particular politician or organisation is a question of downloading large amounts of data and then manipulating it in R. For example, to recreate the curl query on the API docs page:
curl -G \
-d "search_terms='california'" \
-d "ad_type=POLITICAL_AND_ISSUE_ADS" \
-d "ad_reached_countries=['US']" \
-d "access_token=<ACCESS_TOKEN>" \
"https://graph.facebook.com/<API_VERSION>/ads_archive"
We can simply do:
# enter token interactively so it doesn't get added to R history
token <- readline()
query <- adlib_build_query(
search_terms = "california",
ad_reached_countries = 'US',
ad_type = "POLITICAL_AND_ISSUE_ADS"
)
response <- adlib_get(params = query, token = token)
results_df <- Radlibrary::as_tibble(response, censor_access_token = TRUE)
This seems to return what one would expect:
names(results_df)
# [1] "id" "ad_creation_time" "ad_creative_bodies" "ad_creative_link_captions" "ad_creative_link_titles" "ad_delivery_start_time"
# [7] "ad_snapshot_url" "bylines" "currency" "languages" "page_id" "page_name"
# [13] "publisher_platforms" "estimated_audience_size_lower" "estimated_audience_size_upper" "impressions_lower" "impressions_upper" "spend_lower"
# [19] "spend_upper" "ad_creative_link_descriptions" "ad_delivery_stop_time"
library(dplyr)
results_df |>
group_by(page_name) |>
summarise(n = n()) |>
arrange(desc(n))
# # A tibble: 237 x 2
# page_name n
# <chr> <int>
# 1 Senator Brian Dahle 169
# 2 Katie Porter 122
# 3 PragerU 63
# 4 Results for California 28
# 5 Big News Buzz 20
# 6 California Water Service 20
# 7 Cancer Care is Different 17
# 8 Robert Reich 14
# 9 Yes On 28 14
# 10 Protect Tribal Gaming 13
# # ... with 227 more rows
Now - assuming that you are interested specifically in the ads by Senator Brian Dahle, it does not appear that you can send a query for all ads he has placed (i.e. using the page_name parameter in the query). But you can request for all political ads in their area (setting the limit parameter to a high number) with a particular search_term or search_page_id, and then filter the data to the relevant person.

(v)matchPattern DNAStringSetList of Codons to Reference DNAString

I am assessing the impact of hotspot single nucleotide polymorphism (SNPs) from a next generation sequencing (NGS) experiment on the protein sequence of a virus. I have the reference DNA sequence and a list of hotspots. I need to first figure out the reading frame of where these hotspots are seen. To do this, I generated a DNAStringSetList with all human codons and want to use a vmatchpattern or matchpattern from the Biostrings package to figure out where the hotspots land in the codon reading frame.
I often struggle with lapply and other apply functions, so I tend to utilize for loops instead. I am trying to improve in this area, so welcome a apply solution should one be available.
Here is the code for the list of codons:
alanine <- DNAStringSet("GCN")
arginine <- DNAStringSet(c("CGN", "AGR", "CGY", "MGR"))
asparginine <- DNAStringSet("AAY")
aspartic_acid <- DNAStringSet("GAY")
asparagine_or_aspartic_acid <- DNAStringSet("RAY")
cysteine <- DNAStringSet("TGY")
glutamine <- DNAStringSet("CAR")
glutamic_acid <- DNAStringSet("GAR")
glutamine_or_glutamic_acid <- DNAStringSet("SAR")
glycine <- DNAStringSet("GGN")
histidine <- DNAStringSet("CAY")
start <- DNAStringSet("ATG")
isoleucine <- DNAStringSet("ATH")
leucine <- DNAStringSet(c("CTN", "TTR", "CTY", "YTR"))
lysine <- DNAStringSet("AAR")
methionine <- DNAStringSet("ATG")
phenylalanine <- DNAStringSet("TTY")
proline <- DNAStringSet("CCN")
serine <- DNAStringSet(c("TCN", "AGY"))
threonine <- DNAStringSet("ACN")
tyrosine <- DNAStringSet("TGG")
tryptophan <- DNAStringSet("TAY")
valine <- DNAStringSet("GTN")
stop <- DNAStringSet(c("TRA", "TAR"))
codons <- DNAStringSetList(list(alanine, arginine, asparginine, aspartic_acid, asparagine_or_aspartic_acid,
cysteine, glutamine, glutamic_acid, glutamine_or_glutamic_acid, glycine,
histidine, start, isoleucine, leucine, lysine, methionine, phenylalanine,
proline, serine, threonine, tyrosine, tryptophan, valine, stop))
Current for loop code:
reference_stringset <- DNAStringSet(covid)
codon_locations <- list()
for (i in 1:length(codons)) {
pattern <- codons[[i]]
codon_locations[i] <- vmatchPattern(pattern, reference_stringset)
}
Current error code. I am filtering the codon DNAStringSetList so that it is a DNAStringSet.
Error in normargPattern(pattern, subject) : 'pattern' must be a single string or an XString object
I can't give out the exact nucleotide sequence, but here is the COVID genome (link: https://www.ncbi.nlm.nih.gov/nuccore/NC_045512.2?report=fasta) to use as a reprex:
#for those not used to using .fasta files, first copy and past genome into notepad and save as a .fasta file
#use readDNAStringSet from Biostrings package to read in the .fasta file
filepath = #insert file path
covid <- readDNAStringSet(filepath)
For the current code, change the way the codons is formed. Currently the output of codons looks like this:
DNAStringSetList of length 24
[[1]] GCN
[[2]] CGN AGR CGY MGR
[[3]] AAY
[[4]] GAY
[[5]] RAY
[[6]] TGY
[[7]] CAR
[[8]] GAR
[[9]] SAR
[[10]] GGN
...
<14 more elements>
Change it from DNAStringSetList to a conglomerate DNAStringSet of the amino acids.
codons <- DNAStringSet(c(alanine, arginine, asparginine, aspartic_acid, asparagine_or_aspartic_acid,
cysteine, glutamine, glutamic_acid, glutamine_or_glutamic_acid, glycine,
histidine, start, isoleucine, leucine, lysine, methionine, phenylalanine,
proline, serine, threonine, tyrosine, tryptophan, valine, stop))
codons
DNAStringSet object of length 32:
width seq
[1] 3 GCN
[2] 3 CGN
[3] 3 AGR
[4] 3 CGY
[5] 3 MGR
... ... ...
[28] 3 TGG
[29] 3 TAY
[30] 3 GTN
[31] 3 TRA
[32] 3 TAR
When I run the script I get the following output with the SARS-CoV-2 isolate listed for the example (I'm showing a small slice)
codon_locations[27:28]
[[1]]
MIndex object of length 1
$`NC_045512.2 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome`
IRanges object with 0 ranges and 0 metadata columns:
start end width
<integer> <integer> <integer>
[[2]]
MIndex object of length 1
$`NC_045512.2 Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome`
IRanges object with 554 ranges and 0 metadata columns:
start end width
<integer> <integer> <integer>
[1] 89 91 3
[2] 267 269 3
[3] 283 285 3
[4] 352 354 3
[5] 358 360 3
... ... ... ...
[550] 29261 29263 3
[551] 29289 29291 3
[552] 29472 29474 3
[553] 29559 29561 3
[554] 29793 29795 3
Looking at the ones that had an output, only those with the standard nucleotides ("ATCG", no wobbles) found matches. Those will need to be changed as well to search.
If you're on twitter, I suggest linking the question using the #rstats, #bioconductor, and #bioinformatics hashtags to generate some more traction, I've noticed that bioinformatic specific questions on SO don't generate as much buzz.

R: Retrieving multiple variable of a nested list

I am looking at vote data and it's a nested list. I am trying to get multiple variable on each element of my list (example bellow )
So for each element "vote" i am trying to get the uid and the list of individual that vote for or against ("pours" and "contre" ) the law.
I try to simplify the original data ( can be found here )
This is the simplified list i came up with :
scrutin1_detail<-list(uid="VTANR5L14V1",organref="P0644420")
scrutin1_vote1_for<-list(acteurref="PA1816",mandatRef="PM645051")
scrutin1_vote2_for<-list(acteurref="PA1817",mandatRef="PM645052")
scrutin1_vote3_for<-list(acteurref="PA1818",mandatRef="PM645053")
scrutin1_vote_for<-list(scrutin1_vote1_for,scrutin1_vote2_for,scrutin1_vote3_for)
scrutin1_vote1_against<-list(acteurref="PA1816",mandatRef="PM645051")
scrutin1_vote2_against<-list(acteurref="PA1817",mandatRef="PM645052")
scrutin1_vote3_against<-list(acteurref="PA1818",mandatRef="PM645053")
scrutin1_vote_against<-list(scrutin1_vote1_against,scrutin1_vote2_against,scrutin1_vote3_against)
votant1<-list(pours=scrutin1_vote_for,contres=scrutin1_vote_against)
vote1<-list(decompte_nominatif=votant1)
ventilationVotes1<-list(vote=vote1)
scrutin1<-list(scrutin1_detail,list(ventilationVotes=ventilationVotes1))
# Scrutin 2
scrutin2_detail<-list(uid="VTANR5L14V5",organref="P0644423")
scrutin2_vote1_for<-list(acteurref="PA1816",mandatRef="PM645051")
scrutin2_vote2_for<-list(acteurref="PA1817",mandatRef="PM645052")
scrutin2_vote3_for<-list(acteurref="PA1818",mandatRef="PM645053")
scrutin2_vote_for<-list(scrutin1_vote1_for,scrutin1_vote2_for,scrutin1_vote3_for)
scrutin2_vote1_against<-list(acteurref="PA1816",mandatRef="PM645051")
scrutin2_vote2_against<-list(acteurref="PA1817",mandatRef="PM645052")
scrutin2_vote3_against<-list(acteurref="PA1818",mandatRef="PM645053")
scrutin2_vote_against<-list(scrutin2_vote1_against,scrutin2_vote2_against,scrutin2_vote3_against)
scrutin2_votant1<-list(pours=scrutin2_vote_for,contres=scrutin2_vote_against)
scrutin2_vote1<-list(decompte_nominatif=scrutin2_votant1)
scrutin2_ventilationVotes1<-list(vote=scrutin2_vote1)
scrutin2<-list(scrutin2_detail,list(ventilationVotes=scrutin2_ventilationVotes1))
scrutins<-list(scrutins=list(scrutin=list(scrutin1,scrutin2)))
So i am looking at the end ( but i am really interested to understand how to do it as i run into this problem quite often ) to build a dataframe with these column :
uid
For/against (if it was in the list "pour"(for) or "contre" (against)
-acteurref
-mandatref
Sadly I don't speak (or read French) and so am not able to make many correct guesses as to the meaning of names of items in the object constructed using alistaire's suggestion:
library(jsonlite)
scrutin1_detail <- fromJSON("~/Downloads/Scrutins_XIV.json")
> length(scrutin1_detail[[1]])
[1] 1
> length(scrutin1_detail[[1]][[1]])
[1] 18
> names(scrutin1_detail[[1]][[1]])
[1] "#xmlns:xsi" "uid"
[3] "numero" "organeRef"
[5] "legislature" "sessionRef"
[7] "seanceRef" "dateScrutin"
[9] "quantiemeJourSeance" "typeVote"
[11] "sort" "titre"
[13] "demandeur" "objet"
[15] "modePublicationDesVotes" "syntheseVote"
[17] "ventilationVotes" "miseAuPoint"
> str(scrutin1_detail[[1]][[1]]$uid)
chr [1:1219] "VTANR5L14V1" "VTANR5L14V2" "VTANR5L14V3" ...
> table( scrutin1_detail[[1]][[1]]$organeRef)
PO644420
1219
> table( scrutin1_detail[[1]][[1]]$sessionRef)
SCR5A2012E1 SCR5A2012E2 SCR5A2013E1 SCR5A2013E3 SCR5A2013O1 SCR5A2014E1
15 5 42 4 529 50
SCR5A2014E2 SCR5A2014O1 SCR5A2015E1 SCR5A2015E2 SCR5A2015O1 SCR5A2016O1
7 253 18 5 236 55
Maybe you should help us Anglophones to make sense of this. It is very beneficial to provide context rather than just code.

Conversion of Elastic list data output to R data frame slow

I have an output from Elastic that takes very long to convert to an R data frame. I have tried multiple options; and feel there may be some trick there to quicken the process.
The structure of the list is as follows. The list has aggregated data over 29 days (say). If lets say the Elastic query output is in list 'v_day' then l[[5]]$articles_over_time$buckets[1:29] represents each of the 29 days
length(v_day[[5]]$articles_over_time$buckets)
[1] 29
page(v_day[[5]]$articles_over_time$buckets[[1]],method="print")
$key
[1] 1446336000000
$doc_count
[1] 35332
$group_by_state
$group_by_state$doc_count_error_upper_bound
[1] 0
$group_by_state$sum_other_doc_count
[1] 0
$group_by_state$buckets
$group_by_state$buckets[[1]]
$group_by_state$buckets[[1]]$key
[1] "detail"
$group_by_state$buckets[[1]]$doc_count
[1] 876
There is a "key" value here right at the top here (1446336000000) that I am interested in (lets call it "time bucket key").
Within each day(lets take day i), "v_day[[5]]$articles_over_time$buckets[[i]]$group_by_state$buckets" has more data I am interested in. This is an aggregation over each property (property is an entity in the scheme of things here).
page(v_day[[5]]$articles_over_time$buckets[[i]]$group_by_state$buckets,method="print")
[[1]]
[[1]]$key
[1] "detail"
[[1]]$doc_count
[1] 876
[[2]]
[[2]]$key
[1] "ff8081814fdf2a9f014fdf80b05302e0"
[[2]]$doc_count
[1] 157
[[3]]
[[3]]$key
[1] "ff80818150a7d5930150a82abbc50477"
[[3]]$doc_count
[1] 63
[[4]]
[[4]]$key
[1] "ff8081814ff5f428014ffb5de99f1da5"
[[4]]$doc_count
[1] 57
[[5]]
[[5]]$key
[1] "ff8081815038099101503823fe5d00d9"
[[5]]$doc_count
[1] 56
This shows data over 5 properties in day i, each property has a "key" (lets call it "property bucket key") and a "doc_count" that I am interested in.
Eventually I want a data frame with "time bucket key", "property bucket key", "doc count".
Currently I am looping over using the below code:
v <- NULL
ndays <- length(v_day[[5]]$articles_over_time$buckets)
for (i in 1:ndays) {
v1 <- do.call("rbind", lapply(v_day[[5]]$articles_over_time$buckets[[i]]$group_by_state$buckets, data.frame))
th_dt <- as.POSIXct(v_day[[5]]$articles_over_time$buckets[[i]]$key / 1000, origin="1970-01-01")
v1$view_date <- th_dt
v <- rbind(v, v1)
msg <- sprintf("Read views for %s. Found %d \n", th_dt, sum(v1$doc_count))
cat(msg)
}
v

Importing wikipedia tables in R

I regularly extract tables from Wikipedia. Excel's web import does not work properly for wikipedia, as it treats the whole page as a table. In google spreadsheet, I can enter this:
=ImportHtml("http://en.wikipedia.org/wiki/Upper_Peninsula_of_Michigan","table",3)
and this function will download the 3rd table, which lists all the counties of the UP of Michigan, from that page.
Is there something similar in R? or can be created via a user defined function?
Building on Andrie's answer, and addressing SSL. If you can take one additional library dependency:
library(httr)
library(XML)
url <- "https://en.wikipedia.org/wiki/Upper_Peninsula_of_Michigan"
r <- GET(url)
doc <- readHTMLTable(
doc=content(r, "text"))
doc[6]
The function readHTMLTable in package XML is ideal for this.
Try the following:
library(XML)
doc <- readHTMLTable(
doc="http://en.wikipedia.org/wiki/Upper_Peninsula_of_Michigan")
doc[[6]]
V1 V2 V3 V4
1 County Population Land Area (sq mi) Population Density (per sq mi)
2 Alger 9,862 918 10.7
3 Baraga 8,735 904 9.7
4 Chippewa 38,413 1561 24.7
5 Delta 38,520 1170 32.9
6 Dickinson 27,427 766 35.8
7 Gogebic 17,370 1102 15.8
8 Houghton 36,016 1012 35.6
9 Iron 13,138 1166 11.3
10 Keweenaw 2,301 541 4.3
11 Luce 7,024 903 7.8
12 Mackinac 11,943 1022 11.7
13 Marquette 64,634 1821 35.5
14 Menominee 25,109 1043 24.3
15 Ontonagon 7,818 1312 6.0
16 Schoolcraft 8,903 1178 7.6
17 TOTAL 317,258 16,420 19.3
readHTMLTable returns a list of data.frames for each element of the HTML page. You can use names to get information about each element:
> names(doc)
[1] "NULL"
[2] "toc"
[3] "Election results of the 2008 Presidential Election by County in the Upper Peninsula"
[4] "NULL"
[5] "Cities and Villages of the Upper Peninsula"
[6] "Upper Peninsula Land Area and Population Density by County"
[7] "19th Century Population by Census Year of the Upper Peninsula by County"
[8] "20th & 21st Centuries Population by Census Year of the Upper Peninsula by County"
[9] "NULL"
[10] "NULL"
[11] "NULL"
[12] "NULL"
[13] "NULL"
[14] "NULL"
[15] "NULL"
[16] "NULL"
Here is a solution that works with the secure (https) link:
install.packages("htmltab")
library(htmltab)
htmltab("http://en.wikipedia.org/wiki/Upper_Peninsula_of_Michigan",3)
One simple way to do it is to use the RGoogleDocs interface to have Google Docs to do the conversion for you:
http://www.omegahat.org/RGoogleDocs/run.html
You can then use the =ImportHtml Google Docs function with all its pre-built magic.
A tidyverse solution using rvest. It's very useful if you need to find the table based on some keywords, for example in the table headers. Here is an example where we want to get the table on Vital statistics of Egypt. Note: html_nodes(x = page, css = "table") is a useful way to browse available tables on the page.
library(magrittr)
library(rvest)
# define the page to load
read_html("https://en.wikipedia.org/wiki/Demographics_of_Egypt") %>%
# list all tables on the page
html_nodes(css = "table") %>%
# select the one containing needed key words
extract2(., str_which(string = . , pattern = "Live births")) %>%
# convert to a table
html_table(fill = T) %>%
view
That table is the only table which is a child of the second td child of so you can specify that pattern with css. Rather than use a type selector of table to grab the child table you can use the class which is faster:
library(rvest)
t <- read_html('https://en.wikipedia.org/wiki/Upper_Peninsula_of_Michigan') %>%
html_node('td:nth-child(2) .wikitable') %>%
html_table()
print(t)

Resources