I regularly extract tables from Wikipedia. Excel's web import does not work properly for wikipedia, as it treats the whole page as a table. In google spreadsheet, I can enter this:
=ImportHtml("http://en.wikipedia.org/wiki/Upper_Peninsula_of_Michigan","table",3)
and this function will download the 3rd table, which lists all the counties of the UP of Michigan, from that page.
Is there something similar in R? or can be created via a user defined function?
Building on Andrie's answer, and addressing SSL. If you can take one additional library dependency:
library(httr)
library(XML)
url <- "https://en.wikipedia.org/wiki/Upper_Peninsula_of_Michigan"
r <- GET(url)
doc <- readHTMLTable(
doc=content(r, "text"))
doc[6]
The function readHTMLTable in package XML is ideal for this.
Try the following:
library(XML)
doc <- readHTMLTable(
doc="http://en.wikipedia.org/wiki/Upper_Peninsula_of_Michigan")
doc[[6]]
V1 V2 V3 V4
1 County Population Land Area (sq mi) Population Density (per sq mi)
2 Alger 9,862 918 10.7
3 Baraga 8,735 904 9.7
4 Chippewa 38,413 1561 24.7
5 Delta 38,520 1170 32.9
6 Dickinson 27,427 766 35.8
7 Gogebic 17,370 1102 15.8
8 Houghton 36,016 1012 35.6
9 Iron 13,138 1166 11.3
10 Keweenaw 2,301 541 4.3
11 Luce 7,024 903 7.8
12 Mackinac 11,943 1022 11.7
13 Marquette 64,634 1821 35.5
14 Menominee 25,109 1043 24.3
15 Ontonagon 7,818 1312 6.0
16 Schoolcraft 8,903 1178 7.6
17 TOTAL 317,258 16,420 19.3
readHTMLTable returns a list of data.frames for each element of the HTML page. You can use names to get information about each element:
> names(doc)
[1] "NULL"
[2] "toc"
[3] "Election results of the 2008 Presidential Election by County in the Upper Peninsula"
[4] "NULL"
[5] "Cities and Villages of the Upper Peninsula"
[6] "Upper Peninsula Land Area and Population Density by County"
[7] "19th Century Population by Census Year of the Upper Peninsula by County"
[8] "20th & 21st Centuries Population by Census Year of the Upper Peninsula by County"
[9] "NULL"
[10] "NULL"
[11] "NULL"
[12] "NULL"
[13] "NULL"
[14] "NULL"
[15] "NULL"
[16] "NULL"
Here is a solution that works with the secure (https) link:
install.packages("htmltab")
library(htmltab)
htmltab("http://en.wikipedia.org/wiki/Upper_Peninsula_of_Michigan",3)
One simple way to do it is to use the RGoogleDocs interface to have Google Docs to do the conversion for you:
http://www.omegahat.org/RGoogleDocs/run.html
You can then use the =ImportHtml Google Docs function with all its pre-built magic.
A tidyverse solution using rvest. It's very useful if you need to find the table based on some keywords, for example in the table headers. Here is an example where we want to get the table on Vital statistics of Egypt. Note: html_nodes(x = page, css = "table") is a useful way to browse available tables on the page.
library(magrittr)
library(rvest)
# define the page to load
read_html("https://en.wikipedia.org/wiki/Demographics_of_Egypt") %>%
# list all tables on the page
html_nodes(css = "table") %>%
# select the one containing needed key words
extract2(., str_which(string = . , pattern = "Live births")) %>%
# convert to a table
html_table(fill = T) %>%
view
That table is the only table which is a child of the second td child of so you can specify that pattern with css. Rather than use a type selector of table to grab the child table you can use the class which is faster:
library(rvest)
t <- read_html('https://en.wikipedia.org/wiki/Upper_Peninsula_of_Michigan') %>%
html_node('td:nth-child(2) .wikitable') %>%
html_table()
print(t)
Related
As this announcement mentions (https://www.facebook.com/business/news/transparency-social-issue-electoral-political-ads) new targeting information (or a summary) has been made available in the Facebook Ad Library.
I am used to use the 'Radlibrary' package in R, but I can't seem to find any fields in 'Radlibrary' which allows me to get this information? Does anyone know either how to access this information from the Radlibrary package in R (preferred, since this is what I know and usually works with) or how to access this from the API in another way?
I use it to look at how politicians choose to target their ads, why it would be a too big of a task to manually look it up at the facebook.com/ads/library
EDIT
The targeting I refer to is found browsering the ad library like the screenshots below
Thanks for highlighting this data being published which I did not know had been announced. I just registered for an API token to play around with it.
It seems to me that looking for ads from a particular politician or organisation is a question of downloading large amounts of data and then manipulating it in R. For example, to recreate the curl query on the API docs page:
curl -G \
-d "search_terms='california'" \
-d "ad_type=POLITICAL_AND_ISSUE_ADS" \
-d "ad_reached_countries=['US']" \
-d "access_token=<ACCESS_TOKEN>" \
"https://graph.facebook.com/<API_VERSION>/ads_archive"
We can simply do:
# enter token interactively so it doesn't get added to R history
token <- readline()
query <- adlib_build_query(
search_terms = "california",
ad_reached_countries = 'US',
ad_type = "POLITICAL_AND_ISSUE_ADS"
)
response <- adlib_get(params = query, token = token)
results_df <- Radlibrary::as_tibble(response, censor_access_token = TRUE)
This seems to return what one would expect:
names(results_df)
# [1] "id" "ad_creation_time" "ad_creative_bodies" "ad_creative_link_captions" "ad_creative_link_titles" "ad_delivery_start_time"
# [7] "ad_snapshot_url" "bylines" "currency" "languages" "page_id" "page_name"
# [13] "publisher_platforms" "estimated_audience_size_lower" "estimated_audience_size_upper" "impressions_lower" "impressions_upper" "spend_lower"
# [19] "spend_upper" "ad_creative_link_descriptions" "ad_delivery_stop_time"
library(dplyr)
results_df |>
group_by(page_name) |>
summarise(n = n()) |>
arrange(desc(n))
# # A tibble: 237 x 2
# page_name n
# <chr> <int>
# 1 Senator Brian Dahle 169
# 2 Katie Porter 122
# 3 PragerU 63
# 4 Results for California 28
# 5 Big News Buzz 20
# 6 California Water Service 20
# 7 Cancer Care is Different 17
# 8 Robert Reich 14
# 9 Yes On 28 14
# 10 Protect Tribal Gaming 13
# # ... with 227 more rows
Now - assuming that you are interested specifically in the ads by Senator Brian Dahle, it does not appear that you can send a query for all ads he has placed (i.e. using the page_name parameter in the query). But you can request for all political ads in their area (setting the limit parameter to a high number) with a particular search_term or search_page_id, and then filter the data to the relevant person.
I have searched everywhere trying to find an answer to this question and I haven't quite found what I'm looking for yet so I'm hoping asking directly will help.
I am working with the USPS Tracking API, which provides an output an XML format. The API is limited to 35 results per call (i.e. you can only provide 35 tracking numbers to get info on each time you call the API) and I need information on ~90,000 tracking numbers, so I am running my calls in a for loop. I was able to store the results of the call in a list, but then I had trouble exporting the list as-is into anything usable. However, when I tried to convert the results from the list into JSON, it dropped the attribute tag, which contained the tracking number I had used to generate the results.
Here is what a sample result looks like:
<TrackResponse>
<TrackInfo ID="XXXXXXXXXXX1">
<TrackSummary> Your item was delivered at 6:50 am on February 6 in BARTOW FL 33830.</TrackSummary>
<TrackDetail>February 6 6:49 am NOTICE LEFT BARTOW FL 33830</TrackDetail>
<TrackDetail>February 6 6:48 am ARRIVAL AT UNIT BARTOW FL 33830</TrackDetail>
<TrackDetail>February 6 3:49 am ARRIVAL AT UNIT LAKELAND FL 33805</TrackDetail>
<TrackDetail>February 5 7:28 pm ENROUTE 33699</TrackDetail>
<TrackDetail>February 5 7:18 pm ACCEPT OR PICKUP 33699</TrackDetail>
Here is the script I ran to get the output I'm currently working with:
final_tracking_info <- list()
for (i in 1:x) { # where x = the number of calls to the API the loop will need to make
usps = input_tracking_info[i] # input_tracking_info = GET commands
usps = read_xml(usps)
final_tracking_info1[[i+1]]<-usps$TrackResponse
gc()
}
final_output <- toJSON(final_tracking_info)
write(final_output,"final_tracking_info.json") # tried converting to JSON, lost the ID attribute
cat(capture.output(print(working_list),file = "Final_Tracking_Info.txt")) # exported the list to a textfile, was not an ideal format to work with
What I ultimately want tog et from this data is a table containing the tracking number, the first track detail, and the last track detail. What I'm wondering is, is there a better way to compile this in XML/JSON that will make it easier to convert to a tibble/df down the line? Is there any easy way/preferred format to select based on the fact that I know most of the columns will have the same name ("Track Detail") and the DFs will have to be different lengths (since each package will have a different number of track details) when I'm trying to compile 1,000 of results into one final output?
Using XML::xmlToList() will store the ID attribute in .attrs:
$TrackSummary
[1] " Your item was delivered at 6:50 am on February 6 in BARTOW FL 33830."
$TrackDetail
[1] "February 6 6:49 am NOTICE LEFT BARTOW FL 33830"
$TrackDetail
[1] "February 6 6:48 am ARRIVAL AT UNIT BARTOW FL 33830"
$TrackDetail
[1] "February 6 3:49 am ARRIVAL AT UNIT LAKELAND FL 33805"
$TrackDetail
[1] "February 5 7:28 pm ENROUTE 33699"
$TrackDetail
[1] "February 5 7:18 pm ACCEPT OR PICKUP 33699"
$.attrs
ID
"XXXXXXXXXXX1"
A way of using that output which assumes that the Summary and ID are always present as first and last elements, respectively, is:
xml_data <- XML::xmlToList("71563898.xml") %>%
unlist() %>% # flattening
unname() # removing names
data.frame (
ID = tail(xml_data, 1), # getting last element
Summary = head(xml_data, 1), # getting first element
Info = xml_data %>% head(-1) %>% tail(-1) # remove first and last elements
)
I am trying to scrape a website link. So far I downloaded the text and set it as a dataframe. I have the folllowing;
keywords <- c(credit | model)
text_df <- as.data.frame.table(text_df)
text_df %>%
filter(str_detect(text, keywords))
where credit and model are two values I want to search the website, i.e. return row with the word credit or model in.
I get the following error
Error in filter_impl(.data, dots) : object 'credit' not found
The code only returns the results with the word "model" in and ignores the word "credit".
How can I go about returning all results with either the word "credit" or "model" in.
My plan is to have keywords <- c(credit | model | more_key_words | something_else | many values)
Thanks in advance.
EDIT:
text_df:
Var 1 text
1 Here is some credit information
2 Some text which does not expalin any keywords but messy <li> text9182edj </i>
3 This line may contain the keyword model
4 another line which contains nothing of use
So I am trying to extract just rows 1 and 3.
I think the issue is you need to pass a string as an argument to str_detect. To check for "credit" or "model" you can paste them into a single string separated by |.
library(tidyverse)
library(stringr)
text_df <- read_table("Var 1 text
1 Here is some credit information
2 Some text which does not expalin any keywords but messy <li> text9182edj </i>
3 This line may contain the keyword model
4 another line which contains nothing of use")
keywords <- c("credit", "model")
any_word <- paste(keywords, collapse = "|")
text_df %>% filter(str_detect(text, any_word))
#> # A tibble: 2 x 3
#> Var `1` text
#> <int> <chr> <chr>
#> 1 1 Here is some credit information
#> 2 3 This line may contain the keyword model
Ok I have checked it and I think it will not work you way, as you must use the or | operator inside filter() not inside str_detect()
So it would work this way:
keywords <- c("virg", "tos")
library(dplyr)
library(stringr)
iris %>%
filter(str_detect(Species, keywords[1]) | str_detect(Species, keywords[2]))
as a keywords[1] etc you have to specify each "keyword" from this variable
I would recommend staying away from regex when you're dealing with words. There are packages tailored for your particular task that you can use. Try, for example, the following
library(corpus)
text <- readLines("http://norvig.com/big.txt") # sherlock holmes
terms <- c("watson", "sherlock holmes", "elementary")
text_locate(text, terms)
## text before instance after
## 1 1 …Book of The Adventures of Sherlock Holmes
## 2 27 Title: The Adventures of Sherlock Holmes
## 3 40 … EBOOK, THE ADVENTURES OF SHERLOCK HOLMES ***
## 4 50 SHERLOCK HOLMES
## 5 77 To Sherlock Holmes she is always the woman. I…
## 6 85 …," he remarked. "I think, Watson , that you have put on seve…
## 7 89 …t a trifle more, I fancy, Watson . And in practice again, I …
## 8 145 …ere's money in this case, Watson , if there is nothing else.…
## 9 163 …friend and colleague, Dr. Watson , who is occasionally good …
## 10 315 … for you. And good-night, Watson ," he added, as the wheels …
## 11 352 …s quite too good to lose, Watson . I was just balancing whet…
## 12 422 …as I had pictured it from Sherlock Holmes ' succinct description, but…
## 13 504 "Good-night, Mister Sherlock Holmes ."
## 14 515 …t it!" he cried, grasping Sherlock Holmes by either shoulder and loo…
## 15 553 "Mr. Sherlock Holmes , I believe?" said she.
## 16 559 "What!" Sherlock Holmes staggered back, white with…
## 17 565 …tter was superscribed to " Sherlock Holmes , Esq. To be left till call…
## 18 567 "MY DEAR MR. SHERLOCK HOLMES ,--You really did it very w…
## 19 569 …est to the celebrated Mr. Sherlock Holmes . Then I, rather imprudentl…
## 20 571 …s; and I remain, dear Mr. Sherlock Holmes ,
## ⋮ (189 rows total)
Note that this matches the term regardless of the case.
For your specific use case, do
ix <- text_detect(text, terms)
or
matches <- text_subset(text, terms)
I am trying to scrape data from a website using R. I am using rvest in an attempt to mimic an example scraping the IMDB page for the Lego Movie. The example advocates use of a tool called Selector Gadget to help easily identify the html_node associated with the data you are seeking to pull.
I am ultimately interested in building a data frame that has the following schema/columns:
rank, blog_name, facebook_fans, twitter_followers, alexa_rank.
My code below. I was able to use Selector Gadget to correctly identity the html tag used in the Lego example. However, following the same process and same code structure as the Lego example, I get NAs (...using firstNAs introduced by coercion[1] NA
). My code is below:
data2_html = read_html("http://blog.feedspot.com/video_game_news/")
data2_html %>%
html_node(".stats") %>%
html_text() %>%
as.numeric()
I have also experimented with: html_node("html_node(".stats , .stats span")), which seems to work for the "Facebook fans" column since it reports 714 matches, however only returns 1 number is returned.
714 matches for .//*[#class and contains(concat(' ', normalize-space(#class), ' '), ' stats ')] | .//*[#class and contains(concat(' ', normalize-space(#class), ' '), ' stats ')]/descendant-or-self::*/span: using first{xml_node}
<td>
[1] <span>997,669</span>
This may help you:
library(rvest)
d1 <- read_html("http://blog.feedspot.com/video_game_news/")
stats <- d1 %>%
html_nodes(".stats") %>%
html_text()
blogname <- d1%>%
html_nodes(".tlink") %>%
html_text()
Note that it is html_nodes (plural)
Result:
> head(blogname)
[1] "Kotaku - The Gamer's Guide" "IGN | Video Games" "Xbox Wire" "Official PlayStation Blog"
[5] "Nintendo Life " "Game Informer"
> head(stats,12)
[1] "997,669" "1,209,029" "873" "4,070,476" "4,493,805" "399" "23,141,452" "10,210,993" "879"
[10] "38,019,811" "12,059,607" "500"
blogname returns the list of blog names that is easy to manage. On the other hand the stats info comes out mixed. This is due to the way the stats class for Facebook and Twitter fans are indistinguishable from one another. In this case the output array has the information every three numbers, that is stats = c(fb, tw, alx, fb, tw, alx...). You should separate each vector from this one.
FBstats = stats[seq(1,length(stats),3)]
> head(stats[seq(1,length(stats),3)])
[1] "997,669" "4,070,476" "23,141,452" "38,019,811" "35,977" "603,681"
You can use html_table to extract the whole table with minimal work:
library(rvest)
library(tidyverse)
# scrape html
h <- 'http://blog.feedspot.com/video_game_news/' %>% read_html()
game_blogs <- h %>%
html_node('table') %>% # select enclosing table node
html_table() %>% # turn table into data.frame
set_names(make.names) %>% # make names syntactic
mutate(Blog.Name = sub('\\s?\\+.*', '', Blog.Name)) %>% # extract title from name info
mutate_at(3:5, parse_number) %>% # make numbers actually numbers
tbl_df() # for printing
game_blogs
#> # A tibble: 119 x 5
#> Rank Blog.Name Facebook.Fans Twitter.Followers Alexa.Rank
#> <int> <chr> <dbl> <dbl> <dbl>
#> 1 1 Kotaku - The Gamer's Guide 997669 1209029 873
#> 2 2 IGN | Video Games 4070476 4493805 399
#> 3 3 Xbox Wire 23141452 10210993 879
#> 4 4 Official PlayStation Blog 38019811 12059607 500
#> 5 5 Nintendo Life 35977 95044 17727
#> 6 6 Game Informer 603681 1770812 10057
#> 7 7 Reddit | Gamers 1003705 430017 25
#> 8 8 Polygon 623808 485827 1594
#> 9 9 Xbox Live's Major Nelson 65905 993481 23114
#> 10 10 VG247 397798 202084 3960
#> # ... with 109 more rows
It's worth checking that everything is parsed like you want, but it should be usable at this point.
This uses html_nodes (plural) and str_replace to remove commas in numbers. Not sure if these are all the stats you need.
library(rvest)
library(stringr)
data2_html = read_html("http://blog.feedspot.com/video_game_news/")
data2_html %>%
html_nodes(".stats") %>%
html_text() %>%
str_replace_all(',', '') %>%
as.numeric()
I'm trying to read information from the Zillow API and am running into some data structure issues in R. My outputs are supposed to be xml and appear to be, but aren't behaving like xml.
Specifically, the object that GetSearchResults() returns to me is in a format similar to XML, but not quite right to read in R's XML reading functions.
Can you tell me how I should approach this?
#set directory
setwd('[YOUR DIRECTORY]')
# setup libraries
library(dplyr)
library(XML)
library(ZillowR)
library(RCurl)
# setup api key
set_zillow_web_service_id('[YOUR API KEY]')
xml = GetSearchResults(address = '120 East 7th Street', citystatezip = '10009')
data = xmlParse(xml)
This throws the following error:
Error: XML content does not seem to be XML
The Zillow API documentation clearly states that the output should be XML, and it certainly looks like it. I'd like to be able to easily access various components of the API output for larger-scale data manipulation / aggregation. Let me know if you have any ideas.
This was a fun opportunity for me to get acquainted with the Zillow API. My approach, following How to parse XML to R data frame, was to convert the response to a list, for ease of inspection. The onerous bit was figuring out the structure of the data through inspecting the list, particularly because each property might have some missing data. This was why I wrote the getValRange function to deal with parsing the Zestimate data.
results <- xmlToList(xml$response[["results"]])
getValRange <- function(x, hilo) {
ifelse(hilo %in% unlist(dimnames(x)), x["text",hilo][[1]], NA)
}
out <- apply(results, MAR=2, function(property) {
zpid <- property$zpid
links <- unlist(property$links)
address <- unlist(property$address)
z <- property$zestimate
zestdf <- list(
amount=ifelse("text" %in% names(z$amount), z$amount$text, NA),
lastupdated=z$"last-updated",
valueChange=ifelse(length(z$valueChange)==0, NA, z$valueChange),
valueLow=getValRange(z$valuationRange, "low"),
valueHigh=getValRange(z$valuationRange, "high"),
percentile=z$percentile)
list(id=zpid, links, address, zestdf)
})
data <- as.data.frame(do.call(rbind, lapply(out, unlist)),
row.names=seq_len(length(out)))
Sample output:
> data[,c("id", "street", "zipcode", "amount")]
id street zipcode amount
1 2098001736 120 E 7th St APT 5A 10009 2321224
2 2101731413 120 E 7th St APT 1B 10009 2548390
3 2131798322 120 E 7th St APT 5B 10009 2408860
4 2126480070 120 E 7th St APT 1A 10009 2643454
5 2125360245 120 E 7th St APT 2A 10009 1257602
6 2118428451 120 E 7th St APT 4A 10009 <NA>
7 2125491284 120 E 7th St FRNT 1 10009 <NA>
8 2126626856 120 E 7th St APT 2B 10009 2520587
9 2131542942 120 E 7th St APT 4B 10009 1257676
# setup libraries
pacman::p_load(dplyr,XML,ZillowR,RCurl) # I use pacman, you don't have to
# setup api key
set_zillow_web_service_id('X1-mykey_31kck')
xml <- GetSearchResults(address = '120 East 7th Street', citystatezip = '10009')
dat <- unlist(xml)
str(dat)
Named chr [1:653] "120 East 7th Street" "10009" "Request successfully
processed" "0" "response" "results" "result" "zpid" "text"
"2131798322" "links" ...
- attr(*, "names")= chr [1:653] "request.address" "request.citystatezip" "message.text" "message.code" ...
dat <- as.data.frame(dat)
dat <- gsub("text","", dat$dat)
I'm not exactly sure what you wanted to do with these results but they're there and they look fine:
head(dat, 20)
[1] "120 East 7th Street"
[2] "10009"
[3] "Request successfully processed"
[4] "0"
[5] "response"
[6] "results"
[7] "result"
[8] "zpid"
[9] ""
[10] "2131798322"
[11] "links"
[12] "homedetails"
[13] ""
[14] "http://www.zillow.com/homedetails/120-E-7th-St-APT-5B-New-York-NY-10009/2131798322_zpid/"
[15] "mapthishome"
[16] ""
[17] "http://www.zillow.com/homes/2131798322_zpid/"
[18] "comparables"
[19] ""
[20] "http://www.zillow.com/homes/comps/2131798322_zpid/"
As stated previously, the trick is to get the API into a list (as opposed to XML). Then it becomes quite simple to pull out whatever data you are interested in.
I wrote an R package that simplifies this. Take a look on github - https://github.com/billzichos/homer. It comes with a vignette.
Assuming the Zillow ID of the property you were interested in was 36086728, the code would look like.
home_estimate("36086728")