How do I scrape data from this specific website using r?

How do I scrape data from this specific website using r? - r

I want to download the data from this website.
http://asphaltoilmarket.com/index.php/state-index-tracker/
But the request keeps getting timed out.
I have tried following methods already, but it keep getting timed out.
library(rvest)
IndexData <- read_html("http://asphaltoilmarket.com/index.php/state-index-tracker/")
library(RCurl)
IndexData <- getURL("http://asphaltoilmarket.com/index.php/state-index-tracker/")
library(httr)
library(XML)
IndexData <- htmlParse(GET(url))
This website opens in the browser without any problem, and I am able to download this data using excel and alteryx.

If by "get the data", you mean "scrape the table on that page", then you just need to go a little further.
First thing, you'll want to check the sites robots.txt to see if scraping is allowed. In this case, there is no mention against scraping.
You've got the html for the site, you just need to find the css selector for what you want. You can use developer tools or something like selector gadget to find the table and get its css selector.
After that you use the html, extract the node you're interested in with html_node() then extract the table with html_table().
library(magrittr)
library(rvest)
html <-read_html("http://asphaltoilmarket.com/index.php/state-index-tracker/")
html %>%
html_node("#tablepress-5") %>%
html_table()
#> State Jan Feb Mar Apr May Jun Jul
#> 1 Alabama $496.27 $486.86 $482.16 $498.62 $517.44 $529.20 $536.26
#> 2 Alaska $513.33 $513.33 $513.33 $513.33 $513.33 $525.84 $535.00
#> 3 Arizona $476.00 $469.00 $466.00 $463.00 $470.00 $478.00 $480.00
#> 4 Arkansas $503.50 $500.50 $494.00 $503.00 $516.50 $521.20 $525.00
#> 5 California $305.80 $321.00 $346.20 $365.50 $390.10 $380.50 $345.50
#> 6 Colorado $228.10 $301.45 $320.58 $354.12 $348.70 $277.55 $297.23
#> 7 Connecticut $495.00 $495.00 $495.00 $495.00 $502.50 $502.50 $500.56
#> 8 Delaware $493.33 $458.33 $481.67 $496.67 $513.33 $510.00 $498.33
#> 9 Florida $507.30 $484.32 $487.12 $503.38 $518.52 $517.68 $514.03
#> 10 Georgia $515.00 $503.00 $503.00 $517.00 $534.00 $545.00 $550.00

Related

Attempting to scrape an "unscrapable" page?

I'm attempting to build a simple scraper, iterating through a website to pull two pieces of information and build myself a little reference list.
This is what the url looks like: "https://www.mtgstocks.com/prints/[[n]]"
The two pieces of information are the card name (Forbidden Alchemy) and card set (Innistrad).
Pretty straightforward, yeah? I thought so.
I attempted to pass any relevant anchors — css or xpath  — to try and isolate the two variables, but was met with "{xml_nodeset (0)}".
Here's the code that I ran:
# return page info
page_html <- read_html(httr::GET("https://www.mtgstocks.com/prints/1"))
# extract item name
page_html %>%
html_nodes("h3") %>%
html_nodes("a") %>%
html_text()
# character(0)
I've scraped enough webpages to know that this information is being hidden, but I'm not exactly sure how. Would love help!

They are gathering the information from their API, which you can see in the network section of the developer tools. There is a ton of info of card that you can gather from that API - take a look.
library(tidyverse)
library(httr2)
get_card <- function(card_num) {
cat("Scraping card num:", card_num, "\n")
data <- str_c("https://api.mtgstocks.com/prints/", card_num) %>%
request() %>%
req_perform() %>%
resp_body_json(simplifyVector = TRUE)
tibble(
name = data %>%
pluck("name"),
set = data %>%
pluck("card_set") %>%
pluck("name")
)
}
get_card(1)
# A tibble: 1 × 2
name set
<chr> <chr>
1 Forbidden Alchemy Innistrad
Or get for a range of numbers. If the card does not exists, the code returns NA.
map_dfr(1:20, possibly(get_card, otherwise = tibble(
name = NA_character_,
set = NA_character_
)))
# A tibble: 20 × 2
name set
<chr> <chr>
1 Forbidden Alchemy Innistrad
2 NA NA
3 Fortress Crab Innistrad
4 Frightful Delusion Innistrad
5 Full Moon's Rise Innistrad
6 Furor of the Bitten Innistrad
7 Gallows Warden Innistrad
8 Galvanic Juggernaut Innistrad
9 Garruk Relentless Innistrad
10 Gatstaf Shepherd Innistrad
11 Gavony Township Innistrad
12 Geist of Saint Traft Innistrad
13 Geist-Honored Monk Innistrad
14 Geistcatcher's Rig Innistrad
15 Geistflame Innistrad
16 Ghost Quarter Innistrad
17 Ghostly Possession Innistrad
18 Ghoulcaller's Bell Innistrad
19 Ghoulcaller's Chant Innistrad
20 Ghoulraiser Innistrad

Scraping movie scripts failing on small subset

I'm working on scraping the lord of the rings movie scripts from this website here. Each script is broken up across multiple pages that look like this
I can get the info I need for a single page with this code:
library(dplyr)
library(rvest)
url_success <- "http://www.ageofthering.com/atthemovies/scripts/fellowshipofthering1to4.php"
success <- read_html(url_success) %>%
html_elements("#AutoNumber1") %>%
html_table()
summary(success)
Length Class Mode
[1,] 2 tbl_df list
This works for all Fellowship of the Ring pages, and all Return of the King pages. It also works for Two Towers pages covering scenes 57 to 66. However, any other Two Towers page (scenes 1-56) does not return the same result
url_fail <- "http://www.ageofthering.com/atthemovies/scripts/thetwotowers1to4.php"
fail <- read_html(url_fail) %>%
html_elements("#AutoNumber1") %>%
html_table()
summary(fail)
Length Class Mode
0 list list
I've inspected the pages in Chrome, and the failing pages appear to have the same structure as the succeeding ones, including the 'AutoNumber1' table. Can anyone help with this?

Works with xpath. Perhaps ill-formed html (page doesn't seem too spec compliant)
library(rvest)
url_fail <- "http://www.ageofthering.com/atthemovies/scripts/thetwotowers1to4.php"
fail <- read_html(url_fail) %>%
html_elements( xpath = '//*[#id="AutoNumber1"]') %>%
html_table()
fail
#> [[1]]
#> # A tibble: 139 × 2
#> X1 X2
#> <chr> <chr>
#> 1 "Scene 1 ~ The Foundations of Stone\r\n\r\n\r\nThe movie opens as the … "Sce…
#> 2 "GANDALF VOICE OVER:" "You…
#> 3 "FRODO VOICE OVER:" "Gan…
#> 4 "GANDALF VOICE OVER:" "I a…
#> 5 "The scene changes to \r\n inside Moria.  Gandalf is on the Bridge … "The…
#> 6 "GANDALF:" "You…
#> 7 "Gandalf slams down his staff onto the Bridge, \r\ncausing it to crack… "Gan…
#> 8 "BOROMIR :" "(ho…
#> 9 "FRODO:" "Gan…
#> 10 "GANDALF:" "Fly…
#> # … with 129 more rows

Scraping a web page in R without using RSelenium

I’m trying to do a simple scrap in the table in the following url:
https://www.bcb.gov.br/controleinflacao/historicometas
Page Print
By what i notice is that, When using rvest::read_html or httr::GET and when acessing the page source code i can't see anything related to the table, but when acessing google chrome developer tools, i can spot the table references in the elements tab.
Examble above is a simple code where i try to acess the content of the url and search of nodes that contain tables:
library( tidyverse )
library( rvest )
url <- “https://www.bcb.gov.br/controleinflacao/historicometas”
res <- url %>%
read_html( ) %>%
html_node( “table” )
this give me:
{xml_nodeset (0)}
opening the source code of the url mentioned we can see:
view-source:https://www.bcb.gov.br/controleinflacao/historicometas
Page Source Code print
Page Developer Tool table print
By what i have searched the question is that the scripts avaible in source code load the table. I have seen some solutions that use RSelenium, but i would like to know if there is some solution where i can scrap this table without using Rselenium.
Some other related StackOverflow questions:
Scraping webpage (with R) where all elements are placed inside an <app-root> tag
scraping table from a website result as empty
(Last one is a python example)

When dealing with dynamic sites, Network tab tends to be more useful than Inspector. And often you don't have to scroll through hundreds of requests or pages of minified javascript, you rather pick a search term from rendered page to identify the api endpoint that sent that piece of information.
In this case searching for "Resolução CMN nº 2.615" pointed to the correct call, most of the site content (in pure html) was delivered as json.
library(tibble)
library(rvest)
historicometas <- jsonlite::read_json("https://www.bcb.gov.br/api/paginasite/sitebcb/controleinflacao/historicometas")
historicometas$conteudo %>%
read_html() %>%
html_element("table") %>%
html_table()
#> # A tibble: 27 × 7
#> Ano Norma Data Meta …¹ Taman…² Inter…³ Infla…⁴
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 1999 Resolução CMN nº 2.615  30/6… 8 2 6-10 8,94
#> 2 2000 Resolução CMN nº 2.615  30/6… 6 2 4-8 5,97
#> 3 2001 Resolução CMN nº 2.615  30/6… 4 2 2-6 7,67
#> 4 2002 Resolução CMN nº 2.744 28/6… 3,5 2 1,5-5,5 12,53
#> 5 2003* Resolução CMN nº 2.842Resolução … 28/6… 3,254 22,5 1,25-5… 9,309,…
#> 6 2004* Resolução CMN nº 2.972Resolução … 27/6… 3,755,5 2,52,5 1,25-6… 7,60
#> 7 2005 Resolução CMN nº 3.108 25/6… 4,5 2,5 2-7 5,69
#> 8 2006 Resolução CMN nº 3.210 30/6… 4,5 2,0 2,5-6,5 3,14
#> 9 2007 Resolução CMN nº 3.291 23/6… 4,5 2,0 2,5-6,5 4,46
#> 10 2008 Resolução CMN nº 3.378 29/6… 4,5 2,0 2,5-6,5 5,90
#> # … with 17 more rows, and abbreviated variable names ¹`Meta (%)`,
#> # ²`Tamanhodo intervalo +/- (p.p.)`, ³`Intervalode tolerância (%)`,
#> # ⁴`Inflação efetiva(Variação do IPCA, %)`
Created on 2022-10-17 with reprex v2.0.2

scraping with select/ option dropdown

List item
I am new to web scrapping and after a couple of Wikipedia pages I found this page where I wanted to extract the tables for all the portfolio managers. I am not able to use the things I found on the internet. I thought it would be easy since it's just a table but I am not able to extract even a single table after filling out the form. Can someone please tell me how I could get this done in R? I have added an image in this post but it seems to look like a link that says to enter image description here.
https://www.sebi.gov.in/sebiweb/other/OtherAction.do?doPmr=yes
library(tidyverse)
library(rvest)
library(httr)
library(RCurl)
url <- "https://www.sebi.gov.in/sebiweb/other/OtherAction.do?doPmr=yes"
result <- postForm(url,
pmrId="RIGHT HORIZONS PORTFOLIO MANAGEMENT PRIVATE LIMITED",
year="2022",
month="August")
attr(result,"Content-Type")
result
enter image description here
Sebi Website

If you change those passed values to corresponding value attribute values of options (i.e. "8" instead of "August" in case of <option value="8">August</option>), you should be all set.
And you can also check the actual payload of POST requests:
Lazy approach would be just using Copy as cURL in DevTools and heading to https://curlconverter.com/r/ to convert it to httr request.
library(rvest)
resp <- httr::POST("https://www.sebi.gov.in/sebiweb/other/OtherAction.do?doPmr=yes",
body = list(
pmrId="INP000004417##INP000004417##AEQUITAS INVESTMENT CONSULTANCY PRIVATE LIMITED",
year="2022",
month="8"))
tables <- resp %>%
read_html() %>%
html_elements("table") %>%
html_table()
# first table:
tables[[1]]
#> # A tibble: 11 × 2
#> X1 X2
#> <chr> <chr>
#> 1 Name of the Portfolio Manager "Aeq…
#> 2 Registration Number "INP…
#> 3 Date of Registration "201…
#> 4 Registered Address of the Portfolio Manager ",,,…
#> 5 Name of Principal Officer ""
#> 6 Email ID of the Principal Officer ""
#> 7 Contact Number (Direct) of the Principal Officer ""
#> 8 Name of Compliance Officer ""
#> 9 Email ID of the Compliance Officer ""
#> 10 No. of clients as on last day of the month "124…
#> 11 Total Assets under Management (AUM) as on last day of the month (Amoun… "143…
Created on 2022-10-11 with reprex v2.0.2

Importing wikipedia tables in R

I regularly extract tables from Wikipedia. Excel's web import does not work properly for wikipedia, as it treats the whole page as a table. In google spreadsheet, I can enter this:
=ImportHtml("http://en.wikipedia.org/wiki/Upper_Peninsula_of_Michigan","table",3)
and this function will download the 3rd table, which lists all the counties of the UP of Michigan, from that page.
Is there something similar in R? or can be created via a user defined function?

Building on Andrie's answer, and addressing SSL. If you can take one additional library dependency:
library(httr)
library(XML)
url <- "https://en.wikipedia.org/wiki/Upper_Peninsula_of_Michigan"
r <- GET(url)
doc <- readHTMLTable(
doc=content(r, "text"))
doc[6]

The function readHTMLTable in package XML is ideal for this.
Try the following:
library(XML)
doc <- readHTMLTable(
doc="http://en.wikipedia.org/wiki/Upper_Peninsula_of_Michigan")
doc[[6]]
V1 V2 V3 V4
1 County Population Land Area (sqÂ mi) Population Density (per sqÂ mi)
2 Alger 9,862 918 10.7
3 Baraga 8,735 904 9.7
4 Chippewa 38,413 1561 24.7
5 Delta 38,520 1170 32.9
6 Dickinson 27,427 766 35.8
7 Gogebic 17,370 1102 15.8
8 Houghton 36,016 1012 35.6
9 Iron 13,138 1166 11.3
10 Keweenaw 2,301 541 4.3
11 Luce 7,024 903 7.8
12 Mackinac 11,943 1022 11.7
13 Marquette 64,634 1821 35.5
14 Menominee 25,109 1043 24.3
15 Ontonagon 7,818 1312 6.0
16 Schoolcraft 8,903 1178 7.6
17 TOTAL 317,258 16,420 19.3
readHTMLTable returns a list of data.frames for each element of the HTML page. You can use names to get information about each element:
> names(doc)
[1] "NULL"
[2] "toc"
[3] "Election results of the 2008 Presidential Election by County in the Upper Peninsula"
[4] "NULL"
[5] "Cities and Villages of the Upper Peninsula"
[6] "Upper Peninsula Land Area and Population Density by County"
[7] "19th Century Population by Census Year of the Upper Peninsula by County"
[8] "20th & 21st Centuries Population by Census Year of the Upper Peninsula by County"
[9] "NULL"
[10] "NULL"
[11] "NULL"
[12] "NULL"
[13] "NULL"
[14] "NULL"
[15] "NULL"
[16] "NULL"

Here is a solution that works with the secure (https) link:
install.packages("htmltab")
library(htmltab)
htmltab("http://en.wikipedia.org/wiki/Upper_Peninsula_of_Michigan",3)

One simple way to do it is to use the RGoogleDocs interface to have Google Docs to do the conversion for you:
http://www.omegahat.org/RGoogleDocs/run.html
You can then use the =ImportHtml Google Docs function with all its pre-built magic.

A tidyverse solution using rvest. It's very useful if you need to find the table based on some keywords, for example in the table headers. Here is an example where we want to get the table on Vital statistics of Egypt. Note: html_nodes(x = page, css = "table") is a useful way to browse available tables on the page.
library(magrittr)
library(rvest)
# define the page to load
read_html("https://en.wikipedia.org/wiki/Demographics_of_Egypt") %>%
# list all tables on the page
html_nodes(css = "table") %>%
# select the one containing needed key words
extract2(., str_which(string = . , pattern = "Live births")) %>%
# convert to a table
html_table(fill = T) %>%
view

That table is the only table which is a child of the second td child of so you can specify that pattern with css. Rather than use a type selector of table to grab the child table you can use the class which is faster:
library(rvest)
t <- read_html('https://en.wikipedia.org/wiki/Upper_Peninsula_of_Michigan') %>%
html_node('td:nth-child(2) .wikitable') %>%
html_table()
print(t)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How do I scrape data from this specific website using r? - r

Related

Attempting to scrape an "unscrapable" page?

Scraping movie scripts failing on small subset

Scraping a web page in R without using RSelenium

scraping with select/ option dropdown

Importing wikipedia tables in R

Categories

Resources