Web scrape with rvest from a table that is not defined - r

I am trying to get a table from this
website :
http://www.oddsportal.com/american-football/usa/nfl-2012-2013/results/
I actually want to get the table in the middle of the page.
I tried different ways but in vain.
library("rvest")
library(dplyr)
url1 <- "http://www.oddsportal.com/american-football/usa/nfl-2012-2013/results/"
table <- url1 %>%
read_html() %>%
html_nodes(xpath='//*[#id="tournamentTable"]') %>%
html_table(fill = T)
This does not work because i believe that the table is not defined as table.
I also tried to grab the rows separately by using:
df <- mps1 %>%
html_nodes(css = "tr.odd.deactivate,tr.center.nob-border")
but it obtains nothing.
Any idea how can I do it?
thanks

Based on previous questions by people trying to scrape from this site, this table is probably dynamically generated. As far as I know, the only way to deal with pages like this is to use RSelenium - which basically automates a browser.
After a lot of trial and error, the following code seems to work (using Chrome on Windows 10)...
library(RSelenium)
library(rvest)
library(dplyr)
url <- "http://www.oddsportal.com/american-football/usa/nfl-2012-2013/results/"
rD <- rsDriver(port=4444L,browser="chrome")
remDr <- rD$client
remDr$navigate(url)
page <- remDr$getPageSource()
remDr$close() #you can leave open if you are doing several of these: close at the end
table <- page[[1]] %>%
read_html() %>%
html_nodes(xpath='//table[#id="tournamentTable"]') %>% #specify table as there is a div with same id
html_table(fill = T)
table <- table[[1]]
head(table)
American Football» USA»NFL 2012/2013 American Football» USA»NFL 2012/2013 American Football» USA»NFL 2012/2013 American Football» USA»NFL 2012/2013 American Football» USA»NFL 2012/2013 American Football» USA»NFL 2012/2013
1 03 Feb 2013 - Play Offs 03 Feb 2013 - Play Offs 03 Feb 2013 - Play Offs 03 Feb 2013 - Play Offs 1.00 2.00
2 NA NA
3 23:30 San Francisco 49ers - Baltimore Ravens San Francisco 49ers - Baltimore Ravens 31:34 1.49 2.71
4 28 Jan 2013 - All Stars 28 Jan 2013 - All Stars 28 Jan 2013 - All Stars 28 Jan 2013 - All Stars 1.00 2.00
5 NA NA
6 00:00 NFC - AFC NFC - AFC 62:35 2.03 1.83
American Football» USA»NFL 2012/2013
1 B's
2
3 9
4 B's
5
6 9
The odds are coming out as decimal numbers, unfortunately, but hopefully you can work with that.

Related

R: move everything after a word to a new column and then only keep the last four digits in the new column

My data frame has a column called "State" and contains the state name, HB/HF number, and the date the law went into effect. I want the state column to only contain the state name and the second column to contain just the year. How would I do this?
Mintz = read.csv('https://github.com/bandcar/mintz/raw/main/State%20Legislation%20on%20Biosimilars2.csv')
mintz = Mintz
# delete rows if col 2 has a blank value.
mintz = mintz[mintz$Substitution.Requirements != "", ]
# removes entire row if column 1 has the word State
mintz=mintz[mintz$State != "State", ]
#reset row numbers
mintz= mintz %>% data.frame(row.names = 1:nrow(.))
# delete PR
mintz = mintz[-34,]
#reset row numbers
mintz= mintz %>% data.frame(row.names = 1:nrow(.))
I'm almost certain I'll need to use strsplit(gsub()) but I'm not sure how to this since there's no specific pattern
EDIT
I still need help keeping only the state name in column 1.
As for moving the year to a new column, I found the below. It works, but I don't know why it works. From my understanding \d means that \d is the actual character it's searching for. the "." means to search for one character, and I have no idea what the \1 means. Another strange thing is that Minnesota (row 20) did not have a year, so it instead used characters. Isn't \d only supposed to be for digits? Someone care to explain?
mintz2 = mintz
mintz2$Year = sub('.*(\\d{4}).*', '\\1', mintz2$State)
One way could be:
For demonstration purposes select the State column.
Then we use str_extract to extract all numbers with 4 digits with that are at the end of the string \\d{4}-> this gives us the Year column.
Finally we make use of the inbuilt state.name function make a pattern of it an use it again with str_extract and remove NA rows.
library(dplyr)
library(stringr)
mintz %>%
select(State) %>%
mutate(Year = str_extract(State, '\\d{4}$'), .after=State,
State = str_extract(State, paste(state.name, collapse='|'))
) %>%
na.omit()
State Year
2 Arizona 2016
3 California 2016
7 Connecticut 2018
12 Florida 2013
13 Georgia 2015
16 Hawaii 2016
21 Illinois 2016
24 Indiana 2014
28 Iowa 2017
32 Kansas 2017
33 Kentucky 2016
34 Louisiana 2015
39 Maryland 2017
42 Michigan 2018
46 Missouri 2016
47 Montana 2017
50 Nebraska 2018
51 Nevada 2018
54 New Hampshire 2018
55 New Jersey 2016
59 New York 2017
62 North Carolina 2015
63 North Dakota 2013
66 Ohio 2017
67 Oregon 2016
70 Pennsylvania 2016
74 Rhode Island 2016
75 South Carolina 2017
78 South Dakota 2019
79 Tennessee 2015
82 Texas 2015
85 Utah 2015
88 Vermont 2018
89 Virginia 2013
92 Washington 2015
93 West Virginia 2018
96 Wisconsin 2019
97 Wyoming 2018

Problem scraping a webpage with R and Rvest

I used the code below to extract a table from a webpage:
library(rvest)
library(dplyr)
#Link to site and then getting html code.
link <- "https://www.stats.gov.sa/en/915"
page <- read_html(link)
#extract table from html
files <- page %>%
html_nodes("table") %>%
.[[1]] %>%
html_table()
However, I get results that differ from that on the webpage. The result are shown below:
A tibble: 1 × 4
Name Report Period Periodicity Download
1 please wait ... please wait ... please wait ... please wait ...
I was wondering is there a way to get the table in the form I view it through the web browser without using Rselenium. This is because this doesn't seem to work with r studio online
The solution could be RSelenium
Below a simple example
library(RSelenium)
library(rvest)
library(dplyr)
#Your URL
URL <- "https://www.stats.gov.sa/en/915"
#Open the browser by RSelenium
rD <- RSelenium::rsDriver(browser = "firefox", port = 4544L, verbose = F)
remDr <- rD[["client"]]
#Open the page into browser
remDr$navigate(URL)
#Get the table that you see
remDr$getPageSource()[[1]] %>%
read_html() %>%
html_table()
[[1]]
# A tibble: 13 x 4
Name `Report Period` Periodicity Download
<chr> <int> <chr> <lgl>
1 Ar-Riyad Region 2017 Annual NA
2 Makkah Al-Mokarramah Region 2017 Annual NA
3 Al-Madinah Al-Monawarah Region 2017 Annual NA
4 Al-Qaseem Region 2017 Annual NA
5 Eastern Region 2017 Annual NA
6 Aseer Region 2017 Annual NA
7 Tabouk Region 2017 Annual NA
8 Hail Region 2017 Annual NA
9 Northern Borders Region 2017 Annual NA
10 Jazan Region 2017 Annual NA
11 Najran Region 2017 Annual NA
12 Al-Baha Region 2017 Annual NA
13 Al-Jouf Region 2017 Annual NA

Making a table for streamgraph

Hi guys I am trying to plot a streamgraph using data at the following link: https://www.kaggle.com/START-UMD/gtd.
My aim is to streamgraph the frequency of terrorist attacks for each terrorist group of the variable gnamebut my problem is that I don't know how to filter the data frame in order to have all the parameters necessary to plot a streamgraph which are data, key, value, date.
I tried to get to that subset of the original dataframe by using the following code
str <- terror %>%
filter(gname != "Unknown") %>%
group_by(gname) %>%
summarise(total=n()) %>%
arrange(desc(total)) %>%
head(20)
But all I managed to get is the frequency of attacks for each terrorist group, without getting the number of attacks for each year.
Could you suggest any way to do it? That would be amazing!
Thanks for reading guys and for the help.
Dario and Kent are correct. You need to add the iyear variable in the group_by function:
terror %>%
filter(gname != "Unknown") %>%
group_by(gname, iyear) %>%
summarise(total=n()) %>%
arrange(desc(total)) %>%
head(20) -> str
str
# A tibble: 20 x 3
# Groups: gname [7]
gname iyear total
<chr> <int> <int>
1 Islamic State of Iraq and the Levant (ISIL) 2016 1454
2 Islamic State of Iraq and the Levant (ISIL) 2017 1315
3 Islamic State of Iraq and the Levant (ISIL) 2014 1249
4 Taliban 2015 1249
5 Islamic State of Iraq and the Levant (ISIL) 2015 1221
6 Taliban 2016 1065
7 Taliban 2014 1035
8 Taliban 2017 894
9 Al-Shabaab 2014 871
10 Taliban 2012 800
11 Taliban 2013 775
12 Al-Shabaab 2017 570
13 Al-Shabaab 2016 564
14 Boko Haram 2015 540
15 Shining Path (SL) 1989 509
16 Communist Party of India - Maoist (CPI-Maoist) 2010 505
17 Shining Path (SL) 1984 502
18 Boko Haram 2014 495
19 Shining Path (SL) 1983 493
20 Farabundo Marti National Liberation Front (FML~ 1991 492
Then send that to the streamgraph:
str %>% streamgraph("gname", "total", "iyear")
I've always had difficulty annotating these graphs, as far as I know, it had to be done manually:
str %>% streamgraph("gname", "total", "iyear") %>%
sg_annotate(label="ISIL", x=as.Date("2016-01-01"), y=1454, size=14)

Using dplyr to collect data and bind rows of the data collected

I am trying to use rvest() to extract some information. What I have is a list of links and I would like to bind the rows of the data collected together.
What I currently have is the following;
EDIT: heres the links without the weekend data
links <- c("https://finance.yahoo.com/calendar/ipo?day=2018-03-05", "https://finance.yahoo.com/calendar/ipo?day=2018-03-06",
"https://finance.yahoo.com/calendar/ipo?day=2018-03-07", "https://finance.yahoo.com/calendar/ipo?day=2018-03-08",
"https://finance.yahoo.com/calendar/ipo?day=2018-03-09", "https://finance.yahoo.com/calendar/ipo?day=2018-03-12",
"https://finance.yahoo.com/calendar/ipo?day=2018-03-13", "https://finance.yahoo.com/calendar/ipo?day=2018-03-14",
"https://finance.yahoo.com/calendar/ipo?day=2018-03-15", "https://finance.yahoo.com/calendar/ipo?day=2018-03-16",
"https://finance.yahoo.com/calendar/ipo?day=2018-03-19", "https://finance.yahoo.com/calendar/ipo?day=2018-03-20",
"https://finance.yahoo.com/calendar/ipo?day=2018-03-21", "https://finance.yahoo.com/calendar/ipo?day=2018-03-22",
"https://finance.yahoo.com/calendar/ipo?day=2018-03-23", "https://finance.yahoo.com/calendar/ipo?day=2018-03-26",
"https://finance.yahoo.com/calendar/ipo?day=2018-03-27", "https://finance.yahoo.com/calendar/ipo?day=2018-03-28",
"https://finance.yahoo.com/calendar/ipo?day=2018-03-29", "https://finance.yahoo.com/calendar/ipo?day=2018-03-30",
"https://finance.yahoo.com/calendar/ipo?day=2018-04-02", "https://finance.yahoo.com/calendar/ipo?day=2018-04-03",
"https://finance.yahoo.com/calendar/ipo?day=2018-04-04", "https://finance.yahoo.com/calendar/ipo?day=2018-04-05",
"https://finance.yahoo.com/calendar/ipo?day=2018-04-06", "https://finance.yahoo.com/calendar/ipo?day=2018-04-09",
"https://finance.yahoo.com/calendar/ipo?day=2018-04-10", "https://finance.yahoo.com/calendar/ipo?day=2018-04-11",
"https://finance.yahoo.com/calendar/ipo?day=2018-04-12", "https://finance.yahoo.com/calendar/ipo?day=2018-04-13",
"https://finance.yahoo.com/calendar/ipo?day=2018-04-16", "https://finance.yahoo.com/calendar/ipo?day=2018-04-17",
"https://finance.yahoo.com/calendar/ipo?day=2018-04-18", "https://finance.yahoo.com/calendar/ipo?day=2018-04-19",
"https://finance.yahoo.com/calendar/ipo?day=2018-04-20", "https://finance.yahoo.com/calendar/ipo?day=2018-04-23",
"https://finance.yahoo.com/calendar/ipo?day=2018-04-24", "https://finance.yahoo.com/calendar/ipo?day=2018-04-25",
"https://finance.yahoo.com/calendar/ipo?day=2018-04-26", "https://finance.yahoo.com/calendar/ipo?day=2018-04-27",
"https://finance.yahoo.com/calendar/ipo?day=2018-04-30", "https://finance.yahoo.com/calendar/ipo?day=2018-05-01",
"https://finance.yahoo.com/calendar/ipo?day=2018-05-02", "https://finance.yahoo.com/calendar/ipo?day=2018-05-03",
"https://finance.yahoo.com/calendar/ipo?day=2018-05-04", "https://finance.yahoo.com/calendar/ipo?day=2018-05-07",
"https://finance.yahoo.com/calendar/ipo?day=2018-05-08", "https://finance.yahoo.com/calendar/ipo?day=2018-05-09",
"https://finance.yahoo.com/calendar/ipo?day=2018-05-10")
Code:
library(rvest)
library(dplyr)
library(magrittr)
x <- links %>%
read_html() %>%
html_table() %>%
extract2(1) %>%
bind_rows() %>%
as_tibble
This gives the following error:
Error in doc_parse_file(con, encoding = encoding, as_html = as_html, options = options) :
Expecting a single string value: [type=character; extent=68].
I am able to get the code working for 1 link however when I try to get it working for all the links I am running into errors. For example this code works:
x <- "https://finance.yahoo.com/calendar/ipo?day=2018-05-08" %>%
read_html() %>%
html_table() %>%
extract2(1) %>%
bind_rows() %>%
as_tibble
EDIT:
from = "2016-03-04"
to = "2018-05-10"
s <- seq(as.Date(from), as.Date(to), "days")
library(chron)
s <- s[!is.weekend(s)]
links <- paste0("https://finance.yahoo.com/calendar/ipo?day=", s)
library(rvest)
library(dplyr)
library(magrittr)
library(purrr)
library(naniar)
IPOs <- links[1:400] %>%
map_dfr(~read_html(.x) %>%
html_table() %>%
extract2(1) %>%
naniar::replace_with_na_all(condition = ~.x == "-") %>%
type.convert(as.is = TRUE) )
It looks like you want to loop through the URL's. For each one you want to read it, parse it into a data frame, and extracting the first data frame in the list. So the read_html() through extract2() steps should be done within the loop.
One option is to use a purrr::map_dfr() loop, since it looks like you want to bind things into a single tibble in the end.
Nominally that could look like:
library(rvest)
library(dplyr)
library(magrittr)
library(purrr)
links %>%
map_dfr(~read_html(.x) %>%
html_table() %>%
extract2(1) )
However, it turns out that you have missing values that are represented by hyphens (-). Some of the tables have these and some don't. When these are present, R reads your integer columns as characters while when they are not present integers are read as integer columns. This causes problems when binding everything together.
I did not see an argument in read_html() to deal with these directly (I was looking for the equivalent of na.strings in read.table() or na in readr::read_csv()). My work-around was to convert the hyphens to NA using function replace_with_na_all() from package naniar (see the vignette here). Then I converted all columns to the appropriate type with type.convert().
All of this was done within the map_dfr() loop.
Here is an example with just the first two URL's in links.
links[1:2] %>%
map_dfr(~read_html(.x) %>%
html_table() %>%
extract2(1) %>%
naniar::replace_with_na_all(condition = ~.x == "-") %>%
type.convert(as.is = TRUE) )
# A tibble: 15 x 9
Symbol Company Exchange Date `Price Range` Price Currency Shares Actions
<chr> <chr> <chr> <chr> <chr> <dbl> <chr> <int> <chr>
1 2003.HK Vcredit Hldg Ltd HKSE Jun 21, 2018 NA 20 HKD 49969000 Priced
2 2003.HK Vcredit Hldg Ltd HKSE Jun 21, 2018 NA 20 HKD 11745600 Priced
3 2003.HK Vcredit Hldg Ltd HKSE Jun 21, 2018 NA 20 HKD 6857200 Priced
4 0000 Vcredit Hldg Ltd HKSE Jun 12, 2018 NA NA HKD NA Expected
5 6571.JP QB Net Holdings Co Ltd Japan OTC Mar 14, 2018 21.11 - 21.11 NA Y 9785900 Expected
6 1621.HK Vico Intl Hldg Ltd HKSE Mar 05, 2018 NA 0.35 HKD 175000000 Priced
7 PZM.AX Piston Mach Ltd ASX Mar 05, 2018 0.32 - 0.32 NA AU 50000000 Expected
8 "" Agp Ltd Karachi Mar 05, 2018 0.76 - 0.76 80 PKR 8750000 Priced
9 GRC.L GRC International Group PLC LSE Mar 05, 2018 0.98 - 0.98 0.7 GBP 8414286 Priced
10 ACPH.BR Acacia Pharma Group PLC Brussels Mar 05, 2018 3.24 - 4.16 3.6 EUR 3175413 Priced
11 ACPH.BR Acacia Pharma Group PLC Brussels Mar 05, 2018 3.24 - 4.16 3.6 EUR 7935698 Priced
12 GCI.AX Gryphon Capital Income Tr ASX May 23, 2018 1.57 - 1.57 2 AUD 87650000 Priced
13 GCI.AX Gryphon Capital Income Tr ASX May 04, 2018 1.57 - 1.57 NA AUD 50000000 Expected
14 STRL.L Stirling Inds Plc LSE Mar 06, 2018 1.40 - 1.40 1 GBP 8881002 Priced
15 541006.BO Angel Fibers Ltd BSE Mar 06, 2018 NA 27 INR 6408000 Priced

Find html table name and scrape in R

I'm trying to scrape a table from a web page that has multiple tables. I'd like to get the "FIPS Codes for the States and the District of Columbia" table from https://www.census.gov/geo/reference/ansi_statetables.html . I think the XML::readHTMLTable() is the right way to go, but when I try the following I get an error:
url = "https://www.census.gov/geo/reference/ansi_statetables.html"
readHTMLTable(url, header = T, stringsAsFactors = F)
named list()
Warning message:
XML content does not seem to be XML: 'https://www.census.gov/geo/reference/ansi_statetables.html'
This is not surprising, of course, because I'm not giving the function any indication of which table I'd like to read. I've dug around in "Inspect" for quite a while but I'm not connecting dots on how to be more precise. There doesn't seem to be a name or class of the table that is analogous to other examples I've found in documentation or on SO. Thoughts?
Consider using readLines() to scrape the html page content and use result in readHTMLTable():
url = "https://www.census.gov/geo/reference/ansi_statetables.html"
webpage <- readLines(url)
readHTMLTable(webpage, header = T, stringsAsFactors = F) # LIST OF 3 TABLES
# $`NULL`
# Name FIPS State Numeric Code Official USPS Code
# 1 Alabama 01 AL
# 2 Alaska 02 AK
# 3 Arizona 04 AZ
# 4 Arkansas 05 AR
# 5 California 06 CA
# 6 Colorado 08 CO
# 7 Connecticut 09 CT
# 8 Delaware 10 DE
# 9 District of Columbia 11 DC
# 10 Florida 12 FL
# 11 Georgia 13 GA
# 12 Hawaii 15 HI
# 13 Idaho 16 ID
# 14 Illinois 17 IL
# ...
For specific dataframe return:
fipsdf <- readHTMLTable(webpage, header = T, stringsAsFactors = F)[[1]]
Another solution using rvest instead of XML is:
require(rvest)
read_html("https://www.census.gov/geo/reference/ansi_statetables.html") %>%
html_table %>% .[[1]]

Resources