Web Scraping a table into R - r

I'm new to trying to web scrape, and am sure there's a very obvious answer I'm missing here, but have exhausted every post I can find on using rvest, XML, xml2, etc on reading a table from the web into R, and I've had no success.
An example of the table I'm looking to scrape can be found here:
https://www.eliteprospects.com/iframe_player_stats.php?player=364033
I've tried
EXAMPLE <- read_html("http://www.eliteprospects.com/iframe_player_stats.php?
player=364033")
EXAMPLE
URL <- 'http://www.eliteprospects.com/iframe_player_stats.php?player=364033'
table <- URL %>%
read_html %>%
html_nodes("table")
But am unsure what to do with these results to get them into a dataframe, or anything usable.

You need to extract the correct html_nodes, and then convert them into a data.frame. The code below is an example of how to go about doing something like this. I find Selector Gadget very useful for finding the right CSS selectors.
library(tidyverse)
library(rvest)
# read the html
html <- read_html('http://www.eliteprospects.com/iframe_player_stats.php?player=364033')
# function to read columns
read_col <- function(x){
col <- html %>%
# CSS nodes to select by using selector gadget
html_nodes(paste0("td:nth-child(", x, ")")) %>%
html_text()
return(col)
}
# apply the function
col_list <- lapply(c(1:8, 10:15), read_col)
# collapse into matrix
mat <- do.call(cbind, col_list)
# put data into dataframe
df <- data.frame(mat[2:nrow(mat), ] %>% data.frame())
# assign names
names(df) <- mat[1, ]
df

Related

Web Scraping with rvest package don't work

I'm trying to get a table with rvest but it doesn't recognize the numbers and creates two extra columns with NAs
A few months ago it worked, but apparently they made changes to the website and now it doesn't work.I do not know what the problem may be.
url <- paste0("https://climatologia.meteochile.gob.cl/application/mensual/temperaturaMediaMensual/170007/2021/08")
tmp <- read_html(url)
tmp <- html_nodes(tmp,"table")
sapply(tmp, function(x) dim(html_table(x, fill = TRUE))) ## ver tabla con datos
tabla <- html_table(tmp[1],fill = T,header=NA, dec = ".")
I don't see a problem with recognising numbers. There are two empty columns in the html, hence the NAs, and most of the table is blank.
As there are repeat headers, I use janitor to clean the headers, then dplyr to remove the end columns which are automatically labelled x and x_2. You could also slice end columns off instead.
I would probably consider removing/putting into separate table the Resumen Mensual part of the current table.
library(rvest)
library(janitor)
library(dplyr)
url <- paste0("https://climatologia.meteochile.gob.cl/application/mensual/temperaturaMediaMensual/170007/2021/08")
t <- read_html(url) |>
html_element('#excel > table') |>
html_table() |>
clean_names() |>
select(!starts_with('x'))
t
The new base pipe |> requires R 4.1.0. You can replace with %>% pipe from magrittr

Scraping data from finviz with R - Structure for

I am new using R and this is my first question. I apologize if it has been solved before but I haven't found a solution.
By using below code, that I found here, I can get data from and specific subsector from Finviz screener:
library (rvest)
url <- read_html("https://finviz.com/screener.ashx?v=111&f=geo_usa,ind_specialtyindustrialmachinery&o=industry")
tables <- html_nodes(url,"table")
screener <- tables %>% html_nodes("table") %>% .[11] %>%
html_table(fill=TRUE) %>% data.frame()
head(screener)
It was a bit difficult to find the table number bud I did. My question refers to lists with more than 20, like the one I am using at the example. They use &r=1, &r=21, &r=41, &r=61 at the end of each url.
How could I create in this case the structure?
i=0
for(z in ...){
Many thanks in advance for your help.
Update script based on new table number and link:
library (rvest)
library(stringr)
url <- "https://finviz.com/screener.ashx?v=111&f=geo_usa,ind_specialtyindustrialmachinery&o=industry"
TableList<-c("1","21","41","61") # table list
GetData<-function(URL,tableNo){
cat('\n',"Running For table",tableNo,'\n', 'Weblink Used:',stringr::str_c(url,"&r=",tableNo),'\n')
tables<-read_html(stringr::str_c(url,"&r=",tableNo)) #get data from webpage based on table numbers
screener <- tables %>%
html_nodes("table") %>%
.[17] %>%
html_table(fill=TRUE) %>%
data.frame()
return(screener)
}
AllData<- lapply(TableList, function(x) GetData(URL=url, tableNo = x)) # getting all data in form of list
Here is one approach using stringr and lapply:
library (rvest)
library(stringr)
url <- "https://finviz.com/screener.ashx?v=111&f=geo_usa,ind_specialtyindustrialmachinery&o=industry" # base url
TableList<-c("1","21","41","61") # table number list
GetData<-function(URL,tableNo){
cat('\n',"Running For table",tableNo,'\n', 'Weblink Used:',stringr::str_c(url,"&",tableNo),'\n')
tables<-read_html(stringr::str_c(url,"&",tableNo)) #get data from webpage based on table numbers
screener <- tables %>%
html_nodes("table") %>%
.[11] %>% # check
html_table(fill=TRUE) %>%
data.frame()
return(screener)
}
AllData<- lapply(TableList, function(x) GetData(URL=url, tableNo = x)) # list of dataframes
However please check for .[11] number as it will be changed for these URLs(URL with &1, &21, etc). It is working fine for the base URL. Data is not present for the URL with &1, &21, etc at 11th index. Please change accordingly.

save web scraped tables in csv using rvest in R

I used rvest package in R to get the tables from web page . But I am getting the details not in format and I want to save them in csv file also. below is my code chunk. how to view and save the results in excel or csv format
url <- "https://www.moneycontrol.com/india/stockpricequote/metals-non-ferrous/hindustancopper/HC07"
url %>%
read_html() %>%
html_nodes('#mktdet_1') %>%
html_text()
Here is a generalized solution for you to work with. There are multiple different ways you can parse this information and store it into a data frame or write it to a text file. It really depends on your use case. The first goal, however, is to extract each of the elements into its own element in a vector. Your code is a good start. We can build on this but adding an additional css selector which makes things a lot easier.
library(rvest)
library(dplyr)
library(xml2)
library(stringr)
#Define list of URL's to scrape
url_vec <- list(hindustal_copper = "https://www.moneycontrol.com/india/stockpricequote/metals-non-ferrous/hindustancopper/HC07",
reliance = "https://www.moneycontrol.com/india/stockpricequote/refineries/relianceindustries/RI",
dhcf = "https://www.moneycontrol.com/india/stockpricequote/finance-housing/dewanhousingfinancecorporation/DHF")
#Define empty dataframe
result_df = data.frame(name = character(),property = character(),value = numeric())
#For each url
for(name in names(url_vec)){
table = url_vec %>%
.[[name]] %>% #Extract the URL
read_html() %>% # Read the HTML
html_nodes('#mktdet_1')%>% # Extract the table ID
html_nodes(".PA7.brdb")%>% # Extract each of the elements in the tables
html_text() %>% # Convert to text
str_replace_all("[\\\t|\\\r|\\\n]"," ") %>% #Remove tab, return carrage and new line
str_squish() # Remove White space
text = gsub("^([a-zA-z\\(\\)%/. ]+)[0-9,\\.%]+$","\\1",table) #Extract the property elements
value = gsub("^[a-zA-z\\(\\)%/. ]+([0-9,\\.%]+)$","\\1",table) #Extract the numbers
value_num = as.numeric(gsub("[%, ]","",value)) # Convert numbers in character format to numeric
tbl = data.frame(name = rep(name,length(text)),property = text,value = value_num) #Create a temp dataframe
result_df = rbind(result_df,tbl) #Row bind with the original dataframe
#Deliverables are NA because they need to be extracted from the name. Use the appropriate regex to do this
}
write.csv(result_df,file = "stock_stats.csv",row.names = F)
The results of the table are just a vector with every element in its own index. text and value simply separates the column labels and values. You can then store this however you like depending on the use.
For a full, complete answer, read my article published on:
Scraper API - Extract Data from HTML Tables with Rvest [Export Table Data to a CSV in R]
install.packages("rvest")
install.packages("dplyr")
library("rvest")
library("dplyr")
response = read_html("http://api.scraperapi.com?api_key=51e43be283e4db2a5afb62660xxxxxxx&url=https://datatables.net/examples/basic_init/multiple_tables.html")
tables = response %>% html_table()
table_one = tables[[1]]
install.packages("writexl")
library("writexl")
write_xlsx(table_one,"./html_table.csv")

R: Converting characters to numbers in an R data.frame

A question regarding this data extraction I did. I would like to create a bar chart with the data but unfortunately I am unable to convert the characters extracted to numbers inside R. If I edit the file in a text editor, there's no porblem at all but I'd like to do the whole process in R. Here it is the code:
install.packages("rvest")
library(rvest)
url <- "https://en.wikipedia.org/wiki/Corporate_tax"
corporatetax <- url %>%
read_html() %>%
html_nodes(xpath='//*[#id="mw-content-text"]/div/table[5]') %>%
html_table()
str(corporatetax)
As a result in corporatetax there is a data.frame with 3 variables all of them characters. My question, which I've not been abe to resolve, is how should I proceed to convert the second and the third column to numbers to create a bar chart? I've tried with sapply() and dplyr() but did not find a correct way to do that.
Thanks!
You might try to clean up the table like this
library(rvest)
library(stringr)
library(dplyr)
url <- "https://en.wikipedia.org/wiki/Corporate_tax"
corporatetax <- url %>%
read_html() %>%
# your xpath defines the single table, so you can use html_node() instead of html_nodes()
html_node(xpath='//*[#id="mw-content-text"]/div/table[5]') %>%
html_table() %>% as_tibble() %>%
setNames(c("country", "corporate_tax", "combined_tax"))
corporatetax %>%
mutate(corporate_tax=as.numeric(str_replace(corporate_tax, "%", ""))/100,
combined_tax=as.numeric(str_replace(combined_tax, "%", ""))/100
)

Scraping a table made of tables in R

I have a table from a website I am trying to download, and it appears to be made of a bunch of tables. Right now I am using rvest to bring the table in as text, but it is bringing in a bunch of other tables I am not interested in and then I am coercing the data into a better format, but it's not a repeatable process. Here is my code:
library(rvest)
library(tidyr)
#Auto Download Data
#reads the url of the race
race_url <- read_html("http://racing-reference.info/race/2016_Folds_of_Honor_QuikTrip_500/W")
#reads in the tables, in this code there are too many
race_results <- race_url %>%
html_nodes(".col") %>%
html_text()
race_results <- data.table(race_results) #turns from a factor to a DT
f <- nrow(race_results) #counts the number of rows in the data
#eliminates all rows after 496 (11*45 + 1) since there are never more than 43 racers
race_results <- race_results[-c(496:f)]
#puts the data into a format with 1:11 numbering for unstacking
testDT <- data.frame(X = race_results$race_results, ind = rep(1:11, nrow(race_results)/11))
testDT <- unstack(testDT, X~ind) #unstacking data into 11 columns
colnames(testDT) <- testDT[1, ] #changing the top column into the header
I commented everything so you would know what I am trying to do. If you go to the URL, there is a top table with driver results, which is what I am trying to scrape, but it is pulling the bottom ones too, as I can't seem to get a different html_nodes to work other than with ".col". I also tried html_table() in place of the html_text() but it didn't work. I suppose this can be done either by identifying the table in the css (I can't figure this out) or by using a different type of call or the XML library (which I also can't figure out). Any help or direction is appreciated.
UPDATE:
From the comments below, the correct code to pull this data is as follows:
library(rvest)
library(tidyr)
#Auto Download Data
race_url <- read_html("http://racing-reference.info/race/2016_Folds_of_Honor_QuikTrip_500/W") #reads the url of the race
race_results <- race_url %>% html_nodes("table") #returns a DF with all of the tables on the page
race_results <- race_results[7] %>% html_table()
race_results <- data.frame(race_results) #turns from a factor to a DT

Resources