save web scraped tables in csv using rvest in R - r

I used rvest package in R to get the tables from web page . But I am getting the details not in format and I want to save them in csv file also. below is my code chunk. how to view and save the results in excel or csv format
url <- "https://www.moneycontrol.com/india/stockpricequote/metals-non-ferrous/hindustancopper/HC07"
url %>%
read_html() %>%
html_nodes('#mktdet_1') %>%
html_text()

Here is a generalized solution for you to work with. There are multiple different ways you can parse this information and store it into a data frame or write it to a text file. It really depends on your use case. The first goal, however, is to extract each of the elements into its own element in a vector. Your code is a good start. We can build on this but adding an additional css selector which makes things a lot easier.
library(rvest)
library(dplyr)
library(xml2)
library(stringr)
#Define list of URL's to scrape
url_vec <- list(hindustal_copper = "https://www.moneycontrol.com/india/stockpricequote/metals-non-ferrous/hindustancopper/HC07",
reliance = "https://www.moneycontrol.com/india/stockpricequote/refineries/relianceindustries/RI",
dhcf = "https://www.moneycontrol.com/india/stockpricequote/finance-housing/dewanhousingfinancecorporation/DHF")
#Define empty dataframe
result_df = data.frame(name = character(),property = character(),value = numeric())
#For each url
for(name in names(url_vec)){
table = url_vec %>%
.[[name]] %>% #Extract the URL
read_html() %>% # Read the HTML
html_nodes('#mktdet_1')%>% # Extract the table ID
html_nodes(".PA7.brdb")%>% # Extract each of the elements in the tables
html_text() %>% # Convert to text
str_replace_all("[\\\t|\\\r|\\\n]"," ") %>% #Remove tab, return carrage and new line
str_squish() # Remove White space
text = gsub("^([a-zA-z\\(\\)%/. ]+)[0-9,\\.%]+$","\\1",table) #Extract the property elements
value = gsub("^[a-zA-z\\(\\)%/. ]+([0-9,\\.%]+)$","\\1",table) #Extract the numbers
value_num = as.numeric(gsub("[%, ]","",value)) # Convert numbers in character format to numeric
tbl = data.frame(name = rep(name,length(text)),property = text,value = value_num) #Create a temp dataframe
result_df = rbind(result_df,tbl) #Row bind with the original dataframe
#Deliverables are NA because they need to be extracted from the name. Use the appropriate regex to do this
}
write.csv(result_df,file = "stock_stats.csv",row.names = F)
The results of the table are just a vector with every element in its own index. text and value simply separates the column labels and values. You can then store this however you like depending on the use.

For a full, complete answer, read my article published on:
Scraper API - Extract Data from HTML Tables with Rvest [Export Table Data to a CSV in R]
install.packages("rvest")
install.packages("dplyr")
library("rvest")
library("dplyr")
response = read_html("http://api.scraperapi.com?api_key=51e43be283e4db2a5afb62660xxxxxxx&url=https://datatables.net/examples/basic_init/multiple_tables.html")
tables = response %>% html_table()
table_one = tables[[1]]
install.packages("writexl")
library("writexl")
write_xlsx(table_one,"./html_table.csv")

Related

How to check in R if the name of the list element contains "this text" in it and pass to the next element in a for loop?

I'm new at R and have a large list of 30 elements, each of which is a dataframe that contains few hundred rows and around 20 columns (this varies depending on the dataframe). Each dataframe is named after the original .csv filename (for example "experiment data XYZ QWERTY 01"). How can I check through the whole list and only filter those dataframes that don't have specific text included in their filename AND also add an unique id column to those filtered dataframes (the id value would be first three characters of that filename)? For example all the elements/dataframes/files in the list which include "XYZ QWERTY" as a part of their name won't be filtered and doesn't need unique id. I had this pseudo style code:
for(i in 1:length(list_of_dataframes)){
if
list_of_dataframes[[i]] contains "this text" then don't filter
else
list_of_dataframes[[i]] <- filter(list_of_dataframes[[i]], rule) AND add unique.id.of.first.three.char.of.list_of_dataframes[[i]]
}
Sorry if the terminology used here is a bit awkward, but just starting out with programming and first time posting here, so there's still a lot to learn (as a bonus, if you have any good resources/websites to learn to automate and do similar stuff with R, I would be more than glad to get some good recommendations! :-))
EDIT:
The code I tried for the filtering part was:
for(i in 1:length(tbl)){
if (!(str_detect (tbl[[i]], "OLD"))){
tbl[[i]] <- filter(tbl[[i]], age < 50)
}
}
However there was an error message stating "argument is not an atomic vector; coercing" and "the condition has length > 1 and only the first element will be used". Is there any way to get this code working?
Let there be a directory called files containing these csv files:
'experiment 1.csv' 'experiment 2.csv' 'experiment 3.csv'
'OLDexperiment 1.csv' 'OLDexperiment 2.csv'
This will give you a list of data frames with a filter condition (here: do not contain the substring OLD in the filename). Just remove the ! to only include old experiments instead. A new column id is added containing the file path:
library(tidyverse)
list.files("files")
paths <- list.files("files", full.names = TRUE)
names(paths) <- list.files("files", full.names = TRUE)
list_of_dataframes <- paths %>% map(read_csv)
list_of_dataframes %>%
enframe() %>%
filter(! name %>% str_detect("OLD")) %>%
mutate(value = name %>% map2(value, ~ {
.y %>% mutate(id = .x)
})) %>%
pull(value)
A good resource to start is the free book R for Data Science
This is a much simpler approach without a list to get one big combined table of files matching the same condition:
list.files("files", full.names = TRUE) %>%
tibble(id = .) %>%
# discard old experiments
filter(! id %>% str_detect("OLD")) %>%
# read the csv table for every matching file
mutate(data = id %>% map(read_csv)) %>%
# combine the tables into one big one
unnest(data)

Scraping with Rvest in R Studio: Returns df 0 rows by 32 columns

I am trying to scrape some sports data from this website (https://en.khl.ru/stat/players/1097/skaters/) using rvest. There are no pages to filter through, but there is a 'Show All' icon to show all the data on the page.
I have been trying to use a css selector to extract the table. Unfortunately, no rows are produced but the column names of the table are present.
I suspect the problem lies in the website's interactive features with the table.
Yes, this page is dynamically generated, thus troublesome for rvest to handle. But the key to scrape this page is to realize the data is stored as JSON in a script element on the page.
The code below reads the page and extracts the script nodes. Reviewed the script node to find the correct one. Then some trial and error extracted the JSON data. Cleaned up the player and team name columns for the final answer.
library(rvest)
library(dplyr)
library(stringr)
url <- "https://en.khl.ru/stat/players/1097/skaters/"
page <- read_html(url)
#the data for the page is stored in a script element
scripts <-page %>% html_elements("script")
#get column names
headers <- page %>% html_elements("thead th") %>% html_text()
#examined the nodes and manually determined the 31st node was it
tail(scripts, 18)
data <- scripts[31] %>% html_text()
#examined the data string and notice the start of the JSON was '[ ['
#end of the JSON was ']]'
jsonstring <- str_extract(data, "\\[ \\[.+\\]\\]")
#convert the JSON into data frame
answer <- jsonlite::fromJSON(jsonstring) %>% as.data.frame
#rename column titles
names(answer) <- headers
#function to clean up html code in columns
cleanhtml <- function(text) {
out<-text %>% read_html() %>% html_text()
}
#remove the html information in columns 1 &3
answer <- answer[ , -32] %>% rowwise() %>%
mutate(Player = cleanhtml(Player), Team=cleanhtml(Team))
answer

Scraping data from finviz with R - Structure for

I am new using R and this is my first question. I apologize if it has been solved before but I haven't found a solution.
By using below code, that I found here, I can get data from and specific subsector from Finviz screener:
library (rvest)
url <- read_html("https://finviz.com/screener.ashx?v=111&f=geo_usa,ind_specialtyindustrialmachinery&o=industry")
tables <- html_nodes(url,"table")
screener <- tables %>% html_nodes("table") %>% .[11] %>%
html_table(fill=TRUE) %>% data.frame()
head(screener)
It was a bit difficult to find the table number bud I did. My question refers to lists with more than 20, like the one I am using at the example. They use &r=1, &r=21, &r=41, &r=61 at the end of each url.
How could I create in this case the structure?
i=0
for(z in ...){
Many thanks in advance for your help.
Update script based on new table number and link:
library (rvest)
library(stringr)
url <- "https://finviz.com/screener.ashx?v=111&f=geo_usa,ind_specialtyindustrialmachinery&o=industry"
TableList<-c("1","21","41","61") # table list
GetData<-function(URL,tableNo){
cat('\n',"Running For table",tableNo,'\n', 'Weblink Used:',stringr::str_c(url,"&r=",tableNo),'\n')
tables<-read_html(stringr::str_c(url,"&r=",tableNo)) #get data from webpage based on table numbers
screener <- tables %>%
html_nodes("table") %>%
.[17] %>%
html_table(fill=TRUE) %>%
data.frame()
return(screener)
}
AllData<- lapply(TableList, function(x) GetData(URL=url, tableNo = x)) # getting all data in form of list
Here is one approach using stringr and lapply:
library (rvest)
library(stringr)
url <- "https://finviz.com/screener.ashx?v=111&f=geo_usa,ind_specialtyindustrialmachinery&o=industry" # base url
TableList<-c("1","21","41","61") # table number list
GetData<-function(URL,tableNo){
cat('\n',"Running For table",tableNo,'\n', 'Weblink Used:',stringr::str_c(url,"&",tableNo),'\n')
tables<-read_html(stringr::str_c(url,"&",tableNo)) #get data from webpage based on table numbers
screener <- tables %>%
html_nodes("table") %>%
.[11] %>% # check
html_table(fill=TRUE) %>%
data.frame()
return(screener)
}
AllData<- lapply(TableList, function(x) GetData(URL=url, tableNo = x)) # list of dataframes
However please check for .[11] number as it will be changed for these URLs(URL with &1, &21, etc). It is working fine for the base URL. Data is not present for the URL with &1, &21, etc at 11th index. Please change accordingly.

Need little help converting a htmltab to a tibble

Trying to help out a friend with data munging a Miami Dolphins football schedule into a tibble
library(htmltab)
library(tidyr)
library(tibble)
url <- "http://www.espn.com/nfl/team/schedule/_/name/mia"
data <- htmltab(doc = url, which = 1, header = 2)
unique(data)
as_tibble(data)
The table it extracts the same headers (variable). I'm missing something. Need a little help in converting the htmltab to a tibble. Thanks.
What the table should look like
So I am using the "rvest" package to get data from websites. I think the main problem is that this website doesn't provide a nice clear table format that you can directly use it. You have to clean it up to get the desired output.
rm(list=ls())
library(tidyverse)
library(rvest)
##### get data from web #####
url = "http://www.espn.com/nfl/team/schedule/_/name/mia"
tb <- url %>%
read_html() %>%
html_table() # this function is actually going to read all tables at this url
rawdata = tb[[1]] # tb is a list and here we only want the fist table
#### clean up the data #####
names(rawdata) = rawdata[2,] # using the second row as data names
tmp = data[grepl("from",data$TICKETS),] # select rows that contain "from"
tmp2 = tmp[,!duplicated(names(tmp))] # delete columns that have duplicated column names
res = as_tibble(tmp2) # convert to tibble
For the cleaning section, I did it step by step by observing the data. Of course, there are plenty of ways of performing the same task.

Scraping a table made of tables in R

I have a table from a website I am trying to download, and it appears to be made of a bunch of tables. Right now I am using rvest to bring the table in as text, but it is bringing in a bunch of other tables I am not interested in and then I am coercing the data into a better format, but it's not a repeatable process. Here is my code:
library(rvest)
library(tidyr)
#Auto Download Data
#reads the url of the race
race_url <- read_html("http://racing-reference.info/race/2016_Folds_of_Honor_QuikTrip_500/W")
#reads in the tables, in this code there are too many
race_results <- race_url %>%
html_nodes(".col") %>%
html_text()
race_results <- data.table(race_results) #turns from a factor to a DT
f <- nrow(race_results) #counts the number of rows in the data
#eliminates all rows after 496 (11*45 + 1) since there are never more than 43 racers
race_results <- race_results[-c(496:f)]
#puts the data into a format with 1:11 numbering for unstacking
testDT <- data.frame(X = race_results$race_results, ind = rep(1:11, nrow(race_results)/11))
testDT <- unstack(testDT, X~ind) #unstacking data into 11 columns
colnames(testDT) <- testDT[1, ] #changing the top column into the header
I commented everything so you would know what I am trying to do. If you go to the URL, there is a top table with driver results, which is what I am trying to scrape, but it is pulling the bottom ones too, as I can't seem to get a different html_nodes to work other than with ".col". I also tried html_table() in place of the html_text() but it didn't work. I suppose this can be done either by identifying the table in the css (I can't figure this out) or by using a different type of call or the XML library (which I also can't figure out). Any help or direction is appreciated.
UPDATE:
From the comments below, the correct code to pull this data is as follows:
library(rvest)
library(tidyr)
#Auto Download Data
race_url <- read_html("http://racing-reference.info/race/2016_Folds_of_Honor_QuikTrip_500/W") #reads the url of the race
race_results <- race_url %>% html_nodes("table") #returns a DF with all of the tables on the page
race_results <- race_results[7] %>% html_table()
race_results <- data.frame(race_results) #turns from a factor to a DT

Resources