How to make data tidy at in R for scraping data - r

So I am trying is to scraping data from WHO Coronavirus (COVID-19) Dashboard
I am using R code is
library(tidyverse)
whole_word <-'https://covid19.who.int/table'
page_title <- read_html(whole_word)
page_title %>%
html_nodes(".td") %>%
html_text()
SLS_df <- tibble(Country =page_title %>%
html_nodes('.td') %>%
html_text()) # build the Title variable using the code we used above
SLS_df
tibble <-SLS_df[c(17:160),1]
tibble
but what I got is really strange. It shows my tibble has 144*1 and what I want is the same as the webpage. Data set nicely. So what can I do for fixing the problem?

Related

Efficiency in extracting data from webscraping in R

This is no doubt very simple so apologies but I am new to webscraping and am trying to extract multiple datapoints in one call using rvest. Let's take for example the following code (NB I have not used the actual website which I have replaced in this code snippet with xxxxxx.com):
univsalaries <- lapply(paste0('https://xxxxxx.com/job/p', 1:20,'/key=%F9%80%76&final=1&jump=1&PGTID=0d3408-0000-24gf-ac2b-810&ClickID=2'),
function(url_base){
url_base %>% read_html() %>%
html_nodes('.salary') %>%
html_text()
})
Let's say there is another html node I want to scrape (.company). Obviously I can make a separate call and fetch that data, but I want to understand the syntax of how I could extract the information in the same call.
I tried to put it in the following structure, but the code sent me to the debugger
.... function(url_base){
url_base %>% read_html() %>%
Salary <- univsalaries %>%
html_nodes('.salary') %>% html_text()
Company <- univsalaries %>%
html_nodes('.company') %>% html_text()
dt<-tibble(Salary,Company)
})
Read the webpage once and then you can extract multiple values from the same page.
library(purrr)
library(rvest)
univsalaries <- map(paste0('https://xxxxxx.com/job/p', 1:20,'/key=%F9%80%76&final=1&jump=1&PGTID=0d3408-0000-24gf-ac2b-810&ClickID=2'),
function(url_base){
webpage <- url_base %>% read_html()
data.frame(Salary = webpage %>% html_nodes('.salary') %>% html_text(),
Company = webpage %>% html_nodes('.company') %>% html_text())
})
This would give you a list of dataframes (one for every link), if you need one combined dataframe then use map_df instead of map.

Trying to scrape a table from multiple webpages with a for loop in R

I am trying to scrape information from multiple webpages for different MLB teams. These are the websites I am trying to scrape from https://www.covers.com/sport/baseball/mlb/teams/main/miami-marlins/2019 and https://www.covers.com/sport/baseball/mlb/teams/main/cleveland-indians/2019. For both teams I am trying to scrape info from the 12th table on the page and then join them together as a dataframe. so far my code looks like this
library(rvest)
#> Loading required package: xml2
library(magrittr)
teams= c("miami-marlins", "cleveland-indians")
tables <- list()
index <- 1
for(i in teams){
url <- paste0("https://www.covers.com/sport/baseball/mlb/teams/main/",(i),"/2019")
table <- url %>%
read_html() %>%
html_nodes("table")%>%
.[[12]]%>%
html_table()
tables[index] <- table
index <- index + 1
}
#> Warning in tables[index] <- table: number of items to replace is not a multiple
#> of replacement length
#> Warning in tables[index] <- table: number of items to replace is not a multiple
#> of replacement length
df <- do.call("rbind", tables)
Created on 2020-10-15 by the reprex package (v0.3.0)
When I run the code I get the above warning messages and the code only grabs the dates both teams played their games on. I borrowed the code mostly from the post Trying to use rvest to loop a command to scrape tables from multiple pages and then tried to tweak it a little bit to fit what I needed but obviously something about my alterations has messed it up. Below I have posted the code I wrote to scrape the tables from the individual websites which works well.
url15 <- paste0("https://www.covers.com/sport/baseball/mlb/teams/main/miami-marlins/2019")
table <- url15 %>%
read_html() %>%
html_nodes("table")%>%
.[[12]]%>%
html_table()
#> Error in url15 %>% read_html() %>% html_nodes("table") %>% .[[12]] %>% : could not find function "%>%"
Created on 2020-10-15 by the reprex package (v0.3.0)
I would appreciate if someone could point out what I am doing wrong here and if possible explain it in layman's terms as I am very new to this.
Try this
library(rvest)
library(dplyr)
teams <- c("miami-marlins", "cleveland-indians")
dplyr::bind_rows(lapply(
paste0("https://www.covers.com/sport/baseball/mlb/teams/main/", teams, "/2019"),
. %>% read_html() %>% html_nodes("table") %>% .[[12]] %>% html_table() %>% {`names<-`(.[-1L, ], .[1L, , drop = TRUE])}
))

Scraping data from finviz with R - Structure for

I am new using R and this is my first question. I apologize if it has been solved before but I haven't found a solution.
By using below code, that I found here, I can get data from and specific subsector from Finviz screener:
library (rvest)
url <- read_html("https://finviz.com/screener.ashx?v=111&f=geo_usa,ind_specialtyindustrialmachinery&o=industry")
tables <- html_nodes(url,"table")
screener <- tables %>% html_nodes("table") %>% .[11] %>%
html_table(fill=TRUE) %>% data.frame()
head(screener)
It was a bit difficult to find the table number bud I did. My question refers to lists with more than 20, like the one I am using at the example. They use &r=1, &r=21, &r=41, &r=61 at the end of each url.
How could I create in this case the structure?
i=0
for(z in ...){
Many thanks in advance for your help.
Update script based on new table number and link:
library (rvest)
library(stringr)
url <- "https://finviz.com/screener.ashx?v=111&f=geo_usa,ind_specialtyindustrialmachinery&o=industry"
TableList<-c("1","21","41","61") # table list
GetData<-function(URL,tableNo){
cat('\n',"Running For table",tableNo,'\n', 'Weblink Used:',stringr::str_c(url,"&r=",tableNo),'\n')
tables<-read_html(stringr::str_c(url,"&r=",tableNo)) #get data from webpage based on table numbers
screener <- tables %>%
html_nodes("table") %>%
.[17] %>%
html_table(fill=TRUE) %>%
data.frame()
return(screener)
}
AllData<- lapply(TableList, function(x) GetData(URL=url, tableNo = x)) # getting all data in form of list
Here is one approach using stringr and lapply:
library (rvest)
library(stringr)
url <- "https://finviz.com/screener.ashx?v=111&f=geo_usa,ind_specialtyindustrialmachinery&o=industry" # base url
TableList<-c("1","21","41","61") # table number list
GetData<-function(URL,tableNo){
cat('\n',"Running For table",tableNo,'\n', 'Weblink Used:',stringr::str_c(url,"&",tableNo),'\n')
tables<-read_html(stringr::str_c(url,"&",tableNo)) #get data from webpage based on table numbers
screener <- tables %>%
html_nodes("table") %>%
.[11] %>% # check
html_table(fill=TRUE) %>%
data.frame()
return(screener)
}
AllData<- lapply(TableList, function(x) GetData(URL=url, tableNo = x)) # list of dataframes
However please check for .[11] number as it will be changed for these URLs(URL with &1, &21, etc). It is working fine for the base URL. Data is not present for the URL with &1, &21, etc at 11th index. Please change accordingly.

How to extract tabular data from a website using R

I am trying to extract the data from the webpage
https://www.geojit.com/other-market/world-indices
and many others similar to this.
I need to get the tabular data of the website (INDEX,NAME,COUNTRY,CLOSE,PREV.CLOSE,NET CHANGE,CHANGE (%),LAST UPDATED DATE & TIME). would be great if you can share the R code for this or any help would be welcome.
library(rvest)
library(dplyr)
google <- html("https://www.geojit.com/other-market/world-indices")
google %>%
html_nodes()
library(rvest)
my_tbl <- read_html("https://www.geojit.com/other-market/world-indices") %>%
html_nodes(xpath = "//*[#id=\"aboutContent\"]/div[2]/table") %>%
html_table(header = TRUE) %>%
`[[`(1)

using rvest to scrape data from a long table with pagination

I am new to data scraping and trying to use rvest to scrape all the salary data from the long table on this website:
https://www.fedsdatacenter.com/federal-pay-rates/
as expected, the following code gives me the variable names of the data:
url <- "https://www.fedsdatacenter.com/federal-pay-rates/"
names <- url %>%
read_html() %>%
html_node('thead') %>%
html_text()
However, why this code gives me no data?
url <- "https://www.fedsdatacenter.com/federal-pay-rates/"
dat <- url %>%
read_html() %>%
html_node('tbody') %>%
html_text()
I followed an example in this article: http://bradleyboehmke.github.io/2015/12/scraping-html-tables.html
url <- "https://www.fedsdatacenter.com/federal-pay-rates/"
sal <- url %>%
read_html() %>%
html_node('#table-example') %>%
html_table(fill=TRUE)
Again, it produces only column names with no data.
Also, how should I read through all the tens of thousands of pages to get all the data from the table? I suspect that I need to use the information in "#table-example_wrapper > div:nth-child(2) > div", but don't know how. Could anyone help?

Resources