A question regarding this data extraction I did. I would like to create a bar chart with the data but unfortunately I am unable to convert the characters extracted to numbers inside R. If I edit the file in a text editor, there's no porblem at all but I'd like to do the whole process in R. Here it is the code:
install.packages("rvest")
library(rvest)
url <- "https://en.wikipedia.org/wiki/Corporate_tax"
corporatetax <- url %>%
read_html() %>%
html_nodes(xpath='//*[#id="mw-content-text"]/div/table[5]') %>%
html_table()
str(corporatetax)
As a result in corporatetax there is a data.frame with 3 variables all of them characters. My question, which I've not been abe to resolve, is how should I proceed to convert the second and the third column to numbers to create a bar chart? I've tried with sapply() and dplyr() but did not find a correct way to do that.
Thanks!
You might try to clean up the table like this
library(rvest)
library(stringr)
library(dplyr)
url <- "https://en.wikipedia.org/wiki/Corporate_tax"
corporatetax <- url %>%
read_html() %>%
# your xpath defines the single table, so you can use html_node() instead of html_nodes()
html_node(xpath='//*[#id="mw-content-text"]/div/table[5]') %>%
html_table() %>% as_tibble() %>%
setNames(c("country", "corporate_tax", "combined_tax"))
corporatetax %>%
mutate(corporate_tax=as.numeric(str_replace(corporate_tax, "%", ""))/100,
combined_tax=as.numeric(str_replace(combined_tax, "%", ""))/100
)
Related
I have this docx file ( https://github.com/rhozon/datasets/blob/master/idades.docx?raw=true ) and I want to read this file directly in the link and convert the numbers to numeric.
I´m using textreadr::read_docx(... but I believe that here I can find an better way to do this.
A possible solution:
library(tidyverse)
library(textreadr)
read_docx("idades.docx") %>%
str_split(",") %>%
unlist() %>%
str_trim() %>%
as.numeric()
I'm trying to scrape a table from Sports Reference:
cu_url <- "https://www.sports-reference.com/cbb/schools/creighton/"
I was able to get the table into a data frame as intended like this:
cu_html <- read_html(cu_url)
cu_table <- html_nodes(cu_html, "table")
cu_info <- data.frame(html_table(cu_table))
colnames(cu_info) <- cu_info[1,]
cu_info <- cu_info[-1,]
However, I noticed after the fact that the header row repeats throughout the data. For example, row 22 shows the headers again as a row. Is there an efficient way to remove these? In the HTML, the header rows all have a table row () class of "thead" so I'm wondering if I can ask rvest to ignore these but I've failed when trying this using ! =.
Appreciate any thoughts. If I need to remove the actual header in order for this to work I will but would prefer to keep that one and just remove the repeats.
You can keep only the rows which have only numbers in Rk column.
library(rvest)
library(dplyr)
cu_url %>%
read_html %>%
html_nodes('table') %>%
html_table() %>%
.[[1]] %>%
setNames(make.unique(unlist(.[1,]))) %>%
slice(-1L) %>%
filter(grepl('^\\d+$', Rk)) -> result
result
I am new using R and this is my first question. I apologize if it has been solved before but I haven't found a solution.
By using below code, that I found here, I can get data from and specific subsector from Finviz screener:
library (rvest)
url <- read_html("https://finviz.com/screener.ashx?v=111&f=geo_usa,ind_specialtyindustrialmachinery&o=industry")
tables <- html_nodes(url,"table")
screener <- tables %>% html_nodes("table") %>% .[11] %>%
html_table(fill=TRUE) %>% data.frame()
head(screener)
It was a bit difficult to find the table number bud I did. My question refers to lists with more than 20, like the one I am using at the example. They use &r=1, &r=21, &r=41, &r=61 at the end of each url.
How could I create in this case the structure?
i=0
for(z in ...){
Many thanks in advance for your help.
Update script based on new table number and link:
library (rvest)
library(stringr)
url <- "https://finviz.com/screener.ashx?v=111&f=geo_usa,ind_specialtyindustrialmachinery&o=industry"
TableList<-c("1","21","41","61") # table list
GetData<-function(URL,tableNo){
cat('\n',"Running For table",tableNo,'\n', 'Weblink Used:',stringr::str_c(url,"&r=",tableNo),'\n')
tables<-read_html(stringr::str_c(url,"&r=",tableNo)) #get data from webpage based on table numbers
screener <- tables %>%
html_nodes("table") %>%
.[17] %>%
html_table(fill=TRUE) %>%
data.frame()
return(screener)
}
AllData<- lapply(TableList, function(x) GetData(URL=url, tableNo = x)) # getting all data in form of list
Here is one approach using stringr and lapply:
library (rvest)
library(stringr)
url <- "https://finviz.com/screener.ashx?v=111&f=geo_usa,ind_specialtyindustrialmachinery&o=industry" # base url
TableList<-c("1","21","41","61") # table number list
GetData<-function(URL,tableNo){
cat('\n',"Running For table",tableNo,'\n', 'Weblink Used:',stringr::str_c(url,"&",tableNo),'\n')
tables<-read_html(stringr::str_c(url,"&",tableNo)) #get data from webpage based on table numbers
screener <- tables %>%
html_nodes("table") %>%
.[11] %>% # check
html_table(fill=TRUE) %>%
data.frame()
return(screener)
}
AllData<- lapply(TableList, function(x) GetData(URL=url, tableNo = x)) # list of dataframes
However please check for .[11] number as it will be changed for these URLs(URL with &1, &21, etc). It is working fine for the base URL. Data is not present for the URL with &1, &21, etc at 11th index. Please change accordingly.
sorry if this question is already solved, I have search without success to solve this doubt.
I scraped 10 seasons of the NBA and store the datasets inside a list but the main problem is that I don't have a column with the year of the season inside the datasets making difficult to identify from which season is the dataset coming.
So what im looking forward to do is to mutate a new column based on a vector of seasons and recognize the year of the season.
This is what I have tried:
library(tidyverse)
library(rvest)
library(xml2)
season_scrape <- c(2010:2019)
url <- paste0("https://www.basketball-reference.com/leagues/NBA_", season_scrape, "_totals.html")
scrape_function <- function(url){
season_stats <- url %>%
read_html() %>%
html_nodes("table") %>%
.[[1]] %>%
html_table() %>%
mutate(season_year = season_scrape)
}
season_data <- lapply(url, scrape_function)
What would you recommend? mutate inside the scrape_function or after getting the dataset inside the list.
Thanks in advance.
You can handle this in multiple ways. One way is to pass an additional year parameter in the function and apply the function using Map instead of lapply.
library(dplyr)
library(rvest)
scrape_function <- function(url, year){
url %>%
read_html() %>%
html_nodes("table") %>%
.[[1]] %>%
html_table() %>%
mutate(season_year = year)
}
season_data <- Map(scrape_function, url, season_scrape)
If you need to bind the data together into one dataframe, you can also use map2_df from purrr.
season_data <- purrr::map2_df(url, season_scrape, scrape_function)
The first part of this code (up to "pages") successfully retrieves the pages from which I want to scrape. I'm then struggling to find a way to extract pieces of article text, with the associated dates, as a data frame.
I get:
Error in UseMethod("read_xml") :
no applicable method for 'read_xml' applied to an object of class "c('xml_document', 'xml_node')"
Any guidance on elegance, clarity and efficiency also welcome as this is personal learning.
library(rvest)
library(tidyverse)
library(plyr)
library(stringr)
llply(1:2, function(i) {
read_html(str_c("http://www.thetimes.co.uk/search?p=", i, "&q=tech")) %>%
html_nodes(".Headline--regular a") %>%
html_attr("href") %>%
url_absolute("http://www.thetimes.co.uk")
}) -> links
pages <- links %>% unlist() %>% map(read_html)
map_df(pages, function(x) {
text = read_html(x) %>%
html_nodes(".Article-content p") %>%
html_text() %>%
str_extract(".+skills.+")
date = read_html(x) %>%
html_nodes(".Dateline") %>%
html_text()
}) -> article_df
Nice, you were nearly there! There are two mistakes here:
The variable pages already contains the parsed html code. Therefore, applying read_html again on a single page (i.e. inside map_df) doesn't work. This is the error message you get.
The function inside map_df isn't correct. As there is no explicit return the last calculated value is returned, that is date. The variable text is completely forgotten. You have to pack these two variables inside a data frame.
The following contains the fixed code.
article_df <- map_df(pages, function(x) {
data_frame(
text = x %>%
html_nodes(".Article-content p") %>%
html_text() %>%
str_extract(".+skills.+"),
date = x %>%
html_nodes(".Dateline") %>%
html_text()
)
})
Also a few comments on the code itself:
I think it is better to use <- instead of ->. This way one can more easily find where the variable is assigned and if one uses 'speaking variable names' it is much easier to understand the code.
I'd prefer using the package purrr instead of plyr. purrr is part of the tidyverse package. So, instead of the function llply you could simply use map. There is a nice article on purrr vs plyr.
links <- map(1:2, function(i) {
read_html(str_c("http://www.thetimes.co.uk/search?p=", i, "&q=tech")) %>%
html_nodes(".Headline--regular a") %>%
html_attr("href") %>%
url_absolute("http://www.thetimes.co.uk")
})