Table from url using rvest package

Table from url using rvest package - r

I'd like to get the information in three tables from a website. I tried to apply the code below, but the table is in a confusing format.
url <- 'http://www.ufcstats.com/fighter-details/93fe7332d16c6ad9'
url %>% html_table(fill = TRUE)
Obs.: tidyverse and rvest have been used

You need to do some cleaning of the table.
library(rvest)
library(dplyr)
url <- 'http://www.ufcstats.com/fighter-details/93fe7332d16c6ad9'
url %>%
read_html %>%
html_table(fill = TRUE) %>%
.[[1]] %>%
.[complete.cases(.),] %>%
mutate_all(~gsub('\n|\\s{2,}', '', .))
# W/L Fighter Str Td Sub Pass
#1 loss Tom AaronMatt Ricehouse 00 00 00 00
#2 win Tom AaronEric Steenberg 00 00 00 00
# Event Method Round Time
#1 Strikeforce - Henderson vs. BabaluDec. 04, 2010 U-DEC 3 5:00
#2 Strikeforce - Heavy ArtilleryMay. 15, 2010 SUBGuillotine Choke 1 0:56

The table you're working with is tricky because there are table cells (<td> elements in HTML) that span two rows in order to repeat information. When html_table strips information out, those individual rows get concatenated and you get long strings of blank spaces and newlines.
library(dplyr)
library(rvest)
ufc <- read_html("http://www.ufcstats.com/fighter-details/93fe7332d16c6ad9") %>%
html_table(fill = TRUE) %>%
.[[1]] %>%
filter(!is.na(Fighter)) # could instead use janitor::remove_empty or rowSums for number of NAs
ufc$Fighter[1]
#> [1] "Tom Aaron\n \n \n\n \n \n Matt Ricehouse"
With some regex, you can make those blanks into your delimiters to split the cells. Information that applies to two rows (such as time) gets repeated. Originally I did this with mutate_all, but realized Event shouldn't be split—for that, instead just remove the extra spaces. Adjust as needed for other columns.
ufc %>%
mutate_at(vars(Fighter:Pass), stringr::str_replace_all, "\\s{2,}", "|") %>%
mutate_all(stringr::str_replace_all, "\\s{2,}", " ") %>%
tidyr::separate_rows(everything(), sep = "\\|")
#> W/L Fighter Str Td Sub Pass
#> 1 loss Tom Aaron 0 0 0 0
#> 2 loss Matt Ricehouse 0 0 0 0
#> 3 win Tom Aaron 0 0 0 0
#> 4 win Eric Steenberg 0 0 0 0
#> Event Method Round
#> 1 Strikeforce - Henderson vs. Babalu Dec. 04, 2010 U-DEC 3
#> 2 Strikeforce - Henderson vs. Babalu Dec. 04, 2010 U-DEC 3
#> 3 Strikeforce - Heavy Artillery May. 15, 2010 SUB Guillotine Choke 1
#> 4 Strikeforce - Heavy Artillery May. 15, 2010 SUB Guillotine Choke 1
#> Time
#> 1 5:00
#> 2 5:00
#> 3 0:56
#> 4 0:56

Related

Web scraping with R (rvest)

I'm new to R and am having some trouble to create a good web scraper with R.... It has been only 5 days since I started to study this language. So, any help I'll appreciate!
Idea
I'm trying to web scraping the classification table of "Campeonato Brasileiro" from 2003 to 2021 on Wikipedia to group the teams later to analyze some stuff.
Explanation and problem
I'm scraping the page of the 2002 championship. I read the HTML page to extract the HTML nodes that I select with the "SelectorGadget" extension at Google Chrome. There is some considerations:
The page that I'm trying to access is from the 2002 championship. I done that because it was easier to extract the links of the tables that are present on a board in the final of the page, selecting just one selector for all (tr:nth-child(9) div a) to access their links by HTML attribute "href";
The selected CSS was from 2003 championship page.
So, in my twisted mind I thought: "Hey! I'm going to create a function to extract the tables from those pages and I'll save them in a data frame!". However, it went wrong and I'm not understanding why... When I tried to ran the "tabelageral" line, the following error returned : "Error in UseMethod("xml_find_all") : no applicable method for 'xml_find_all' applied to an object of class "character"". I think that it is reading a string instead of a xml. What am I misunderstanding here? Where is my error? The "sapply" method? Since now, thanks!
The code
library("dplyr")
library("rvest")
link_wikipedia <- "https://pt.wikipedia.org/wiki/Campeonato_Brasileiro_de_Futebol_de_2002"
pagina_wikipedia <- read_html(link_wikipedia)
links_temporadas <- pagina_wikipedia %>%
html_nodes("tr:nth-child(9) div a") %>%
html_attr("href") %>%
paste("https://pt.wikipedia.org", ., sep = "")
tabela <- function(link){
pagina_tabela <- read_html(link)
tabela_wiki = link %>%
html_nodes("table.wikitable") %>%
html_table() %>%
paste(collapse = "|")
}
tabela_geral <- sapply(links_temporadas, FUN = tabela, USE.NAMES = FALSE)
tabela_final <- data.frame(tabela_geral)

You can use :contains to target the appropriate table by class and then a substring that the table contains. Furthermore, you can use html_table() to extract in tabular format from matched node. You can then subset on a vector of desired columns. I don't know the correct football terms so have guessed the columns to subset on. You can adjusted the columns vector.
If you wrap the years and constructed urls to make requests to inside of a map2_dfr() call you can return a single DataFrame for all desired years.
library(tidyverse)
library(rvest)
years <- 2003:2021
urls <- paste("https://pt.wikipedia.org/wiki/Campeonato_Brasileiro_de_Futebol_de_", years, sep = "")
columns <- c("Pos.", "Equipes", "GP", "GC", "SG")
df <- purrr::map2_dfr(urls, years, ~
read_html(.x, encoding = "utf-8") %>%
html_element('.wikitable:contains("ou rebaixamento")') %>%
html_table() %>%
.[columns] %>%
mutate(year = .y, SG = as.character(SG)))

You can get all the tables from those links by doing this:
tabela <- function(link){
read_html(link) %>% html_nodes("table.wikitable") %>% html_table()
}
all_tables = lapply(links_temporadas, tabela)
names(all_tables)<-2003:2022
This gives you a list of length 20, named 2003 to 2022 (i.e. one element for each of those years). Each element is itself a list of tables (i.e. the tables that are available at that link of links_temporadas. Note that the number of tables avaialable at each link varies.
lengths(all_tables)
2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022
6 5 10 9 10 12 11 10 12 11 13 14 17 16 16 16 16 15 17 7
You will need to determine which table(s) you are interested in from each of these years.

Here is a way. It's more complicated than your function because those pages have more than one table so the function returns only the tables with a column names matching "Pos.".
Then, before rbinding the tables, keep only the common columns since the older tables have one less column, column "M".
suppressPackageStartupMessages({
library("dplyr")
library("rvest")
})
link_wikipedia <- "https://pt.wikipedia.org/wiki/Campeonato_Brasileiro_de_Futebol_de_2002"
pagina_wikipedia <- read_html(link_wikipedia)
links_temporadas <- pagina_wikipedia %>%
html_nodes("tr:nth-child(9) div a") %>%
html_attr("href") %>%
paste("https://pt.wikipedia.org", ., sep = "")
tabela <- function(link){
pagina_tabela <- read_html(link)
lista_wiki <- pagina_tabela %>%
html_elements("table.wikitable") %>%
html_table()
i <- sapply(lista_wiki, \(x) "Pos." %in% names(x))
i <- which(i)[1]
lista_wiki[[i]]
}
tabela_geral <- sapply(links_temporadas, FUN = tabela, USE.NAMES = FALSE)
sapply(tabela_geral, ncol)
#> [1] 12 12 12 12 12 12 13 13 13 13 13 13 13 13 13 13 13 13 13 13
#sapply(tabela_geral, names)
common_names <- Reduce(intersect, lapply(tabela_geral, names))
tabela_reduzida <- lapply(tabela_geral, `[`, common_names)
tabela_final <- do.call(rbind, tabela_reduzida)
head(tabela_final)
#> # A tibble: 6 x 12
#> Pos. Equipes P J V E D GP GC SG `%`
#> <int> <chr> <chr> <int> <int> <int> <int> <int> <int> <chr> <int>
#> 1 1 Cruzeiro 100 46 31 7 8 102 47 +55 72
#> 2 2 Santos 87 46 25 12 9 93 60 +33 63
#> 3 3 São Paulo 78 46 22 12 12 81 67 +14 56
#> 4 4 São Caetano 742 46 19 14 13 53 37 +16 53
#> 5 5 Coritiba 73 46 21 10 15 67 58 +9 52
#> 6 6 Internacional 721 46 20 10 16 59 57 +2 52
#> # ... with 1 more variable: `Classificação ou rebaixamento` <chr>
Created on 2022-04-03 by the reprex package (v2.0.1)
To have all columns, including the "M" columns:
data.table::rbindlist(tabela_geral, fill = TRUE)

How to order and mark duplicated rows at the same time

I am looking to make a new variable to mark which of my data is duplicated, selecting the oldest datapoint to be the "original". My dataframe is ordered by date, but by ID.
ID Name Number Datetime (dd/mm/yyy/hh/MM)
1 ace 114 15.03.2019 15:26
2 bert 197 18.03.2019 07:28
3 vance 245 16.03.2019 14:03
4 chad 116 17.03.2019 02:02
5 chad 116 18.03.2019 18:23
6 ace 114 12.03.2019 23:15
Ordering the dataframe works and selecting the duplicated lines also works, but not in combination, which leads to the originals not being the first presentation. Even if I order the dataframe before marking the represenation the dataframe is seems to be unordered for the next command and linking the two commands with %>% is not working.
df %>% arrange(Datetime)
df$representations <- if_else(duplicated(df$number, .keep_all =TRUE), 1, 0)
df$represntations <- df %>%
arrange(Datetime) %>%
if_else(duplicated(df$number, .keep_all =TRUE), 1, 0)
How can i be sure, that the the originals will be the first datapoint to the number (like this)?
ID Name Number Datetime (dd/mm/yyy/hh/MM) representation
1 ace 114 15.03.2019 15:26 1
2 bert 197 18.03.2019 07:28 0
3 vance 245 16.03.2019 14:03 0
4 chad 116 17.03.2019 02:02 0
5 chad 116 18.03.2019 18:23 1
6 ace 114 12.03.2019 23:15 0

Try the below code
df <- df %>%
arrange(Datetime) %>%
mutate(representations = if_else(duplicated(number, .keep_all =TRUE), 1, 0)) %>%
arrange(ID)

library(dplyr)
df %>%
arrange(`Datetime(dd/mm/yyy/hh/MM)`) %>%
mutate(flag = duplicated(Number)*1) %>%
arrange(ID)
1 ace 114 15.03.2019 1
2 2 bert 197 18.03.2019 0
3 3 vance 245 16.03.2019 0
4 4 chad 116 17.03.2019 0
5 5 chad 116 18.03.2019 1
6 6 ace 114 12.03.2019 0

I ended up using this code and the sample I checked seemed to be correct, thank you! (even though the as.Date changed the year from 2019 to 2020, but the order is correct)
# split time and date, so as.Date can be used
emerge$date <- as.Date(sapply(strsplit(as.character(emerge$Falleinzeitdatum.Notfall), " "), "[", 1), format = "%d.%m.%y")
# arrange as proposed
emerge <- emerge %>%
arrange(date) %>%
mutate(re = if_else(duplicated(Patientennummer, .keep_all = TRUE), 1, 0))

Signalling where a new row should start based on arbitrary characters when converting webscraped output to a tibble

I'm scraping a television script and then trying to clean it up. This is what I have so far:
library(tidyverse)
library(rvest)
s1_e1 <- read_html('http://www.chakoteya.net/DoctorWho/27-1.htm')
s1_e1 <- s1_e1 %>%
html_nodes("p") %>%
html_text()
s1_e1 <- str_replace_all(string = s1_e1, pattern = "\\s*\\([^\\)]+\\)", replacement = "")
s1_e1 <- str_replace_all(string = s1_e1, pattern = "\\s*\\[[^\\]]+\\]", replacement = "")
s1_e1 <- str_squish(s1_e1)
s1_e1 <- s1_e1 %>%
as_tibble() %>%
filter(value!="") %>%
mutate(season = "27",
episode_num = "1",
airdate_orig = str_sub(.$value[1], -12),
episode_name = str_sub(.$value[1], 1, regexpr(" O", .$value[1])-1)) %>%
slice(-1)
Which gives me the below:
# A tibble: 38 x 5
value season episode_num airdate_orig episode_name
<chr> <chr> <chr> <chr> <chr>
1 ROSE: Bye! JACKIE: See you later! 27 1 26 Mar, 2005 Rose
2 TANNOY: This is a customer announcement… 27 1 26 Mar, 2005 Rose
3 ROSE: You pulled his arm off. DOCTOR: Y… 27 1 26 Mar, 2005 Rose
4 ROSE: That's just not funny. That's sic… 27 1 26 Mar, 2005 Rose
5 TAXI DRIVER: Watch it! 27 1 26 Mar, 2005 Rose
6 TELEVISION: The whole of Central London… 27 1 26 Mar, 2005 Rose
7 JACKIE: There's no point in getting up,… 27 1 26 Mar, 2005 Rose
8 JACKIE: There's Finch's. You could try … 27 1 26 Mar, 2005 Rose
9 ROSE: It's about last night. He's part … 27 1 26 Mar, 2005 Rose
10 ROSE: Don't mind the mess. Do you want … 27 1 26 Mar, 2005 Rose
# … with 28 more rows
I would like each row to be a new character's speech. As you can see, thankfully the script capitalizes who is speaking and then has a colon and a space before new speech, i.e. ROSE: or TANNOY: . Is there a way to indicate to R that I want each row of the tibble to begin with this capitalized text followed by a colon and to continue in that row until there is another capitalized word followed by a colon?
For example, the first row would start with ROSE: Bye! and the second row would start with JACKIE: See you later!, the third TANNOY: This is a customer announcement… until it reached another capitalized word followed by a colon, and so on.
Additionally, if anyone has any suggestions for how I can integrate the stringr functions into the dplyr chunk let me know. I can make a separate post about this if that's best, but I kept getting errors when attempting to do that (the above is functional though).
Many thanks in advance!

You could use a Look-ahead pattern:
library(tidyverse)
s1_e1 %>%
mutate(value=str_split(value, "\\s(?=[A-Z]+:)")) %>%
unnest(value)
returns
# A tibble: 322 x 5
value season episode_num airdate_orig episode_name
<chr> <chr> <chr> <chr> <chr>
1 ROSE: Bye! 27 1 26 Mar, 2005 Rose
2 JACKIE: See you later! 27 1 26 Mar, 2005 Rose
3 TANNOY: This is a customer announcement. The store will be closi~ 27 1 26 Mar, 2005 Rose
4 GUARD: Oi! 27 1 26 Mar, 2005 Rose
5 ROSE: Wilson? Wilson, I've got the lottery money. Wilson, are yo~ 27 1 26 Mar, 2005 Rose
6 ROSE: I can't hang about 'cos they're closing the shop. Wilson! ~ 27 1 26 Mar, 2005 Rose
7 ROSE: Hello? Hello, Wilson, it's Rose. Hello? Wilson? 27 1 26 Mar, 2005 Rose
8 ROSE: Wilson? Wilson! 27 1 26 Mar, 2005 Rose
9 ROSE: You're kidding me. 27 1 26 Mar, 2005 Rose
10 ROSE: Is that someone mucking about? Who is it? 27 1 26 Mar, 2005 Rose
Simplified workflow
You can indeed put all your operations into one pipe:
s1_e1 <- read_html('http://www.chakoteya.net/DoctorWho/27-1.htm') %>%
html_nodes("p") %>%
html_text() %>%
tibble(value = .) %>%
mutate(value = str_squish(str_replace_all(value, "(\\s*\\([^\\)]+\\)|\\s*\\[[^\\]]+\\])", ""))) %>%
filter(value!="") %>%
mutate(season = "27",
episode_num = "1",
airdate_orig = str_sub(.$value[1], -12),
episode_name = str_sub(.$value[1], 1, regexpr(" O", .$value[1])-1)) %>%
slice(-1) %>%
mutate(value=str_split(value, "\\s(?=[A-Z]+:)")) %>%
unnest(value)

Using dplyr to collect data and bind rows of the data collected

I am trying to use rvest() to extract some information. What I have is a list of links and I would like to bind the rows of the data collected together.
What I currently have is the following;
EDIT: heres the links without the weekend data
links <- c("https://finance.yahoo.com/calendar/ipo?day=2018-03-05", "https://finance.yahoo.com/calendar/ipo?day=2018-03-06",
"https://finance.yahoo.com/calendar/ipo?day=2018-03-07", "https://finance.yahoo.com/calendar/ipo?day=2018-03-08",
"https://finance.yahoo.com/calendar/ipo?day=2018-03-09", "https://finance.yahoo.com/calendar/ipo?day=2018-03-12",
"https://finance.yahoo.com/calendar/ipo?day=2018-03-13", "https://finance.yahoo.com/calendar/ipo?day=2018-03-14",
"https://finance.yahoo.com/calendar/ipo?day=2018-03-15", "https://finance.yahoo.com/calendar/ipo?day=2018-03-16",
"https://finance.yahoo.com/calendar/ipo?day=2018-03-19", "https://finance.yahoo.com/calendar/ipo?day=2018-03-20",
"https://finance.yahoo.com/calendar/ipo?day=2018-03-21", "https://finance.yahoo.com/calendar/ipo?day=2018-03-22",
"https://finance.yahoo.com/calendar/ipo?day=2018-03-23", "https://finance.yahoo.com/calendar/ipo?day=2018-03-26",
"https://finance.yahoo.com/calendar/ipo?day=2018-03-27", "https://finance.yahoo.com/calendar/ipo?day=2018-03-28",
"https://finance.yahoo.com/calendar/ipo?day=2018-03-29", "https://finance.yahoo.com/calendar/ipo?day=2018-03-30",
"https://finance.yahoo.com/calendar/ipo?day=2018-04-02", "https://finance.yahoo.com/calendar/ipo?day=2018-04-03",
"https://finance.yahoo.com/calendar/ipo?day=2018-04-04", "https://finance.yahoo.com/calendar/ipo?day=2018-04-05",
"https://finance.yahoo.com/calendar/ipo?day=2018-04-06", "https://finance.yahoo.com/calendar/ipo?day=2018-04-09",
"https://finance.yahoo.com/calendar/ipo?day=2018-04-10", "https://finance.yahoo.com/calendar/ipo?day=2018-04-11",
"https://finance.yahoo.com/calendar/ipo?day=2018-04-12", "https://finance.yahoo.com/calendar/ipo?day=2018-04-13",
"https://finance.yahoo.com/calendar/ipo?day=2018-04-16", "https://finance.yahoo.com/calendar/ipo?day=2018-04-17",
"https://finance.yahoo.com/calendar/ipo?day=2018-04-18", "https://finance.yahoo.com/calendar/ipo?day=2018-04-19",
"https://finance.yahoo.com/calendar/ipo?day=2018-04-20", "https://finance.yahoo.com/calendar/ipo?day=2018-04-23",
"https://finance.yahoo.com/calendar/ipo?day=2018-04-24", "https://finance.yahoo.com/calendar/ipo?day=2018-04-25",
"https://finance.yahoo.com/calendar/ipo?day=2018-04-26", "https://finance.yahoo.com/calendar/ipo?day=2018-04-27",
"https://finance.yahoo.com/calendar/ipo?day=2018-04-30", "https://finance.yahoo.com/calendar/ipo?day=2018-05-01",
"https://finance.yahoo.com/calendar/ipo?day=2018-05-02", "https://finance.yahoo.com/calendar/ipo?day=2018-05-03",
"https://finance.yahoo.com/calendar/ipo?day=2018-05-04", "https://finance.yahoo.com/calendar/ipo?day=2018-05-07",
"https://finance.yahoo.com/calendar/ipo?day=2018-05-08", "https://finance.yahoo.com/calendar/ipo?day=2018-05-09",
"https://finance.yahoo.com/calendar/ipo?day=2018-05-10")
Code:
library(rvest)
library(dplyr)
library(magrittr)
x <- links %>%
read_html() %>%
html_table() %>%
extract2(1) %>%
bind_rows() %>%
as_tibble
This gives the following error:
Error in doc_parse_file(con, encoding = encoding, as_html = as_html, options = options) :
Expecting a single string value: [type=character; extent=68].
I am able to get the code working for 1 link however when I try to get it working for all the links I am running into errors. For example this code works:
x <- "https://finance.yahoo.com/calendar/ipo?day=2018-05-08" %>%
read_html() %>%
html_table() %>%
extract2(1) %>%
bind_rows() %>%
as_tibble
EDIT:
from = "2016-03-04"
to = "2018-05-10"
s <- seq(as.Date(from), as.Date(to), "days")
library(chron)
s <- s[!is.weekend(s)]
links <- paste0("https://finance.yahoo.com/calendar/ipo?day=", s)
library(rvest)
library(dplyr)
library(magrittr)
library(purrr)
library(naniar)
IPOs <- links[1:400] %>%
map_dfr(~read_html(.x) %>%
html_table() %>%
extract2(1) %>%
naniar::replace_with_na_all(condition = ~.x == "-") %>%
type.convert(as.is = TRUE) )

It looks like you want to loop through the URL's. For each one you want to read it, parse it into a data frame, and extracting the first data frame in the list. So the read_html() through extract2() steps should be done within the loop.
One option is to use a purrr::map_dfr() loop, since it looks like you want to bind things into a single tibble in the end.
Nominally that could look like:
library(rvest)
library(dplyr)
library(magrittr)
library(purrr)
links %>%
map_dfr(~read_html(.x) %>%
html_table() %>%
extract2(1) )
However, it turns out that you have missing values that are represented by hyphens (-). Some of the tables have these and some don't. When these are present, R reads your integer columns as characters while when they are not present integers are read as integer columns. This causes problems when binding everything together.
I did not see an argument in read_html() to deal with these directly (I was looking for the equivalent of na.strings in read.table() or na in readr::read_csv()). My work-around was to convert the hyphens to NA using function replace_with_na_all() from package naniar (see the vignette here). Then I converted all columns to the appropriate type with type.convert().
All of this was done within the map_dfr() loop.
Here is an example with just the first two URL's in links.
links[1:2] %>%
map_dfr(~read_html(.x) %>%
html_table() %>%
extract2(1) %>%
naniar::replace_with_na_all(condition = ~.x == "-") %>%
type.convert(as.is = TRUE) )
# A tibble: 15 x 9
Symbol Company Exchange Date `Price Range` Price Currency Shares Actions
<chr> <chr> <chr> <chr> <chr> <dbl> <chr> <int> <chr>
1 2003.HK Vcredit Hldg Ltd HKSE Jun 21, 2018 NA 20 HKD 49969000 Priced
2 2003.HK Vcredit Hldg Ltd HKSE Jun 21, 2018 NA 20 HKD 11745600 Priced
3 2003.HK Vcredit Hldg Ltd HKSE Jun 21, 2018 NA 20 HKD 6857200 Priced
4 0000 Vcredit Hldg Ltd HKSE Jun 12, 2018 NA NA HKD NA Expected
5 6571.JP QB Net Holdings Co Ltd Japan OTC Mar 14, 2018 21.11 - 21.11 NA Y 9785900 Expected
6 1621.HK Vico Intl Hldg Ltd HKSE Mar 05, 2018 NA 0.35 HKD 175000000 Priced
7 PZM.AX Piston Mach Ltd ASX Mar 05, 2018 0.32 - 0.32 NA AU 50000000 Expected
8 "" Agp Ltd Karachi Mar 05, 2018 0.76 - 0.76 80 PKR 8750000 Priced
9 GRC.L GRC International Group PLC LSE Mar 05, 2018 0.98 - 0.98 0.7 GBP 8414286 Priced
10 ACPH.BR Acacia Pharma Group PLC Brussels Mar 05, 2018 3.24 - 4.16 3.6 EUR 3175413 Priced
11 ACPH.BR Acacia Pharma Group PLC Brussels Mar 05, 2018 3.24 - 4.16 3.6 EUR 7935698 Priced
12 GCI.AX Gryphon Capital Income Tr ASX May 23, 2018 1.57 - 1.57 2 AUD 87650000 Priced
13 GCI.AX Gryphon Capital Income Tr ASX May 04, 2018 1.57 - 1.57 NA AUD 50000000 Expected
14 STRL.L Stirling Inds Plc LSE Mar 06, 2018 1.40 - 1.40 1 GBP 8881002 Priced
15 541006.BO Angel Fibers Ltd BSE Mar 06, 2018 NA 27 INR 6408000 Priced

scrape urls from a wikipedia table

Im trying to scrape the page https://en.wikipedia.org/wiki/UEFA_Euro_2012_squads and can take the text data off fine using rvest
library(plyr)
library(XML)
library(rvest)
library(dplyr)
library(magrittr)
library(data.table)
for(i in 1:16)
{
float <- paste("squad", i, sep ="")
print(float)
html = read_html("https://en.wikipedia.org/wiki/UEFA_Euro_2012_squads")
assign(float, html_table(html_nodes(html, "table")[[i]]))
}
but would also like to add an extra column to this with the URLs on each table for the club. e.g. for squad 1 (the polish squad on the page, truncated to show the first 5 players only)
0#0 Pos. Player Date of birth (age) Caps Goals Club
1 1 1GK Wojciech SzczÄ™sny (1990-04-18)18 April 1990 (aged 22) 11 0 Arsenal
2 2 2DF Sebastian Boenisch (1987-02-01)1 February 1987 (aged 25) 9 0 Werder Bremen
3 3 2DF Grzegorz Wojtkowiak (1984-01-26)26 January 1984 (aged 28) 19 0 Lech PoznaÅ„
4 4 2DF Marcin KamiÅ„ski (1992-01-15)15 January 1992 (aged 20) 3 0 Lech PoznaÅ„
5 5 3MF Dariusz Dudka (1983-12-09)9 December 1983 (aged 28) 65 2 Auxerre
6 6 3MF Adam Matuszczyk (1989-02-14)14 February 1989 (aged 23) 20 1 Fortuna DÃ¼sseldorf
I would like a column after "club" for "clubURL" that would show the wikipedia url for that club. For instance, the first player plays for Arsenal, so to take the link on the table for Arsenal and create:
0#0 Pos. Player Date of birth (age) Caps Goals Club
1 1 1GK Wojciech SzczÄ™sny (1990-04-18)18 April 1990 (aged 22) 11 0 Arsenal
clubURL
1 https://en.wikipedia.org/wiki/Arsenal_F.C.
and so on and so forth. I found rvest table scraping including links but couldn't get that example to work, nor for what I want to do. Sorry if it's been asked elsewhere,
thanks,

I made an example using the first table on the page. You can extend this as needed.
First, grab the first table and save it using html_table. Then I created a helper function to extract the link from the table, given the link text. Then I used sapply to populate a new column in the dataframe.
library("rvest")
url <- "https://en.wikipedia.org/wiki/UEFA_Euro_2012_squads"
mytable <- read_html(url) %>% html_nodes("table") %>% .[[1]]
df <- mytable %>% html_table()
get_link <- function(html_table, team){
html_table %>%
html_nodes(xpath=paste0("//a[text()='", team, "']")) %>%
.[[1]] %>%
html_attr("href")
}
df$club_link <- sapply(df$Club, function(x)get_link(mytable, x))
> head(df)
0#0 Pos. Player
1 1 1GK Wojciech SzczÄ™sny
2 2 2DF Sebastian Boenisch
3 3 2DF Grzegorz Wojtkowiak
4 4 2DF Marcin KamiÅ„ski
5 5 3MF Dariusz Dudka
6 6 3MF Adam Matuszczyk
Date of birth (age) Caps Goals
1 (1990-04-18)18 April 1990 (aged 22) 11 0
2 (1987-02-01)1 February 1987 (aged 25) 9 0
3 (1984-01-26)26 January 1984 (aged 28) 19 0
4 (1992-01-15)15 January 1992 (aged 20) 3 0
5 (1983-12-09)9 December 1983 (aged 28) 65 2
6 (1989-02-14)14 February 1989 (aged 23) 20 1
Club club_link
1 Arsenal /wiki/Arsenal_F.C.
2 Werder Bremen /wiki/SV_Werder_Bremen
3 Lech PoznaÅ„ /wiki/Lech_Pozna%C5%84
4 Lech PoznaÅ„ /wiki/Lech_Pozna%C5%84
5 Auxerre /wiki/AJ_Auxerre
6 Fortuna DÃ¼sseldorf /wiki/Fortuna_D%C3%BCsseldorf

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Table from url using rvest package - r

I'd like to get the information in three tables from a website. I tried to apply the code below, but the table is in a confusing format. url <- 'http://www.ufcstats.com/fighter-details/93fe7332d16c6ad9' url %>% html_table(fill = TRUE) Obs.: tidyverse and rvest have been used

Related

Web scraping with R (rvest)

How to order and mark duplicated rows at the same time

Signalling where a new row should start based on arbitrary characters when converting webscraped output to a tibble

Using dplyr to collect data and bind rows of the data collected

scrape urls from a wikipedia table

Categories

Resources