R web scraping issue on extracting content through rvest - r

I am trying to extract content from https://careers.microsoft.com/us/en/search-results and get the title, info, etc from the page
urlString <- "https://careers.microsoft.com/us/en/search-results?"
getHTML <- xml2::read_html(urlString)
t2 <- getHTML %>% html_text() %>% stringr::str_sub(start = 8027, end = 65679)
jsonWeb <- jsonlite::fromJSON(t2)
df <- jsonWeb$data$jobs
Is there a more elegant way to do it? Like extracting the json of phApp.ddo {}
Thank you so much

It's not possible to get reliable results from web scraping a site like this, because you have no control over the content you are scraping. However, doing it by substring index is a disaster, because almost any change in the dynamic content will break your code (in fact, your code didn't work for me because the json string I was served was slightly shorter, so I got trailing garbage that wouldn't parse).
A more robust solution (though see caveat below) is to find useful delimiters at the start and end of the json string which you can use to cut away the parts you don't want.
urlString <- "https://careers.microsoft.com/us/en/search-results?"
getHTML <- xml2::read_html(urlString)
json <- jsonlite::fromJSON(strsplit(strsplit(html_text(getHTML),
"phApp\\.ddo = ")[[1]][2], "; phApp")[[1]][1])
json$eagerLoadRefineSearch$data$jobs
#> # A tibble: 50 x 27
#> country subCategory industry title multi_location type orgFunction
#> <chr> <chr> <lgl> <chr> <list> <lgl> <lgl>
#> 1 United~ Software E~ NA Prin~ <chr [1]> NA NA
#> 2 United~ Art NA Lead~ <chr [1]> NA NA
#> 3 India Support En~ NA Supp~ <chr [1]> NA NA
#> 4 Romania Support En~ NA Micr~ <chr [2]> NA NA
#> 5 China Solution S~ NA Seni~ <chr [1]> NA NA
#> 6 United~ Software E~ NA Soft~ <chr [1]> NA NA
#> 7 India Support En~ NA Supp~ <chr [1]> NA NA
#> 8 United~ Software E~ NA Seni~ <chr [1]> NA NA
#> 9 Japan Marketing ~ NA Full~ <chr [1]> NA NA
#> 10 United~ Software E~ NA Seni~ <chr [1]> NA NA
#> # ... with 40 more rows, and 20 more variables: experience <chr>,
#> # locale <chr>, multi_location_array <list>, jobSeqNo <chr>,
#> # postedDate <chr>, searchresults_display <lgl>,
#> # descriptionTeaser <chr>, dateCreated <chr>, state <chr>,
#> # targetLevel <chr>, jd_display <lgl>, reqId <lgl>, badge <chr>,
#> # jobId <chr>, isMultiLocation <lgl>, jobVisibility <list>,
#> # mostpopular <dbl>, location <chr>, category <chr>,
#> # locationLatlong <lgl>
I agree it would be better if you could request just the json, but in this case the page is constructed server-side, so there is no standalone xhr request to an API that delivers json, so you need to carve the json out of the served HTML. Regex isn't ideal for this, but it's far better than snipping fixed length strings.

Use the V8 package to run JS scripts on the page to get the phApp object:
library(rvest)
library(V8)
pg <- read_html("https://careers.microsoft.com/us/en/search-results")
scripts <- pg %>% html_nodes(xpath = "//script[contains(.,'phApp')]") %>% html_text()
ct <- v8()
ct$eval("var phApp = {}")
for (js in scripts) ct$eval(js)
data <- ct$get("phApp")
jobs <- data$ddo$eagerLoadRefineSearch$data$jobs

Related

Webscrape Table from FantasyLabs

I am trying to webscrape historical DFS NFL ownership from fanatsylabs.com using Rselenium. I am able to navigate to the page and even able to highlight the element I am trying to scrape, but am coming up with an error when I put it into a table.
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘readHTMLTable’ for signature ‘"webElement"’
I have looked up the error but cannot seem to find a reason why. I am essentially trying to follow this stack overflow example for this web scraping problem. Would someone be able to help me understand why I am not able to scrape this table and what I could do differently in order to do so?
here is my full code:
library(RSelenium)
library(XML)
library(RCurl)
# start the Selenium server
rdriver <- rsDriver(browser = "chrome",
chromever = "106.0.5249.61",
)
# creating a client object and opening the browser
obj <- rdriver$client
# navigate to the url
appURL <- 'https://www.fantasylabs.com/nfl/contest-ownership/?date=10112022'
obj$navigate(appURL)
obj$findElement(using = 'xpath', '//*[#id="ownershipGrid"]')$highlightElement()
tableElem <- obj$findElement(using = 'xpath', '//*[#id="ownershipGrid"]')
projTable <- readHTMLTable(tableElem, header = TRUE, tableElem$getElementAttribute("outerHTML")[[1]])
dvpCTable <- projTable[[1]]
dvpCTable
library(tidyverse)
library(httr2)
"https://www.fantasylabs.com/api/contest-ownership/1/10_12_2022/4/75377/0/" %>%
request() %>%
req_perform() %>%
resp_body_json(simplifyVector = TRUE) %>%
as_tibble
#> # A tibble: 43 x 4
#> Prope~1 $Fant~2 $Posi~3 $Play~4 $Team $Salary $Actu~5 Playe~6 SortV~7 Fanta~8
#> <int> <int> <chr> <chr> <chr> <int> <dbl> <int> <lgl> <int>
#> 1 50882 1376298 TE Albert~ "DEN" 2800 NA 50882 NA 1376298
#> 2 51124 1376299 TE Andrew~ "DEN" 2500 1.7 51124 NA 1376299
#> 3 33781 1385590 RB Austin~ "LAC" 7500 24.3 33781 NA 1385590
#> 4 55217 1376255 QB Brett ~ "DEN" 5000 NA 55217 NA 1376255
#> 5 2409 1376309 QB Chase ~ "LAC" 4800 NA 2409 NA 1376309
#> 6 40663 1385288 WR Courtl~ "DEN" 6100 3.4 40663 NA 1385288
#> 7 50854 1376263 RB Damare~ "DEN" 4000 NA 50854 NA 1376263
#> 8 8580 1376342 WR DeAndr~ "LAC" 3600 4.7 8580 NA 1376342
#> 9 8472 1376304 D Denver~ "DEN" 2500 7 8472 NA 1376304
#> 10 62112 1376262 RB Devine~ "" 4000 NA 62112 NA 1376262
#> # ... with 33 more rows, 34 more variables:
#> # Properties$`$5 NFL $70K Flea Flicker [$20K to 1st] (Mon-Thu)` <dbl>,
#> # $Average <dbl>, $Volatility <lgl>, $GppGrade <chr>, $MyExposure <lgl>,
#> # $MyLeverage <lgl>, $MyLeverage_rnk <lgl>, $MediumOwnership_pct <lgl>,
#> # $PlayerId_rnk <int>, $PlayerId_pct <dbl>, $FantasyResultId_rnk <int>,
#> # $FantasyResultId_pct <dbl>, $Position_rnk <lgl>, $Position_pct <lgl>,
#> # $Player_Name_rnk <lgl>, $Player_Name_pct <lgl>, $Team_rnk <lgl>, ...
Created on 2022-11-03 with reprex v2.0.2

Scrap webpage that requires button click

I am trying to scrap data from the link below. I need to click and download a csv file available in the csv button from the webpage.
library(netstat)
library(RSelenium)
url <- https://gtr.ukri.org/search/project?term=%22climate+change%22+OR+%22climate+crisis%22&fetchSize=25&selectedSortableField=&selectedSortOrder=&fields=pro.gr%2Cpro.t%2Cpro.a%2Cpro.orcidId%2Cper.fn%2Cper.on%2Cper.sn%2Cper.fnsn%2Cper.orcidId%2Cper.org.n%2Cper.pro.t%2Cper.pro.abs%2Cpub.t%2Cpub.a%2Cpub.orcidId%2Corg.n%2Corg.orcidId%2Cacp.t%2Cacp.d%2Cacp.i%2Cacp.oid%2Ckf.d%2Ckf.oid%2Cis.t%2Cis.d%2Cis.oid%2Ccol.i%2Ccol.d%2Ccol.c%2Ccol.dept%2Ccol.org%2Ccol.pc%2Ccol.pic%2Ccol.oid%2Cip.t%2Cip.d%2Cip.i%2Cip.oid%2Cpol.i%2Cpol.gt%2Cpol.in%2Cpol.oid%2Cprod.t%2Cprod.d%2Cprod.i%2Cprod.oid%2Crtp.t%2Crtp.d%2Crtp.i%2Crtp.oid%2Crdm.t%2Crdm.d%2Crdm.i%2Crdm.oid%2Cstp.t%2Cstp.d%2Cstp.i%2Cstp.oid%2Cso.t%2Cso.d%2Cso.cn%2Cso.i%2Cso.oid%2Cff.t%2Cff.d%2Cff.c%2Cff.org%2Cff.dept%2Cff.oid%2Cdis.t%2Cdis.d%2Cdis.i%2Cdis.oid%2Ccpro.rtpc%2Ccpro.rcpgm%2Ccpro.hlt&type=#/csvConfirm
I am struggling to implement that using Selenium. Here is the code I have so far.
rD <- rsDriver(port= free_port(), browser = "chrome", chromever = "106.0.5249.21", check = TRUE, verbose = TRUE)
remote_driver <- rD[["client"]]
remDr <- rD$client
remDr$navigate(url)
webElem <- remDr$findElement(using = "css", "content gtr-body d-flex flex-column ng-scope")
webElem$clickElement()
You can often just record the network log and see what request is sent when hitting the download button. In Chrome, right click Inspect, then look for the network tab. In this case there is only one request sent:
Right click and "copy as cURL" to see the whole request or just click copy URL, since the cookies and headers are not necessary here. I wrote a quick function around the task of querying the site:
dl_ukri <- function(query,
destfile = paste0(query, ".csv"),
size = 25L,
quiet_download = FALSE) {
url <- paste0(
"https://gtr.ukri.org/search/project/csv?term=",
urltools::url_encode(query),
"&selectedFacets=&fields=acp.d,is.t,prod.t,pol.oid,acp.oid,rtp.t,pol.in,prod.i,per.pro.abs,acp.i,col.org,acp.t,is.d,is.oid,cpro.rtpc,prod.d,stp.oid,rtp.i,rdm.oid,rtp.d,col.dept,ff.d,ff.c,col.pc,pub.t,kf.d,dis.t,col.oid,pro.t,per.sn,org.orcidId,per.on,ff.dept,rdm.t,org.n,dis.d,prod.oid,so.cn,dis.i,pro.a,pub.orcidId,pol.gt,rdm.i,rdm.d,so.oid,per.fnsn,per.org.n,per.pro.t,pro.orcidId,pub.a,col.d,per.orcidId,col.c,ip.i,pro.gr,pol.i,so.t,per.fn,col.i,ip.t,ff.oid,stp.i,so.i,cpro.rcpgm,cpro.hlt,col.pic,so.d,ff.t,ip.d,dis.oid,ip.oid,stp.d,rtp.oid,ff.org,kf.oid,stp.t&type=&selectedSortableField=score&selectedSortOrder=DESC"
)
curl::curl_download(url, destfile, quiet = quiet_download)
}
Testing this with your original search:
dl_ukri('"climate change" OR "climate crisis"', destfile = "test.csv")
readr::read_csv("test.csv")
#> Rows: 5894 Columns: 25
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (23): FundingOrgName, ProjectReference, LeadROName, Department, ProjectC...
#> dbl (2): AwardPounds, ExpenditurePounds
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 5,894 × 25
#> FundingOrgN…¹ Proje…² LeadR…³ Depar…⁴ Proje…⁵ PISur…⁶ PIFir…⁷ PIOth…⁸ PI OR…⁹
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 ESRC ES/W00… Univer… School… Fellow… Thew Harriet Christ… http:/…
#> 2 AHRC AH/W00… Univer… Arts L… Resear… Scott Peter Manley <NA>
#> 3 AHRC 2609218 Queen … Drama Studen… <NA> <NA> <NA> <NA>
#> 4 UKRI MR/V02… Univer… Politi… Fellow… Spaiser Viktor… <NA> http:/…
#> 5 MRC MC_PC_… Univer… <NA> Intram… Alessi Dario Renato <NA>
#> 6 AHRC 1948811 Royal … School… Studen… <NA> <NA> <NA> <NA>
#> 7 EPSRC 2688399 Brunel… Chemic… Studen… <NA> <NA> <NA> <NA>
#> 8 ESRC ES/T01… Univer… Social… Resear… Walker Cather… Louise http:/…
#> 9 AHRC AH/X00… Queen … Drama Resear… Herita… Paul <NA> http:/…
#> 10 ESRC 2272756 Univer… Sch of… Studen… <NA> <NA> <NA> <NA>
#> # … with 5,884 more rows, 16 more variables: StudentSurname <chr>,
#> # StudentFirstName <chr>, StudentOtherNames <chr>, `Student ORCID iD` <chr>,
#> # Title <chr>, StartDate <chr>, EndDate <chr>, AwardPounds <dbl>,
#> # ExpenditurePounds <dbl>, Region <chr>, Status <chr>, GTRProjectUrl <chr>,
#> # ProjectId <chr>, FundingOrgId <chr>, LeadROId <chr>, PIId <chr>, and
#> # abbreviated variable names ¹​FundingOrgName, ²​ProjectReference, ³​LeadROName,
#> # ⁴​Department, ⁵​ProjectCategory, ⁶​PISurname, ⁷​PIFirstName, ⁸​PIOtherNames, …
Created on 2022-10-17 with reprex v2.0.2
Voilà. I also played around with the fetchSize=25 which is in the original URL. But it does not seem to do anything, so I just omitted it.

R task, web scraping

I share my solution for the task, however, I get an error and cannot find the reason. Anyone can help with it?
Data download 1.1 Collect links Data on the Stack Overflow user survey is available on the Stack Overflow website. Create a web scraper that collects the links to the survey files. Select only the links to the surveys from 2017 to 2021.
lst_nodes <- "https://insights.stackoverflow.com/survey/" %>%
read_html() %>%
html_nodes(".js-download-link")
lst_url <- lst_nodes[1:5] %>%
html_attr("href")
print(lst_url)
Complete the function to download the data files from the URLs that extracted.
fun_download <- function(url) {
year <- # extract year from url
zip_file <- paste0("file_", year, ".zip")
zip_dir <- paste0("dir_", year)
download.file(url, zip_file)
unzip(zip_file, exdir = zip_dir, files = "survey_results_public.csv")
out <- read_csv(file.path(zip_dir, "survey_results_public.csv"), col_types = cols(.default = "c")) %>%
mutate(Year = year, ResponseId = row_number())
return(out)
year <- sub(".*[^0-9]([0-9]+)\\.zip$", "\\1", lst_url)
}
Apply the function to the URLs that you extracted and generate a data frame that contains the data from all surveys.
Save the data frame. Note: The read_csv command in the function seems to keep the downloaded csv files locked after reading. So once you tried to open the csv files, you cannot delete them. To overcome this lock, restart the R session.
Best to save the data so that you have to run the download and importing only once.
alldf <- lapply(lst_url, fun_download)
That is all I did so far...but it seems something is wrong
My suggestion to use year <- sub(.) needs to be put in context of the function itself, using its url only. This works.
fun_download <- function(url) {
stopifnot(length(url) == 1L) # just a safeguard
year <- sub(".*[^0-9]([0-9]+)\\.zip$", "\\1", url)
zip_file <- paste0("file_", year, ".zip")
zip_dir <- paste0("dir_", year)
download.file(url, zip_file)
unzip(zip_file, exdir = zip_dir, files = "survey_results_public.csv")
out <- readr::read_csv(file.path(zip_dir, "survey_results_public.csv"), col_types = readr::cols(.default = "c")) %>%
mutate(
Year = year,
ResponseId = row_number()
)
return(out)
}
fun_download(lst_url[[1]])
# trying URL 'https://info.stackoverflowsolutions.com/rs/719-EMH-566/images/stack-overflow-developer-survey-2021.zip'
# Content type 'application/zip' length 8825103 bytes (8.4 MB)
# downloaded 8.4 MB
# # A tibble: 83,439 x 49
# ResponseId MainBranch Employment Country US_State UK_Country EdLevel Age1stCode LearnCode YearsCode YearsCodePro DevType
# <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
# 1 1 I am a deve~ Independen~ Slovakia NA NA Seconda~ 18 - 24 y~ Coding Bo~ NA NA Develop~
# 2 2 I am a stud~ Student, f~ Netherl~ NA NA Bachelo~ 11 - 17 y~ Other onl~ 7 NA NA
# 3 3 I am not pr~ Student, f~ Russian~ NA NA Bachelo~ 11 - 17 y~ Other onl~ NA NA NA
# 4 4 I am a deve~ Employed f~ Austria NA NA Master?~ 11 - 17 y~ NA NA NA Develop~
# 5 5 I am a deve~ Independen~ United ~ NA England Master?~ 5 - 10 ye~ Friend or~ 17 10 Develop~
# 6 6 I am a stud~ Student, p~ United ~ Georgia NA Bachelo~ 11 - 17 y~ Other onl~ NA NA NA
# 7 7 I code prim~ I prefer n~ United ~ New Ham~ NA Seconda~ 11 - 17 y~ Other onl~ 3 NA NA
# 8 8 I am a stud~ Student, f~ Malaysia NA NA Bachelo~ 11 - 17 y~ School;On~ 4 NA NA
# 9 9 I am a deve~ Employed p~ India NA NA Bachelo~ 18 - 24 y~ Coding Bo~ 6 4 Develop~
# 10 10 I am a deve~ Employed f~ Sweden NA NA Master?~ 11 - 17 y~ School 7 4 Data sc~
# # ... with 83,429 more rows, and 37 more variables: OrgSize <chr>, Currency <chr>, CompTotal <chr>, CompFreq <chr>,
# # LanguageHaveWorkedWith <chr>, LanguageWantToWorkWith <chr>, DatabaseHaveWorkedWith <chr>, DatabaseWantToWorkWith <chr>,
# # PlatformHaveWorkedWith <chr>, PlatformWantToWorkWith <chr>, WebframeHaveWorkedWith <chr>, WebframeWantToWorkWith <chr>,
# # MiscTechHaveWorkedWith <chr>, MiscTechWantToWorkWith <chr>, ToolsTechHaveWorkedWith <chr>, ToolsTechWantToWorkWith <chr>,
# # NEWCollabToolsHaveWorkedWith <chr>, NEWCollabToolsWantToWorkWith <chr>, OpSys <chr>, NEWStuck <chr>, NEWSOSites <chr>,
# # SOVisitFreq <chr>, SOAccount <chr>, SOPartFreq <chr>, SOComm <chr>, NEWOtherComms <chr>, Age <chr>, Gender <chr>,
# # Trans <chr>, Sexuality <chr>, Ethnicity <chr>, Accessibility <chr>, MentalHealth <chr>, SurveyLength <chr>, ...
From here, use lapply(., fun_download) to produce a list of frames.
list_of_frames <- lapply(lst_url, fun_download)
# trying URL 'https://info.stackoverflowsolutions.com/rs/719-EMH-566/images/stack-overflow-developer-survey-2021.zip'
# Content type 'application/zip' length 8825103 bytes (8.4 MB)
# downloaded 8.4 MB
# trying URL 'https://info.stackoverflowsolutions.com/rs/719-EMH-566/images/stack-overflow-developer-survey-2020.zip'
# Content type 'application/zip' length 9908290 bytes (9.4 MB)
# downloaded 9.4 MB
# trying URL 'https://info.stackoverflowsolutions.com/rs/719-EMH-566/images/stack-overflow-developer-survey-2019.zip'
# Content type 'application/zip' length 18681322 bytes (17.8 MB)
# downloaded 17.8 MB
# trying URL 'https://info.stackoverflowsolutions.com/rs/719-EMH-566/images/stack-overflow-developer-survey-2018.zip'
# Content type 'application/zip' length 20022841 bytes (19.1 MB)
# downloaded 19.1 MB
# trying URL 'https://info.stackoverflowsolutions.com/rs/719-EMH-566/images/stack-overflow-developer-survey-2017.zip'
# Content type 'application/zip' length 9576818 bytes (9.1 MB)
# downloaded 9.1 MB
And a terse summary to show what they hold:
lapply(list_of_frames, function(z) z[1:2, 1:4])
# [[1]]
# # A tibble: 2 x 4
# ResponseId MainBranch Employment Country
# <int> <chr> <chr> <chr>
# 1 1 I am a developer by profession Independent contractor, freelancer, or self-employed Slovakia
# 2 2 I am a student who is learning to code Student, full-time Netherlands
# [[2]]
# # A tibble: 2 x 4
# Respondent MainBranch Hobbyist Age
# <chr> <chr> <chr> <chr>
# 1 1 I am a developer by profession Yes NA
# 2 2 I am a developer by profession No NA
# [[3]]
# # A tibble: 2 x 4
# Respondent MainBranch Hobbyist OpenSourcer
# <chr> <chr> <chr> <chr>
# 1 1 I am a student who is learning to code Yes Never
# 2 2 I am a student who is learning to code No Less than once per year
# [[4]]
# # A tibble: 2 x 4
# Respondent Hobby OpenSource Country
# <chr> <chr> <chr> <chr>
# 1 1 Yes No Kenya
# 2 3 Yes Yes United Kingdom
# [[5]]
# # A tibble: 2 x 4
# Respondent Professional ProgramHobby Country
# <chr> <chr> <chr> <chr>
# 1 1 Student Yes, both United States
# 2 2 Student Yes, both United Kingdom
If you need to assign names (such as the URL used to derive each dataset), then perhaps this, which adds a $url field to each frame.
list_of_frames <- Map(function(x, u) transform(x, url = u), list_of_frames, lst_url)
Data
library(rvest)
lst_nodes <- read_html("https://insights.stackoverflow.com/survey/") %>%
html_nodes(".js-download-link")
lst_url <- html_attr(lst_nodes [1:5], "href")
lst_url
# [1] "https://info.stackoverflowsolutions.com/rs/719-EMH-566/images/stack-overflow-developer-survey-2021.zip"
# [2] "https://info.stackoverflowsolutions.com/rs/719-EMH-566/images/stack-overflow-developer-survey-2020.zip"
# [3] "https://info.stackoverflowsolutions.com/rs/719-EMH-566/images/stack-overflow-developer-survey-2019.zip"
# [4] "https://info.stackoverflowsolutions.com/rs/719-EMH-566/images/stack-overflow-developer-survey-2018.zip"
# [5] "https://info.stackoverflowsolutions.com/rs/719-EMH-566/images/stack-overflow-developer-survey-2017.zip"

Webscraping Tables From Wikipedia in R

I was wondering if anyone had useful ideas or code for web scraping tables from Wikipedia.
Specifically, I'm interested in the Presidential election results table in the "Results by county" section on Wikipedia.
An example table can be found using the following link and scrolling down to the "Results by county" section: https://en.wikipedia.org/wiki/1948_United_States_presidential_election_in_Texas
The table looks like this:
I've tried some solutions from the following StackOverflow post: Importing wikipedia tables in R
However, they don't appear to be appliable to the type of table I want to scrape in Wikipedia.
Any advice, solutions, or code would be greatly appreciated. Thank you!
Making use of the rvest package you could get the table by first selecting the element containing the desired table via html_element("table.wikitable.sortable") and then extracting the table via html_table() like so:
library(rvest)
url <- "https://en.wikipedia.org/wiki/1948_United_States_presidential_election_in_Texas"
html <- read_html(url)
county_table <- html %>%
html_element("table.wikitable.sortable") %>%
html_table()
head(county_table)
#> # A tibble: 6 x 14
#> County `Harry S. Truman… `Harry S. Truman… `Thomas E. Dewey… `Thomas E. Dewe…
#> <chr> <chr> <chr> <chr> <chr>
#> 1 County # % # %
#> 2 Anders… 3,242 62.37% 1,199 23.07%
#> 3 Andrews 816 85.27% 101 10.55%
#> 4 Angeli… 4,377 69.05% 1,000 15.78%
#> 5 Aransas 418 61.02% 235 34.31%
#> 6 Archer 1,599 86.20% 191 10.30%
#> # … with 9 more variables: Strom ThurmondStates’ Rights Democratic <chr>,
#> # Strom ThurmondStates’ Rights Democratic.1 <chr>,
#> # Henry A. WallaceProgressive <chr>, Henry A. WallaceProgressive.1 <chr>,
#> # Various candidatesOther parties <chr>,
#> # Various candidatesOther parties.1 <chr>, Margin <chr>, Margin.1 <chr>,
#> # Total votes cast[11] <chr>

Webscraping Table

I am trying to webscrape the table from this following page (https://www.coya.com/bike/fahrrad-index-2019), namely the values the bike index for 50 german cities (if u click "Alle Ergebnisse +", you ll see all 50 cities.
I need especially some columns ("Bewertung spezielle Radwege & Qualität der Radwege", "Investitionen & QUalität der Infrastruktur", "Bewertung der Infrastruktur", "Fahrradsharing-Score", "Autofreier Tag", "Critical-Mass-Fahrrad-aktionen, "Event-Score).
This is what I tried:
library(rvest)
num_link="https://www.coya.com/bike/fahrrad-index-2019"
num_page= read_html(num_link)
xyc= num_page %>% html_nodes("._1200:nth-child(2)") %>% html_text()
I tried Selectorgadget, unfortunately I get all the values of the table in a long String (str_split is challenging, because commas in numbers got mixed with commas between the numbers:
"[1] "Ergebnisse für DeutschlandKriminalitätInfrastrukturFahrrad-SharingEvents#StadtLandSizeTotal Score1OldenburgDeutschlandK57,90,4271,94588,3594,4684,5227,153,0590,3454,1836,4515,0525,75N31,5216,2669,122MünsterDeutschlandK58,740,3910,53445,5883,0488,4328,1551,2388,0453,0535,522630,76N23,8412,4265,933Freiburg i. Breisg.DeutschlandK59,350,"
Could someone help me scraping the table, if possible, especially only some values of specific columns (see above)? Very thankful for any help/tip.
Thank you in advance.
(I am a newbie, please be gentle.)
Here is one way to solve the puzzle. Though the row names use a lot of icons so I just leave empty column name. You can create a vector names and assign them manually using
names(table_content) <- names_vector
Here is the code
library(rvest)
#> Loading required package: xml2
library(dplyr, warn.conflicts = FALSE)
library(purrr)
# Here is just reuse your code
num_link <- "https://www.coya.com/bike/fahrrad-index-2019"
num_page <- read_html(num_link)
# Extract the items from your code but go further down
table_content <- num_page %>% html_nodes("._1200:nth-child(2)") %>%
# Extract the node that contain the table
html_nodes(css = ".w-dyn-list") %>%
# Extract the nodes corresponded to each row
html_nodes(css = ".bike-collection-item") %>%
# Then map the function that take each rows in and convert them to a table
# and bind them together into one table
map_dfr(function(x) {
# suppress the message due to no column name was feed into map_dfc
suppressMessages(
x %>% html_nodes(".td") %>%
map_dfc(function(x) { x %>% html_text })
)
})
Here is the extracted content
#> # A tibble: 70 x 21
#> ...1 ...2 ...3 ...4 ...5 ...6 ...7 ...8 ...9 ...10 ...11 ...12 ...13
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 1 Olde… Deut… K 57,9 0,427 1,94 588,… 94,46 84,52 27,1 53,05 90,34
#> 2 2 Müns… Deut… K 58,74 0,391 0,53 445,… 83,04 88,43 28,15 51,23 88,04
#> 3 3 Frei… Deut… K 59,35 0,34 2,27 962,… 88,87 77,52 32,57 48,11 93,49
#> 4 4 Bamb… Deut… K 55,59 0,302 0 456,… 89,04 92,66 30,29 47,74 93,75
#> 5 5 Gött… Deut… K 62,66 0,28 3,07 379,… 92,8 80,99 23,03 48,07 89,18
#> 6 6 Heid… Deut… K 63,14 0,22 1,21 394,… 90,39 88,33 29,02 47,88 94,21
#> 7 7 Karl… Deut… K 57,39 0,25 4,23 725,… 90,35 71,62 18,75 46,33 93,93
#> 8 8 Brau… Deut… K 67,36 0,21 0 522,… 85,89 90,97 20,55 49,2 89,78
#> 9 9 Kons… Deut… K 62,77 0,22 4,6 121,… 93,62 76,98 23 48,49 94,09
#> 10 10 Brem… Deut… M 58,86 0,21 1,38 334,… 87,34 87,15 18,64 59,78 94,64
#> # … with 60 more rows, and 8 more variables: ...14 <chr>, ...15 <chr>,
#> # ...16 <chr>, ...17 <chr>, ...18 <chr>, ...19 <chr>, ...20 <chr>,
#> # ...21 <chr>
Created on 2021-04-08 by the reprex package (v1.0.0)
You probably want to do one table at a time and then the columns one at a time, so you can then create a data frame. Try this for example:
col1 <- num_page %>% html_nodes(paste0(".w-dyn-item :nth-child(2) div")) %>%
html_text()
The selector gadget is nifty but I usually need to experiment a lot to get the right selectors.

Resources