I would LIKE TO EXTRACT THE job descriptions, aka the "p" TAG HTML ELEMENTS, from all 16 pages generated from the last line of code.
"ret" IS A LIST of 16 HTML pages generated by the last line of code. I'm not used to working with lists of lists, so I'm confused how to extract the data from these lists.
Normally I would use
res %>%
html_elements("body p")
But I'm getting the error message, "Error in UseMethod("xml_find_all") :
no applicable method for 'xml_find_all' applied to an object of class "list"
library(tidyverse)
library(rvest)
library(xml2)
url<-"https://www.indeed.com/jobs?q=data%20analyst&l=San%20Francisco%2C%20CA&vjk=0c2a6008b4969776"
page<-xml2::read_html(url)#function will read in the code from the webpage and break it down into different elements (<div>, <span>, <p>, etc.
#get job title
title<-page %>%
html_nodes(".jobTitle") %>%
html_text()
#get company Location
loc<-page %>%
html_nodes(".companyLocation") %>%
html_text()
#job snippet
page %>%
html_nodes(".job-snippet") %>%
html_text()
#Get link
desc<- page %>%
html_nodes("a[data-jk]") %>%
html_attr("href")
# Create combine link
combined_link <- paste("https://www.indeed.com", desc, sep="")
#Turn combined link into a session follow link
page1 <- html_session(combined_link[[1]])
page1 %>%
html_nodes(".iCIMS_JobContent, #jobDescriptionText") %>%
html_text()
#one<- page %>% html_elements("a[id*='job']")
#create function return a list of page-returns
ret <- lapply(paste0("https://www.indeed.com", desc), read_html)
We could either use lapply from base R
out <- lapply(ret, function(x) x %>%
html_nodes(".iCIMS_JobContent, #jobDescriptionText") %>%
html_text())
or loop with map from purrr
library(purrr)
out <- map(ret, ~ .x %>%
html_nodes(".iCIMS_JobContent, #jobDescriptionText") %>%
html_text())
NOTE: Both are looping over the elements of the list, the .x or x are the individual elements (from anonymous function - i.e. function created on the fly (function(x) or ~ - in tidyverse)
Related
how are you? I am trying to extract some info about this sportbetting webpage using rvest. I asked a related question a few days ago and i get almost 100% of my goals. So far , and thanks to you, extracted succesfully the title, the score and the time of the matches being played using the next code:
library(rvest)
library(tidyverse)
page <- "https://www.supermatch.com.uy/live_recargar_menu/" %>%
read_html()
data=data.frame(
Titulo = page %>%
html_elements(".titulo") %>%
html_text(),
Marcador = page %>%
html_elements(".marcador") %>%
html_text(),
Tiempo = page %>%
html_elements(".marcador+ span") %>%
html_text() %>%
str_squish()
)
Now i want to get repeated values, for example if the country of the match is "Brasil" I want to put it in the data frame that the country is Brasil for every match in that category. So far i only managed to extract all the countries but individually. Same applies for sport name and tournament.
Can you help me with that? Already thanks.
You could re-write your code to use separate functions that work with different levels of information. These can be called in a nested fashion making the code easier to read.
Essentially, using nested map_dfr() calls to produce a single dataframe from functions working with lists at different levels within the DOM.
Below, you could think of it like an outer loop of sports, then an intermediate loop over countries, and an innermost loop over events within a sport and country.
library(rvest)
library(tidyverse)
get_sport_info <- function(sport) {
df <- map_dfr(sport %>% html_elements(".category"), get_play_info)
df$sport <- sport %>%
html_element(".sport-name") %>%
html_text()
return(df)
}
get_play_info <- function(play) {
df <- map_dfr(play %>% html_elements(".event"), ~
data.frame(
titulo = .x %>% html_element(".titulo") %>% html_text(),
marcador = .x %>% html_element(".marcador") %>% html_text(),
tiempo = .x %>% html_element(".marcador + span") %>% html_text() %>% str_squish()
))
df$country <- play %>%
html_element(".category-name") %>%
html_text()
return(df)
}
page <- "https://www.supermatch.com.uy/live_recargar_menu/" %>% read_html()
sports <- page %>% html_elements(".sport")
final <- map_dfr(sports, get_sport_info)
I am trying to apply a function that extracts a table from a list of scraped links. I am at the final stage where I am applying the get_injury_data function to the links - I have been having issues with successfully executing this. I get the following error:
Error in matrix(unlist(values), ncol = width, byrow = TRUE) :
'data' must be of a vector type, was 'NULL'
I wonder if anyone can help me spot where I am going wrong. The code is as follows:
library(tidyverse)
library(rvest)
# create a function to grab the team links
get_team_links <- function(url){
url %>%
read_html %>%
html_nodes('td.hauptlink a') %>%
html_attr('href') %>%
.[. != '#'] %>% # remove rows with # string
paste0('https://www.transfermarkt.com', .) %>% # pat the website link to the url strings
unique() %>% # keep only unique links
as_tibble() %>% # turn strings into a tibble datatset
rename("links" = "value") %>% # rename the value column
filter(!grepl('profil', links)) %>% # remove link of players included
filter(!grepl('spielplan', links)) %>% # remove link of additional team pages included
mutate(links = gsub("startseite", "kader", links)) # change link to go to the detailed page
}
# create a function to grab the player links
get_player_links <- function(url){
url %>%
read_html %>%
html_nodes('td.hauptlink a') %>%
html_attr('href') %>%
.[. != '#'] %>% # remove rows with # string
paste0('https://www.transfermarkt.com', .) %>% # pat the website link to the url strings
unique() %>% # keep only unique links
as_tibble() %>% # turn strings into a tibble datatset
rename("links" = "value") %>% # rename the value column
filter(grepl('profil', links)) %>% # remove link of players included
mutate(links = gsub("profil", "verletzungen", links)) # change link to go to the injury page
}
# create a function to get the injury dataset
get_injury_data <- function(url){
url %>%
read_html() %>%
html_nodes('#yw1') %>%
html_table()
}
# get team links and save it as team_links
team_links <- get_team_links('https://www.transfermarkt.com/premier-league/startseite/wettbewerb/GB1')
# get player links and by mapping the function on to the player_injury_links dataset
# and then unnest the list of lists as a long list
player_injury_links <- team_links %>%
mutate(links = map(team_links$links, get_player_links)) %>%
unnest(links)
# using the player_injury_links list create a dataset by web scrapping the play injury pages
player_injury_data <- map(player_injury_links$links, get_injury_data)
Solution
So the issue that I was having was that some of the links that I was scraping did not have any data.
To overcome this issue used, I used the possibly function from purrr package. This helped me create a new, error-free function.
The line code that was giving me trouble is as follows:
player_injury_data <- player_injury_links %>%
purrr::map(., purrr::possibly(get_injury_data, otherwise = NULL, quiet = TRUE))
I'm trying to find a way to copy-paste the title and the abstract from a PubMed page.
I started using
browseURL("https://pubmed.ncbi.nlm.nih.gov/19592249") ## final numbers are the PMID
now I can't find a way to obtain the title and the abstract in a txt way. I have to do it for multiple PMID so I need to automatize it. It can be useful also just copying everything is on that page and after I can take only what I need.
Is it possible to do that? thanks!
I suppose what you're trying to do is scrape PubMed for articles of interest?
Here's one way to do this using the rvest package:
#Required libraries.
library(magrittr)
library(rvest)
#Function.
getpubmed <- function(url){
dat <- rvest::read_html(url)
pid <- dat %>% html_elements(xpath = '//*[#title="PubMed ID"]') %>% html_text2() %>% unique()
ptitle <- dat %>% html_elements(xpath = '//*[#class="heading-title"]') %>% html_text2() %>% unique()
pabs <- dat %>% html_elements(xpath = '//*[#id="enc-abstract"]') %>% html_text2()
return(data.frame(pubmed_id = pid, title = ptitle, abs = pabs, stringsAsFactors = FALSE))
}
#Test run.
urls <- c("https://pubmed.ncbi.nlm.nih.gov/19592249", "https://pubmed.ncbi.nlm.nih.gov/22281223/")
df <- do.call("rbind", lapply(urls, getpubmed))
The code should be fairly self-explanatory. (I've not added the contents of df here for brevity.) The function getpubmed does no error-handling or anything of that sort, but it is a start. By supplying a vector of URLs to the do.call("rbind", lapply(urls, getpubmed)) construct, you can get back a data.frame consisting of the PubMed ID, title, and abstract as columns.
Another option would be to explore the easyPubMed package.
I would also use a function and rvest. However, I would go with a passing the pid in as the argument function, using html_node as only a single node is needed to be matched, and use faster css selectors. String cleaning is done via stringr package:
library(rvest)
library(stringr)
library(dplyr)
get_abstract <- function(pid){
page <- read_html(paste0('https://pubmed.ncbi.nlm.nih.gov/', pid))
df <-tibble(
title = page %>% html_node('.heading-title') %>% html_text() %>% str_squish(),
abstract = page %>% html_node('#enc-abstract') %>% html_text() %>% str_squish()
)
return(df)
}
get_abstract('19592249')
I'm writing a data scraper in using rvest which looks like this:
library(tidyverse)
library(rvest)
library(magrittr)
library(dplyr)
library(tidyr)
library(data.table)
library(zoo)
targets_url <- paste0("https://247sports.com/college/ohio-state/Season/2021-Football/Targets/")
targets <- map_df(targets_url, ~.x %>% read_html %>%
html_nodes(".ri-page__star-and-score .score , .position , .meta , .ri-page__name-link") %>%
html_text() %>%
str_trim %>%
str_split(" ") %>%
matrix(ncol = 4, byrow = T) %>%
as.data.frame)
df_structure <- apply(targets,2,as.character)
df_targets <- as.data.frame(df_structure)
You'll notice that it creates a dataframe with four variables and 53 rows.
But now go to the URL itself. You'll notice that the 53 rows correspond to certain subcategorizations: Top Target, High Choice, and Interested. Here's a picture showing an example:
What I'm trying to do is create a fifth column, which contains the subcategory. So for example, the three individuals who fall under "Top Target" will be assigned another column which lists them as "Top Target". Then the next 20 rows will have that fifth column reading as "High Choice" and so on. The reason I'm here is because I have no clue how to do that. What makes it even harder is that not every page will have the same numbers, here's an example of that. You'll see that while the picture from above only lists Top Target (3), this page now has Top Target (24). It varies for each page.
Would it be possible to alter my original script that would:
A) Creates that fifth column with the subcategory I mentioned above
B) Knows when it's suppose to switch to the next subcategory
C) Is agnostic to whatever the total number of people in each subcategory
EDITED script partially based on #Dave2e answer:
library(rvest)
library(dplyr)
library(stringr)
teams <- c("ohio-state","penn-state","michigan","michigan-state")
targets_url <- paste0("https://247sports.com/college/", teams, "/Season/2021-Football/Targets/")
# read the web page once! then extract the information requested
targets <- map_df(targets_url, ~.x %>% read_html %>%
html_nodes(".ri-page__star-and-score .score , .position , .meta , .ri-page__name-link") %>%
html_text() %>%
str_trim %>%
str_split(" ") %>%
matrix(ncol = 4, byrow = T) %>%
as.data.frame)
#find the headings and the players
list <- page %>% html_nodes("li.ri-page__list-item")
headers <- which(html_attr(list, "class") == "ri-page__list-item list-header")
#find the category
category <- list[headers] %>% html_node("b.name") %>% html_text()
#extract repeats from header
nrepeats<-as.integer(str_extract(category, "[0-9]+"))
categories <- rep(category, nrepeats)[1:nrow(targets)]
#create combined dataframe
answer <- cbind(categories, targets)
The headings are located at a "li" node with the class ="ri-page__list-item list-header". It is also convenient to note the heading contains the number of players underneath that heading.
This script, finds the heading nodes, extracts the number of players and then creates the vector of repeating headings to merge to the targets dataframe.
library(rvest)
library(dplyr)
library(stringr)
targets_url <- paste0("https://247sports.com/college/ohio-state/Season/2021-Football/Targets/")
# read the web page once! then extract the information requested
page <- read_html(targets_url)
targets <- page %>%
html_nodes(".ri-page__star-and-score .score , .position , .meta , .ri-page__name-link") %>%
html_text() %>%
str_trim %>%
str_split(" ") %>%
matrix(ncol = 4, byrow = T) %>%
as.data.frame
#find the headings and the players
list <- page %>% html_nodes("li.ri-page__list-item")
headers <- which(html_attr(list, "class") == "ri-page__list-item list-header")
#find the category
category <- list[headers] %>% html_node("b.name") %>% html_text()
#extract repeats from header
nrepeats<-as.integer(str_extract(category, "[0-9]+"))
categories <- rep(category, nrepeats)[1:nrow(targets)]
#create combined dataframe
answer <- cbind(categories, targets)
Update - finding the hidden data
The webpage dynamically hides some information if the list is too long. The cope can now handle that information. The code below finds the hidden JSON data (contained in a 'script' node and parses that data. It does return a list of players but not all of the same information.
#another option
#find the hidden JSON data
jsons <- page %>% html_nodes(xpath = '//*[#type ="application/ld+json"]')
allplayers <- jsonlite::fromJSON( html_text(jsons[2]))
#Similar list, provide URL to each players webpage
answer2 <- cbind(rep(category, nrepeats), allplayers$athlete)
The code below works if I remove the Sys.sleep() from within the map() function. I tried to research the error ('Don't know how to pluck from a closure') but i haven't found much on that topic.
Does anyone know where I can find documentation on this error, and any help on why it is happening and how to prevent it?
library(rvest)
library(tidyverse)
library(stringr)
# lets assume 3 pages only to do it quickly
page <- (0:18)
# no need to create a list. Just a vector
urls = paste0("https://www.mlssoccer.com/players?page=", page)
# define this function that collects the player's name from a url
get_the_names = function( url){
url %>%
read_html() %>%
html_nodes("a.name_link") %>%
html_text()
}
# map the urls to the function that gets the names
players = map(urls, get_the_names) %>%
# turn into a single character vector
unlist() %>%
# make lower case
tolower() %>%
# replace the `space` to underscore
str_replace_all(" ", "-")
# Now create a vector of player urls
player_urls = paste0("https://www.mlssoccer.com/players/", players )
# define a function that reads the 3rd table of the url
get_the_summary_stats <- function(url){
url %>%
read_html() %>%
html_nodes("table") %>%
html_table() %>% .[[3]]
}
# lets read 3 players only to speed things up [otherwise it takes a significant amount of time to run...]
a_few_players <- player_urls[1:5]
# get the stats
tables = a_few_players %>%
# important step so I can name the rows I get in the table
set_names() %>%
#map the player urls to the function that reads the 3rd table
# note the `safely` wrap around the get_the_summary_stats' function
# since there are players with no stats and causes an error (eg.brenden-aaronson )
# the output will be a list of lists [result and error]
map(., ~{ Sys.sleep(5)
safely(get_the_summary_stats) }) %>%
# collect only the `result` output (the table) INTO A DATA FRAME
# There is also an `error` output
# also, name each row with the players name
map_df("result", .id = "player") %>%
#keep only the player name (remove the www.mls.... part)
mutate(player = str_replace(player, "https://www.mlssoccer.com/players/", "")) %>%
as_tibble()
tables <- tables %>% separate(Match,c("awayTeam","homeTeam"), extra= "drop", fill = "right")
purrr::safely(...) returns a function, so your map(., { Sys.sleep(5); safely(get_the_summary_stats) }) is returning functions, not any data. In R, a "closure" is a function and its enclosing environment.
Tilde notation is a tidyverse-specific method of more-terse anonymous functions. Typically (e.g., with lapply) one would use lapply(mydata, function(x) get_the_summary_stats(x)). In tilde notation, the same thing is written as map(mydata, ~ get_the_summary_stats(.))
So, re-write to:
... %>% map(~ { Sys.sleep(5); safely(get_the_summary_stats)(.); })
From comments by #r2evans