I am trying to extract the title of Youtube video from a specific channel. My current scraper is the following:
url <- 'https://www.youtube.com/channel/UCmgE6sLiR_cC_0T4fRHpZ0A/videos'
webpage <- read_html(url)
titles_html <- html_nodes(webpage, '#contents')
titles <- html_text(titles_html)
I'm pretty sure the node is not the correct one, but I can't seem to find anything through Google Chrome SelectorGadget.
Anyone knows how to get data in this case?
Thank you very much!
#contents is correct but need more processing, use h3 a to get title of every video.
Related
I am pretty new to web scraping and i need to scrape newspapers articles content from a list of urls related to articles from different websites. I would like to obtain the actual textual content from each of the documents, however, I cannot find a way to automate the scraping procedure through links relating to different websites.
In my case, data are stored in "dublin", a dataframe looking like this.
enter image description here
So far, I managed to scrape together articles from equal websites in order to rely to the same .css paths I find with selector gadget for retrieving the texts. Here is the code I'm using to scrape content selecting documents from the same webpage, in this case those posted by The Irish Times:
library(xml2)
library(rvest)
library(dplyr)
dublin <- dublin%>%
filter(dublin$page == "The Irish Times")
link <- c(pull(dublin, 2))
articles <- list()
for(i in link){
page <- read_html(i)
text = page %>%
html_elements(".body-paragraph")%>%
html_text()
articles[[i]] <- c(text)
}
articles
It actually works. However, since webpages vary case by case, I was wondering whether there is any way to automate this procedure through all the elements of the "url" variable.
Here is an example of the links I scraped:
https://www.thesun.ie/news/10035498/dublin-docklands-history-augmented-reality-app/
https://lovindublin.com/lifestyle/dublins-history-comes-to-life-with-new-ar-app-that-lets-you-experience-it-first-hand
https://www.irishtimes.com/ireland/dublin/2023/01/11/phone-app-offering-augmented-reality-walking-tour-of-dublins-docklands-launched/
https://www.dublinlive.ie/whats-on/family-kids-news/new-augmented-reality-app-bring-25949045
https://lovindublin.com/news/campaigners-say-we-need-to-be-ambitious-about-potential-lido-for-georges-dock
Thank you in advance! Hope the material I provided is enough.
Okay to start, I'm very new to web scraping. I'm trying to learn and I thought I'd start with something simple - scraping a paragraph of text from a webpage. The webpage I'm trying to scrape is https://www.cato.org/blog
I'm just trying to scrape the first paragraph that begins with "Border patrol arrests..."
I added the SelectorGadget extension to chrome to get the CSS selector.
The code I have written is as follows:
url <- "https://www.cato.org/blog"
webpage <- read_html(url)
text <- html_nodes(webpage, "p")
text <- html_text2(text)
However, after running text <- html_nodes(webpage, "p"), I just get an empty list. No errors or anything just... nothing. Am I doing something wrong? When I look up similar issues, I find answers recommending trying the RSelenium package but when I look up this package and how to use it for my task, a lot of it goes over my head.
I am trying to use web scraping to generate a play by play dataset from ESPN. I have figured out most of it, but have been unable to tell which team the event is for, as this is only encoded on ESPN in the form of an image. The best way I have come up with to solve this problem is to get the URL of the logo for each entry and compare it to the URL of the logo for each team at the top of the page. However, I have been unable to figure out how to get an attribute such as the url from the image.
I am running this on R and am using the rvest package. The url I am scraping is https://www.espn.com/mens-college-basketball/playbyplay?gameId=400587906 and I am scraping using the SelectorGadget Chrome extension. I have also tried comparing the name of the player to the boxscore, which has all of the players listed, but each team has a player with the last name of Jones, so I would prefer to be able to get the team by looking at the image, as this will always be right.
library(rvest)
url <- "https://www.espn.com/mens-college-basketball/playbyplay?gameId=400587906"
webpage <- read_html(url)
# have been able to successfully scrape game_details and score
game_details_html <- html_nodes(webpage,'.game-details')
game_details <- html_text(game_details_html) %>% as.character()
score_html <- html_nodes(webpage,'.combined-score')
score <- html_text(score_html)
# have not been able to scrape image
ImgNode <- html_nodes(webpage, css = "#gp-quarter-1 .team-logo")
link <- html_attr(ImgNode, "src")
For each event, I want it to be labeled "Duke" or "Wake Forest".
Is there a way to generate the URL for each image? Any help would be greatly appreciated.
"https://a.espncdn.com/combiner/i?img=/i/teamlogos/ncaa/500/150.png&h=100&w=100"
"https://a.espncdn.com/combiner/i?img=/i/teamlogos/ncaa/500/154.png&h=100&w=100"
Your code returns these.
500/150 is Duke and 500/154 is Wake Forest. You can create a simple dataframe with these and then join the tables.
link_df <- as.data.frame(link)
link_ref_df <- data.frame(link = c("https://a.espncdn.com/combiner/i?img=/i/teamlogos/ncaa/500/150.png&h=100&w=100", "https://a.espncdn.com/combiner/i?img=/i/teamlogos/ncaa/500/154.png&h=100&w=100"),
team_name = c("Duke", "Wake Forest"))
link_merged <- merge(link_df,
link_ref_df,
by = 'link',
all.x = T)
This is not scalable if you're doing hundreds of these with other teams, but works for this specific option.
I am looking to scrape the main table from this website. I have managed to get the table into R and working, but the one problem is the website defaults to PS4, while I wanted the data for Xbox (this is changed in the top-right dropdown menu).
Ideally there would be a way to pass a parameter in the URL that will define the platform in this way, but I haven't been able to find anything about that.
Looking around it seems that PhantomJS would be the best way to go but I have no experience using Javascript and I'm not sure how you would implement performing an action on the page, then scraping the resulting table.
This is currently all I have in terms of my main code scraping the data:
library(rvest)
url1 <- "https://www.futbin.com/19/players?page="
pge <- 1
tbl <- paste0(url1, pge) %>%
read_html() %>%
html_nodes(xpath='//*[#id="repTb"]') %>%
html_table()
Thanks in advance for any help.
in the last couple of month I used a code to get historical data from Wunderground. The code worked. But since today they somehow changed the website. Unfortunately I am not very familiar with html stuff. This was the code which worked:
webpage <- read_html("https://www.wunderground.com/history/monthly/LOWW/date/2018-06?dayend=18&monthend=6&yearend=2018&req_city=&req_state=&req_statename=&reqdb.zip=&reqdb.magic=&reqdb.wmo=")
tbls <- html_nodes(webpage, "table")
weather.data <- html_table(tbls)[[2]]
This code selected the second table on the website. Does anyone has an idea why it does not work anymore? They somehow changed the description for the nodes?
Cheers