Web scraping using a url-list in R - r

I am trying to scrape some URLs from multiple websites I collected. I saved the already collected websites in a dataframe called meetings2017_2018. The problem is that URLs don't look very similar to one another except the first part of the URLs: https://amsterdam.raadsinformatie.nl. The second part of urls are saved in the dataframe. Here are some examples:
/vergadering/458873/raadscommissie%20Algemene%20Zaken
/vergadering/458888/raadscommissie%20Wonen
/vergadering/458866/raadscommissie%20Jeugd%20en%20Cultuur
/vergadering/346691/raadscommissie%20Algemene%20Zaken
So the whole URL would be https://amsterdam.raadsinformatie.nl/vergadering/458873/raadscommissie%20Algemene%20Zaken
I managed to create a very simple function from which I can pull out the URLs from a single website (see below).
web_scrape <- function(meeting) {
url <- glue("https://amsterdam.raadsinformatie.nl{meeting}")
read_html(url) %>%
html_nodes("a") %>%
html_attr("href")
}
With this function I still need to insert every single URL from the dataframe I want to scrape. Since there more than 140 in the dataframe this might take a while. As you can guess, I want to scrape all the urls at once using the url-list in the dataframe. Does anybody know how I can do that?

You could map/iterate over your saved URL in the meetings2017_2018 data frame:
Assuming your URLs are saved in a url column in your meetings2017_2018 data frame a starting point would be:
# create a vector of the URLs
urls <- pull(meetings2017_2018, url)
# map over the URLs and execute whatever code you want for every URL
map(urls, function(url) {
your_code
})

Related

How to table data scraped from the web and read all the data from a table

I am trying to scrape data from the web, specifically from a table that has different filters and pages and I have the following code:
library (rvest)
url.colombia.compra <- "https://colombiacompra.gov.co/tienda-virtual-del-estado-colombiano/ordenes-compra?&number_order=&state=&entity=&tool=IAD%20Software%20I%20-%20Microsoft&date_to = & date_from = "
tmp <- read_html (url.colombia.compra)
tmp_2 <- html_nodes (tmp, ".active")
the problem is that the code generates a list for me but I need to format it as a table and I have not succeeded, besides that it only shows me data from the first page of the table, how could I complement the code so that it allows me to read the data from all the pages in the table and format it as a table.
This is the table that looks like the table that shows the data
I would split this problem into two parts. Your first is how to programmatically access each of the 11 pages of this online table.
Since this is a simple html table, using the "Next" button (siguiente) will take us to a new page. If we look at the URL on the Next page, we can see the page number in the query parameters.
...tienda-virtual-del-estado-colombiano/ordenes-compra?page=1&number_order=&state...
We know that the pages are numbered starting with 0 (because "next" takes us to page1), and using the navigation bar we can see that there are 11 pages.
We can use the query parameters to construct a series of rvest::read_html() calls corresponding to page number by simply using lapply and paste0 to replace the page=. This will let us access all the pages of the table.
The second part is making use of rvest::html_table which will parse a tibble from the results of read_html
pages <-
lapply(0:11, function(x) {
data.frame(
html_table(
read_html(x = paste0("https://colombiacompra.gov.co/tienda-virtual-del-estado-colombiano/ordenes-compra?page=",
x,
"&number_order=&state=&entity=&tool=IAD%20Software%20I%20-%20Microsoft&date_to_=%20&date_from_="))
)
)
})
The result is a list of dataframes which we can combine with do.call.
do.call(rbind, pages)

R: Automatically 'translate' shortened URLs from Twitter data

I am looking for a way to automatically 'translate' the shortened URLs from Twitter to the original URL.
I scraped a couple of twitter timelines using following code:
tweets <- userTimeline("exampleuser", n = 3200, includeRts=TRUE)
tweets_df <- tbl_df(map_df(tweets, as.data.frame))
Then I separated the shortened URLs from the rest of the tweet text, so that I have a separate column in my dataframe, which contains only the shortened URL.
Now I am looking for a way to automatically scrape all these URLs, which redirect to various websites, and get a new column with the original (i.e. unshortened) URL.
Anyone an idea how I can do this in R?
Thanks,
Manuel
You can use the httr package.
httr::HEAD("URL") will give you a response in the first line and then you can do the usual cleaning to get just the URL-s.

Getting right part of web-scraped xml file into data frame in R

I'm trying to scrape the URL of the first website returned from Google searches using the Rvest package in R.
I seem to be able to get the URL into an XML file, but I can't transfer the right part of the XML file into a data frame.
I've used the code below.
url <- 'https://www.google.co.nz/search?rlz=1C1GCEB_enNZ790NZ790&ei=P4jsW6fbL4_RrQHd_K3wBw&q=auckland+university+of+technology+lifespan+development+and+communication+heal504&oq=auckland+university+of+technology+lifespan+development+and+communication+heal504&gs_l=psy-ab.3...20931.45570..45696...3.0..2.284.15672.0j63j18......0....1..gws-wiz.......0j0i71j35i39j0i67j0i131j0i131i67j0i20i263j0i13j0i22i10i30j0i22i30j33i21j33i160j33i22i29i30j33i10.xTnG49NmCBs'
googleurl <- read_html(url)
address <- html_nodes(googleurl,'.r')
address <- html_text(address)
urlname <- data.frame(address)
I can see the URL when I open the XML file in R as pictured in the attached image. However, when I transfer this to a data frame using html_text the relevant URL seems to be lost.
Screenshot image
html_text() return text for element, you need to select a tag to get URL and using html_attr()
address <- html_nodes(googleurl,'.r>a')
address <- html_attr(address, "href")

Scrape data from flash page using rvest

I am trying to scrape data from this page:
http://www.atpworldtour.com/en/tournaments/brisbane-international-presented-by-suncorp/339/2016/match-stats/r975/f324/match-stats?
If I try to scrape the name of the players using the css selector and the usual rvest syntax:
names <- read_html("http://www.atpworldtour.com/en/tournaments/brisbane-international-presented-by-suncorp/339/2016/match-stats/r975/f324/match-stats?") %>%
html_nodes(".scoring-player-name") %>% sapply(html_text)
everything goes well.
Unfortunately if I try to scrape the statistics below (first serve pts won, ..)
using the selector .stat-breakdown span I am not able to retrieve any data.
I know rvest is generally not recommended to scrape pages created dynamically, however I don't understand why some data are scraped and some not.
I don't use Rvest. If you follow the code below you should get to the format which is in the picture basically a string which you could transform to dataframe based on separators :, .
This Tag also contains more information than it was displayed in UI of webpage.
I can try also RSelenium but need to get my other PC. So I would let you know if RSelenium worked for me.
library(XML)
library(RCurl)
library(stringr)
url<-"http://www.atpworldtour.com/en/tournaments/brisbane-international-presented-by-suncorp/339/2016/match-stats/r975/f324/match-stats?"
url2<-getURL(url)
parsed<-htmlParse(url2)
# get messi data from tag
step1<-xpathSApply(parsed,"//script[#id='matchStatsData']",xmlValue)
# removing some unwanted characters
step2<-str_replace_all(step1,"\r\n","")
step3<-str_replace_all(step2,"\t","")
step4<-str_replace_all(step3,"[[{}]\"]","")
Output then is a string like this

How can I scrape data from this website (multiple webpages) using R?

I am a beginner in scraping data from website. It seems difficult for me to interpret the structure of html using XML or other packages.
Can anyone help me to download the data from this website?
http://wszw.hzs.mofcom.gov.cn/fecp/fem/corp/fem_cert_stat_view_list.jsp
It is about the investment from China. The character set is in Chinese.
What I've tried so far:
library("rvest")
url <- "http://wszw.hzs.mofcom.gov.cn/fecp/fem/corp/fem_cert_stat_view_list.jsp"
firm <- url %>%
html() %>%
html_nodes(xpath='//*[#id="Grid1MainLayer"]/table[1]') %>%
html_table()
firm <- firm[[1]] head(firm)
You can try with the function in the XML package called readHTMLTable that should download all the tables in the page and already format it into a data.frame.
library(XML)
all_tables = readHTMLTable("http://wszw.hzs.mofcom.gov.cn/fecp/fem/corp/fem_cert_stat_view_list.jsp")
Then since there is only one table in the page you linked it should be enough to get the first element so:
target_table = all_tables[[1]]

Resources