Web Scraping using rvest in R - r

I have been trying to scrap information from a url in R using the rvest package:
url <-'https://eprocure.gov.in/cppp/tendersfullview/id%3DNDE4MTY4MA%3D%3D/ZmVhYzk5NWViMWM1NTdmZGMxYWYzN2JkYTU1YmQ5NzU%3D/MTUwMjk3MTg4NQ%3D%3D'
but am not able to correctly identity the xpath even after using selector plugin.
The code i am using for fetching the first table is as follows:
detail_data <- read_html(url)
detail_data_raw <- html_nodes(detail_data, xpath='//*[#id="edit-t-
fullview"]/table[2]/tbody/tr[2]/td/table')
detail_data_fine <- html_table(detail_data_raw)
When i try the above code, the detail_data_raw results in {xml_nodeset (0)} and consequently detail_data_fine is an empty list()
The information i am interested in scrapping is under the headers:
Organisation Details
Tender Details
Critical Dates
Work Details
Tender Inviting Authority Details
Any help or ideas in what is going wrong and how to rectify it is welcome.

Your example URL isn't working for anyone, but if you're looking to get the data for a particular tender, then:
library(rvest)
library(stringi)
library(tidyverse)
pg <- read_html("https://eprocure.gov.in/mmp/tendersfullview/id%3D2262207")
html_nodes(pg, xpath=".//table[#class='viewtablebg']/tr/td[1]") %>%
html_text(trim=TRUE) %>%
stri_replace_last_regex("\ +:$", "") %>%
stri_replace_all_fixed(" ", "_") %>%
stri_trans_tolower() -> tenders_cols
html_nodes(pg, xpath=".//table[#class='viewtablebg']/tr/td[2]") %>%
html_text(trim=TRUE) %>%
as.list() %>%
set_names(tenders_cols) %>%
flatten_df() %>%
glimpse()
## Observations: 1
## Variables: 15
## $ organisation_name <chr> "Delhi Jal Board"
## $ organisation_type <chr> "State Govt. and UT"
## $ tender_reference_number <chr> "Short NIT. No.20 (Item no.1) EE ...
## $ tender_title <chr> "Short NIT. No.20 (Item no.1)"
## $ product_category <chr> "Civil Works"
## $ tender_fee <chr> "Rs.500"
## $ tender_type <chr> "Open/Advertised"
## $ epublished_date <chr> "18-Aug-2017 05:15 PM"
## $ document_download_start_date <chr> "18-Aug-2017 05:15 PM"
## $ bid_submission_start_date <chr> "18-Aug-2017 05:15 PM"
## $ work_description <chr> "Replacement of settled deep sewe...
## $ pre_qualification <chr> "Please refer Tender documents."
## $ tender_document <chr> "https://govtprocurement.delhi.go...
## $ name <chr> "EXECUTIVE ENGINEER (NORTH)-II"
## $ address <chr> "EXECUTIVE ENGINEER (NORTH)-II\r\...
seems to work just fine w/o installing Python and using Selenium.

Have a look at 'dynamic webscraping'. Typically what happens is when you enter the url in your browser, it sends a get request to the host server. The host server builds an HTML page with all data in it and posts it back to you. In dynamic pages, the server just sends you a HTML template, which once you open, runs javascript in your browser, which then retrieves the data that populates the template.
I would recommend scraping this page using python and the Selenium library. Selenium library gives your program the ability to wait until the javascript has run in your browser and retrieved the data. See below a query I had on the same concept, and a very helpful reply
BeautifulSoup parser can't access html elements

Related

Scraping movie scripts failing on small subset

I'm working on scraping the lord of the rings movie scripts from this website here. Each script is broken up across multiple pages that look like this
I can get the info I need for a single page with this code:
library(dplyr)
library(rvest)
url_success <- "http://www.ageofthering.com/atthemovies/scripts/fellowshipofthering1to4.php"
success <- read_html(url_success) %>%
html_elements("#AutoNumber1") %>%
html_table()
summary(success)
Length Class Mode
[1,] 2 tbl_df list
This works for all Fellowship of the Ring pages, and all Return of the King pages. It also works for Two Towers pages covering scenes 57 to 66. However, any other Two Towers page (scenes 1-56) does not return the same result
url_fail <- "http://www.ageofthering.com/atthemovies/scripts/thetwotowers1to4.php"
fail <- read_html(url_fail) %>%
html_elements("#AutoNumber1") %>%
html_table()
summary(fail)
Length Class Mode
0 list list
I've inspected the pages in Chrome, and the failing pages appear to have the same structure as the succeeding ones, including the 'AutoNumber1' table. Can anyone help with this?
Works with xpath. Perhaps ill-formed html (page doesn't seem too spec compliant)
library(rvest)
url_fail <- "http://www.ageofthering.com/atthemovies/scripts/thetwotowers1to4.php"
fail <- read_html(url_fail) %>%
html_elements( xpath = '//*[#id="AutoNumber1"]') %>%
html_table()
fail
#> [[1]]
#> # A tibble: 139 × 2
#> X1 X2
#> <chr> <chr>
#> 1 "Scene 1 ~ The Foundations of Stone\r\n\r\n\r\nThe movie opens as the … "Sce…
#> 2 "GANDALF VOICE OVER:" "You…
#> 3 "FRODO VOICE OVER:" "Gan…
#> 4 "GANDALF VOICE OVER:" "I a…
#> 5 "The scene changes to \r\n inside Moria.  Gandalf is on the Bridge … "The…
#> 6 "GANDALF:" "You…
#> 7 "Gandalf slams down his staff onto the Bridge, \r\ncausing it to crack… "Gan…
#> 8 "BOROMIR :" "(ho…
#> 9 "FRODO:" "Gan…
#> 10 "GANDALF:" "Fly…
#> # … with 129 more rows

Scraping a web page in R without using RSelenium

I’m trying to do a simple scrap in the table in the following url:
https://www.bcb.gov.br/controleinflacao/historicometas
Page Print
By what i notice is that, When using rvest::read_html or httr::GET and when acessing the page source code i can't see anything related to the table, but when acessing google chrome developer tools, i can spot the table references in the elements tab.
Examble above is a simple code where i try to acess the content of the url and search of nodes that contain tables:
library( tidyverse )
library( rvest )
url <- “https://www.bcb.gov.br/controleinflacao/historicometas”
res <- url %>%
read_html( ) %>%
html_node( “table” )
this give me:
{xml_nodeset (0)}
opening the source code of the url mentioned we can see:
view-source:https://www.bcb.gov.br/controleinflacao/historicometas
Page Source Code print
Page Developer Tool table print
By what i have searched the question is that the scripts avaible in source code load the table. I have seen some solutions that use RSelenium, but i would like to know if there is some solution where i can scrap this table without using Rselenium.
Some other related StackOverflow questions:
Scraping webpage (with R) where all elements are placed inside an <app-root> tag
scraping table from a website result as empty
(Last one is a python example)
When dealing with dynamic sites, Network tab tends to be more useful than Inspector. And often you don't have to scroll through hundreds of requests or pages of minified javascript, you rather pick a search term from rendered page to identify the api endpoint that sent that piece of information.
In this case searching for "Resolução CMN nº 2.615" pointed to the correct call, most of the site content (in pure html) was delivered as json.
library(tibble)
library(rvest)
historicometas <- jsonlite::read_json("https://www.bcb.gov.br/api/paginasite/sitebcb/controleinflacao/historicometas")
historicometas$conteudo %>%
read_html() %>%
html_element("table") %>%
html_table()
#> # A tibble: 27 × 7
#> Ano Norma Data Meta …¹ Taman…² Inter…³ Infla…⁴
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 1999 Resolução CMN nº 2.615 ​ 30/6… 8 2 6-10 8,94
#> 2 2000 Resolução CMN nº 2.615 ​ 30/6… ​6 ​2 4-8 5,97
#> 3 2001 Resolução CMN nº 2.615 ​ 30/6… ​4 ​2 2-6 7,67
#> 4 2002 Resolução CMN nº 2.744 28/6… 3,5 2 1,5-5,5 12,53
#> 5 2003* Resolução CMN nº 2.842Resolução … 28/6… 3,254 22,5 1,25-5… 9,309,…
#> 6 2004* Resolução CMN nº 2.972Resolução … 27/6… 3,755,5 2,52,5 1,25-6… 7,60
#> 7 2005 Resolução CMN nº 3.108 25/6… 4,5 2,5 2-7 5,69
#> 8 2006 Resolução CMN nº 3.210 30/6… 4,5 ​2,0 2,5-6,5 3,14
#> 9 2007 Resolução CMN nº 3.291 23/6… 4,5 ​2,0 2,5-6,5 4,46
#> 10 2008 Resolução CMN nº 3.378 29/6… 4,5 ​2,0 2,5-6,5 5,90
#> # … with 17 more rows, and abbreviated variable names ¹​`Meta (%)`,
#> # ²​`Tamanhodo intervalo +/- (p.p.)`, ³​`Intervalode tolerância (%)`,
#> # ⁴​`Inflação efetiva(Variação do IPCA, %)`
Created on 2022-10-17 with reprex v2.0.2

scraping with select/ option dropdown

List item
I am new to web scrapping and after a couple of Wikipedia pages I found this page where I wanted to extract the tables for all the portfolio managers. I am not able to use the things I found on the internet. I thought it would be easy since it's just a table but I am not able to extract even a single table after filling out the form. Can someone please tell me how I could get this done in R? I have added an image in this post but it seems to look like a link that says to enter image description here.
https://www.sebi.gov.in/sebiweb/other/OtherAction.do?doPmr=yes
library(tidyverse)
library(rvest)
library(httr)
library(RCurl)
url <- "https://www.sebi.gov.in/sebiweb/other/OtherAction.do?doPmr=yes"
result <- postForm(url,
pmrId="RIGHT HORIZONS PORTFOLIO MANAGEMENT PRIVATE LIMITED",
year="2022",
month="August")
attr(result,"Content-Type")
result
enter image description here
Sebi Website
If you change those passed values to corresponding value attribute values of options (i.e. "8" instead of "August" in case of <option value="8">August</option>), you should be all set.
And you can also check the actual payload of POST requests:
Lazy approach would be just using Copy as cURL in DevTools and heading to https://curlconverter.com/r/ to convert it to httr request.
library(rvest)
resp <- httr::POST("https://www.sebi.gov.in/sebiweb/other/OtherAction.do?doPmr=yes",
body = list(
pmrId="INP000004417##INP000004417##AEQUITAS INVESTMENT CONSULTANCY PRIVATE LIMITED",
year="2022",
month="8"))
tables <- resp %>%
read_html() %>%
html_elements("table") %>%
html_table()
# first table:
tables[[1]]
#> # A tibble: 11 × 2
#> X1 X2
#> <chr> <chr>
#> 1 Name of the Portfolio Manager "Aeq…
#> 2 Registration Number "INP…
#> 3 Date of Registration "201…
#> 4 Registered Address of the Portfolio Manager ",,,…
#> 5 Name of Principal Officer ""
#> 6 Email ID of the Principal Officer ""
#> 7 Contact Number (Direct) of the Principal Officer ""
#> 8 Name of Compliance Officer ""
#> 9 Email ID of the Compliance Officer ""
#> 10 No. of clients as on last day of the month "124…
#> 11 Total Assets under Management (AUM) as on last day of the month (Amoun… "143…
Created on 2022-10-11 with reprex v2.0.2

How do I access (new july 2022) targeting information from the Facebook Ad Library API (R solution preferred)?

As this announcement mentions (https://www.facebook.com/business/news/transparency-social-issue-electoral-political-ads) new targeting information (or a summary) has been made available in the Facebook Ad Library.
I am used to use the 'Radlibrary' package in R, but I can't seem to find any fields in 'Radlibrary' which allows me to get this information? Does anyone know either how to access this information from the Radlibrary package in R (preferred, since this is what I know and usually works with) or how to access this from the API in another way?
I use it to look at how politicians choose to target their ads, why it would be a too big of a task to manually look it up at the facebook.com/ads/library
EDIT
The targeting I refer to is found browsering the ad library like the screenshots below
Thanks for highlighting this data being published which I did not know had been announced. I just registered for an API token to play around with it.
It seems to me that looking for ads from a particular politician or organisation is a question of downloading large amounts of data and then manipulating it in R. For example, to recreate the curl query on the API docs page:
curl -G \
-d "search_terms='california'" \
-d "ad_type=POLITICAL_AND_ISSUE_ADS" \
-d "ad_reached_countries=['US']" \
-d "access_token=<ACCESS_TOKEN>" \
"https://graph.facebook.com/<API_VERSION>/ads_archive"
We can simply do:
# enter token interactively so it doesn't get added to R history
token <- readline()
query <- adlib_build_query(
search_terms = "california",
ad_reached_countries = 'US',
ad_type = "POLITICAL_AND_ISSUE_ADS"
)
response <- adlib_get(params = query, token = token)
results_df <- Radlibrary::as_tibble(response, censor_access_token = TRUE)
This seems to return what one would expect:
names(results_df)
# [1] "id" "ad_creation_time" "ad_creative_bodies" "ad_creative_link_captions" "ad_creative_link_titles" "ad_delivery_start_time"
# [7] "ad_snapshot_url" "bylines" "currency" "languages" "page_id" "page_name"
# [13] "publisher_platforms" "estimated_audience_size_lower" "estimated_audience_size_upper" "impressions_lower" "impressions_upper" "spend_lower"
# [19] "spend_upper" "ad_creative_link_descriptions" "ad_delivery_stop_time"
library(dplyr)
results_df |>
group_by(page_name) |>
summarise(n = n()) |>
arrange(desc(n))
# # A tibble: 237 x 2
# page_name n
# <chr> <int>
# 1 Senator Brian Dahle 169
# 2 Katie Porter 122
# 3 PragerU 63
# 4 Results for California 28
# 5 Big News Buzz 20
# 6 California Water Service 20
# 7 Cancer Care is Different 17
# 8 Robert Reich 14
# 9 Yes On 28 14
# 10 Protect Tribal Gaming 13
# # ... with 227 more rows
Now - assuming that you are interested specifically in the ads by Senator Brian Dahle, it does not appear that you can send a query for all ads he has placed (i.e. using the page_name parameter in the query). But you can request for all political ads in their area (setting the limit parameter to a high number) with a particular search_term or search_page_id, and then filter the data to the relevant person.

Extract meta description from web pages using R

Hi i am trying to retrive these wepages meta descriptions
from the pages sources "
Data<-data.frame(Pages=c(
"http://boingboing.net/2016/06/16/spam-king-sanford-wallace.html",
"http://boingboing.net/2016/06/16/omg-the-japanese-trump-commer.html",
"http://boingboing.net/2016/06/16/omar-mateen-posted-to-facebook.html"))
Desired output
Data$Meta_Description<-data.frame(Extracted=c(
"Sanford Wallace gets 2.5 years in prison for 27 million Facebook",
"OMG, this Japanese Trump Commercial is everything",
"Omar Mateen posted to Facebook during Orlando mass shooting"))
I was trying to achieved this task with httr but I am not ablet to get it in the desired ouput format or extract content from what it is retrieved using the GET command
library (httr)
resp<-GET ("http://boingboing.net/2016/06/16/spam-king-sanford-wallace.html")
str(resp)
List of 10
$ url : chr "http://boingboing.net/2016/06/16/spam-king-sanford-wallace.html"
$ status_code: int 200
$ headers :List of 22
..$ server : chr "Apache/2.2"
The field that i need to extract from the source code is after this string
<meta itemprop="description" content="
Like so
<meta itemprop="description" content="'Spam King'
Sanford Wallace gets 2.5 years in prison for 27 million Facebook scam messages"
You really only need rvest. Since they're all <h1> headings, you can just iterate over the list of URLs, picking out the headings:
library(rvest)
sapply(Data$Pages,
function(url){
url %>%
as.character() %>% # in case strings are stored as factors
read_html() %>%
html_nodes('h1') %>%
html_text()
})
# [1] "'Spam King' Sanford Wallace gets 2.5 years in prison for 27 million Facebook scam messages"
# [2] "OMG, this Japanese Trump Commercial is everything"
# [3] "Omar Mateen posted to Facebook during Orlando mass shooting"
Or if you really want to scrape the <meta> tags, you can do it in the same way, though the selectors are more of a pain:
sapply(Data$Pages, function(url){
url %>%
as.character() %>%
read_html() %>%
html_nodes(xpath = '//meta[#itemprop="description"]') %>%
html_attr('content')
})
You get the same results either way.

Resources