Phantomjs with R - r

I am trying to scrape data from a web page. Since the page has a dynamic content, I used phantomjs to handle. But, with the codes I am using, I just can download the data seen on the web page. However, I need to input the date range and then submit to get all the data I want.
Here are the codes i used,
library(xml2)
library(rvest)
connection <- "pr.js"
writeLines(sprintf("var page=require('webpage').create();
var fs = require('fs');
page.open('%s',function(){
console.log(page.content);//page source;
fs.write('pr.html', page.content, 'w');
phantom.exit();
});",url),con=connection)
system_input <- paste(path,"phantomjs"," ",connection,sep="")
system(system_input)
Thanks to the codes, I have the html output of the webpage which has been created dynamically.
And as I stated, I also need a date input submit. But I couldn't achieve.
The url is : https://seffaflik.epias.com.tr/transparency/piyasalar/gop/ptf.xhtml

Related

IMPORTXML comes back as blank

I'm trying to get the restaurant's address from a website using IMPORTXML.
When pulling the xpath the selected line (for the street) I get this information:
//*[#id="marmita-panel0-2"]/div/div[2]/p[2]
enter image description here
The formula I write on the Google Sheets cell is =IMPORTXML("URL Address","//*[#id='marmita-panel0-2']/div/div[2]/p[2]") and returns blank.
The site is built in javascript on the client side, not the server side. Then you cannot retrieve the information using importxml. Hopefully the data is contained in a json inside the source. So try:
function info(){
var url='https://www.ifood.com.br/delivery/santo-andre-sp/hamburgueria-sabor-amigo-parque-das-nacoes/1d270c55-1158-49a7-8df4-f369402a07e0'
var source = UrlFetchApp.fetch(url).getContentText()
var jsonString = source.split('<script type="application/ld+json">')[2].split('</script>')[0]
var data = JSON.parse(jsonString)
Logger.log(data.address.streetAddress)
Logger.log(data.address.addressLocality)
}

How to scrape Local Storage KEY/VALUES with R or Python (RVEST, HTTR, XHR, or something like that)

I've been trying to scrape this page's Data https://data.anbima.com.br/debentures?page=1&size=2000&.
I easily could get that table usin rvest bs4 etc.
However i found that the JSON file that sources the table's data have others useful complementary information.
Then i found the XHR link in the browser inspection panel provides the access to the JSON file.
I've been using this link for several month, however in the last few weeks that link (https://data.anbima.com.br/debentures-bff/debentures?page=0&size=2000&field=&order=&) started to request for a authorization code (TOKEN). The issue is that this token changes every period of time or another criteria.
I explore i little more and figure-out that TOKEN is generated by a JS and is stored in a Local Storage inside somewhere in the page. I need this token to include as headers in the code...
My simple question is: How can I scrape that value with r or python?
PLEASE CHECK THE IMAGE BELOW
LOCAL STORAGE VALUES
My simple question is: How can I scrape that value with r or python?
library(httr)
library(rlist)
library(jsonlite)
library(dplyr)
library(tidyverse)
library(V8)
resp<-GET("https://data.anbima.com.br/debentures?page=1&size=1499")
http_type(resp)
http_error(resp)
query <- list(
page="0",
size="1470",
field="",
order=""
)
URL <- "https://data.anbima.com.br/debentures-bff/debentures"
resp<-GET(URL,
c(
# add_headers(Referer = "https://data.anbima.com.br/debentures?page=1&size=1470&"),
add_headers(Authorization = "03AGdBq25HDdu4v2AzEjXJ_twI97EMrFlaNIcs3IuDHWzTFIp2mCXBqPaQPikuK7VRS3D7IC2v5briUdxPK3LpMPqrb1NoBqcXuI8gUkFdgVyNlObIdNzwQpVjcYASaW9N_gDx-M0SclFK54dDXHyRI7UVPAEQryV-1YSF6ebdJbY4BDr_eXRgMYe6UcK_Uh0YdfU1pMlcuU8O5dXKoRA-9GcX_AeaUxAUo5Mo_hQEGb0IPkPxojvEfgHvFdK0SQ4wgnmnJ0pcieO3h2exnJY1QxQd9sqqkfzdbGLaaCC7eNeWzXRAO3Yd9HtUciMclK612LfEm_ut89rtw8hSzlX3ZY6Vmo6zTvPT0WlMUrGLZ7syDEoDJKCi5xv6CSNgdAxqqqudEltDPUB7
")
),
query=query)
js <- fromJSON(content(resp,as="text"))[[1]]
enter image description here

Scrape dynamic info from same URL using python or any other tool

I am trying to scrape the URL of every company who has posted a job offer on this website:
https://jobs.workable.com/
I want to pull the info to generate some stats re this website.
The problem is that when I click on an add and navigate through the job post, the url is always the same. I know a bit of python so any solution using it would be useful. I am open to any other approach though.
Thank you in advance.
This is just a pseudo code to give you the idea of what you are looking for.
import requests
headers = {'User-Agent': 'Mozilla/5.0'}
first_url = 'https://job-board-v3.workable.com/api/v1/jobs?query=&orderBy=postingUpdateTime+desc'
base_url= 'https://job-board-v3.workable.com/api/v1/jobs?query=&orderBy=postingUpdateTime+desc&offset='
page_ids = ['0','10','20','30','40','50'] ## can also be created dynamically this is just raw
for pep_id in page_ids:
# for initial page
if(pep_id == '0'):
page = requests.get(first_url, headers=headers)
print('You still need to parse the first page')
##Enter some parsing logic
else:
final_url = base_url + str(pep_id)
page = requests.get(final_url, headers=headers)
print('You still need to parse the other pages')
##Enter some parsing logic

scraping the each of the link pages and storing it as an XML table

Hello I'm new to using R to scrape data from the Internet and, sadly, know little about HTML and XML. Am trying to scrape each story link at the following parent page: https://news.google.com/search?q=NREGA&hl=en-IN&gl=IN&ceid=IN%3Aen I don't care about any of the other links on the parent page, but need to create a table with a columns for the URL, title of the story and then the rest for the complete text of the page (which can be several paragraphs of text).
I tried with the rvest package and got the urls but the real issue is getting over all the articles and extracting the text and storing all in a table.
For Google News app:
library(rvest)
url <- 'https://news.google.com/search?q=NREGA&hl=en-IN&gl=IN&ceid=IN%3Aen'
webpage <- read_html(url)
data_html <- html_nodes(webpage, '.VDXfz') %>% html_att
r('href')
I will provide the javascript examples since i am not aware of the library you are using.
1.Getting the links of all the urls :
var anchors = document.querySelectorAll("article > a");
for(var i in anchors)
{
console.log(anchors[i].getAttribute("href"));
}
2.Getting the headers of each url link :
var headers = document.querySelectorAll("article > div:nth-of-type(1)");
for(var i in headers)
{
console.log(headers[i].innerText);
}
3.Getting the story once you navigated to that link :
var story = document.querySelector("div.full-details").innerText;
console.log(story);
This will fetch some extra details like number of shares on social media as visible on top, written by line , etc. If you want just the body without these details , you can get all paragraph elements using "document.querySelectorAll("div.full-details p")" and get innerText property for each of them which you can combine later.

Scraping data using Python3 from JS generated content

I need to scrape a website (say "www.example.com") from a python3 program which has a form with two elements as follows:
1: Textbox
2: Dropdown
Need to run queries with several options (e.g. 'abc' and '1') to be filled/selected in the above form and scrape the pages thus generated. The pages thus generated after filling the form and submitting have a url as seen in the browser as "www.example.com/abc/1".The results on this page are fetched through a javacript as can be verified in the page source. Synopsis of the relevant javascript below:
<script type="text/rfetchscript">
$(document).ready(function(){
$.ajax({
url: "http://clients.example.com/api/search",
data: JSON.parse('{"textname":"abc", "dropval":"1"}'),
method: 'POST',
dataType: 'json',
Logic to fetch the data
</script>
I have tried to get the results of the page by using methods of requests, urllib:
1:
resp = requests.get('http://www.example.com/abc/1')
2:
req = urllib.request.Request('http://www.example.com/abc/1')
x = urllib.request.urlopen(req)
SourceCode = x.read()
3: Also tried scrapy.
But all of the above return only the static data as seen in "view page source", and not the actual results as can be seen in the browser.
Looking for help on the right approach here.
Scraping pages with urllib or requests will only return the page source since it can not execute the javascript codes etc that the server returns. If you want to load the content just like your browsers you have to use selenium with an optional chrome or firefox driver. If you want to keep using urllib or requests you have to find out which content pages the site loads with for example the network tab in your chrome browser. Probably the data you are interested in is loaded from a json file.

Resources