Scraping data using Python3 from JS generated content - web-scraping

I need to scrape a website (say "www.example.com") from a python3 program which has a form with two elements as follows:
1: Textbox
2: Dropdown
Need to run queries with several options (e.g. 'abc' and '1') to be filled/selected in the above form and scrape the pages thus generated. The pages thus generated after filling the form and submitting have a url as seen in the browser as "www.example.com/abc/1".The results on this page are fetched through a javacript as can be verified in the page source. Synopsis of the relevant javascript below:
<script type="text/rfetchscript">
$(document).ready(function(){
$.ajax({
url: "http://clients.example.com/api/search",
data: JSON.parse('{"textname":"abc", "dropval":"1"}'),
method: 'POST',
dataType: 'json',
Logic to fetch the data
</script>
I have tried to get the results of the page by using methods of requests, urllib:
1:
resp = requests.get('http://www.example.com/abc/1')
2:
req = urllib.request.Request('http://www.example.com/abc/1')
x = urllib.request.urlopen(req)
SourceCode = x.read()
3: Also tried scrapy.
But all of the above return only the static data as seen in "view page source", and not the actual results as can be seen in the browser.
Looking for help on the right approach here.

Scraping pages with urllib or requests will only return the page source since it can not execute the javascript codes etc that the server returns. If you want to load the content just like your browsers you have to use selenium with an optional chrome or firefox driver. If you want to keep using urllib or requests you have to find out which content pages the site loads with for example the network tab in your chrome browser. Probably the data you are interested in is loaded from a json file.

Related

How to figure out where is the raw data in a table?

https://www.nyse.com/quote/XNYS:A
After I access the above URL, I open Developer Tools in Firefox. Then change the date in HISTORIC PRICES, then click 'GO'. The table is updated. But I don't see relevant HTTP requests sent in devtools.
So this means that the data has already been downloaded in the first request. But I can not figure out how to extract the raw data of the table. Could anybody take a look at how to extract the raw data from the table? (Note that I don't want to use methods like selenium, I want to stay with raw HTTP requests to get the raw data.)
EDIT: websocket is mentioned in the comment. But I can't see it in Developer Tools. I add websocket tag anyway in case somebody knows more about websocket can chime in.
I am afraid you cannot extract javascript rendered content without selenium. You can always make use of a headless browser(you don't see any instance on your screen, the only pitfall is that you have to wait until the page fully loads) and it won't bother you anymore.
In other words, all the other scraping libs are based on urls and forms. Scrapy can post forms but not run javascripts.
Selenium will save the day, all you lose is a couple of seconds for each attempt(will be milliseconds if it is run in frontend). You can share page source with driver.page_source and it can be directly used for parsing(as a html text) with BeautifulSoup or whatever.
You can do it with requests-html, for example let's grab the first row of the table:
from requests_html import HTMLSession
session = HTMLSession()
url = 'https://www.nyse.com/quote/XNYS:A'
r = session.get(url)
r.html.render(sleep=7)
first_row = r.html.find('.flex_tr', first=True)
print(first_row.text)
Output:
06/18/2021
146.31
146.83
144.94
145.01
3,220,680
As #Nikita said you will have to wait the page loading (here 7sec but maybe less), but if you want to do multiple requests you can do it asynchronously !

Scraping from a slippery .aspx page

I need some values that I can see on a web page pop-up but the source is unknown al least with my knowledge.
The page is: https://www.afpmodelo.cl/AFP/Indicadores/Valor-Cuota.aspx
and the data appears in Modal (or something like that) after clicking the "DESCARGAR EXCEL" button.
I have searched the source, and the network XHR using Chrome dev tools, but data is nowhere to be found.
I use ruby with Mechanize for scraping, but suspect that's not the way to go here.
The data is showing in the web tools for me. Right click > Inspect:
The following code fetches that (slippery) table:
require 'mechanize'
require 'nokogiri'
url = 'https://www.afpmodelo.cl/AFP/Indicadores/Valor-Cuota.aspx'
mechanize = Mechanize.new { |agent|
agent.user_agent_alias = 'Mac Safari'
}
mechanize.get(url).form_with(:id => 'form1') do |form|
# submit the form using the DESCARGAR EXCEL button
data_page = form.submit(form.button_with(:id => 'ContentPlaceHolder1_btn_GRILLA'))
doc = Nokogiri::HTML(data_page.body)
results_table = doc.css('div.modal-dialog table')
# do something with the results_table
puts results_table
end

Phantomjs with R

I am trying to scrape data from a web page. Since the page has a dynamic content, I used phantomjs to handle. But, with the codes I am using, I just can download the data seen on the web page. However, I need to input the date range and then submit to get all the data I want.
Here are the codes i used,
library(xml2)
library(rvest)
connection <- "pr.js"
writeLines(sprintf("var page=require('webpage').create();
var fs = require('fs');
page.open('%s',function(){
console.log(page.content);//page source;
fs.write('pr.html', page.content, 'w');
phantom.exit();
});",url),con=connection)
system_input <- paste(path,"phantomjs"," ",connection,sep="")
system(system_input)
Thanks to the codes, I have the html output of the webpage which has been created dynamically.
And as I stated, I also need a date input submit. But I couldn't achieve.
The url is : https://seffaflik.epias.com.tr/transparency/piyasalar/gop/ptf.xhtml

Scrapy dowload zip file from ASP.NET site

I need some help getting scrapy to download a file from an asp.net site. Normally from a browser one would click the link and the file would begin downloading, but that is not possible with scrapy so what I am trying to do is the following:
def retrieve(self, response):
print('Response URL: {}'.format(response.url))
pattern = re.compile('(dg[^\']*)')
for file in response.xpath('//table[#id="dgFile"]/tbody/tr/td[2]/a'):
file_url = file.xpath('#href').extract_first()
target = re.search(pattern, file_url).group(1)
viewstate = response.xpath('//*[#id="__VIEWSTATE"]/#value').extract_first()
viewstategenerator = response.xpath('//*[#id="__VIEWSTATEGENERATOR"]').extract_first()
eventvalidation = response.xpath('//*[#id="__EVENTVALIDATION"]').extract_first()
data = {
'_EVENTTARGET': target,
'_VIEWSTATE': viewstate,
'_VIEWSTATEGEERATOR': viewstategenerator,
'_EVENTVALIDATION': eventvalidation
}
yield FormRequest.from_response(
response,
formdata=data,
callback=self.end(response)
)
I am trying to submit the information to the page in order the receive the zip file back as a response, however this is not working as I hoped it would. Instead I am simply getting the same page as a response.
In a situation like this is it even possible to use scrapy to download this file? does anyone have any pointers?
I have also tried to use Selenium+PhantomJS but I run into a dead end trying to transfer the session from scrapy to selenium. I would be willing to use selenium for this one function but I need to use scrapy for this project.

Retrieve comments from website using disqus

I would like to write a scraping script to retrieve comments from cnn articles. For example, this article: http://www.cnn.com/2012/01/19/politics/gop-debate/index.html?hpt=hp_t1
I realize that cnn uses disqus for their comment discussion. As the comment loading is not webpage-based (ie, prev page, next page) and is dynamic (ie, need to click "load next 25"), I have no idea how to retrieve all the 5000+ comments for this article.
Any idea or suggestion?
Thanks so much!
I needed to get comments via scraping a page that had disqus comments via ajax. Because they were not rendered on the server, I had to call the disqus api. In the source code, you will need the identifier code:
var identifier = "456643" // take note of this from the page source
// this is the ident url query param in the following js request
also,look in the js source code to get the pages public key, and forum name. Place these in the url where appropriate.
I used javascript nodejs to test this, ie :
var request = require("request");
var publicKey = "pILMw27bsbJsdfsdQDh9Eh0MzAgFL6xx0hYdsdsdfaIfBHRvLGqFFQ09st";
var disqusUri = "https://disqus.com/api/3.0/threads/listPosts.json?&api_key=" + publicKey + "&thread:ident=456643&forum=nameOfForumFromSource";
request(disqusUri, function(res,status,err){
console.log(res.body);
if(err){
console.log("ERR: " + err);
}
});
The option for scraping (other then getting the page), which might be less robust (depends on you're needs) but will offer a solution for the problem you have, is to use some kind of wrapper around a full fledged web browser and literally code the usage pattern and extract the relevant data. Since you didn't mention which programming language you know, I'll give 3 examples: 1) Watir - ruby, 2) Watin - IE & Firefox via .net, 3) Selenium - IE via C#/Java/Perl/PHP/Ruby/Python
I'll provide a little example using Watin & C#:
IE browser = new IE();
browser.GoTo(YOUR CNN URL);
List visibleComments = Browser.List(Find.ById("dsq-comments"));
//do your scraping thing
Link moreComments = Browser.Link(Find.ByClass("dsq-paginate-append-text");
moreComments.click();
//wait util ajax ended by searching for some indicator
Browser.WaitUntilContainsText(SOME TEXT);
//do your scraping thing
Notice:
I'm not familiar with disqus, but it might be a better option to force all the comments to show by looping the Link & click parts of the code I posted until all the comments are visible and the scrape the List element dsq-comments

Resources