How to scrape incapsula protected website?

How to scrape incapsula protected website? - web-scraping

https://www.genecards.org/cgi-bin/carddisp.pl?gene=ZSCAN22
On the above webpage, if I click See all 33, I will see the following GET request is sent in Chrome DevTools.
https://www.genecards.org/gene/api/data/Enhancers?geneSymbol=ZSCAN22
Direct accessing of it is blocked.
I have try to use a puppeteer. I can click "See all 33" with puppeteer, but then I need to parse the resulted HTML file. It would be best to directly get the results from https://www.genecards.org/gene/api/data/Enhancers?geneSymbol=ZSCAN22. I am not sure how to get it after clicking "See all 33" with puppeteer.
I am not sure if apify can help.
Can anybody let me know how to scrape it?

I used selenium it working fine
from selenium import webdriver
browser = webdriver.Chrome(executable_path="C:/src/webdriver/chromedriver.exe")
genesLocations = 'https://www.genecards.org/cgi-bin/carddisp.pl?gene={}'
Extract Genomic Locations
gene='ZSCAN22'
browser.get(genesLocations.format(gene))
location = browser.find_element_by_xpath('//*[#id="genomic_location"]/div/div[3]/div/div')
print(location.text)

Related

Scraping Data on TradingView with BeautifulSoup

I just started learning web scraping and decided to scrape the daily value from this site:
https://www.tradingview.com/symbols/INDEX-MMTW/
I am using BeautifulSoup and then doing inspect element and then Copy -> CSS Selector.
However, the returned items are always 0 length. I tried the select() method (from ATBS) and find() method.
Not sure what I am doing wrong. Here is the code...
import requests, bs4
res = requests.get('https://www.tradingview.com/symbols/INDEX-MMTW/')
res.raise_for_status()
nmmtw_data = bs4.BeautifulSoup(res.text, 'lxml')
(Instead of writing the selector yourself, you can also right-click on the element in your browser
and select Inspect Element. When the browser’s developer console opens, right-click on the element’s
HTML and select Copy ▸ CSS Selector to copy the selector string to the clipboard and paste it into your
source code.)
elems = nmmtw_data.select("div.js-symbol-last > span:nth-child(1)")
new_try = nmmtw_data.find(class_="tv-symbol-price-quote__value js-symbol-last")
print(type(new_try))
print(len(new_try))
print(elems)
print(type(elems))
print(len(elems))
Thanks in advance!

Since the price table is generated with JavaScript, unfortunately, we cannot simply use BeautifulSoup to scrape the pricing table. Instead, you should use web browser automation framework.

I'm sure you've found the solution so far, but if not, I believe the answer to your problems is using selenium module. Additionally, you need to install the webdriver specific to the browser you're using. I think BeautifulSoup is very limited these days because most of the sites are generated using java script.
All the info that you need for selenium you can find here:
https://www.selenium.dev/documentation/webdriver/

Scrapy - get map information

I'm trying to scrap the information of "5 Postos de abastecimento" from the section "Mapa" on https://www.imovirtual.com/anuncio/moradia-no-alto-da-ajuda-para-reconstrucao-ID10VFW.html#46747dcb9d using Scrapy.
When I look at the web site in chrome the map section appears and I can inspect the html in developer tools and find the information on the div class style__place___1StFN.
But when I try to find this div class in scrapy shell it wont find anything:
response.css('div.style__place___1StFN ')
I looked at the network in developer tools and try to find any other GEt / POST response that has this information but wasn't able to find it.
Any suggestion?
Thank you

Why the same URL gives different results?

On the following page, the number 2, 3 ... at the bottom all point to the same URL. Yet, the different tables will be shown. Does anybody know what specific techniques are used here? How to extract information in these tables using raw HTTP request (I prefer not to use a headless browser to do so)? Thanks.
https://services27.ieee.org/fellowsdirectory/home.html#results_table

It is using Javascript (AJAX) to make HTTP calls to the server.
If you inspect the Network activity in the Developer tools you will see calls to the following URL: https://services27.ieee.org/fellowsdirectory/getpageresultsdesk.html.
They send data from Javascript:
selectedJSON: {"alpha":"ALL","menu":"ALPHABETICAL","gender":"All","currPageNum":1,"breadCrumbs":[{"breadCrumb":"Alphabetical Listing "}],"helpText":"Click on any of the alphabet letters to view a list of Fellows."}
inputFilterJSON: {"sortOnList":[{"sortByField":"fellow.lastName","sortType":"ASC"}],"typeAhead":false}
pageNum: 2
You can see the pageNum property. This is how they request a specific page of results.

When you click the number buttons, some Javascript code makes an AJAX POST request to https://services27.ieee.org/fellowsdirectory/getpageresultsdesk.html;jsessionid=yoursessionid with formData including pageNum: 3 and some other formatting parameters. The server responds with the HTML block of table rows that get loaded into the page. You can look at the requests on that webpage in your browser's network inspector (in the developer tools) to see exactly what HTTP requests are happening.

The link has an onclick handler that changes the href onclick. Go to
https://services27.ieee.org/fellowsdirectory/home.html#results_table
In the console, enter:
window.location=getDetailProfileUrl('lOH1bDxMyI1CCIxo5ODlGg==');
This redirects to Aarons, Jules.
Now go back and enter window.location=getDetailProfileUrl('JJuL3J00kHdIUozoVAgKdg==');
This opens Aarts, Ronald.
Basically, when the link is clicked, the JavaScript changes the url of the link.
To extract them using php, use the file_get_contents() function.
echo file_get_contents('https://services27.ieee.org/fellowsdirectory/home.html#results_table');
That will print out the page. Now scrape it with JavaScript.
echo "<script>console.log(document.querySelectorAll('.name'));</script>";
Hope this helps.

Retrieve HTML source from webpage

I am new to Rust and trying out different projects to become more familiar with the language. As the title says I would like to retrieve the html source of a webpage. I am aware of rust-http but I am not sure how to use that library for this purpose
For a more detailed description of what I am trying to do:
Given some url: www.google.com
I would like the underlying HTML source.
I have looked at the Github Documentation of rust-http but the lack of docs have been confusing.

(answer moved from an edit to the question)
After playing around/searching more I finally was able to get what I wanted. Here is some code using Hyper that retrieves the source.
let req = request::Request::get(hyper::Url::parse("www.google.com").unwrap()).unwrap();
let res = req.start().unwrap()
.send().unwrap()
.read_to_string().unwrap();
println!("Response: {}", res);

What is wrong with using the options built into the browsers to display source code? E.g Chrome and Firefox: right click and select "View Page Source".

Need help on showing the test output in html format containing screen capture link

I have a framework with Webdriver+testng.
I want the result of all the methods i have run, in html format with their status and screen capture link.
please let me know whts the way of doing this.
Thanks in advance.

You can make use of this, But this only for BDD styled stories. If you are looking for a plain web driver scripts, try this selenium loggers.

TestNG creates a report in HTML format in 'test-output' folder (or, if you're using testNG plugin in Eclipse, you can specify the path in Window-Preferences-testNG). The folder is created and rewritten after each run. Look for index.html there.
You can add custom information there using Reporter.log (from org.testng.Reporter), the information can be found after 'Reporter output' link in the report. So basically all you need is an #AfterMethod which will take the screenshot and embed it into the log. This discussion may help.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to scrape incapsula protected website? - web-scraping

Related

Scraping Data on TradingView with BeautifulSoup

Scrapy - get map information

Why the same URL gives different results?

Retrieve HTML source from webpage

Need help on showing the test output in html format containing screen capture link

Categories

Resources