Scrapy - get map information

Scrapy - get map information - web-scraping

I'm trying to scrap the information of "5 Postos de abastecimento" from the section "Mapa" on https://www.imovirtual.com/anuncio/moradia-no-alto-da-ajuda-para-reconstrucao-ID10VFW.html#46747dcb9d using Scrapy.
When I look at the web site in chrome the map section appears and I can inspect the html in developer tools and find the information on the div class style__place___1StFN.
But when I try to find this div class in scrapy shell it wont find anything:
response.css('div.style__place___1StFN ')
I looked at the network in developer tools and try to find any other GEt / POST response that has this information but wasn't able to find it.
Any suggestion?
Thank you

Related

Error during web scraping in R using Selector Gadget

I hope you are all doing well.
I am facing an error during web scraping in R using the Selector Gadget Tool where when I am selecting the data using the tool on the Coursera website, the no. of values it shows is correct (10). But when I copy that particular CSS code in R and run it, it's showing 18 names in the list. Please if anyone can help me with this. Here is a screenshot of the selector gadget output:
And here is what gets returned in R when I scrape that css selector:

The rendered content seen via a browser is not exactly the same as that returned by an XHR request (rvest). This is because a browser can run JavaScript to update content.
Inspect the page source by pressing Ctrl+U in browser on that webpage.
You can re-write your css selector list to match the actual html returned. One example would be as follows, which also removes the reliance on dynamic classes which change more frequently and would break your program more quickly.
library(rvest)
read_html("https://in.coursera.org/degrees/bachelors") |>
html_elements('[data-e2e="degree-list"] div[class] > p:first-child') |>
html_text2()
Learn about CSS selectors and operators here: https://developer.mozilla.org/en-US/docs/Learn/CSS/Building_blocks/Selectors

How to scrape incapsula protected website?

https://www.genecards.org/cgi-bin/carddisp.pl?gene=ZSCAN22
On the above webpage, if I click See all 33, I will see the following GET request is sent in Chrome DevTools.
https://www.genecards.org/gene/api/data/Enhancers?geneSymbol=ZSCAN22
Direct accessing of it is blocked.
I have try to use a puppeteer. I can click "See all 33" with puppeteer, but then I need to parse the resulted HTML file. It would be best to directly get the results from https://www.genecards.org/gene/api/data/Enhancers?geneSymbol=ZSCAN22. I am not sure how to get it after clicking "See all 33" with puppeteer.
I am not sure if apify can help.
Can anybody let me know how to scrape it?

I used selenium it working fine
from selenium import webdriver
browser = webdriver.Chrome(executable_path="C:/src/webdriver/chromedriver.exe")
genesLocations = 'https://www.genecards.org/cgi-bin/carddisp.pl?gene={}'
Extract Genomic Locations
gene='ZSCAN22'
browser.get(genesLocations.format(gene))
location = browser.find_element_by_xpath('//*[#id="genomic_location"]/div/div[3]/div/div')
print(location.text)

Need help on showing the test output in html format containing screen capture link

I have a framework with Webdriver+testng.
I want the result of all the methods i have run, in html format with their status and screen capture link.
please let me know whts the way of doing this.
Thanks in advance.

You can make use of this, But this only for BDD styled stories. If you are looking for a plain web driver scripts, try this selenium loggers.

TestNG creates a report in HTML format in 'test-output' folder (or, if you're using testNG plugin in Eclipse, you can specify the path in Window-Preferences-testNG). The folder is created and rewritten after each run. Look for index.html there.
You can add custom information there using Reporter.log (from org.testng.Reporter), the information can be found after 'Reporter output' link in the report. So basically all you need is an #AfterMethod which will take the screenshot and embed it into the log. This discussion may help.

Is there a tool to check CSS url files exist?

I've just been tasked with migrating a website from a Windows server to a Linux server.
One of the issues I've noticed straight away is that there are a number of CSS url() definitions that don't work because the case in the CSS is not the same as the actual file.
eg:
background: url(myFile.jpg);
while on the server the file is actually MyFile.jpg.
Does anyone know of a simple tool or browser plugin I can use just to scan the CSS file and verify that the url() declarations exist so that I can easily find and fix them?
The site is quite large, so I don't want to have to navigate through the pages to find 404 errors if I can avoid it.

Use Developer Tools in Google Chrome or Firebug in Firefox.
When you load HTML page with that CSS, it will show any missing resources in Network tab.
EDIT
I guess there is no any tool that will
Scan through CSS file for all the URLs
Check whether each URL exists or not.
But you can try following two links for these two tasks.
RegEx to get the URLs from CSS : With this you will have all list of URLs used in CSS
Check if a URL exists or not with cURL : An example in PHP was given.
You can still search for these two items separately and try fixing the issues.
Let me know if this helps.

What, if you simply write a http request into browser's URL bar pointing directly to the image and/or css?

How about firebug in firefox? It would give you all 404 in its console.
download

You can install Firebug if you're using Firefox or you can press F12 if you're using Chrome.. i think that goes the same with IE.. From there you will be able to check the URL and even view it in a new tab.

Turns out that the W3c Link Checker also scans CSS files which is very handy.
Had this have not worked I would have had to put together something like Vanga's solution.

Here's how I would approach this.
Make sure all image requests are handled by a (PHP) script, by adding the following to my .htaccess
RewriteRule .(?:jpe?g|gif|png|bmp)$ /images.php [NC,L]
Use file_exists() to check if the file exists, maybe even try if a lowercase version of the file exists.
Log missing files into a database table or text file.
Use a script to loop through the website's sitemap with curl to get a complete list of requested filenames that resulted in a 404.

Error when using LinkedIn's Share button

I'm attempting to add a LinkedIn Share button to our content-driven website. I've generated the embed code using the button builder, but whenever I try to actually use the button, I get a generic error:
There was a problem performing this action, please try again later.
It's been doing this for several days (since I first added the code), so I don't know if the error is on the LinkedIn side or mine. Is there any way to get a more specific error message? The code they provide is just a script tag that you paste in:
<script src="http://platform.linkedin.com/in.js" type="text/javascript"></script>
<script type="IN/Share"></script>
Unfortunately LinkedIn's "support" forums are limited to the various API's; there's nowhere available to submit a question regarding the build-a-button functionality. I'm hoping someone else has used this function and can point me in the right direction.

Most likely the page you are trying to share is not web accessible (local, under htaccess password or something). It looks to me like LinkedIn tries to actually look at the page you are sharing, and if it can't reach it, it gives you this message.

Most likely the url which you are sharing is not encoded, try encoding that, also follow this article for more.

The easiest way to ensure the linkedin share button works properly, is to use
<!DOCTYPE html>
instead of other alternatives.

look at the data-url attrbute. Remove the "http://" and only use "www." for your website url. That fixed my issue at least.

I found this way for validate in xhtml:
< !DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN" "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
< script src="http://platform.linkedin.com/in.js" type="text/javascript">< /script>
< div id="linkedin">< /div>
< script type="text/javascript">
var po2 = document.createElement('script');
po2.type = 'IN/MemberProfile';
po2.setAttribute("data-id","http://www.linkedin.com/pub/luca-di-lenardo/11/4b7/3b8");
po2.setAttribute("data-format","hover");
po2.setAttribute("data-text","Luca Di Lenardo");
document.getElementById("linkedin").appendChild(po2);
< /script>
Remove the white space and it works!

If anyone is getting this error, and cannot figure out why, I recommend checking the URL of the page you're sharing with: LinkedIn Post Inspector.
So, for instance, if I were to check out how wikipedia.org looks when shared, I would visit and enter that URL, like so...
https://www.linkedin.com/post-inspector/inspect/https:%2F%2Fwww.wikipedia.org%2F
And I see...
But there's a ton of information here, showing how everything is parsed, from the description to the title to the image selection for thumbnail display...
Warning: Add an og:image tag to the page to have control over the content's image on LinkedIn.
Title: Wikipedia
Type: Article
Image: No image found
Description: Wikipedia is a free online encyclopedia, created and edited by volunteers around the world and hosted by the Wikimedia Foundation.
Author: No author found
Publish date: 6/1/2020, 6:39:59 AM
They even give you instructions on how to fix your page! Hey, got some advice for Wikipedia.org here!
Provide a metadata tag for the og:image in the page's head section. For example:
<meta name="image" property="og:image" content="[Image URL here]">

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Scrapy - get map information - web-scraping

Related

Error during web scraping in R using Selector Gadget

How to scrape incapsula protected website?

Need help on showing the test output in html format containing screen capture link

Is there a tool to check CSS url files exist?

Error when using LinkedIn's Share button

Categories

Resources