Scrapy empty xpath response - web-scraping

I'm trying to get the url of images from this url: https://www.iproperty.com.my/sale/all-residential/ .
Using Chrome extension Xpath Helper, I've identified the Xpath and used Scrapy Shell to get a response:
fetch("https://www.iproperty.com.my/sale/all-residential/")
response.xpath("//div[#class='cFwUMy']/div[#class='fUtkLG']/div[#class='slick-initialized slick-slider']/div[#class='slick-list']/div[#class='slick-track']/div[#class='slick-slide slick-active'][1]/div[#class='img-wrapper']/a/div[#class='cHKlDH']/img[#class='lazyautosizes lazyloaded']/#src")
However, it doesn't return anything.
I've also tried:
response.xpath("//div[#class='img-wrapper']/a/div[#class='cHKlDH']")
Still not working.
How do I get the url of the image from the page? I've been successful with getting the title, location, and price, but am stuck with getting the images.
EDIT1:
So weird, I tried
response.xpath("div[#class='img-wrapper']/a")
It returns the links as expected, but
response.xpath("div[#class='img-wrapper']/a/div[#class='cHKlDH']")
and
response.xpath("//div[#class='cHKlDH']")
simply refuses to return anything.

Scrapy only downloads initial pages response
It does not executes an Javascript as our normal browser does.
Trick is, disable Javascript in your browser and then check if your desired element exists or not
In the website mentioned above, they have image links in JSON format in their initial page response and after that
In scrapy, you can do
re.findall(r"window.__INITIAL_STATE__ =(.*)window.__RENDER_APP_ERROR__", response.body, flags=re.DOTALL)
It will return you this JSON code, https://jsoneditoronline.org/?id=bbef330441b24957aeaceedcea621ba7
listings > items key, it has all data, prices/images you need
Here is complete working Python code
https://repl.it/#UmairAyub/AdmirableHilariousSpellchecker

Related

Why python requests.get method returns corrupted "href" from webpage?

I am trying to scarpe the "href" from an URL. The webpage is this: "https://standards.cencenelec.eu/dyn/www/f?p=305:111:0::::FSP_PROJECT_CROSSREF,FSP_PROJECT:EN-SPACE-60238,44005&cs=181563CC6721C4C894F6FF8D18062BB51".
Using the python requests.get method and BeautifulSoup I am having a very strange result: every time I re-run the request.get() the result I get in the one and only "a" tag present in the page changes, not allowing me to get the correct href I see from the inspector.
Here is what I am Trying:
import requests
import BeautifulSoup
link = "https://standards.cencenelec.eu/dyn/www/f?p=305:111:0::::FSPPROJECTCROSSREF,FSPPROJECT:EN-SPACE-60238,44005&cs=181563CC6721C4C894F6FF8D18062BB51"
req = requests.get(link)
zoup = soup(req.text, "html.parser")
for x in zoup.find_all("a"):
print(x.get("href"))
As an example the result I am getting now is: "f?p=CENELEC:110:0::::FSP_PROJECT:54685&cs=3112E12038AAA8791B214FDBC01ABB64F".
But this link does not bring anywhere. Analyzing the page with the inspector indeed the correct "href" is this one:
"f?p=CENELEC:110:0::::FSP_PROJECT:54685&cs=3A3706F21F2613DED09897BE4C7491039".
The thing that really "shocks" me is that this results changes every time I re-do the request! (while all the other element of the page are as in the inspector). At the beggining I thought it was an issue with the html parser and I tried with different ones, but the result did not change.
Inspecting the "req.text" element I discovered that the wrong link is already there and therefore the problem is at the requests.get level, but I cannot understand what it is.
Thank you very much for help!!!

How can I get custom querystring included in returned firebase dynamic shortlink

I'm using the Firebase dynamic link post API to return a shortlink. When I post this:
https://CENSORED.page.link/?link=https://www.CENSORED.co.uk/offers/friends/?utm_source=referafriend&utm_medium=ecrm&utm_campaign=cbk25&utm_term=988776
clicking the returned shortlink redirects to:
https://www.CENSORED.co.uk/offers/friends/?utm_source=referafriend
The post is made from clientside js. Firebase is returning a working shortlink, but with some parameters missing.
Expected url from clicked shortlink:
https://www.CENSORED.co.uk/offers/friends/?utm_source=referafriend&utm_medium=ecrm&utm_campaign=cbk25&utm_term=988776
Looks like its chopping off most of my querystring - how do I get the full query string returned correctly please?
Solved: Escaping the url worked for me:
params lost:
"https://www.test.co.uk/testing/?utm_source=jam&utm_medium=spoon&utm_campaign=jar&utm_term=lid"
params returned correctly:
"https%3A%2F%2Fwww.test.co.uk%2Ftesting%2F%3Futm_source%3Djam%26utm_medium%3Dspoon%26utm_campaign%3Djar%26utm_term%3Dlid"

Paw app query request

Hi I am attempting to initiate a query to my backend on Kinvey which is backed by a MongoDB. They require passing URL parameters as such:
?query={"firstName":"James"}
I have tried every imaginable way of setting up these parameters in PAW but either get a success response with no filtering of the data or an error message of URL not supported when I try using a Raw Query String.
I have ran the query using their (Kinvey) backend API interface and it works fine in filtering the results so the problem definitely lies within PAW. I am currently using version 3.0.9. Any suggestions or is this just a bug that needs to be fixed?
Thanks!
I've just tried this setup in Paw and I have a few recommendations:
Paw will URL-encode the chars { and " as you can see if you open the HTTP preview in the bottom panel
Trying to send a similar query via Chrome (to test with another app to make sure Paw behaves correctly), I see that the query is URL encoded (try this query https://echo.paw.cloud/?query={"firstName":"James"} you'll see that the browser actually URL-encodes the characters { and " when sending. So the behavior is the same with Paw.
I don't think these two chars ({ and ") are valid HTTP if they are not URL-encoded, so I'm sure your server is expecting them encoded anyway
Testing this exact query in Paw, works for me, so please try these exact steps: go to URL Params, in the first column enter query and {"firstName":"James"} in the second column. Then using the HTTP preview mentioned above, make sure Paw is sending the request you're expecting.
Lastly, it's more like a tip, but as your value is JSON, I recommend that you use the JSON dynamic value to generate the JSON. It will be visually better for you, and will make sure you send valid JSON. For that, right click on the value field, and select Values > JSON. Here's some example:

Is it possible to return HTTP code 200, but give a "better" url without using 3xx?

Consider StackOverflow, where each question has a unique ID, but URLs are often overridden to include a stub in the URL. For readability and other reasons the stub helps users know they are at the right place.
I have a site that returns 200 when calling a URL like:
http://stackoverflow.com/questions/28057406/
But want the URL to update to:
http://stackoverflow.com/questions/28057406/is-it-possible-to-return-http-code-200-but-give-a-better-url-without-using-3x
The first call is technically valid and the code can retrieve the object and render it perfectly fine, but I'd like to update the URL to use the stubified one.
I'd prefer to do this without a redirect as just getting the ID causes a database call to get the object. Which would mean with a redirect the process would be:
Call http://stackoverflow.com/questions/28057406/
Retrieve item 25257999 from the database to get the name to make the stub
Redirect to http://stackoverflow.com/questions/28057406/is-it-possible-to-return-http-code-200-but-give-a-better-url-without-using-3x
New HTTP Call, so retrieve item 25257999 from the database to render the final page.
If possible I'd like to not use Javascript either.
So, is it possible to return Location as part of a HTTP header with a status code of 200 and the actual page, or am I stuck using 3xx calls or Javascript?
If you are just doing HTTP, you can either choose to redirect, or not choose to redirect... You can also (with Content-Location) tell the client that the canonical address is actually somewhere else... but no browser will respond to that.
To avoid the database-call, you could of course just cache the result.
If you are in a browser however, you can dynamically update the current address without forcing a refresh, with window.history.pushState.
For more information about that call, see this other SO answer:
Modify the URL without reloading the page

Why does a percent symbol in a get request break my site?

I feel pretty stupid for asking this, but I'm doing a form where the user enters some input and sometimes the input is a percent symbol, say 5%. When this gets passed along as part of a GET request, like this:
http://kburke.org/project/company_x/?id=4&var1=1&ops=23255&cashflow=25000&growth=5%25&pv=100000&roe=20&profitmargin=30&roe=80&turnover=2
I get a 404 Page Not Found error. When I remove the query string pair
&growth=5%25
the page loads fine. Can someone help explain what the problem is?
Edit: I tried removing all of the Javascript from the page and the server still craps out. I also just tried running it in MAMP as
http://localhost:8888/project/company_x/?id=4&var1=1&ops=23255&cashflow=25000&growth=5%25&pv=100000&roe=20&profitmargin=30&roe=80&turnover=2
and it worked fine. I'm wondering if it's a problem with my own server. When I open Firebug to the console and run the page, I see an error very briefly and then the 404 page loads - is there a way I can pause the redirect so I can read the error message?
Check out URL ENCODING. The "%" character in a url means something special.
You encode the space character ' ' as %20 in a url.
You encode the percent character '%' as %25 in a url.
So after your url gets to the script, your argument 'growth' will equal "5%".
I tried messing around with your url and it appears that your script is crashing when it tries to parse the growth argument, and your web site is hiding that crash from you by sending you to the 404 page. I'd post your script code if you need more help.

Resources