Why python requests.get method returns corrupted "href" from webpage? - web-scraping

I am trying to scarpe the "href" from an URL. The webpage is this: "https://standards.cencenelec.eu/dyn/www/f?p=305:111:0::::FSP_PROJECT_CROSSREF,FSP_PROJECT:EN-SPACE-60238,44005&cs=181563CC6721C4C894F6FF8D18062BB51".
Using the python requests.get method and BeautifulSoup I am having a very strange result: every time I re-run the request.get() the result I get in the one and only "a" tag present in the page changes, not allowing me to get the correct href I see from the inspector.
Here is what I am Trying:
import requests
import BeautifulSoup
link = "https://standards.cencenelec.eu/dyn/www/f?p=305:111:0::::FSPPROJECTCROSSREF,FSPPROJECT:EN-SPACE-60238,44005&cs=181563CC6721C4C894F6FF8D18062BB51"
req = requests.get(link)
zoup = soup(req.text, "html.parser")
for x in zoup.find_all("a"):
print(x.get("href"))
As an example the result I am getting now is: "f?p=CENELEC:110:0::::FSP_PROJECT:54685&cs=3112E12038AAA8791B214FDBC01ABB64F".
But this link does not bring anywhere. Analyzing the page with the inspector indeed the correct "href" is this one:
"f?p=CENELEC:110:0::::FSP_PROJECT:54685&cs=3A3706F21F2613DED09897BE4C7491039".
The thing that really "shocks" me is that this results changes every time I re-do the request! (while all the other element of the page are as in the inspector). At the beggining I thought it was an issue with the html parser and I tried with different ones, but the result did not change.
Inspecting the "req.text" element I discovered that the wrong link is already there and therefore the problem is at the requests.get level, but I cannot understand what it is.
Thank you very much for help!!!

Related

Scrapy response returns empty list

I'm new to Scrapy and I'm trying to extract data from sport bets on sportsbooks.
I am currenty trying to extract data from the upcoming matches in the Premier League: https://sport.mrgreen.com/da-DK/filter/football/england/premier_league
(The site is in Danish)
First I have used the command "fetch" on the website, and I am able to return something back using the "response" command with both CSS and xpath from the body of the HTML code. However, when I want to extract data beyond a certain point in the HTML code ("div data-ui-view"), response just returns an empty list. (See picture)
Example
I have encircled the xpath in red. I return something when I run the following:
response.xpath('/html/body/div[1]/div')
I have tried to use both CSS on the innermost class that I could find on the data I want to extract and the direct xpath as well. Still only an empty list.
response.xpath('/html/body/div[1]/div/div')
(The above code returns "[]")
response.xpath('response.xpath('/html/body/div[1]/div/div/div[2]/div/div/div[1]/div/div[3]/div[2]/div/div/div/div/div/div[4]/div/div[2]/div/div/ul/li[1]/a/div/div[2]/div/div/div/div/button[1]/div/div[1]/div'))
(The above xpath is to a football club name)
Does anybody know what the problem might be? Thanks
You can't do response.xpath(response.xpath()), one response is enough; also, I always use "" instead of '', and avoid using full xpath, that rarely works, instead try with .//div and see what returns, and for better results, use the search options that xpath has, like response.xpath(".//div[contains(text(), 'Chelsea Wolves')]//text(). Make sure your response.url matches with the url you want to scrapy.
Remember, a short and specific xpath is better than a large and ambiguos xpath.

Scrapy empty xpath response

I'm trying to get the url of images from this url: https://www.iproperty.com.my/sale/all-residential/ .
Using Chrome extension Xpath Helper, I've identified the Xpath and used Scrapy Shell to get a response:
fetch("https://www.iproperty.com.my/sale/all-residential/")
response.xpath("//div[#class='cFwUMy']/div[#class='fUtkLG']/div[#class='slick-initialized slick-slider']/div[#class='slick-list']/div[#class='slick-track']/div[#class='slick-slide slick-active'][1]/div[#class='img-wrapper']/a/div[#class='cHKlDH']/img[#class='lazyautosizes lazyloaded']/#src")
However, it doesn't return anything.
I've also tried:
response.xpath("//div[#class='img-wrapper']/a/div[#class='cHKlDH']")
Still not working.
How do I get the url of the image from the page? I've been successful with getting the title, location, and price, but am stuck with getting the images.
EDIT1:
So weird, I tried
response.xpath("div[#class='img-wrapper']/a")
It returns the links as expected, but
response.xpath("div[#class='img-wrapper']/a/div[#class='cHKlDH']")
and
response.xpath("//div[#class='cHKlDH']")
simply refuses to return anything.
Scrapy only downloads initial pages response
It does not executes an Javascript as our normal browser does.
Trick is, disable Javascript in your browser and then check if your desired element exists or not
In the website mentioned above, they have image links in JSON format in their initial page response and after that
In scrapy, you can do
re.findall(r"window.__INITIAL_STATE__ =(.*)window.__RENDER_APP_ERROR__", response.body, flags=re.DOTALL)
It will return you this JSON code, https://jsoneditoronline.org/?id=bbef330441b24957aeaceedcea621ba7
listings > items key, it has all data, prices/images you need
Here is complete working Python code
https://repl.it/#UmairAyub/AdmirableHilariousSpellchecker

Python 3.5 requests for clawing

I have a coding problem regarding Python 3.5 web clawing.
I try to use 'requests.get' to extract the real link from 'http://www.baidu.com/link?url=ePp1pCIHlDpkuhgOrvIrT3XeWQ5IRp3k0P8knV3tH0QNyeA042ZtaW6DHomhrl_aUXOaQvMBu8UmDjySGFD2qCsHHtf1pBbAq-e2jpWuUd3'. An example of the code is like below:
import requests
response = requests.get('http://www.baidu.com/link?url=ePp1pCIHlDpkuhgOrvIrT3XeWQ5IRp3k0P8knV3tH0QNyeA042ZtaW6DHomhrl_aUXOaQvMBu8UmDjySGFD2qCsHHtf1pBbAq-e2jpWuUd3')
c = response.url
I expected that 'c' should be 'caifu.cnstock.com/fortune/sft_jj/tjj_yndt/201605/3787477.htm'. (I remove http:// from the link as I can't post two links in one question.)
However, it doesn't work, and keeps return me the same link as I putted in.
Can anyone help on this. Many thanks in advance.
#
Thanks a lot to Charlie.
I have found out the solution. I first use .content.decode to read the response history, but that will be mixed up with many irrelevant info. I then use .findall to extract the redirect url from the history, which should be the first url displayed in the response history. Then, I use requests.get to retrieve the info. Below is the code:
rep1 = requests.get(url)
cont = rep1.content.decode('utf-8')
extract_cont = re.findall('"([^"]*)"', cont)
redir_url = extract_cont[0]
rep = requests.get(redir_url)
You may consider looking into the response headers for a 'location' header.
response.headers['location']
You may also consider looking at the response history, which contains a response for each response instance in a chain of redirects
response.history
Your sample URL doesn't redirect; The response is a 200 and then it uses a JavaScript window.location change. The requests library won't support this type of redirect.
<script>window.location.replace("http://caifu.cnstock.com/fortune/sft_jj/tjj_yndt/201605/3787477.htm")</script>
<noscript><META http-equiv="refresh" content="0;URL='http://caifu.cnstock.com/fortune/sft_jj/tjj_yndt/201605/3787477.htm'"></noscript>
If you know you will always be using this one service, you could parse the response, maybe using regex.
If you don't know what service will always be used and also want to handle every possible situation, you might need to instantiate a WebKit instance or something and somehow try to determine when it finally finishes. I'm sure there's a page load complete event which you could use, but you still might have pages that do a window.location change after the page is loaded using a timer. This will be very heavyweight and still not cover every conceivable type of redirect.
I recommend starting with writing a special handler for each type of edge case and fallback on a default handler that just looks at the response.url. As new edge cases come up, write new handlers. It's kind of the 'trial and error' approach.

web scraping in txt mode

I am currently using watir to do a web scraping of a website hiding all data from the usual HTML source. If I am not wrong, they are using XML and those AJAX technology to hide it. Firefox can see it but it is displayed via "DOM Source of selection".
Everything works fine but now I am looking for an equivalent tool as watir but everything need to be done without a browser. Everything need to be done in txt file.
In fact right now, watir is using my browser to emulate the page and return me the whole html code I am looking. I would like to the same but without the browser.
Is it possible ?
Thanks
Regards
Tak
Your best guess would be to use something like webscarab and capture the URLS of the AJAX requests your browser is doing.
That way, you can just grab the "important" data yourself by simulating those calls with any HTTP library
It is possible with a little Python coding.
I wrote a simple script to fetch locations of cargo offices.
First steps
Open the ajax page with Google Chrome for example, in Turkish but you can understand it.
http://www.yurticikargo.com/bilgi-servisleri/Sayfalar/en-yakin-sube.aspx
Press F12 to show bottom developer tools and navigate to Network tab.
Navigate XHR tab on the bottom.
Make an AJAX request by selecting an item in the first combobox. And go to Headers Tab
You will GetTownByCity on left pane, click it and inspect it.
Request URL: (...)/_layouts/ArikanliHolding.YurticiKargo.WebSite/ajaxproxy-
sswservices.aspx/GetTownByCity
Request Method:POST
Status Code:200 OK
In the Request Payload tree item you will see
Request Payload :{cityId:34}
header.
This will guide us to implement a python code.
Lets do it.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import requests
import json
# import simplejson as json
baseUrl = 'http://www.yurticikargo.com/'
ajaxRoot = '_layouts/ArikanliHolding.YurticiKargo.WebSite/'
getTown = 'ajaxproxy-sswservices.aspx/GetTownByCity'
urlGetTown = baseUrl + ajaxRoot + getTown
headers = {'content-type': 'application/json','encoding':'utf-8'} # We are sending JSON headers, equivalent to Python dictionary
for plaka in range(1,82): # Because Turkiye has number plates from 1 to 81
payload = {'cityId':plaka}
r = requests.post(url, data=json.dumps(payload), headers=headers)
data = r.json() # Returning data is in JSON format, if you need HTML use r.content()
# ... Process the fetched data with JSON parser,
# If HTML format, Beautiful Soup, Lxml, or etc...
Note that this code is a part of my working code and it is written on the fly, the most important is I did not test it. It may require small modifications to run it.

Retrieving information with Python's urllib from a page that is done via __doPostBack()?

I'm trying to parse a page that has different sections that are loaded with a Javascript __doPostBack() function.
An example of a link is: javascript:__doPostBack('ctl00$cphMain$ucOemSchPicker$dlSch$ctl03$btnSch','')
As soon as this is clicked, the browser doesn't fetch a new URL but a section of webpage is updated to reflect new information.
What would I pass into a urllib function to complete the operation?
javascript:__doPostBack('...
(Urgh. That's a sad and nasty approach.)
A simple general-purpose approach for finding URLs whose logic is buried in JavaScript is to run the page normally, with a network debugger on (eg. Firebug's ‘Net’ tab, or Fiddler). By monitoring the request made when you click, you can see what URL and what POST request body parameters are to be passed.
You'll need to use the data argument of urlopen to send POST request bodies.

Resources