web scraping in txt mode - web-scraping

I am currently using watir to do a web scraping of a website hiding all data from the usual HTML source. If I am not wrong, they are using XML and those AJAX technology to hide it. Firefox can see it but it is displayed via "DOM Source of selection".
Everything works fine but now I am looking for an equivalent tool as watir but everything need to be done without a browser. Everything need to be done in txt file.
In fact right now, watir is using my browser to emulate the page and return me the whole html code I am looking. I would like to the same but without the browser.
Is it possible ?
Thanks
Regards
Tak

Your best guess would be to use something like webscarab and capture the URLS of the AJAX requests your browser is doing.
That way, you can just grab the "important" data yourself by simulating those calls with any HTTP library

It is possible with a little Python coding.
I wrote a simple script to fetch locations of cargo offices.
First steps
Open the ajax page with Google Chrome for example, in Turkish but you can understand it.
http://www.yurticikargo.com/bilgi-servisleri/Sayfalar/en-yakin-sube.aspx
Press F12 to show bottom developer tools and navigate to Network tab.
Navigate XHR tab on the bottom.
Make an AJAX request by selecting an item in the first combobox. And go to Headers Tab
You will GetTownByCity on left pane, click it and inspect it.
Request URL: (...)/_layouts/ArikanliHolding.YurticiKargo.WebSite/ajaxproxy-
sswservices.aspx/GetTownByCity
Request Method:POST
Status Code:200 OK
In the Request Payload tree item you will see
Request Payload :{cityId:34}
header.
This will guide us to implement a python code.
Lets do it.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import requests
import json
# import simplejson as json
baseUrl = 'http://www.yurticikargo.com/'
ajaxRoot = '_layouts/ArikanliHolding.YurticiKargo.WebSite/'
getTown = 'ajaxproxy-sswservices.aspx/GetTownByCity'
urlGetTown = baseUrl + ajaxRoot + getTown
headers = {'content-type': 'application/json','encoding':'utf-8'} # We are sending JSON headers, equivalent to Python dictionary
for plaka in range(1,82): # Because Turkiye has number plates from 1 to 81
payload = {'cityId':plaka}
r = requests.post(url, data=json.dumps(payload), headers=headers)
data = r.json() # Returning data is in JSON format, if you need HTML use r.content()
# ... Process the fetched data with JSON parser,
# If HTML format, Beautiful Soup, Lxml, or etc...
Note that this code is a part of my working code and it is written on the fly, the most important is I did not test it. It may require small modifications to run it.

Related

Python 3.5 requests for clawing

I have a coding problem regarding Python 3.5 web clawing.
I try to use 'requests.get' to extract the real link from 'http://www.baidu.com/link?url=ePp1pCIHlDpkuhgOrvIrT3XeWQ5IRp3k0P8knV3tH0QNyeA042ZtaW6DHomhrl_aUXOaQvMBu8UmDjySGFD2qCsHHtf1pBbAq-e2jpWuUd3'. An example of the code is like below:
import requests
response = requests.get('http://www.baidu.com/link?url=ePp1pCIHlDpkuhgOrvIrT3XeWQ5IRp3k0P8knV3tH0QNyeA042ZtaW6DHomhrl_aUXOaQvMBu8UmDjySGFD2qCsHHtf1pBbAq-e2jpWuUd3')
c = response.url
I expected that 'c' should be 'caifu.cnstock.com/fortune/sft_jj/tjj_yndt/201605/3787477.htm'. (I remove http:// from the link as I can't post two links in one question.)
However, it doesn't work, and keeps return me the same link as I putted in.
Can anyone help on this. Many thanks in advance.
#
Thanks a lot to Charlie.
I have found out the solution. I first use .content.decode to read the response history, but that will be mixed up with many irrelevant info. I then use .findall to extract the redirect url from the history, which should be the first url displayed in the response history. Then, I use requests.get to retrieve the info. Below is the code:
rep1 = requests.get(url)
cont = rep1.content.decode('utf-8')
extract_cont = re.findall('"([^"]*)"', cont)
redir_url = extract_cont[0]
rep = requests.get(redir_url)
You may consider looking into the response headers for a 'location' header.
response.headers['location']
You may also consider looking at the response history, which contains a response for each response instance in a chain of redirects
response.history
Your sample URL doesn't redirect; The response is a 200 and then it uses a JavaScript window.location change. The requests library won't support this type of redirect.
<script>window.location.replace("http://caifu.cnstock.com/fortune/sft_jj/tjj_yndt/201605/3787477.htm")</script>
<noscript><META http-equiv="refresh" content="0;URL='http://caifu.cnstock.com/fortune/sft_jj/tjj_yndt/201605/3787477.htm'"></noscript>
If you know you will always be using this one service, you could parse the response, maybe using regex.
If you don't know what service will always be used and also want to handle every possible situation, you might need to instantiate a WebKit instance or something and somehow try to determine when it finally finishes. I'm sure there's a page load complete event which you could use, but you still might have pages that do a window.location change after the page is loaded using a timer. This will be very heavyweight and still not cover every conceivable type of redirect.
I recommend starting with writing a special handler for each type of edge case and fallback on a default handler that just looks at the response.url. As new edge cases come up, write new handlers. It's kind of the 'trial and error' approach.

PDF creation tool that can handle Client-side view changes

I am trying to find a tool that can be used to create PDFs from websites. These websites all have Bootstrap based client side view settings, such as tabs, toggles, and paging. As such, there is no post-back to the server. I need to be able to create a pdf that is in the same state as the user sees it.
In my research, I have only been able to find tools that can create PDFs if given a URL, or if given a HTML string. Examples of these tools include ActivePDF's Webgrabber, and EVO PDF. However, they are not able to generate the PDF's with the client-specific settings, but instead only see the default selections of a given page. It is not possible for me to do a post-back to the server, so I am looking for a tool that can create PDFs on the fly, with the dynamic settings intact. I am working in ASP.NET, and so I would like a tool that is .NET friendly as well. Lastly, I would prefer the tool to not be open-sourced.
It sounds like a proper solution here would be to take data from your client side, post it via AJAX (not an ASP.NET postback!) and then process it on the server side to generate the PDF. Since you haven't given much detail, I'm going to assume we don't need to send over all the HTML, but rather values that were entered via a form.
<script>
function postDataToServer(){
var itemId = $("#ItemIdTB").val();
var quantityOrdered = $("#QuantityTB").val();
$.ajax({
url : "GeneratePdfFromForm?itemId="+itemId+"&quantityOrdered="+quantityOrdered
})
.done(function(){alert("Success!");})
.fail(function(){alert("Failure!");});
}
</script>
My answer is using jQuery, which is a JavaScript library that simplifies AJAX and DOM manipulation. However, this technique is doable with other JavaScript libraries (or even without a library).
On the server side, let's have an ASP.NET Web API function that can handle that AJAX Post.
[Route("GeneratePdfFromForm")]
public static void GeneratePdfFromForm(string itemId, int quantityOrdered)
{
Debug.WriteLine("Received itemId {0} and quantity {1}", itemId, quantityOrdered);
byte[] pdf = GeneratePDF(itemId, quantityOdered); //you'll need a function called GeneratePDF that can generate your PDF based on the parameters, probably using a library like iTextSharp.
//now do you want with the PDF byte array
}
So that's how you'd generate the PDF. You could return it from the GeneratePdfFromForm() function back to the client. But since that's potentially a long running task, you should probably implement it in the background using something like Hangfire, then when the PDF is ready you'd present it to the client for download (perhaps using SignalR or jQuery AJAX polling to alert the client of when the PDF is ready).
I was passing data from the client side to the server side via query string. You could instead create a class to represent the parameters for your PDF generation, then pass that form the client side to the server side via jQuery AJAX's data parameter.
If you really want to post the entire HTML from the client side to the server, you could so something like this:
var html=$("html").html();
$.ajax({
url: <your url here>,
data: html,
contentType: "text/html"
});
However, I'm willing to bet the actual HTML isn't what you care about for generating the PDF, but rather the selected values from some client side form. The actual HTML would include everything on the client side, including navigation menus, scripts etc. You could post a subset of the HTML using a different selector (ex: $("#OrderDiv").html()), which is a technique some people use to generate PDF's. But I think it's much cleaner to decouple your HTML intended for the browser from the way the PDF is generated, so that changes to your site don't mess up the PDF. You can then use the PDF creating library's capabilities to build the PDF rather than using HTML.

having trouble uploading images from Flex to CQ5 DAM

i have written a flex component to allow the user to select an image from the local filesystem and then POST it to a CQ5 DAM.
there are 2 CQ5 instances with which i'm working. the image posts fine to one instance, but not the other. specifically, in the 2nd instance, the renditions are not getting created when using the component.
one difference i've noted is that the working images, when i look at them in crxde, have a jcr:primaryType of dam:Asset. the non-working ones are nt:File.
from Flex, I use URLLoader to POST with a multipart form. the request (in part) looks like this:
POST /content/dam/test/foo.createasset.html HTTP/1.1
Host: xxxxxxxx:4502
Content-type: multipart/form-data; boundary=doudrbitutcfasnbhlpogirdctuxem
--doudrbitutcfasnbhlpogirdctuxem
Content-Disposition: form-data; name="file"
home.png
--doudrbitutcfasnbhlpogirdctuxem
Content-Disposition: form-data; name="Filename"
home.png
--doudrbitutcfasnbhlpogirdctuxem
Content-Disposition: form-data; name="home.png"; filename="home.png"
Content-Type: application/octet-stream
*** image data ***
--doudrbitutcfasnbhlpogirdctuxem
Content-Disposition: form-data; name="Upload"
Submit Query
--doudrbitutcfasnbhlpogirdctuxem--
that does save the image at: /content/dam/test/foo/home.png
i've tried adding a variable to the form:
./jcr:contentType dam:Asset
but that didn't cause the contentType to change. instead, the file didn't show up in CQ5 at all.
i know next to nothing about CQ5. i've seen some (old) examples of code POSTing right to where they want the asset to go, instead of hitting foo.createAsset.html as i've done. i could not get the more-straightforward POST working, and instead used CQ5 DAM to upload and image and captured through Charles, then tried to replicate that.
the CQ5 version that works is 5.5.0.
the version that does not is 5.4.0.
i'm sure that there are other configuration differences as well. in addition, the client is unwilling to upgrade from 5.4.0.
am i on the right track? close?
edit to clarify server setup:
CQ 5.5.0 --> installed locally, this one is an author server. my component works when POSTing to this server. meaning, the uploaded image is marked as dam:Asset and the renditions are generated.
CQ 5.4.0 --> a dev instance used by many. this is an author and publish server. my component does not 100% work when POSTing to this server. however, if i use the DAM admin interface to upload an image, it does properly mark the image as dam:Asset and generate the renditions.
edit #2: WORKING
it turns out that the dev/5.4 instance handles file uploading differently. my multi-part POST code mostly worked, but instead of using createAsset.html, i'm uploading to /tmp/fileupload.
then i issue a 2nd POST, using application/x-www-form-urlencoded, to issue a move command.
for those wishing to do the same, the move code looks like this:
var service:HTTPService = new HTTPService();
var url:String = instanceUrl + "/tmp/fileupload";
service.url = url;
var headerData : Object = new Object();
headerData['Cache-Control'] = 'no-store';
headerData['Authorization'] = getAuthString();
service.headers = headerData;
service.contentType = "application/x-www-form-urlencoded; charset=UTF-8";
service.method = URLRequestMethod.POST;
var urlVar:URLVariables = new URLVariables();
var command:String = "/var/dam/" + destPath + "/" + filename + "#MoveFrom";
var arg:String = "/tmp/fileupload/" + filename;
urlVar[command] = arg;
urlVar["_charset_"] = "utf-8";
var token:AsyncToken = service.send(urlVar);
not knowing CQ5, i can only assume the dev server is set up to run some workflow steps when it receives the #MoveFrom; those are the steps that ensure the uploaded file is of type dam:Asset and that the desired renditions are created.
If uploading from the DAM admin page via a browser works on the 5.4.0 instance, I would suggest analysing the HTTP request that this makes, to reproduce the same request from your Flex client. There's probably a subtle difference between the 5.4.0 and 5.5.0 HTTP APIs that explains this.
as a followup, below are the broad steps i took to get this working.
my overall goal was to write a Flex component that, for a specified VO, allowed the user to upload an image from their local filesystem (i used FileReference for this) into the component, then upload that image to CQ5 and publish it. after it was published, i then read it back into the component to display it.
i won't put the full code solution here, as it's involved and belongs to my client. in addition to my component, i wrote a utility for cq5 DAM operations, and an http service with built-in retries (which ended up being necessary because even though cq will give me a 200 when i request a resource, subsequent operations on that resource may fail, because cq doesn't seem to think it's there). Note that in all retry instances, i have a max retry count. the default value is 10, and default retry interval is 250ms.
please understand i know very little of CQ; most of what i learned was reverse engineering through trying things in the tool and watching Charles. also understand the steps below may be very specific to the install of CQ5 i'm working with.
so here are my overall steps. unless indicated otherwise, all requests are on port 4502:
a destination directory is determined from data in the VO and a POST is issued to create it. this is done with Content-type=application/x-www-form-urlencoded. the url is the full path of the folder i want to create, with no trailing slash.
repeat a GET on the created directory until we get a 200. the url here does have a trailing slash.
the image is POSTed to a temp area, [instance]/tmp/fileupload, as multipart form data. To help with this, i used an MIT-licensed AS class called MultipartURLLoader (https://code.google.com/p/in-spirit/). I used Content-type=multipart/form-data; boundary=[boundary]. CQ seemed very picky about the contents of the form data. mine is set up like this:
file: [name of file]
Filename: [name of file]
[name of file]: [file data]
Upload: Submit Query
another POST is issued, with a move command, to move the image from the temp area to the directory created in step 1. the url is [instance]/tmp/fileupload, and Content-type=application/x-www-form-urlencoded; charset=UTF-8. The form data is set up like this:
/var/dam/[destination_path]/[filename]#MoveFrom: /tmp/fileupload/[filename]
charset: utf-8
repeat step 4 until we get a 200. when new destination folders are indeed created, the first POST to #MoveFrom usually results in a 500, saying the destination folder is not there. perhaps there's another way to ask CQ if the destination is ready? i don't know.
we now need to publish the file, but first we issue a series of GETs on it to ensure it's there, with this url: [instance]/content/dam/[destination]/[filename].assets.json. once it's there, CQ will respond with some JSON that we use next.
check to see if the file has already been published. it may be the case that the user has already uploaded an image with the same name to the same location. the JSON response has a results node, which i check to see if it's 1. if it is, then i look at "pages[0].replication" to see if it has a node called "action". if it does, i see if the value is "ACTIVATE". if it is, it's already published. in every other case, i try to publish it.
POST a command to activate (publish it). the url is [instance]/bin/replicate.json. Content-type=application/x-www-form-urlencoded; charset=UTF-8. The form looks like this:
path: /content/dam/[destination]/[filename]
cmd: Activate
charset: utf-8
for my purposes, i wanted to then retrieve the published image to re-display it in my component. i waited for the 200 from the publish, then tried my GET. the url i used here had no port number, and no trailing slash: [instance:80]/content/dam/[destination]/[filename]. The first call almost always gave me a 404, so i kept trying until i got the 200.
that's it. i hope this is helpful to someone.
note: just saw that "charset" is in italics in the form specification. note that i'm using (underscore)charset(underscore).

Retrieving information with Python's urllib from a page that is done via __doPostBack()?

I'm trying to parse a page that has different sections that are loaded with a Javascript __doPostBack() function.
An example of a link is: javascript:__doPostBack('ctl00$cphMain$ucOemSchPicker$dlSch$ctl03$btnSch','')
As soon as this is clicked, the browser doesn't fetch a new URL but a section of webpage is updated to reflect new information.
What would I pass into a urllib function to complete the operation?
javascript:__doPostBack('...
(Urgh. That's a sad and nasty approach.)
A simple general-purpose approach for finding URLs whose logic is buried in JavaScript is to run the page normally, with a network debugger on (eg. Firebug's ‘Net’ tab, or Fiddler). By monitoring the request made when you click, you can see what URL and what POST request body parameters are to be passed.
You'll need to use the data argument of urlopen to send POST request bodies.

how to submit query to .aspx page in python

I need to scrape query results from an .aspx web page.
http://legistar.council.nyc.gov/Legislation.aspx
The url is static, so how do I submit a query to this page and get the results? Assume we need to select "all years" and "all types" from the respective dropdown menus.
Somebody out there must know how to do this.
As an overview, you will need to perform four main tasks:
to submit request(s) to the web site,
to retrieve the response(s) from the site
to parse these responses
to have some logic to iterate in the tasks above, with parameters associated with the navigation (to "next" pages in the results list)
The http request and response handling is done with methods and classes from Python's standard library's urllib and urllib2. The parsing of the html pages can be done with Python's standard library's HTMLParser or with other modules such as Beautiful Soup
The following snippet demonstrates the requesting and receiving of a search at the site indicated in the question. This site is ASP-driven and as a result we need to ensure that we send several form fields, some of them with 'horrible' values as these are used by the ASP logic to maintain state and to authenticate the request to some extent. Indeed submitting. The requests have to be sent with the http POST method as this is what is expected from this ASP application. The main difficulty is with identifying the form field and associated values which ASP expects (getting pages with Python is the easy part).
This code is functional, or more precisely, was functional, until I removed most of the VSTATE value, and possibly introduced a typo or two by adding comments.
import urllib
import urllib2
uri = 'http://legistar.council.nyc.gov/Legislation.aspx'
#the http headers are useful to simulate a particular browser (some sites deny
#access to non-browsers (bots, etc.)
#also needed to pass the content type.
headers = {
'HTTP_USER_AGENT': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.13) Gecko/2009073022 Firefox/3.0.13',
'HTTP_ACCEPT': 'text/html,application/xhtml+xml,application/xml; q=0.9,*/*; q=0.8',
'Content-Type': 'application/x-www-form-urlencoded'
}
# we group the form fields and their values in a list (any
# iterable, actually) of name-value tuples. This helps
# with clarity and also makes it easy to later encoding of them.
formFields = (
# the viewstate is actualy 800+ characters in length! I truncated it
# for this sample code. It can be lifted from the first page
# obtained from the site. It may be ok to hardcode this value, or
# it may have to be refreshed each time / each day, by essentially
# running an extra page request and parse, for this specific value.
(r'__VSTATE', r'7TzretNIlrZiKb7EOB3AQE ... ...2qd6g5xD8CGXm5EftXtNPt+H8B'),
# following are more of these ASP form fields
(r'__VIEWSTATE', r''),
(r'__EVENTVALIDATION', r'/wEWDwL+raDpAgKnpt8nAs3q+pQOAs3q/pQOAs3qgpUOAs3qhpUOAoPE36ANAve684YCAoOs79EIAoOs89EIAoOs99EIAoOs39EIAoOs49EIAoOs09EIAoSs99EI6IQ74SEV9n4XbtWm1rEbB6Ic3/M='),
(r'ctl00_RadScriptManager1_HiddenField', ''),
(r'ctl00_tabTop_ClientState', ''),
(r'ctl00_ContentPlaceHolder1_menuMain_ClientState', ''),
(r'ctl00_ContentPlaceHolder1_gridMain_ClientState', ''),
#but then we come to fields of interest: the search
#criteria the collections to search from etc.
# Check boxes
(r'ctl00$ContentPlaceHolder1$chkOptions$0', 'on'), # file number
(r'ctl00$ContentPlaceHolder1$chkOptions$1', 'on'), # Legislative text
(r'ctl00$ContentPlaceHolder1$chkOptions$2', 'on'), # attachement
# etc. (not all listed)
(r'ctl00$ContentPlaceHolder1$txtSearch', 'york'), # Search text
(r'ctl00$ContentPlaceHolder1$lstYears', 'All Years'), # Years to include
(r'ctl00$ContentPlaceHolder1$lstTypeBasic', 'All Types'), #types to include
(r'ctl00$ContentPlaceHolder1$btnSearch', 'Search Legislation') # Search button itself
)
# these have to be encoded
encodedFields = urllib.urlencode(formFields)
req = urllib2.Request(uri, encodedFields, headers)
f= urllib2.urlopen(req) #that's the actual call to the http site.
# *** here would normally be the in-memory parsing of f
# contents, but instead I store this to file
# this is useful during design, allowing to have a
# sample of what is to be parsed in a text editor, for analysis.
try:
fout = open('tmp.htm', 'w')
except:
print('Could not open output file\n')
fout.writelines(f.readlines())
fout.close()
That's about it for the getting of the initial page. As said above, then one would need to parse the page, i.e. find the parts of interest and gather them as appropriate, and store them to file/database/whereever. This job can be done in very many ways: using html parsers, or XSLT type of technogies (indeed after parsing the html to xml), or even for crude jobs, simple regular-expression. Also, one of the items one typically extracts is the "next info", i.e. a link of sorts, that can be used in a new request to the server to get subsequent pages.
This should give you a rough flavor of what "long hand" html scraping is about. There are many other approaches to this, such as dedicated utilties, scripts in Mozilla's (FireFox) GreaseMonkey plug-in, XSLT...
Most ASP.NET sites (the one you referenced included) will actually post their queries back to themselves using the HTTP POST verb, not the GET verb. That is why the URL is not changing as you noted.
What you will need to do is look at the generated HTML and capture all their form values. Be sure to capture all the form values, as some of them are used to page validation and without them your POST request will be denied.
Other than the validation, an ASPX page in regards to scraping and posting is no different than other web technologies.
Selenium is a great tool to use for this kind of task. You can specify the form values that you want to enter and retrieve the html of the response page as a string in a couple of lines of python code.
Using Selenium you might not have to do the manual work of simulating a valid post request and all of its hidden variables, as I found out after much trial and error.
The code in the other answers was useful; I never would have been able to write my crawler without it.
One problem I did come across was cookies. The site I was crawling was using cookies to log session id/security stuff, so I had to add code to get my crawler to work:
Add this import:
import cookielib
Init the cookie stuff:
COOKIEFILE = 'cookies.lwp' # the path and filename that you want to use to save your cookies in
cj = cookielib.LWPCookieJar() # This is a subclass of FileCookieJar that has useful load and save methods
Install CookieJar so that it is used as the default CookieProcessor in the default opener handler:
cj.load(COOKIEFILE)
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
urllib2.install_opener(opener)
To see what cookies the site is using:
print 'These are the cookies we have received so far :'
for index, cookie in enumerate(cj):
print index, ' : ', cookie
This saves the cookies:
cj.save(COOKIEFILE) # save the cookies
"Assume we need to select "all years" and "all types" from the respective dropdown menus."
What do these options do to the URL that is ultimately submitted.
After all, it amounts to an HTTP request sent via urllib2.
Do know how to do '"all years" and "all types" from the respective dropdown menus' you do the following.
Select '"all years" and "all types" from the respective dropdown menus'
Note the URL which is actually submitted.
Use this URL in urllib2.

Resources