It is trivial to extract the PDF url from the following webpage.
https://www.osapublishing.org/boe/abstract.cfm?uri=boe-11-5-2745
But when I wget it, it will show something like in the output instead of downloading a PDF file.
<p>OSA has implemented a process that requires you to enter the letters and/or numbers below before you can download this article.</p>
As the website uses the cookie cfid, it should be protected by ColdFusion. Does anybody know how to scrape such a webpage? Thanks.
https://cookiepedia.co.uk/cookies/CFID
EDIT: The wget solution offered by Sev Roberts does not work. I checked the chrome devtools (in a new incognito window), many requests are sent after the first request of https://www.osapublishing.org/boe/abstract.cfm?uri=boe-11-5-2745 is sent. I guess it is because wget won't send those requests so that the subsequent wget (with cookies) of https://www.osapublishing.org/boe/viewmedia.cfm?uri=boe-11-5-2745&seq=0 won't work. Can anybody tell which of those extract requests are essential? Thanks.
There are several methods that sites use against this sort of scraping and direct linking or embedding. The basic old methods included:
Checking the user's cookies: to at least check the user already had a session from a previous page on this site; some sites might go further and look for the presence of specific cookie or session variables that verify a genuine path through the site.
Checking the cgi.http_referer variable to see whether the user arrived from the expected source.
Checking whether the cgi.http_user_agent looks like a known human browser - or checking that the user agent does not look like a known bot browser.
Other more intelligent methods of course exist, but in my experience if you're requiring more than the above then you're reaching the territory of requiring a captcha and/or requiring a user to register and log in.
Obviously (2) and (3) are easily spoofed by setting the headers manually. For (1) if you're using cfhttp or its equivalent in another language, then you need to ensure that cookies returned in the Set-Cookie header of the site's response, are returned in the headers of your subsequent request by using cfhttpparam. Various cfhttp wrappers and alternative libraries such as Java wrappers bypassing the cfhttp layer, are available to do this. But if you want to understand a simple example of how this works then Ben Nadel has an old but good one here: https://www.bennadel.com/blog/725-maintaining-sessions-across-multiple-coldfusion-cfhttp-requests.htm
With the pdf url from the link in your question, a couple of minutes tinkering in Chrome shows that if I lose the cookies from the previous page and keep the http_referer then I see the captcha challenge, but if I keep the cookies and lose the http_referer then I get directly through to the pdf. This confirms that they care about the cookies but not the referer.
Copy of Ben's example for SO completeness:
<cffunction
name="GetResponseCookies"
access="public"
returntype="struct"
output="false"
hint="This parses the response of a CFHttp call and puts the cookies into a struct.">
<!--- Define arguments. --->
<cfargument
name="Response"
type="struct"
required="true"
hint="The response of a CFHttp call."
/>
<!---
Create the default struct in which we will hold
the response cookies. This struct will contain structs
and will be keyed on the name of the cookie to be set.
--->
<cfset LOCAL.Cookies = StructNew() />
<!---
Get a reference to the cookies that werew returned
from the page request. This will give us an numericly
indexed struct of cookie strings (which we will have
to parse out for values). BUT, check to make sure
that cookies were even sent in the response. If they
were not, then there is not work to be done.
--->
<cfif NOT StructKeyExists(
ARGUMENTS.Response.ResponseHeader,
"Set-Cookie"
)>
<!---
No cookies were send back in the response. Just
return the empty cookies structure.
--->
<cfreturn LOCAL.Cookies />
</cfif>
<!---
ASSERT: We know that cookie were returned in the page
response and that they are available at the key,
"Set-Cookie" of the reponse header.
--->
<!---
Now that we know that the cookies were returned, get
a reference to the struct as described above.
--->
<!---
The cookies might be coming back as a struct or they
might be coming back as a string. If there is only
ONE cookie being retunred, then it comes back as a
string. If that is the case, then re-store it as a
struct.
---><!---<cfdump var="#arguments#" label="Line 305 - arguments for function GetResponseCookies" output="D:\web\safenet_GetResponseCookies.html" FORMAT="HTML">--->
<cfif IsSimpleValue(ARGUMENTS.Response.ResponseHeader[ "Set-Cookie" ])>
<cfset LOCAL.ReturnedCookies = {} />
<cfset LOCAL.ReturnedCookies[1] = ARGUMENTS.Response.ResponseHeader[ "Set-Cookie" ] />
<cfelse>
<cfset LOCAL.ReturnedCookies = ARGUMENTS.Response.ResponseHeader[ "Set-Cookie" ] />
</cfif>
<!--- Loop over the returned cookies struct. --->
<cfloop
item="LOCAL.CookieIndex"
collection="#LOCAL.ReturnedCookies#">
<!---
As we loop through the cookie struct, get
the cookie string we want to parse.
--->
<cfset LOCAL.CookieString = LOCAL.ReturnedCookies[ LOCAL.CookieIndex ] />
<!---
For each of these cookie strings, we are going to
need to parse out the values. We can treate the
cookie string as a semi-colon delimited list.
--->
<cfloop
index="LOCAL.Index"
from="1"
to="#ListLen( LOCAL.CookieString, ';' )#"
step="1">
<!--- Get the name-value pair. --->
<cfset LOCAL.Pair = ListGetAt(
LOCAL.CookieString,
LOCAL.Index,
";"
) />
<!---
Get the name as the first part of the pair
sepparated by the equals sign.
--->
<cfset LOCAL.Name = ListFirst( LOCAL.Pair, "=" ) />
<!---
Check to see if we have a value part. Not all
cookies are going to send values of length,
which can throw off ColdFusion.
--->
<cfif (ListLen( LOCAL.Pair, "=" ) GT 1)>
<!--- Grab the rest of the list. --->
<cfset LOCAL.Value = ListRest( LOCAL.Pair, "=" ) />
<cfelse>
<!---
Since ColdFusion did not find more than one
value in the list, just get the empty string
as the value.
--->
<cfset LOCAL.Value = "" />
</cfif>
<!---
Now that we have the name-value data values,
we have to store them in the struct. If we are
looking at the first part of the cookie string,
this is going to be the name of the cookie and
it's struct index.
--->
<cfif (LOCAL.Index EQ 1)>
<!---
Create a new struct with this cookie's name
as the key in the return cookie struct.
--->
<cfset LOCAL.Cookies[ LOCAL.Name ] = StructNew() />
<!---
Now that we have the struct in place, lets
get a reference to it so that we can refer
to it in subseqent loops.
--->
<cfset LOCAL.Cookie = LOCAL.Cookies[ LOCAL.Name ] />
<!--- Store the value of this cookie. --->
<cfset LOCAL.Cookie.Value = LOCAL.Value />
<!---
Now, this cookie might have more than just
the first name-value pair. Let's create an
additional attributes struct to hold those
values.
--->
<cfset LOCAL.Cookie.Attributes = StructNew() />
<cfelse>
<!---
For all subseqent calls, just store the
name-value pair into the established
cookie's attributes strcut.
--->
<cfset LOCAL.Cookie.Attributes[ LOCAL.Name ] = LOCAL.Value />
</cfif>
</cfloop>
</cfloop>
<!--- Return the cookies. --->
<cfreturn LOCAL.Cookies />
</cffunction>
Assuming you have a cfhttp response from the first page https://www.osapublishing.org/boe/abstract.cfm?uri=boe-11-5-2745 and pass that response into the above function and hold its result in a variable named cookieStruct, then you can use this inside subsequent cfhttp requests:
<cfloop item="strCookie" collection="#cookieStruct#">
<cfhttpparam type="COOKIE" name="#strCookie#" value="#cookieStruct[strCookie].Value#" />
</cfloop>
Edit: if using wget instead of cfhttp - you could try the approach from the answer to this question - but without posting a username and password since you don't actually need a login form
How to get past the login page with Wget?
eg
# Get a session.
wget --save-cookies cookies.txt \
--keep-session-cookies \
--delete-after \
https://www.osapublishing.org/boe/abstract.cfm?uri=boe-11-5-2745
# Now grab the page or pages we care about.
# You may also need to add valid http_referer or http_user_agent headers
wget --load-cookies cookies.txt \
https://www.osapublishing.org/boe/viewmedia.cfm?uri=boe-11-5-2745&seq=0
...although as others have pointed out, you may be violating the terms of service of the source, so I couldn't recommend actually doing this.
Related
I've created a bucket where policy makes it mandatory to specify the content-type of the object being uploaded. If I specify the content-type after file element, (example below)
<form action="..">
...
<input type='file' name='file' />
<input name='content-type' value='image/jpeg' />
<input type='submit' />
</form>
it returns following error
<Error>
<Code>AccessDenied</Code>
<Message>Invalid according to Policy: Policy Condition failed: ["starts-with", "$Content-Type", ""]</Message>
<RequestId>15063EB427B4A469</RequestId>
<HostId>yEzAPF4Z2inaafhcqyQ4ooLVKdwnsrwqQhnYg6jm5hPQWSOLtPTuk0t9hn+zkBEbk+rP4S5Nfvs=</HostId>
</Error>
If I specify content-type before file element, upload works as expected. I encountered this behaviour for the first time. I have a few questions regarding it.
Is it part of some specification where clients and all intermediate proxies are supposed to maintain order of http post params? Please point me to it.
Why would you make your API be aware of such ordering? In this particular case I can guess that the file can be huge and unless you are seeing all expected params before, you should immediately return failure. Please correct me if my understanding is not correct.
It is part of the spec that the parts are sent as ordered in the form. There is no reason to believe that reordering by an intermediate proxy would be allowed.
The form data and boundaries (excluding the contents of the file) cannot exceed 20K.
...
The file or content must be the last field in the form.
http://docs.aws.amazon.com/AmazonS3/latest/API/sigv4-HTTPPOSTForms.html#sigv4-HTTPPOSTFormFields
The logical assumption is that this design allows S3 to reject invalid uploads early.
I’m trying to use Helicon Ape’s mod_xsendfile with Railo server (Windows 2012 R2). mod_xsendfile functions correctly and it works fine with PHP, it deliver the file and also it pass the content length value to the browser too. No file size limit found with PHP and no significant use of sever memory regardless the file size.
With Railo, obvious first attempt.
<cfcontent type="text/plain">
<cfheader name="content-disposition" value="attachment; filename=test.txt"/>
<cfheader name="X-Sendfile" value="D:\iis\hello.txt"/>
This does not work. It returns a blank file; no error log generated by Helicon Ape, so it is safe to assume Header X-Sendfile does not passed into IIS/ correctly.
Second Attempt
<cfheader name="content-disposition" value="attachment; filename=test.txt"/>
<cfset Response = GetPageContext().GetResponse() />
<cfset Response.setHeader('X-Sendfile','D:\iis\hello.txt')>
<cfset Response.setContentType('plain/text')>
<cfset Response.GetOutputStream().Flush() />
<cfset Response.Reset() />
<cfset Response.Finish() />
This works with
Limitation 1: When the file size is more than 2GB, browser returns error “ERR_INVALID_CHUNKED_ENCODING” It works fine with smaller file size, no memory issues. (again, PHP seems not to have this issue. IIS do not have a size limit either)
Limitation 2: This does not pass the content-length to the browser, hence browser don’t know the size of the file.
Third Attempt: Add content-length manually. (this is not necessary with PHP)
<cfset filePath = "D:\iis\246.zip">
<cfheader name="content-disposition" value="attachment; filename=246.zip"/>
<cfset Response = GetPageContext().GetResponse() />
<cfset Response.setContentLength( createObject("java","java.io.File").init( filePath ).length() )>
<cfset Response.setHeader('X-Sendfile', filePath )>
<cfset Response.setContentType('application/octet-stream')>
<cfset Response.GetOutputStream().Flush() />
<cfset Response.Reset() />
<cfset Response.Finish() />
Content-length passed into the browser, but unlike with PHP, IIS tries to allocate memory for the file and it soon end up with “Overflow or underflow in the arithmetic operation” error.
I’m sure I’m not handling GetPageContext().GetResponse() correctly. If anyone can help me out here, I would be highly appreciated.
If you use BonCode to connect to IIS it has facilities for spooling large files without overloading the server mem limit. Thus, allowing efficient streaming.
You will need to add the FlushThresholdBytes setting to your BonCode setting (check c:\windows), e.g.:
<FlushThresholdBytes>10240</FlushThresholdBytes>
However, from my limited understanding of Railo it seems to load the whole file into memory which would create a limit on the file size you can stream.
-John
I have created some header to login into a server,
After loggin into server, i am moving one page to another page using geturl operation using this below headers, but the problem i logged out the server i am not moving into further.
I thought it was missing cookie information.
set headers(Accept) "text/html\;q=0.9,text/plain\;q=0.8,image/png,*/*"
set headers(Accept-Language) "en-us,en\;q=0.5"
set headers(Accept-Charset) "ISO-8859-1,utf-8\;q=0.7,*\;q=0.7"
set headers(Proxy-Authorization) "[concat \"Basic\" [base64::encode $username:$password]]"
I don't how to set cookie information into headers could someone explain.
Thanks
Malli
Cookie support in Tcl is currently exceptionally primitive; I've got 95–99% of the fix in our fossil repository, but that's not much help to you. But for straight handling a session cookie for login purposes, you can “guerilla hack” it.
Sending the cookie
To send a cookie to the server, you need to send a header Cookie: thecookiestring. That's done by passing the -headers option to http::geturl which has a dictionary describing what to pass. We can get that from the array simply enough:
set headers(Cookie) $thecookiestring
set token [http::geturl $theurl -headers [array get headers]]
# ...
Receiving the cookie
That's definitely the easy bit. The rather-harder part is that you also need to check for a Set-Cookie header in the response when you do a login action. You get that with http::meta and then iterate through the list with foreach:
set thecookiestring ""
set token [http::geturl $theloginurl ...]
if {[http::ncode $token] >= 400} {error ...}
foreach {name value} [http::meta $token] {
if {$name ne "Set-Cookie"} continue
# Strip the stuff you probably don't care about
if {$thecookiestring ne ""} {append thecookiestring "; "}
append thecookiestring [regsub {;.*} $value ""]
}
Formally, there can be many cookies and they have all sorts of complicated features. Handling them is what I was working on in that fossil branch…
I'm assuming that you don't need to be able to forget cookies, manage persistent storage, or other such complexities. (After all, they're things you probably won't need for normal login sessions.)
I solved using this tool Fiddler
Thanks All
I recently did a website for my company using ColdFusion 9. The issue I am having is with the ColdFusion encryption/decryption function. On certain strings that I decrypt I get these weird special characters that show up.
Example:
MK/_0 <---Encrypted String Outputted
�#5&z <---Decrypted String Outputted
I'm not sure why this is happening (and only on certain strings that get decrypted).
Here is the code:
<cfset ccNum = decrypt(getCCInfo.CUST_CARDNUMBER,myKey)>
Ok, well first, I have to point out that by not specifying an encryption algorithm you are using very POOR encryption. So you'll need to fix that. Second, you should probably be using some encoding to make your crypto storage more reliable.
So try this code.
<cfset key = generateSecretKey("AES") />
<!--- Set the ciphertext to a variable. This is the string you will store for later deciphering --->
<cfset cipherText = encrypt(plaintext, key, "AES/CBC/PKCS5Padding", "HEX") />
<cfoutput>#cipherText#</cfoutput>
<!--- Then when you decrypt --->
<cfset decipherText = decrypt(cipherText, key, "AES/CBC/PKCS5Padding", "HEX") />
<cfoutput>#decipherText#</cfoutput>
The above code will use a strong crypto algorithm and will put the ciphertext into a much easier to store format than the gibberish you showed as an example above. That way when you store it, it will be more reliable when you retrieve it again.
Here is an example of what the string will look like:
A51BBB284D6DCCDC17D26FB481584236087C3AB272918E17963BAF749438C06A484922820EDCCD25150732CC5CF8A096
A HTTP Cookie consists of a name-value pair and can be set by the server using this response:
HTTP/1.0 200 OK
Content-type: text/html
Set-Cookie: name=value
Set-Cookie: name2=value2; Expires=Wed, 09 Jun 2021 10:18:14 GMT
Future requests from the client will then look like this:
GET /spec.html HTTP/1.1
Host: www.example.org
Cookie: name=value; name2=value2
Is the name of the cookie case sensitive?
For example, if my server sends a response as such:
HTTP/1.0 200 OK
Content-type: text/html
Set-Cookie: Aaaa=Bbbb
Set-Cookie: aAaa=bBbb
Set-Cookie: aaAa=bbBb
Set-Cookie: aaaA=bbbB
Is it reasonable to expect a client (Chrome, FireFox, Safari, IExplorer, Opera, etc) to send future requests with the header Cookie: Aaaa=Bbbb; aAaa=bBbb; aaAa=bbBb; aaaA=bbbB;?
Note: Question is neither JSP-specific, PHP-specific, nor ASP-specific.
Cookie names are case-sensitive. The RFC does not state that explicitly, but each case-insensitive comparison is stated so explicitly, and there is no such explicit statement regarding the name of the cookie. Chrome and Firefox both treat cookies as case-sensitive and preserve all case variants as distinct cookies.
Test case (PHP):
print_r($_COOKIE);
setcookie('foo', '123');
setcookie('Foo', '456');
Load script twice, observe $_COOKIE dump on second run.
At the bottom is a script that demonstrates Cookie case sensitivity on browsers and .Net framework. Every time it is run, it will insert a cookie named xxxxxxxxxx, with random upper/lower cases. Press F5 to refresh a few times to insert a few cookies.
I have teste it on Chrome and Firefox, and both demonstrate similar behavior, something like below:
Request.Cookies["xxxxxxxxxx"].Name returns: xxxxXxXXXX
All XXXXXXXXXX Cookies:
xxxxXxXXXX
xXxxXxXXXx
XxxxxXxXXx
XXXxXxXXxX
It shows:
Cookies are case sensitive on Chrome and Firefox
.Net Framework can handle case sensitive cookies (that's why it could loop through all those cookies)
Request.Cookies["xxxxxxxxxx"] is case insensitive (that's why it returns the first cookie that case-insensitively matches the name)
As mentioned in other answers, the new RFC indicates that cookies are case sensitive, and both Chrome and Firefox seem to handle it that way. .Net Framework can handle case-sensitive cookies, but it really wants to treat cookies case-insensitively, and many of its functions do treat cookies that way (Cookies[], Cookies.Set() etc.). This inconsistency can cause many hard to track bugs.
TestCookie.aspx:
<%# Page language="c#" AutoEventWireup="false" validateRequest=false %>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<title id="title">Test Cookie Sensitivity</title>
</head>
<body>
<p>Request.Cookies["xxxxxxxxxx"].Name returns:
<%
HttpCookie cookie2 = Request.Cookies["xxxxxxxxxx"];
if (cookie2 == null) Response.Write("No cookie found");
else Response.Write(cookie2.Name);
%>
</p>
<h3>All XXXXXXXXXX Cookies:</h3>
<ul>
<%
foreach (string key in Request.Cookies.Keys)
if (key.ToLower() == "xxxxxxxxxx") Response.Write("<li>" + key + "</li>");
Random rand = new Random();
StringBuilder name = new StringBuilder();
for (int i = 0; i < 10; i++) {
if (rand.Next(2) == 0) name.Append('x');
else name.Append('X');
}
HttpCookie cookie = new HttpCookie(name.ToString());
cookie.HttpOnly = true;
cookie.Expires = DateTime.Now.AddMonths(1);
Response.Cookies.Add(cookie);
%>
</ul>
</body>
</html>
It seems cookies are actually case sensitive. Theres some confusion with this. It is interesting that the MSDN says otherwise:
Cookie names are NOT case-sensitive
Source: http://msdn.microsoft.com/en-us/library/ms970178.aspx the bottom of the article says it's ©2002 so it might be outdated.
Also, the question has been asked in the asp.net forums, too: http://forums.asp.net/t/1170326.aspx?Are+cookie+names+case+sensitive+ and it seems the answer is case-sensitive.
What's going on? MSDN says no, other technologies say yes. To be sure, I tested this using ASP classic.
Code
hashUCASE = Request.Cookies("data")("Hash")
hashLCASE = Request.Cookies("data")("hash")
Response.Write "<p> hashUCASE = " & hashUCASE
Response.Write "<br> hashLCASE = " & hashLCASE
cookieNameUCASE = Request.Cookies("Data")
cookieNameLCASE = Request.Cookies("data")
Response.Write "<p> cookieNameUCASE = " & cookieNameUCASE
Response.Write "<br> cookieNameLCASE = " & cookieNameLCASE
Response.End
Results
hashUCASE: EE3305C0DAADAAAA221BD5ACF6996AAA
hashLCASE: EE3305C0DAADAAAA221BD5ACF6996AAA
cookieNameUCASE: name=1&Hash=EE3305C0DAADAAAA221BD5ACF6996AAA
cookieNameLCASE: name=1&Hash=EE3305C0DAADAAAA221BD5ACF6996AAA
As you can see in the results, the value "Hash" was created with uppercase and even when you make the request with lower case, it returns the same value, which makes it not case-sensitive. Under this MS technology, it is not.
Conclusion
So, using Request.Cookies() in ASP classic, it's not case-sensitive, like Microsoft says. But wait, isn't it case sensitive again? This may mean that whether sensitive or not depends on the server side technology that makes the request to the browser, which could be normalizing the cookie name to make the requests and thus making it not case sensitive. But that's something else we'll have to test to verify.
My advice is to make tests with whatever technology you are using and establish a standard in your code base, make an agreement with your team. i.e. if you're going to use a cookie, decide if it will always be written in lowercase or uppercase anytime you are going to use it in your code. That way there won't be any case sensitivity problems because in your code it will be always declared with the same case.
tl;dr
As long as you keep a convention with the cookie names you won't have problems with case sensitivity.
According to RFC 2109 - HTTP State Management Mechanism cookie names aka attribute names are case insensitive:
4.1 Syntax: General
The two state management headers, Set-Cookie and Cookie, have common
syntactic properties involving attribute-value pairs. The following
grammar uses the notation, and tokens DIGIT (decimal digits) and
token (informally, a sequence of non-special, non-white space
characters) from the HTTP/1.1 specification [RFC 2068] to describe
their syntax.
av-pairs = av-pair *(";" av-pair)
av-pair = attr ["=" value] ; optional value
attr = token
value = word
word = token | quoted-string
Attributes (names) (attr) are case-insensitive. White space is
permitted between tokens. Note that while the above syntax
description shows value as optional, most attrs require them.
According to MSDN, cookies name are NOT case sensitive. However, I'm not sure if that's just ASPX/IIS specific implementation. I believe it depends on the web server and the language as well.
If you send a cookie named "UserID", the browser will make sure they send it back as "UserID", not "userid".