Why the same URL gives different results? - web-scraping

On the following page, the number 2, 3 ... at the bottom all point to the same URL. Yet, the different tables will be shown. Does anybody know what specific techniques are used here? How to extract information in these tables using raw HTTP request (I prefer not to use a headless browser to do so)? Thanks.
https://services27.ieee.org/fellowsdirectory/home.html#results_table

It is using Javascript (AJAX) to make HTTP calls to the server.
If you inspect the Network activity in the Developer tools you will see calls to the following URL: https://services27.ieee.org/fellowsdirectory/getpageresultsdesk.html.
They send data from Javascript:
selectedJSON: {"alpha":"ALL","menu":"ALPHABETICAL","gender":"All","currPageNum":1,"breadCrumbs":[{"breadCrumb":"Alphabetical Listing "}],"helpText":"Click on any of the alphabet letters to view a list of Fellows."}
inputFilterJSON: {"sortOnList":[{"sortByField":"fellow.lastName","sortType":"ASC"}],"typeAhead":false}
pageNum: 2
You can see the pageNum property. This is how they request a specific page of results.

When you click the number buttons, some Javascript code makes an AJAX POST request to https://services27.ieee.org/fellowsdirectory/getpageresultsdesk.html;jsessionid=yoursessionid with formData including pageNum: 3 and some other formatting parameters. The server responds with the HTML block of table rows that get loaded into the page. You can look at the requests on that webpage in your browser's network inspector (in the developer tools) to see exactly what HTTP requests are happening.

The link has an onclick handler that changes the href onclick. Go to
https://services27.ieee.org/fellowsdirectory/home.html#results_table
In the console, enter:
window.location=getDetailProfileUrl('lOH1bDxMyI1CCIxo5ODlGg==');
This redirects to Aarons, Jules.
Now go back and enter window.location=getDetailProfileUrl('JJuL3J00kHdIUozoVAgKdg==');
This opens Aarts, Ronald.
Basically, when the link is clicked, the JavaScript changes the url of the link.
To extract them using php, use the file_get_contents() function.
echo file_get_contents('https://services27.ieee.org/fellowsdirectory/home.html#results_table');
That will print out the page. Now scrape it with JavaScript.
echo "<script>console.log(document.querySelectorAll('.name'));</script>";
Hope this helps.

Related

What does ?t=some-number mean when used at the end of an image url

Here is the example image url I found on Steam.
https://steamcommunity-a.akamaihd.net/public/shared/images/header/globalheader_logo.png?t=962016
The image url gives the same result with or without the ?t=962016. What is it called? And what does it do?
?t=962016
This is a technique to disable browser caching, browser sees it as a new url, and fetches the resource again from web server. The resource can be image, css file, js file etc. This is the most common use case, but can be also used differently by the web server.
There is another use case also. I have done this one of my project.
I have a made all requests to *.jpg handle by a php script.
Eg: mysite.com/user/avatar.jpg?id=100
avatar.jpg is actually a php script which takes the query param (in this case the id 100) and returns the correspond user's avatar (user with id 100). Browser see this as an image. Another advantage is we can disable hot linking directly to this image, as the script can check if the request is originated from the same domain.
IMO there is 2 possibilities
- They put that parameter to avoid the image to be cached, the value of t is random in this case
- The image can be generated by a script, in this case the value of t is the id of the image.

Change HTTP status code for page in Adobe CQ5 (AEM)

I'm trying to support a CQ5 (5.5) installation developed by an outside firm for my company.
It appears that my company wanted a pretty 404 page that looked like the rest of the site, and using the custom Sling 404.jsp error handler to redirect to a regular page that merely says "Page Not Found" was the easiest way to do it. The problem is that the 404 page actually returns a 200 status code since it really is just a regular content page that bears a "Not Found" message on it.
This is causing us problems with Google and the GoogleBot, since Google believes all the old search links to now non-existent pages are still valid (200 status code).
Is there any way to configure CQ to return the appropriate 404 status code for the "not found" HTML page that we display? When I am in the CQ Author mode editing the page, I find nothing in page properties or in components that could be added to the page.
Any help would be appreciated, as CQ is not exactly my area of expertise.
You'll have to overlay /libs/sling/servlet/errorhandler/404.jsp file in order to do so - copy it to /apps/sling/servlet/errorhandler/404.jsp and change according to your specification.
And if you are looking specifically into setting appropriate response status code - you can do it by setting respective response property:
response.setStatus(404);
UPDATE: instead of redirecting to the page_not_found.html you might want to include it to the 404.jsp after setting response status:
<sling:include path="path/page_not_found.html" />
You can set the response code fairly easily with this sort of code: response.setStatus(SlingHttpServletResponse.SC_NOT_FOUND);
So for example, a quick-and-dirty implementation on your page_not_found.jsp would be as follows:
<%
response.setStatus(SlingHttpServletResponse.SC_NOT_FOUND);
%>
(or a longer-term/better implementation would be to set it via a tag and a tag library to avoid scriptlets)
If your page_not_found.html page is a static HTML page and not rendered via a jsp, you may need to change your 404.jsp so it redirects to a page that is rendered via a jsp for this approach to work. The status code is set by the server rendering the response. It is not something intrinsic in the HTML itself, so you won't be able to set this in a regular, static HTML page. Something must be done on the server to set this status code. Also see How to Return Specific HTTP Status Code in a Plain HTML Page

Why it is necessary to mention http while redirecting

I have done a sample code to redirect to google by clicking on an hyperlink
This one worked fine
<asp:HyperLink ID="MyHyperLinkControl" NavigateUrl="http://google.com""
runat="server">link</asp:HyperLink>
But this does not work
<asp:HyperLink ID="MyHyperLinkControl" NavigateUrl="www.google.com"
runat="server">link</asp:HyperLink>
Can any one give me a detail explanation please
NavigateUrl="relative-path" goes to ./relative-path, meaning that if your browser is currently at http://www.example.com/test, that link will tell it to go to http://www.example.com/test/relative-path.
Therefore, NavigateUrl="www.google.com" will go to http://www.example.com/test/www.google.com.
If, however, you specify a full URL, like NavigateUrl="http://www.google.com/", then you will get to http://www.google.com/.
To illustrate:
If you have a hyperlink on a page, and that page is located at http://www.example.com/test:
example.html goes to http://www.example.com/test/example.html
example goes to http://www.example.com/test/example
google.html goes to http://www.example.com/test/google.html
www.google.html goes to http://www.example.com/test/www.google.html
www.google.com goes to http://www.example.com/test/www.google.com
The fact that it's called www.google.com makes it no different from any other link.
If you use absolute links (links that start with a /), you will get the following behavior:
/example.html goes to http://www.example.com/example.html
/example goes to http://www.example.com/example
/google.html goes to http://www.example.com/google.html
/www.google.html goes to http://www.example.com/www.google.html
/www.google.com goes to http://www.example.com/www.google.com
If however you specify the full URL, including schema, then the whole URL from the address bar is replaced with what your full URL points to. For example:
mailto:test#example.com uses the email client
http://www.google.com uses the browser, pointing it to that URL
I'm guessing its because it treats the URL as a relative one and so if you're currently on page: www.yourdomain.com/ the second will try and navigate to www.yourdomain.com/www.google.com. However if you put the http in front you are telling the link that it should be an absolute link,
You may have a link in your app which goes to test.pdf.old (as an example). You would then not expect it to go to http://test.pdf.old.
WWW is not a keyword meaning this is an external webpage - its essentially a convention. So what if the pdf example above was called www. You would still want to download it and not go to http://www.pdf.old.

How to include HTML contents from another site? I have access to both sites

I have a site which is using DNN (DotNetNuke) as a content management system. I am using another site for my event registrations. I have sent them my template; which displays the basics including a hover menu with many different items in it.
Issue is - as I update the menu on my site using DNN, I need it to be reflected on the site using my template - without me having to send them a new file. Anyone have suggetsions on how to approach this?
I don't want to send the events provider all of the DNN DLLs as well as my database login information in order to render the menu.
I created a page on my site that is something like 'menu.aspx' - this produces the menu in HTML format, however it has tags like in it that I'd like to remove before serving it to them.
What is the best approach for this? Do I need to write a custom server control using XMLHttp? Can I accomplish this in Javascript?
Any advice much appreciated.
Thank you!
If both sites are hosted on the same domain (eg site1.domain.com and site2.domain.com) you can use JavaScript and XmlHttpRequest to insert code from one site to another. Otherwise, the Same Origin Policy prevents you from using AJAX.
If they're not on the same domain but you have access to the page on their website, you can simply include there a JS script from your site :
<script type="text/javascript" src="http://yoursite.com/code.js"></script>
In the JS, simply document.write() what you want on the page. This way, you can easily change the content of the page on their site without having to send them a new file.
Finally, you can also use an iframe on their site, pointing to a page on yours.
EDIT: As Vincent E. pointed out, this will only work if they're on the same domain - my bad.
If you are unwilling or unable to use frames, then I would set up an ashx on your DNN server which renders the menu (if you've got it in a user control all the better, as you can just instatiate it and Render it directly to the output stream) and then just make an Ajax call to that from your events page and insert it directly into the DOM.
Here's a quick and hacky jquery-based example of the events page end of things:
<script type="text/javascript">
function RenderMenu(data)
{
$('#Menu').html(data);
}
$(document).ready(function() {
$.ajax({
type : 'GET',
url : 'http://localhost/AjaxHandlers/Menu.ashx',
data : '',
success : RenderMenu,
});
});
</script>
You'll want an empty div with the ID 'Menu' on the page where you want your menu to sit, but apart from that you're good to go.
If for whatever reason you can't get the menu HTML in an isolated way, then you'll need to do some text processing in RenderMenu, but it's still do-able.
I am not a web expert, so don't shoot me.
Can't you just put their registration form into an iFrame in DNN ?

Problem passing parameters via Iframe in IE

I'm trying to execute an HTTP GET from my website to another website that is brought in via iframe.
On Firefox, you can see in the source that the correct url is in the iframe src along with it's correct parameters-- and it works.
On IE, you can see in the source that the correct url is in the iframe src along with it's correct parameters-- and it doesn't work...
Is there something about IE that doesn't let you pass parameters through an iframe in the querystring?
I've tried refreshing the iframe in IE, I've tried refreshing my page & the iframe in IE, and I've tried copying the url and re-pasting it into the iframe src (forcing it to refresh as if I just entered it into the address bar for that iframe window). Still no luck!
Anyone know why this is happening, or have any suggestions to try to get around this?
Edit: I cannot give a link to this because the site requires a password and login credentials to both our site and our vendor's site. Even though I could make a test account on our site, it would not do any good for the testing process because I cannot do the same for the vendor site. As for the code, all it's doing is creating the src from the backend code on page load and setting the src attribute from the back end...
//Backend code to set src
mainIframe.Attributes["src"] = srcWeJustCreated;
//Front end iframe code
<iframe id="mainIframe" runat="server" />
Edit: Problem was never solved. Answer auto accepted because the bounty expired. I will re-ask this question with more info and a link to the page when our site is closer to going live.
Thanks,
Matt
By the default security settings in IE query parameters are blocked in Iframes. On the security tab under internet options set your security level to low. If this fixes your problem then you know that is your issue. If the site is for external customers then expecting them to turn down their security settings is probably unreasonable, so you may have to find a work around.
Let's say your site is www.acme.com and the iframe source is at www.myvendor.com.
IIRC, most domain-level security settings don't care about the hostname, so add a DNS CNAME to your zone file for myvendor.acme.com, pointed back to www.myvendor.com. Then, in your IFRAME, set the source using your hostname alias.
Another solution might be to have your Javascript set the src to a redirector script on your own server (and, thus, within your domain). Your script would then simply redirect the IFRAME to the "correct" URL with the same parameters.
If it suits you, you can communicate between sites with fragment identifiers. You can find an article here: http://tagneto.blogspot.com/2006/06/cross-domain-frame-communication-with.html
What BYK said. I think what's happening is you are GETting a URL that is too large for IE to handle. I notice you are trying to send variable named src, which is probably very long, over 4k. I ran into this problem before, and this was my code. Notice the comment about IE. Also notice it causes a problem with Firefox then, which is addressed in another comment.
var autoSaveFrame = window.frames['autosave'];
// try to create a temp form object to submit via post, as sending the browser to a very very long URL causes problems for the server and in IE with GET requests.
var host = document.location.host;
var protocol = document.location.protocol;
// Create a form
var f = autoSaveFrame.document.createElement("form");
// Add it to the document body
autoSaveFrame.document.body.appendChild(f);
// Add action and method attributes
f.action = protocol + '//' + host + "/autosave.php"; // firefox requires a COMPLETE url for some reason! Less a cryptic error results!
f.method = "POST"
var postInput = autoSaveFrame.document.createElement('input');
postInput.type = 'text'
postInput.name = 'post';
postInput.value = post;
f.appendChild(postInput);
//alert(f.elements['post'].value.length);
// Call the form's submit method
f.submit();
Based on Mike's answer, the easiest solution in your case would be to use "parameter hiding" to convert all GET parameters into a single URL.
The most scalable way would be for each 'folder' in the URL to consist of the parameter, then a comma, then the value. For example you would use these URLs in your app:
http://example.com/app/param,value/otherparam,othervalue
http://example.com/app/param,value/thirdparam,value3
Which would be the equivalent of these:
http://example.com/app?param=value&otherparam=othervalue
http://example.com/app?param=value&thirdparam=value3
This is pretty easy on Apache with .htaccess, but it looks like you're using IIS so I'll leave it up to you to research the exact implementation.
EDIT: just came back to this and realised it wouldn't be possible for you to implement the above on a different domain if you don't own it :p However, you can do it server-side like this:
Set up the above parameter-hiding on your own server as a special script (might not be necessary if IE doesn't mind GET from the same server).
In Javascript, build the static-looking URL from the various parameters.
Have the script on your server use the parameters and read the external URL and output it, i.e. get the content server-side. This question may help you with that.
So your iframe URL would be:
http://yoursite.com/app/param,value/otherparam,othervalue
And that page would read and display the URL:
http://externalsite.com/app?param=value&otherparam=othervalue
Try using an indirect method. Create a FORM. Set its action parameter to the base url you want to navigate. Set its method to POST. Set its target to your iframe and then create the necessary parameters as hidden inputs. Finally, submit the form. It should work since it works with POST.

Resources