I am wondering how to make sure that I only ever show/render the content (send the code to the client) if the content is loaded in an iframe in a real browser, similar to the way Facebook checks when to display their like buttons and other social utilities.
There, when trying to simply load the content using curl, even when sending cookies, session details and user agent details, it still returns nothing. When trying to load the content outside an iframe, one receives nothing. How can that be achieved? I guess it is all but a simple process that involves multiple steps. I am especially interested in the first one, namely how to detect that it is really sent from a browser and not simply curled.
Thanks.
There is no way for your server to detect if it was sent using browser or curl, as the headers are easily forged.
Related
I'm trying to design a scraper for a page in R using the selenium webdriver package and the part of the page I want to scrape is not loading, no matter how long I wait for it to. It may be to do with javascript which I admittedly know nothing about.
I've tried forcing it to scroll down to load the element (in this case a table) but to no avail.
It loads fine in normal browsers.
It's like the severalth site for which this has happened so I thought I'd pop my stackoverflow cherry and ask the experts.
Sorry I have no reprex as I just don't know where the issue is coming from!
The link to the page is
https://jdih.kemenkeu.go.id/#/home
an image showing what selenium says it sees - yellow highlighted area is where the table should load.
how it is supposed to display shown in firefox
Thanks for reading!
(18 months later and I can answer my own question!)
The issue was that the page is loading content dynamically using an API request.
When scraping using direct GET requests of a URL to extract the page contents, this initial request alone may not load the desired content.
In this case, I found the exact issue by reloading the page with the developer interface open (F12) with the 'Network' (or similar) tab open.
This then shows all the requests made when the browser loads the page.
One of these will be a request for the desired data - in this case by filtering on XHR requests only, I was able to identify one which loaded content through an internal API.
Right-click the request, open in new tab, and voilĂ , you have a URL which you can use in the same way you would normally with this scraping method that will provide the page content required.
Sometimes the URL alone may not be enough. You will need the correct request headers sent with the request. These can be seen in the request data as mentioned above in the Developer interface in one's browser. Right-click and select 'Copy headers' or similar to get them.
In this case, i.e. when using R, the httr package can be used to send get requests with specific headers thus:
headers = c(
"Host" = "website.com"
[other headers here]
)
page <- httr::GET(url = "www.website.com/data",
httr::add_headers(.headers = headers)) %>%
httr::content()
When you have the page content, it is possible to parse the HTML or whatever else is required as usual.
i heard that , if your asp.net page is inside a iframe, and u want to get the parent url, you can achieve this by using the referrer?
i tested is okay, and found that the window parent url will included in the referrer when called the iframe content
Request.UrlReferrer.ToString();
Assume that i can only use server side to achieve
I just want to ask is that way safe?
Any chance to lost the referrer url in this case
The browser is not guaranteed to send the referer. It's all up to the browser/configuration/extensions/proxies and whatnot between the request and your server.
If the user navigates to a different page within the iframe, the referer will point to whatever the user came from.
All in all, never use the referer for any logic that may fail if it's not there or if it has an unexpected value.
You can do this but it is not entirely in ASP.Net.
You would have to get the referrer from Javascript and pass that to the iFrame.
One of the 2 following calls would be what you are looking for.
top.document.referrer
or
parent.document.referrer
I am going to describe my specific case below, but this might be useful to a number of web-mashups.
My web application POSTs to Twitter by filling a form and then submitting it (via javascript). The target of the form is set to an iframe which has an onload trigger. When the onload trigger is called the application knows that the POST was completed.
This used to work fine until Chrome version 11, which now respects the X-Frame-Options=SAMEORIGIN sent by Twitter in the POST response. The POST goes through, but the iframe's onload is not called anymore.
It still works in Firefox 4, but I suppose that's a bug that will eventually get fixed.
Is there any other way to know the status of the POST? I understand that knowing the contents of the POST response would violate the security policy, but I am not interested in the contents. I would just like the app to be notified when the POST is completed.
If you just need to know when the POST was submitted, and not necessarily whether it succeeded or not, you could poll the iframe's contentWindow and contentWindow.document on an interval. When you can no longer access one of those objects, or when the document has an empty body, that means that the iframe has loaded a page with X-Frame-Options restrictions, which likely means that the submission went through. It's hacky, but it looks like it will work for this purpose. (You'll probably have to go through a few combinations to figure out what the contents of restricted iframes look like in your target browsers.)
You can do it by getting the headers of the page. In php it will be looks like,
$url = 'http://www.google.com';
print_r(get_headers($url));
Many browsers in Japan (EZWeb, i-mode, etc) don't allow meta refresh, and in fact, they may display warning messages such as "This page uses newer technology and cannot be displayed" in place of your webpage.
How can I tell if a mobile browser does not support meta-refreshing so that I can take different action in those cases?
Thanks
The best option for something like this is to display a link on the page with the meta-refresh. The traditional "click here if the page doesn't redirect you in 5 seconds" kind of thing. That's what has been done for years in the PC realm.
You should also consider an HTTP 304 with the Location: header if you are just redirecting.
If instead you want a page to reload after a specific amount of time, then you are stuck. Without JavaScript, there is no other method you can use to automatically do this.
Without JavaScript you're really limited to User Agent sniffing. To provide the best experience I would recommend use known UA strings to only send the meta-refresh to browsers you know can handle it and for those that you don't know send a plain HTML response that has a link for users to click on to do the refresh.
For a specialized purpose with Aweber regarding a newsletter subscription, I have a page loading a nested IFRAME inside, and both reside on the same domain. (Many other stackoverflow posts talk about different domains, but this question deals only with the same domain.) I need a cross-platform way (including browsers as old as the dawn of IE6) for the two to communicate.
For example, someone fills out name and email and clicks a checkbox, and the hidden IFRAME next to the checkbox sits in a setInterval() loop watching for that. When it receives notification, it grabs the name and email and does a form post.
I thought at first that I could just drop a cookie in the parent page, and then the IFRAME child could then sit in an interval watching for that cookie. But my tests show that this won't work. The cookie gets created -- but the IFRAME can't see it. So, I tried the meta-refresh technique in the IFRAME, and again it couldn't see that cookie for some reason.
The only solution I can come up with is that the parent page will take the checkbox click (we use jQuery) and do an AJAX data push to the server into a database. The IFRAME can then check on an interval back to the server via AJAX to see if the database value has changed, and react to it if so. But this seems like an over-engineered solution and I'm looking for an easier alternative that works cross-platform, even in earlier browsers from the timeframe of IE6 and forward.
It's much more simple: In the iframe, you can access the parent variable, which contains the parent window. So you can use parent.document to find the form, read the values, etc.