Simple HTML DOM Issue - simple-html-dom

When using Simple HTML DOM library I have faced a problem with some websites. When I tried to load the following url http://www.t-mobile.com/shop/phones/cell-phone-detail.aspx?cell-phone=HTC-One-S-Gradient-Blue&tab=reviews#BVRRWidgetID
My PHP code is:
<?php
include "simple_html_dom.php";
$html=new simple_html_dom();
$url="http://www.t-mobile.com/shop/phones/cell-phone-detail.aspx?cell-phone=HTC-One-S- Gradient-Blue&tab=reviews#BVRRWidgetID";
$html->load_file($url);
echo $html;
?>
The php script gives no error but it shows the following content every time.
Unsupported Browser
It appears that you are viewing this page with an unsupported Web browser. This Web site works best with one of these supported browsers:
Microsoft Internet Explorer 5.5 or higher
Netscape Navigator 7.0 or higher
Mozilla Firefox 1.0 or higher
If you continue to view our site with your current browser, certain pages may not display correctly and certain features may not work properly for you.
What is the problem? Does Simple HTML DOM have a limitation? Is there any other way to solve this problem?

Some websites are not allowed to scrap its content directly.
you can use curl fetch html content and then use load() of dom object.
i hope it work for you.

Just setup your USERAGENT in simple_html_dom request:
# Creating useragent array
$useragent = array("http" => "User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6");
# Creating a line from array
$useragent = stream_context_create($useragent);
# Starting Simple_HTML_Dom with our useragent
$html = file_get_html($urlCategory, $useragent)
So, our request will be from the newer browser than yours.

set the useragent
$context = stream();
stream($context, array('user_agent' => 'Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6\r\n'));
file_get_html('http://www.t-mobile.com/shop/phones/cell-phone-detail.aspx?cell-phone=HTC-One-S- Gradient-Blue&tab=reviews#BVRRWidgetID', 0, $context);

Related

What does " Mozilla/5.0" in user agent string signify? [duplicate]

This question already has answers here:
Why do all browsers' user agents start with "Mozilla/"?
(6 answers)
Closed 4 years ago.
When I myself send many requests to the server I found it amazing that in IE if I choose opera user string that the value of user string was
User-Agent Opera/9.80 (Windows NT 6.1; U; en) Presto/2.2.15 Version/10.00
But if I choose another browser in Internet Explorer that it puts Mozilla 5.0 in the user string first.
When I send the ajax request from Chrome that I found same thing that they put user string
Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.672.2 Safari/534.20
I found that Mozilla is an organization that doesn't have anything to do with Google and Microsoft. Perhaps it was a competitor for both. Why do MSFT and Google both put Mozilla in their user agent? Is there any reason for putting Mozilla in connection string?
Why do chrome and IE both put Mozilla in the userstring when they send the request? I do not know why but is there any specific reason for that?
See: user-agent-string-history
It all goes back to browser sniffing and making sure that the browsers are not blocked from getting content they can support. From the above article:
And Internet Explorer supported frames, and yet was not Mozilla, and so was not given frames. And Microsoft grew impatient, and did not wish to wait for webmasters to learn of IE and begin to send it frames, and so Internet Explorer declared that it was “Mozilla compatible” and began to impersonate Netscape, and called itself Mozilla/1.22 (compatible; MSIE 2.0; Windows 95), and Internet Explorer received frames, and all of Microsoft was happy, but webmasters were confused.

Change in User-Agent header triggering forms authentication

I've got an app built using ASP.NET MVC 3.0. It uses asp.net's built in forms authentication, without session state, and cookies on the browser to identify the user making requests.
Now, when I'm testing the app using IE9, the typical HTML request sends this user-agent in the header, and everything works fine.
User-Agent: Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)
However, we have one page in the app that has an ActiveX container that hosts Microsoft Word in the browser. The purpose of this ActiveX container is to allow you to make modifications to the word document, click on a button to POST that word document with your changes to our server so it can be saved.
There is a method in the ActiveX control--Office Viewer Component from www.ocxt.com--called HttpPost() that POSTs the contents of the viewed document to the server.
When you call HttpPost(), it sends all the same cookies properly, but uses a different User-Agent string.
User-Agent: Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 4.0)
The UserAgent using MSIE 5.5 string appears to cause ASP.NET or MVC to not send the request to the appropriate controller, but instead sends a redirect response to the Login page even though the cookie is correct for the session. I did a test with Fiddler, and tried using MSIE 6.0, 7.0, 8.0 and those seem to work fine, so specifically, 5.5 causes part of the server stack to redirect to login page.
This page used to work fine, so I'm not sure if something has changed in recent versions of ASP.NET/MVC, or is it because I've moved up to IE9.0, but basically, I'd like to know if it is possible to tell ASP.NET to not take the User-Agent into account when determining if a session has been authenticated already or not.
Thanks.
IIRC there was a change in ASP.NET 4.0 where Forms Authentication uses the user agent to detect whether it supports cookies and if it is not a recognized or unsupported user agent it simply doesn't use the authentication cookie. You will need to change the User Agent of the HTTP request.
How to disable this default behavior for the webserver to check cookie support on the user agent in the web.config and force cookies for all browsers...
<system.web>
<authentication mode="Forms">
<forms cookieless="UseCookies" />
</authentication>
</system.web>
What's annoying about this default setting is that some valid User-Agent headers on new browsers will cause cookies to be ignored.
this User-Agent's form auth cookie is NOT ignored...
Mozilla/5.0 (iPad; CPU OS 5_1_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5.1 Mobile/9B206 Safari/7534.48.3
this User-Agent's form auth cookie IS ignored...
Mozilla/5.0 (iPhone; CPU iPhone OS 6_0_1 like Mac OS X; en-us) AppleWebKit/536.26 (KHTML, like Gecko) CriOS/23.0.1271.91 Mobile/10A523 Safari/8536.25
But adding the cookieless="UseCookies" attribute will tell ASP.NET to use the cookies from anything.

Second request from Media Player in a web page

I have an ASP.NET web application where a Microsoft Media Player object on the page (in IE) is issuing a request for an .aspx web page. In the page I use TransmitFile to send back the audio file. This works fine most of the time.
But, in some cases (a combination of IE version and a specific client, at least from what I can see) there is a second request issued, with the exact same URL. The only difference I can see between the first and second request is the user-agent value. The first request will have User-Agent: Windows-Media-Player/9.00.00.4508 and the second one will have User-Agent: Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)
This second request is causing the audio file to be sent a second time over the net, which is wasteful. This is what I'm trying to avoid.
I had a related question here, but in this case there is no Range request. It is just the same exact request again (same headers, except for the user-agent).
I was trying to suppress the second response (based on the user-agent in the header) with all kind of HTTP status responses (304, 404, 500 etc.) This works for some clients, some of the time, but breaks occasionally (the Media Player will just not play the audio, even though Fiddler will show it was transfered on the first request).
I would like to "convince" the browser to avoid the second request, if possible. As a second option I would like to find a response to the second request that will not break the playback, but avoid sending the whole audio buffer.
The only thing I can think of so far is that maybe they have a plugin/toolbar installed that is trying to do something when it detects media.
I think VLC does something like that for Firefox, and I know there are many 'video downloader' kind of addons for Firefox, maybe there are equivalents made for IE.
You could try asking the client that is having the problems to give you their list of addons (Should be Tools -> Manage Add-ons in IE8).
Hope this helps : )
EDIT: One more thing you could check is to ask them to try enabling compatibility mode on your site and see if it changes anything.
Please provide full headers dump, are you sure both requests are GET ones?
First request might be just to check if the cached version is good enough.
Second thought: try to return 'expires' header to supress cache check.

Using a query string in an excel hyperlink to an ASP.Net Web Application

I want to pass some data between an existing excel application and an existing ASP.Net VB Webforms application.
I thought a hyperlink with some query string variables would be the most straightforward means of doing this. However, it seems that the hyperlink does not retain the session of the logged in user.
Testing this with the same URL on a webpage does work. So it seems Excel is starting a new session. Any ideas on how to make Excel hyperlinks behave the same way a browser hyperlink does?
I am having this same problem, and using Fiddler I can see that when following the link in Excel, cookies are not being sent to the server - causing session problems.
My work around is as follows; create a redirect page that does not require a valid session, that just redirects to the page that requires a valid session. As the redirect page is in the browser - the page that is redirected to gets the session cookies as expected.
Code (redirect.htm);
<html>
<body>
Please wait, loading your page...
<script type="text/javascript">
<!--
function getQuerystring(key) {
key = key.replace(/[\[]/,"\\\[").replace(/ [\]]/,"\\\]");
var regex = new RegExp("[\\?&]"+key+"=([^&#]*)");
var query = regex.exec(window.location.href);
return query[1];
}
window.location = "http://site-page/" + getQuerystring('page'); //-->
</script>
</body>
</html>
Accessing the page from access using http://site-page/redirect.htm?page=this-sub-page - works for me now.
I've just stumbled across this problem while using Firefox as my default browser. If I set IE as my default the issue goes away. This may not help in your case but it is a workaround.
I have also found out what causes the issue. Excel is requesting the page itself using IE7 before it passes the url to the default browser.
This is a snippet from our server log:
"GET /ar/vehicle.php?rv_id=9046 HTTP/1.1" 302 - "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1 ...
"GET / HTTP/1.1" 302 - "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1 ...
"GET /?q=node/57 HTTP/1.1" 200 6231 "-" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1 ...
"GET /?q=node/57 HTTP/1.1" 200 8318 "-" "Mozilla/5.0 (Windows NT 5.1; rv:2.0) Gecko/20100101 Firefox/4.0 ...
The first three lines are Excel sending the request and processing the redirect. The final line is what gets passed to the default browser.
Gabriel
Clicking a link in Excel typically opens a new browser, and thus, a new session. There's nothing you can really do within Excel or the hyperlink to mitigate this - it's the way browser sessions work.
If you can't just re-initialize the user's session state when they access this url (I assume they may be asked to log in, etc.) then maybe you could consider using cookies to retain the user's identity?
old post but I had same problem
here's how I fixed it.
I made the hyperlink point to a php script which checked the browsers user agent and if it contained the term 'ms-office' did nothing and otherwise redirected it to the real page!
here is what I've got:
if (strpos($_SERVER['HTTP_USER_AGENT'],'ms-office') === false) {
header("Location: ".$_GET['url']);
}
simply send excel to e.g. redirect.php?url=http://google.com

Web server changed name, url return wrong host name

In our web application (asp.net), the tabs are dynamic links. The links were built like this:
finalUrl = "https://" + Request.Url.Host + "/home.aspx";
The link is ended up like:
https://server0/home.aspx
The problem is the web server's name was server0, but now it was changed to server1. Still the old server name keeps showing up. Can anyone help point out where we missed?
(etc/hosts has correct setting)
Thanks!
Can I assume that there's a good reason you're not using relative URLs?
If using asp.net as tagged, would you not look at using HttpRequest.ApplicationPath ? (Using System.Web)
Is the site still accessible through the old URL? That property is based off the URL entered, not the name of the server it's on.
dove, I made some simplification out of it. The tab will point to another virtual directory on the same machine.
John, no, it is not accessible through old host name.
Use a program like FireBug for FireFox to see the request header. Then open up your web browser and go to your application. Turn on the Net tab on FireBug to view the request values. For example, on this web site I can see:
Host stackoverflow.com
User-Agent Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.5) Gecko/2008120122 Firefox/3.0.5 (.NET CLR 3.5.30729)
Accept text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
...
Thus I would expect that Request.Url.Host will be "stackoverflow.com".
As a side note, you should definitely look into using the ~/ absolute path option, a brief example:
string finalUrl = "~/home.aspx";
Response.Redirect(finalUrl);
There is probably good reason for not using relative url. I am not the original developer. And it is just that the code works everywhere including production servers but this one machine. Just want to get to the bottom of it.
At the time when "Request.Url.Host" is referenced, the host name is already changed back old server.
I found the problem lies in Metabase.xml, where the host is extracted from, if relative url is used. Then the subsequent reference of "Request.Url.Host" will reflect the value.

Resources