How to read an external page's title? - asp.net

I think it's possible with jQuery, but any ASP.NET serverside code is good for my situation too.
With jQuery I can load a page to for example a div, and filter the div for <title> tag, but I think for heavy pages, it is not good to first read all of the content and then read the title tag..
or maybe it has a very simple solution? anyways I couldnt find anything about that from internet.
thanks

okay thanks to cjjer and Boo, I've just read more about regex and finally the code below is working for me.
Dim qq As New System.Net.WebClient
Dim theuri As New Uri(TextBox1.Text)
Dim res As String = qq.DownloadString(theuri)
Dim re As Regex = New Regex("<title\b[^>]*>(.*?)</title>", RegexOptions.Singleline)
Dim ma As Match = re.Match(res)
If Not ma Is Nothing And ma.Success Then
Response.Write(ma.Groups(1).Value.ToString())
Else
Response.Write("error")
End If
but anyways, the problem remains, this code is downloading the whole page and seeking through it, which one heavy websites it took more than 2 or 3 secconds to complete, but seems it is the only way as far as I know :|
Is there any suggestions to refine this code?

cjjer almost got it right.
First, change the regex to: <title>(?<Content>.*?)?</title>
Second, you need to create a match object first (just in case your URI does not have a title).
Match tMatch = new RegEx(#"<title>(?<Content>.*?)?</title>").Match(new System.Net.WebClient().DownloadString(url));
if ((null != tMatch) && (tMatch.IsSuccess)) {
// yay.
title = tMatch.Groups("Content").value;
}

Titles usually appear within the first few hundred bytes, so you could try a range request for the first 1KiB or so, try parsing that (with an error-correcting parser, since some closing tags will be missing) and if that fails fall back to loading the whole page.

It would be security risk for you to load any other web page into yours, just for title read... You should do this with server side scripting (asp.net, php, ...) and just output the title to your web page. Thing of some kind of caching because it is seamless to fetch titles on every request.

There is no simple clean way to retrieve an external page's title. You could do it server side using a WebClient and parsing the response.
However it may be worth reviewing the requirement, is it really necessary, how much extra traffic and latency is it going to generate. Consider also that you could be generating load on the external site which is unaware all you want is a title, the page creation may be quite expensive.

string title=Regex.Match(new System.Net.WebClient().DownloadString(url),(#"<title>(.*?)</title>"))[0].Groups[1].ToString();
try.i am not sure.

I am not sure whether all servers support this.
See, if this helps
char[] data = new char[299];
System.Net.HttpWebRequest wr =(HttpWebRequest)WebRequest.Create("http://www.yahoo.com");
wr.AddRange("bytes", 0, 299);
HttpWebResponse wre = (HttpWebResponse)wr.GetResponse();
StreamReader sr = new StreamReader(wre.GetResponseStream());
sr.Read(data, 0, 299);
Console.WriteLine((data));
sr.Close();
EDIT: Try checking with some network monitoring tool to find out what is the text that servers send out. I used fiddler to see the output & wrote it to console.
EDIT2: I am assuming the title to be in the beginning of the page.

Related

Python 3.5 requests for clawing

I have a coding problem regarding Python 3.5 web clawing.
I try to use 'requests.get' to extract the real link from 'http://www.baidu.com/link?url=ePp1pCIHlDpkuhgOrvIrT3XeWQ5IRp3k0P8knV3tH0QNyeA042ZtaW6DHomhrl_aUXOaQvMBu8UmDjySGFD2qCsHHtf1pBbAq-e2jpWuUd3'. An example of the code is like below:
import requests
response = requests.get('http://www.baidu.com/link?url=ePp1pCIHlDpkuhgOrvIrT3XeWQ5IRp3k0P8knV3tH0QNyeA042ZtaW6DHomhrl_aUXOaQvMBu8UmDjySGFD2qCsHHtf1pBbAq-e2jpWuUd3')
c = response.url
I expected that 'c' should be 'caifu.cnstock.com/fortune/sft_jj/tjj_yndt/201605/3787477.htm'. (I remove http:// from the link as I can't post two links in one question.)
However, it doesn't work, and keeps return me the same link as I putted in.
Can anyone help on this. Many thanks in advance.
#
Thanks a lot to Charlie.
I have found out the solution. I first use .content.decode to read the response history, but that will be mixed up with many irrelevant info. I then use .findall to extract the redirect url from the history, which should be the first url displayed in the response history. Then, I use requests.get to retrieve the info. Below is the code:
rep1 = requests.get(url)
cont = rep1.content.decode('utf-8')
extract_cont = re.findall('"([^"]*)"', cont)
redir_url = extract_cont[0]
rep = requests.get(redir_url)
You may consider looking into the response headers for a 'location' header.
response.headers['location']
You may also consider looking at the response history, which contains a response for each response instance in a chain of redirects
response.history
Your sample URL doesn't redirect; The response is a 200 and then it uses a JavaScript window.location change. The requests library won't support this type of redirect.
<script>window.location.replace("http://caifu.cnstock.com/fortune/sft_jj/tjj_yndt/201605/3787477.htm")</script>
<noscript><META http-equiv="refresh" content="0;URL='http://caifu.cnstock.com/fortune/sft_jj/tjj_yndt/201605/3787477.htm'"></noscript>
If you know you will always be using this one service, you could parse the response, maybe using regex.
If you don't know what service will always be used and also want to handle every possible situation, you might need to instantiate a WebKit instance or something and somehow try to determine when it finally finishes. I'm sure there's a page load complete event which you could use, but you still might have pages that do a window.location change after the page is loaded using a timer. This will be very heavyweight and still not cover every conceivable type of redirect.
I recommend starting with writing a special handler for each type of edge case and fallback on a default handler that just looks at the response.url. As new edge cases come up, write new handlers. It's kind of the 'trial and error' approach.

How secured is the simple use of addslashes() and stripslashes() to code contents?

Making an ad manager plugin for WordPress, so the advertisement code can be almost anything, from good code to dirty, even evil.
I'm using simple sanitization like:
$get_content = '<script>/*code to destroy the site*/</script>';
//insert into db
$sanitized_code = addslashes( $get_content );
When viewing:
$fetched_data = /*slashed code*/;
//show as it's inserted
echo stripslashes( $fetched_data );
I'm avoiding base64_encode() and base64_decode() as I learned their performance is a bit slow.
Is that enough?
if not, what else I should ensure to protect the site and/or db from evil attack using bad ad code?
I'd love to get your explanation why you are suggestion something - it'll help deciding me the right thing in future too. Any help would be greatly appreciated.
addslashes then removeslashes is a round trip. You are echoing the original string exactly as it was submitted to you, so you are not protected at all from anything. '<script>/*code to destroy the site*/</script>' will be output exactly as-is to your web page, allowing your advertisers to do whatever they like in your web page's security context.
Normally when including submitted content in a web page, you should be using htmlspecialchars so that everything comes out as plain text and < just means a less then sign.
If you want an advertiser to be able to include markup, but not dangerous constructs like <script> then you need to parse the HTML, only allowing tags and attributes you know to be safe. This is complicated and difficult. Use an existing library such as HTMLPurifier to do it.
If you want an advertiser to be able to include markup with scripts, then you should put them in an iframe served from a different domain name, so they can't touch what's in your own page. Ads are usually done this way.
I don't know what you're hoping to do with addslashes. It is not the correct form of escaping for any particular injection context and it doesn't even remove difficult characters. There is almost never any reason to use it.
If you are using it on string content to build a SQL query containing that content then STOP, this isn't the proper way to do that and you will also be mangling your strings. Use parameterised queries to put data in the database. (And if you really can't, the correct string literal escape function would be mysql_real_escape_string or other similarly-named functions for different databases.)

asp.net dynamic HTML form

I want to create an html page inside a asp.net page using c# and then request that html page. The flow is, I'll be creating a request that will give me a response with some values. Those values will be stored in hidden fields in the html page I'm creating on the fly and then requesting. I figure it would be something like below but I'm not sure if it would work, I've also received some "Thread Aborting" errors. So, does anyone know the proper way to do this or at least direct me to a nice article or something?
StringBuilder builder = new StringBuilder();
builder.Append("<html><head></head>");
builder.Append("<body onload=\"document.aButton.submit();\">");
builder.Append("<input type=\"hidden\" name=\"something\" value=\"" + aValue + "\">");
HttpContext.Current...Response.Write(builder.ToString());
... end response
This is a very common request and is almost never a good idea. What are you trying to do?
That said: you write out a file with a temporary name and redirect to that file. Later you have to figure out when it's safe to delete the file.
Edit That method points out one of the problems: you have to do your own garbage collection, deciding how long files must be kept around and deleting them appropriately.

Checking The Date A Webpage Has Been Updated?

I want to be able to run a little script that I can populate with a list of URLs and it pulls in and checks when the page was last updated? Has anyone done this?
I can only find a manual way of doing this using JavaScript by pasting this into the browser URL field
javascript:alert(document.lastModified)
Any ideas greatly received :)
The following will step through an array of URLs and display the last modified date or, if it's not present, the date of the server request.
string[] urls = { "http://boflynn.net", "http://slashdot.org" };
foreach ( string url in urls )
{
System.Net.HttpWebRequest req =
(System.Net.HttpWebRequest) System.Net.WebRequest.Create(url);
System.Net.HttpWebResponse resp =
(System.Net.HttpWebResponse) req.GetResponse();
Console.WriteLine("{0} - {1}", url, resp.LastModified);
}
If you use urllib2 (or perhaps httplib might be better still) in a python script you can inspect the headers that are returned for the last-modified field.
It depends on what you mean by "last updated". Sure, there is the Last-Modified HTTP header, but it can be very misleading. For example, if the page is being served up dynamically, there is a good change that this field will be the current time, even if the content of the page itself (the part useful to humans) has not been updated in a rather long time. This page itself is a good example of this phenomenon.
If you are truly interested in the last time the content was updated, then I don't have an immediate answer.

is it possible to issue dynamic include in asp-classic?

I mean, like php'h include...
something like
my_file_to_be_included = "include_me.asp"
-- >
for what I've seen so far, there are a couple of alternatives, but every one of them has some sort of shortcoming...
what I'm trying to figure out is how to make a flexible template system... without having to statically include the whole thing in a single file with a loooooong case statement...
here there are a couple of links
a solution using FileSysmemObject, just lets you include static pages
idem
yet another one
same thing from adobe
this approach uses Server.Execute
but it has some shortcomings I'd like to avoid... seems like (haven't tried yet) Server.Execute code runs in another context, so you can't use it to load a functions your are planning to use in the caller code... nasty...
same thing
I think this one is the same
this looks promising!!!
I'm not sure about it (couldn't test it yet) but it seems like this one dinamycally handles the page to a SSDI component...
any idea???
No you can't do a dyanmic include, period.
Your best shot at this is a server.execute and passing whatever state it needs via a Session variable:-
Session("callParams") = BuildMyParams() 'Creates some sort of string
Server.Execute(my_file_to_be_included)
Session.Contents.Remove("callParams")
Improved version (v2.0):
<%
' **** Dynamic ASP include v.2.0
function fixInclude(content)
out=""
if instr(content,"#include ")>0 then
response.write "Error: include directive not permitted!"
response.end
end if
content=replace(content,"<"&"%=","<"&"%response.write ")
pos1=instr(content,"<%")
pos2=instr(content,"%"& ">")
if pos1>0 then
before= mid(content,1,pos1-1)
before=replace(before,"""","""""")
before=replace(before,vbcrlf,""""&vbcrlf&"response.write vbcrlf&""")
before=vbcrlf & "response.write """ & before & """" &vbcrlf
middle= mid(content,pos1+2,(pos2-pos1-2))
after=mid(content,pos2+2,len(content))
out=before & middle & fixInclude(after)
else
content=replace(content,"""","""""")
content=replace(content,vbcrlf,""""&vbcrlf&"response.write vbcrlf&""")
out=vbcrlf & "response.write """ & content &""""
end if
fixInclude=out
end function
Function getMappedFileAsString(byVal strFilename)
Dim fso,td
Set fso = Server.CreateObject("Scripting.FilesystemObject")
Set ts = fso.OpenTextFile(Server.MapPath(strFilename), 1)
getMappedFileAsString = ts.ReadAll
ts.close
Set ts = nothing
Set fso = Nothing
End Function
execute (fixInclude(getMappedFileAsString("included.asp")))
%>
Sure you can do REAL classic asp dynamic includes. I wrote this a while back and it has opened up Classic ASP for me in a whole new way. It will do exactly what you are after, even though people seem to think it isn't possible!
Any problems just let me know.
I'm a bit rusty on classic ASP, but I'm pretty sure you can use the Server.Execute method to read in another asp page, and then carry on executing the calling page. 15Seconds had some basic stuff about it - it takes me back ...
I am building a web site where it would have been convenient to be able to use dynamic includes. The site is all ajax (no page reloads at all) and while the pure-data JSON-returning calls didn't need it, all the different html content for each different application sub-part (window/pane/area/form etc) seems best to me to be in different files.
My initial idea was to have the ajax call be back to the "central hub" main file (that kicks the application off in the first place), which would then know which sub-file to include. Simply including all the files was not workable after I realized that each call for some possibly tiny piece would have to parse all the ASP code for the entire site! And using the Execute method was not good, both in terms of speed and maintenance.
I solved the problem by making the supposed "child" pages the main pages, and including the "central hub" file in each one. Basically, it's a javascript round-trip include.
This is less costly than it seems since the whole idea of a web page is that the server responds to client requests for "the next page" all the time. The content that is being requested is defined in scope by the page being called.
The only drawback to this is that the "web pieces" of the application have to live partly split apart: most of their content in a real named .asp file, but enough of their structure and relationship defined in the main .asp file (so that, for example, a menu item in one web piece knows the name to use to call or load another web piece and how that loading should be done). In a way, though, this is still an advantage over a traditional site where each page has to know how to load every other page. Now, I can do stuff like "load this part (whether it's a whole page or otherwise) the way it wants to be loaded".
I also set it up so each part can have its own javascript and css (if only that part needs those things). Then, those files are included dynamically through javascript only the first time that part is loaded. Then if the part is loaded repeatedly it won't incur an extra overhead.
Just as an additional note. I was getting weird ASCII characters at the top of the pages that were using dynamic includes so I found that using an ADODB.Stream object to read the include file eliminated this issue.
So my updated code for the getMappedFileAsString function is as follows
Function getMappedFileAsString(byVal strFilename)
Dim fso
Set fso = CreateObject("ADODB.Stream")
fso.CharSet = "utf-8"
fso.Open
fso.LoadFromFile(Server.MapPath(strFilename))
getMappedFileAsString = fso.ReadText()
'Response.write(getMappedFileAsString)
'Response.End
Set fso = Nothing
End Function

Resources