I am new to the HtmlAgilityPack and its a bit unclear for me how it exactly works. Lets say when something like this piece of code is written
Dim url1 As String = "http://www.bing.com/search?q=Verizon
Dim hw As New HtmlWeb()
Dim doc As HtmlDocument = hw.Load(url1)
For Each link As HtmlNode In doc.DocumentNode.SelectNodes("//a[#href]")
Dim att As HtmlAttribute = link.Attributes("href")
Response.Write(att.Value)
Next
So when the SelectNodes is //a[#href] does that mean that it will only look at ahref tags?
If so how can I make it consider other tags in within the loop like <li>, <h3>, <div>.
Is it correct like //li[#class='wrap']|//div[#class='last'] ??
How can the data between those tags be fetched and presented.
One other issue is that lets say I need to scrape a telephone number from that url, the number might be unavailable or might not be in any of the tags defined. Is there any reliable method that I can work on in order to obtain a telephone number to a relative search term? Any suggestions or thoughts?
Indeed, the current xpath looks at anchor tags that have a href parameter. I suggest you read up on xpath syntax (for instance at http://www.w3schools.com/xpath/xpath_syntax.asp)
To select other nodes you need to change the xpath to select those tags, for instance:
doc.DocumentNode.SelectNodes("//li")
to get all li nodes etc.
The data in the tags can be reached using the InnerHtml of the selected document nodes (link.InnerHtml in your example)
Automatically scraping telephone numbers is a real pain, every country uses different lengths and there are many different formats to write down a number: +12(0)3456 +123456 00123456 +12(0)34-56 are all the same valid phone number... See Check if there is phone number in string C# for a simple sollution
GL&HF!
Related
I would like to have a query that looks at the name of its parent and then will navigate to the folder with the same sitename underneath the content folder and show the items underneath it in the multilist.
Basically the structure in the content tree looks as followed:
sitecore
content
Sitename1
(items that needs to be showed in the multilist)
Sitename2
(items that needs to be showed in the multilist)
medialibrary
Sitename1
PDF1 (with multiList with search)
Sitename2
PDF2 (with multiList with search)
Trouble starts when I would like to start comparing the "name" of the relative parent to the "name" of the child of the absolute path. In Xpath it would probably go something like it is described within: Compare attribute values using Xpath
In this case I had the following query up until now:
/sitecore/content/*[ancestor-or-self::*[##templateid='{Template Id of Sitename}']=##name and ##templateid='{Template Id of Sitename}']
This query returns Sitename1 and Sitename2.
Funny thing is that if I replace "ancestor-or-se.." or "##name" part bij "Sitename1", like so:
/sitecore/content/*['Sitename1'=##name and ##templateid='{Template Id of Sitename}']
..and..
/sitecore/content/*[./ancestor-or-self::*[##templateid='{Template Id of Sitename}']='DSW' and ##templateid='{Template Id of Sitename}']
I get the wanted result: Sitename1.
Btw I'm using the build-in xpath builder for now, before it paste the query into the "multilist with search."
Any help would be appreciated.
Edit:
I think I found out that when I start the relative query (the "./ancestor::..." part) it actually is relative to the item where I ended up with the absolute query. So I should have the following query:
./ancestor-or-self::*[##templateid='{Template Id of Sitename}' and ##name=ancestor::*[##templateid='{Template Id of root item aka "sitecore"}']//*[##templateid={Template Id of Sitename}]]
Here I get the error "Object must be of type String," which is probably because of the following part of the previous query:
##name=ancestor::*[##templateid='{Template Id of root item aka "sitecore"}']//*[##templateid={Template Id of Sitename}]
The right part of this doesn't solve to a string. So the question remains, how to extract pure the string out of a sitecore item using sitecore xpath in order to be able to make a comparison.
I figured out that Sitecore doesn't support subqueries at least for fast queries, I think same applies to normal ones (see also: "Subqueries are not supported" in here ). Which now lead me to using simple code where I perform two queries. A very simple way to do it is to inherit from IDatasource (in the sitecore.buckets.dll), you will need to write "code:{fullpath to class}, assemblyname.dll" (See also: here)
I have found a similar question earlier here:
Google Analytics Visitors Flow: grouping URLs?
However I'm confused because people suggest different way to write the Replace String, and either way I try it am not able to make it work.
So I have a ecommerce site with hundreds of different pages. The different parts of the website is:
http://example.com/sv/ (Root)
http://example.com/sv/category/1-name/
http://example.com/sv/product/1-name/
http://example.com/sv/designer-tool/1-name/
http://example.com/sv/checkout/
When I go to the visitors flow. I want to see the amount of people that go from example Root to Category, and from Category to Product, and from Product to Designer Tool, and from Designer Tool to Checkout. However now when I have so many different pages it becomes very difficult to follow the visitors flow, because the product pages are for example not grouped together.
So instead of above. I would like to remove the 1-name/ part in the end. And only see /sv/category/, /sv/product/, /sv/designer-tool/.
In the earlier post I understand you can use an advanced filter to do this. I have set the following settings:
Type: Search & Replace
Field: Request URI
Search String: ^/(category|product|designer-tool)(/\d*)(.*)
Replace String: /$A1$A3
I guess that my search string and my replace string is wrong. Any ideas?
EDIT: I updated my filter to the following:
Search String: ^/sv/(category|product|designer-tool)(/\d*)(.*)$
Replace String: /sv/\1/
Still testing and unsure if it's the correct way to set it up.
I was able to solve this by the Search String and the Replace String in my edit above.
So basically what I did was:
Create a secondary view/profile for your site. If you apply your filter to your one and only view/profile that means that you won't be able to see any detailed data about specific pages, because the filter removes/filter that.
Add an Advanced Filter with the following settings:
Type: Search & Replace
Field: Request URI
Search String: ^/sv/(category|product|designer-tool)(/\d*)(.*)$
Replace String: /sv/\1/
You need to wait 24h after creating your new profile/view before you can see any data in it.
So my confusion was regarding the Search and Replace String. The Search String is an regular expression for matching everything after your .tld. So for example http://www.example.com/sv/mypage/1-post/, the Search String will only search within /sv/mypage/1-post/.
The Replace String is what it should replace the whole Search String with. So in my case, I matched all URL's that had /sv/category/1-string/. I wanted only to keep the "category" part, so I replaced the whole string with /sv/category/ by inputting Replace String /sv/\1/
/sv/ means just what it says. \1 means that it should take the value of the first () of my Search String (In this case "category"). The ending / is just an ending slash.
All in all, it means that any URLs that looked like http://example.com/sv/category/1-string/ was changed to http://example.com/sv/category/. Meaning that I can now see data for all my categories as a group, instead of individual pages.
Basically I have a news page which stores headlines, stories, and a unique story identifier in a SQL database. I want to be able to create a hyperlink on a webpage to the pictures.
so when someone selects a news story from a drop down menu (which uses the headline) and presses submit I want to pass the storyID, which is a unique identifier, to a spot in a hyperlink. so if it was story 134 then then link would look like:
I know the SQL statement would look like:
SELECT StoryID from db.News
Where Headline = {The headline selected in the dropdown menu}
the dropdown menu is called NewsDrop
this would be an ASPX page written with a VB code base
SO I guess I need help passing the variables along to the search string and the hyperlink.
Not even sure if this is even possible.
There are a number of options available to achieve this, the most common would be to using a query string in the hyperlinks in your drop down menu to send a parameter to a SQL stored procedure which would use it in a variable in your select statement. So basically the hyperlinks you have in the drop down menu would be appended with ?storyID=<uniquestoryid> and on the far end SELECT StoryID from db.News Where StoryID = #StoryID it would be less efficient to use the headline from the link as a query string and variable in the where clause as you have shown but if that is your only option it could be done.
However you should proceed carefully when using query strings here is a link to a good basic article about query strings and another link about best practices.
On the page I have created I have a search facility that if a doctors number is searched it will bring up the doctors details, once search button is clicked the results are displayed in textboxes (I cannot use gridviews because this is not wanted)
sample of code placed on the search button
Query statement = "SELECT DocNumber FROM tblDoctor WHERE DNum LIKE '%"
execute the query and get the result
The result is converted to string and Execute Scalar is used
DocNum.Text = Result1
Query statement = "SELECT DocName FROM tblDoctor WHERE DNum LIKE '%"
execute the query and get the result
The result is converted to string and Execute Scalar is used
DocName.Text = Result2
etc.... there are are 14 other textboxes that I want too display data in, so there is a large amount of repeated lines of code following the structure above. Can anyone suggest a better way of doing this?
Another problem of repetition of code comes from the prev page that is linked to it. The page before has a summary of details of doctors, once the row is clicked it takes you to this page displaying a more detailed view of their personal details. The doctor number selected will be passed onto the more detailed view using a querystring so I have the code
Automatic population of the selected doctors will fill the labels
on page load
Request the query string and store into variable dNum
Query statement = "SELECT DocNumber FROM tblDoctor WHERE DNum = " & dNum"
Get result from query convert to string and use execute scalar
lblDocNum.Text = Res1
Query statement = "SELECT DocNumber FROM tblDoctor WHERE DNum = " & dNum"
Get result from query convert to string and use execute scalar
lblDocNum.Text = Res1
etc...
What I am doing works correctly but the coding style looks poor. Any help would be much appreciated.
Thank you
Why not use a DataReader or DataSet or whatever you prefer to return the whole record, then simple move from column to column and populate the textboxes that way? Instead of returning one value at a time.
If the goal is less code,
SELECT * FROM tblDoctor WHERE xxx
into a DataTable or DataReader, as Thyamine suggested, above.
From there, you could also put the textboxes in an HTML table in a Repeater which you will bind to that datatable. You won't have to individually assign any of the values, the databinding will do it for you!
I know that people think HTML tables are evil but it's the easiest way to line up the labels I assume you will also want.
I also know that the control I suggested is called a Repeater but you only have one record. If you don't tell the compiler, I won't. :)
From parts of your question, it sounds like you're wondering whether to send all of the bits of information along in the querystring but that doesn't sound like a good idea to me because it invites users to mess with the data in the querystring.
You didn't mention - are these textboxes meant for editing? A Save button inside the Repeater would have easy access to all of the controls to build your update statement. You could also put the Save button outside the repeater and refer to the repeater's first Item to find the controls.
I am trying to use query strings in ASP.NET. I have a requirement of the following format
http://localhost/website/1/?callback=?
Here 1 denotes the ID of the profile. This means some info from id=1 will be fetched through the string
If this would have been website/2/?callback=? , then the id would be 2. My questions is to how do I use this /id/ as a query string so it can be used to fetch the profile ID. This was my first preference to use /id/ format otherwise I could look into fetching using two ?'s
If the id =1, I want to fetch ID=1 particulars from DB on this page. http://localhost/website/1/?callback=?
In your case the ID is in the PATH, not the query string. You can access the path via Request.Path in an ASPX page. From there you would need to do some string parsing to get at the portion of the path where you expect the ID to be.
In your case I would probably use something like int.Parse(Request.Path.Split(new char[] {'/'}, StringSplitOptions.RemoveEmptyEntries)[1]), but please note that I've made that line pretty dense for brevity's sake. For starters, you should use int.TryParse() instead of int.Parse(). This code assumes that the ID will always be in the same place in the url. For example, it will work for "/website/2/" and "/user/2/", but not for "/website/somethingelse/2/".
UriTemplate might be a good choice for that sort of parsing. Not to mention, probably a bit more clear and explicit about what's happening.
Check out: http://msdn.microsoft.com/en-us/library/system.uritemplate(v=VS.90).aspx