how to handle xpath with html-agility-pack

how to handle xpath with html-agility-pack - web-scraping

I'm trying to scrape some data from aliexpress using c# and html-agility-pack.
Usually, the xpath of some element looks like this :
/html/body/div[7]/div[2]/div[4]/div/div/div[2]/div[1]/div[2]/div/div[1]/a
But when i try to copy the xpath of an element in aliexpress it looks like this :
//*[#id="node-gallery"]/div[4]/div/div/ul/li[1]/div[1]/div[1]/a
and then the list of nodes return null and the program can't make any progress.
var html = #"https://best.aliexpress.com/?lan=en";
HtmlWeb web = new HtmlWeb();
var htmlDoc = web.Load(html);
var nodes = htmlDoc.DocumentNode.SelectNodes("//*[#id]/div/div[2]/div/div[2]/dl//dd/div/div[2]/ul/li//a");
if (nodes.Count <= 0)
{
Console.WriteLine("nothing found");
}
else
{
foreach (HtmlNode n in nodes)
{
Console.WriteLine(n.Attributes);
}
}
Console.ReadKey();

Indeed when you hover over those items an API request is made. You can probably find the detail in one of the source files however looking at the first 2 in the network tab they have the following pattern (url decoded):
https://best.aliexpress.com/api/load_ams_path.do?path=aliexpress.com/common/#langField/ru/c-women-content.htm
https://best.aliexpress.com/api/load_ams_path.do?path=aliexpress.com/common/#langField/ru/c-men-content.htm
I suspect the others follow suit.
You can make requests to these endpoints to get the html you can then retrieve you desired content from. To get the href for the xpath element matched in browser by your xpath you could do the following:
fiddle
using System;
using HtmlAgilityPack;
public class Program
{
public static void Main()
{
string url = "https://best.aliexpress.com/api/load_ams_path.do?path=aliexpress.com/common/#langField/ru/c-women-content.htm";
HtmlWeb web = new HtmlWeb();
var htmlDoc = web.Load(url);
var nodetest1 = htmlDoc.DocumentNode.SelectSingleNode("*//li[#class='sup-brand-item'][1]/a");
Console.WriteLine(nodetest1.Attributes["href"].Value);
}
}

Related

ASP.NET MVC: How to send the ids of the DOM Body elements on the client's browser TO the controller when navigating from the view

I am working on an ASP.NET MVC app (ASP.NET NOT ASP.NET Core).
When a View is rendered, the user can click on some buttons on the page to collapse or show divs associated with each button. The div changes its class depending on whether it is collapsed or shown. I am using bootstrap attributes for this, and it works fine.
Now I have a "Save" button on the page. When the user clicks on this button, I need to retrieve the ids and classes of the divs, and pass them TO the Controller (in an array/collection/dictionary whatever).
Is there a way/method in ASP.NET to send to the Controller the attributes (ids, classes, etc) of the DOM elements on the client's browser ?
Thanks

If you want to send some attributes of DOM to Controller, I have a way.
HTML:
<div id="demo-1" class="chosendiv other-className" data-code ="abc">Lorem Ipsum</div>
<div id="demo-2" class="chosendiv other-className" data-code ="xyz">Lorem Ipsum</div>
<div id="demo-3" class="other-className" data-code ="mnt">Lorem Ipsum</div>
<button id="btn-save" onclick="Save()">SAVE</button>
Javascript
<script>
function Save(){
var cds = document.getElementsByClassName('chosendiv');
var finder = [];
if(cds != null){
for(i = 0; i< cds.length; i++){
finder.push({
ID: cds[i].getAttribute('id'),
ClassName: cds[i].getAttribute('class'),
Code: cds[i].getAttribute('data-code')
})
}
}
//
// Send finder to Controller. You can use Ajax...
// A simple ajax call:
//
$.ajax({
url: '/Home/YourAction',
type: 'GET', //<---- you can use POST method.
data:{
myDiv: JSON.stringify(finder)
},
success: function(response){
// Your code
}
})
}
</script>
Your Controller
public class HomeController: Controller
{
public HomeController(){}
[HttpGet]
public void YourAction(string myDiv)
{
//A lot of ways for converting string to Object, such as: creating new class for model, ...
// I use Dictionary Class
List<Dictionary<string, string>> temp = new List<Dictionary<string, string>>();
if(!string.IsNullOrEmpty(myDiv))
{
try
{
temp = Newtonsoft.Json.JsonConvert.DeserializeObject<List<Dictionary<string, string>>>(myDiv);
}
catch { // Do something if it catches error. }
}
// Get a element (at index) from temp if temp.Count()>0
// var id = temp.ElementAt(index)["ID"];
// var className = temp.ElementAt(index)["ClassName"];
// var code = temp.ElementAt(index)["Code"];
//
//Your code
//
}
//......
}
It would be great if my answer could solve your problem.

Based on the answer provided by #Gia Khang
I made few changes in order to avoid the issue of the length of the URL exceeding the maximum limit.
Instead of adding the element's classes to an array using JS, I add them to a string :
function Save() {
var cds = document.getElementsByClassName('chosendiv');
// I use as string instead of an array
var finder = "";
if(cds != null){
for(i = 0; i< cds.length; i++){
finder = finder + "id=" + cds[i].getAttribute('id') + "class=" + cds[i].getAttribute('class') + "data-code=" +cds[i].getAttribute('data-code')
}
}
// Send finder to Controller. You can use Ajax...
// A simple ajax call:
var myURL = "/{Controller}/{Action}"
$.ajax({
url: myURL,
type: "POST",
data: { ids:finder },
success: function (response) {
}
})
}
In the Controller Action I add a parameter named "ids" (this must be the same name as the identifier of the data object in the post request)and I extract the id, class, and data value from the ids string by a method in one of my Models classes (sorry I work with VB.NET not with C# and it will take me a lot of time to convert the code to C#. I use the Split method in VB to split the ids string several times: a first one by using "id=" as delimiter, then spiting each element in the resulting array by the second delimiter "class=", etc. I add the resulting elements to a collection)
The Controller Action looks like this:
public class HomeController: Controller
{
public HomeController(){}
[HttpPost]
public void YourAction(string ids)
{
Models.myClass.splitStringMethod(ids)
Return View()
}
}

How to post Special character tweet using asp.net API?

I m using Given below code to post the tweet on twitter. But when we upload it on the server then special character (!,:,$ etc) tweets not published on twitter. this code is working fine in the local system
string key = "";
string secret = "";
string token="";
string tokenSecret="";
try
{
string localFilename = HttpContext.Current.Server.MapPath("../images/").ToString();
using (WebClient client = new WebClient())
{
client.DownloadFile(imagePath, localFilename);
}
var service = new TweetSharp.TwitterService(key, secret);
service.AuthenticateWith(token, tokenSecret);
// Tweet wtih image
if (imagePath.Length > 0)
{
using (var stream = new FileStream(localFilename, FileMode.Open))
{
var result = service.SendTweetWithMedia(new SendTweetWithMediaOptions
{
Status = message,
Images = new Dictionary<string, Stream> { { "name", stream } }
});
}
}
else // just message
{
var result = service.SendTweet(new SendTweetOptions
{
Status = HttpUtility.UrlEncode(message)
});
}
}
catch (Exception ex)
{
throw ex;
}

The statuses/update_with_media API endpoint is actually deprecated by Twitter and shouldn't be used (https://dev.twitter.com/rest/reference/post/statuses/update_with_media).
TweetSharp also has some issues with using this method when the tweet contains both a 'special character' AND an image (works fine with either, but not both). I don't know why and I haven't been able to fix it, it's something to do with the OAuth signature I'm pretty sure.
As a solution I suggest you use TweetMoaSharp (a fork of TweetSharp). It has been updated to support the new Twitter API's for handling media in tweets, and it will work in this situation if you use the new stuff.
Basically you upload each media item using a new UploadMedia method, and that will return you a 'media id'. You then use the normal 'SendTweet' method and provide a list of the media ids to it along with the other status details. Twitter will attach the media to the tweet when it is posted, and it will work when there are both special characters and images.

In addition to TweetMoaSharp you can use Tweetinvi with the following code:
var binary = File.ReadAllBytes(#"C:\videos\image.jpg");
var media = Upload.UploadMedia(binary);
var tweet = Tweet.PublishTweet("hello", new PublishTweetOptionalParameters
{
Medias = {media}
});

Find the TITLE of current webpage open in WebEngine [[JAVAFX]]

I am coding a Web Browser based on Javafx. I want to fetch the TITLE of webpages currently open in the WebEngine.
Thankyou :)

A mutch better and nice way is just use the WebEngine.getTitle()
Here is an example how to use it:
stage.titleProperty().bind(webView.getEngine().titleProperty());

Once the document is loaded you can use the DOM API to find the title. (I generally dislike the DOM API, but here's how you'd do this.)
private String getTitle(WebEngine webEngine) {
Document doc = webEngine.getDocument();
NodeList heads = doc.getElementsByTagName("head");
String titleText = webEngine.getLocation() ; // use location if page does not define a title
if (heads.getLength() > 0) {
Element head = (Element)heads.item(0);
NodeList titles = head.getElementsByTagName("title");
if (titles.getLength() > 0) {
Node title = titles.item(0);
titleText = title.getTextContent();
}
}
return titleText ;
}

Just a different implementation of #James_D excellent answer (little less verbose, little more Java 8 style):
private String getTitle(WebEngine webEngine) {
Document doc = webEngine.getDocument();
NodeList heads = doc.getElementsByTagName("head");
String titleText = webEngine.getLocation(); // use location if page does not define a title
return getFirstElement(heads)
.map(h -> h.getElementsByTagName("title"))
.flatMap(this::getFirstElement)
.map(Node::getTextContent).orElse(titleText);
}
private Optional<Element> getFirstElement(NodeList nodeList) {
if (nodeList.getLength() > 0 && nodeList.item(0) instanceof Element) {
return Optional.of((Element) nodeList.item(0));
}
return Optional.empty();
}

Web Api Help Page XML comments from more than 1 files

I have different plugins in my Web api project with their own XML docs, and have one centralized Help page, but the problem is that Web Api's default Help Page only supports single documentation file
new XmlDocumentationProvider(HttpContext.Current.Server.MapPath("~/App_Data/Documentation.xml"))
How is it possible to load config from different files? I wan to do sth like this:
new XmlDocumentationProvider("PluginsFolder/*.xml")

You can modify the installed XmlDocumentationProvider at Areas\HelpPage to do something like following:
Merge multiple Xml document files into a single one:
Example code(is missing some error checks and validation):
using System.Xml.Linq;
using System.Xml.XPath;
XDocument finalDoc = null;
foreach (string file in Directory.GetFiles(#"PluginsFolder", "*.xml"))
{
if(finalDoc == null)
{
finalDoc = XDocument.Load(File.OpenRead(file));
}
else
{
XDocument xdocAdditional = XDocument.Load(File.OpenRead(file));
finalDoc.Root.XPathSelectElement("/doc/members")
.Add(xdocAdditional.Root.XPathSelectElement("/doc/members").Elements());
}
}
// Supply the navigator that rest of the XmlDocumentationProvider code looks for
_documentNavigator = finalDoc.CreateNavigator();

Kirans solution works very well. I ended up using his approach but by creating a copy of XmlDocumentationProvider, called MultiXmlDocumentationProvider, with an altered constructor:
public MultiXmlDocumentationProvider(string xmlDocFilesPath)
{
XDocument finalDoc = null;
foreach (string file in Directory.GetFiles(xmlDocFilesPath, "*.xml"))
{
using (var fileStream = File.OpenRead(file))
{
if (finalDoc == null)
{
finalDoc = XDocument.Load(fileStream);
}
else
{
XDocument xdocAdditional = XDocument.Load(fileStream);
finalDoc.Root.XPathSelectElement("/doc/members")
.Add(xdocAdditional.Root.XPathSelectElement("/doc/members").Elements());
}
}
}
// Supply the navigator that rest of the XmlDocumentationProvider code looks for
_documentNavigator = finalDoc.CreateNavigator();
}
I register the new provider from HelpPageConfig.cs:
config.SetDocumentationProvider(new MultiXmlDocumentationProvider(HttpContext.Current.Server.MapPath("~/App_Data/")));
Creating a new class and leaving the original one unchanged may be more convenient when upgrading etc...

Rather than create a separate class along the lines of XmlMultiDocumentationProvider, I just added a constructor to the existing XmlDocumentationProvider. Instead of taking a folder name, this takes a list of strings so you can still specify exactly which files you want to include (if there are other xml files in the directory that the Documentation XML are in, it might get hairy). Here's my new constructor:
public XmlDocumentationProvider(IEnumerable<string> documentPaths)
{
if (documentPaths.IsNullOrEmpty())
{
throw new ArgumentNullException(nameof(documentPaths));
}
XDocument fullDocument = null;
foreach (var documentPath in documentPaths)
{
if (documentPath == null)
{
throw new ArgumentNullException(nameof(documentPath));
}
if (fullDocument == null)
{
using (var stream = File.OpenRead(documentPath))
{
fullDocument = XDocument.Load(stream);
}
}
else
{
using (var stream = File.OpenRead(documentPath))
{
var additionalDocument = XDocument.Load(stream);
fullDocument?.Root?.XPathSelectElement("/doc/members").Add(additionalDocument?.Root?.XPathSelectElement("/doc/members").Elements());
}
}
}
_documentNavigator = fullDocument?.CreateNavigator();
}
The HelpPageConfig.cs looks like this. (Yes, it can be fewer lines, but I don't have a line limit so I like splitting it up.)
var xmlPaths = new[]
{
HttpContext.Current.Server.MapPath("~/bin/Path.To.FirstNamespace.XML"),
HttpContext.Current.Server.MapPath("~/bin/Path.To.OtherNamespace.XML")
};
var documentationProvider = new XmlDocumentationProvider(xmlPaths);
config.SetDocumentationProvider(documentationProvider);

I agree with gurra777 that creating a new class is a safer upgrade path. I started with that solution but it involves a fair amount of copy/pasta, which could easily get out of date after a few package updates.
Instead, I am keeping a collection of XmlDocumentationProvider children. For each of the implementation methods, I'm calling into the children to grab the first non-empty result.
public class MultiXmlDocumentationProvider : IDocumentationProvider, IModelDocumentationProvider
{
private IList<XmlDocumentationProvider> _documentationProviders;
public MultiXmlDocumentationProvider(string xmlDocFilesPath)
{
_documentationProviders = new List<XmlDocumentationProvider>();
foreach (string file in Directory.GetFiles(xmlDocFilesPath, "*.xml"))
{
_documentationProviders.Add(new XmlDocumentationProvider(file));
}
}
public string GetDocumentation(System.Reflection.MemberInfo member)
{
return _documentationProviders
.Select(x => x.GetDocumentation(member))
.FirstOrDefault(x => !string.IsNullOrWhiteSpace(x));
}
//and so on...
The HelpPageConfig registration is the same as in gurra777's answer,
config.SetDocumentationProvider(new MultiXmlDocumentationProvider(HttpContext.Current.Server.MapPath("~/App_Data/")));

Flex prevent URL encoding of params with HTTPRequest

I'm trying to port an existing AJAX app to Flex, and having trouble with the encoding of parameters sent to the backend service.
When trying to perform the action of deleting a contact, the existing app performs a POST, sending the the following: (captured with firebug)
contactRequest.contacts[0].contactId=2c33ddc6012a100096326b40a501ec72
So, I create the following code:
var service:HTTPService;
function initalizeService():void
{
service = new HTTPService();
service.url = "http://someservice";
service.method = 'POST';
}
public function sendReq():void
{
var params:Object = new Object();
params['contactRequest.contacts[0].contactId'] = '2c33ddc6012a100097876b40a501ec72';
service.send(params);
}
In firebug, I see this sent out as follows:
Content-type: application/x-www-form-urlencoded
Content-length: 77
contactRequest%2Econtacts%5B0%5D%2EcontactId=2c33ddc6012a100097876b40a501ec72
Flex is URL encoding the params before sending them, and we're getting an error returned from the server.
How do I disable this encoding, and get the params sent as-is, without the URL encoding?
I feel like the contentType property should be the key - but neither of the defined values work.
Also, I've considered writing a SerializationFilter, but this seems like overkill - is there a simpler way?

Writing a SerializtionFilter seemed to do the trick:
public class MyFilter extends SerializationFilter
{
public function MyFilter()
{
super();
}
override public function serializeBody(operation:AbstractOperation, obj:Object):Object
{
var s:String = "";
var classinfo:Object = ObjectUtil.getClassInfo(obj);
for each (var p:* in classinfo.properties)
{
var val:* = obj[p];
if (val != null)
{
if (s.length > 0)
s += "&";
s += StringUtil.substitute("{0}={1}",p,val);
}
}
return s;
}
}
I'd love to know any alternative solutions that don't involve doing this though!

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

how to handle xpath with html-agility-pack - web-scraping

Related

ASP.NET MVC: How to send the ids of the DOM Body elements on the client's browser TO the controller when navigating from the view

How to post Special character tweet using asp.net API?

Find the TITLE of current webpage open in WebEngine [[JAVAFX]]

Web Api Help Page XML comments from more than 1 files

Flex prevent URL encoding of params with HTTPRequest

Categories

Resources