I need to crawler a site to make some checks to know if the URLs are available or not periodically. For this, I am using crawler4j.
My problem comes with some web pages that have disabled the robots with <meta name="robots" content="noindex,nofollow" /> that make sense to not index this web pages in a search engine due to the content it have.
The crawler4j also is not following these links despite disable the configuration of the RobotServer. This must be very easy with robotstxtConfig.setEnabled(false);:
RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
robotstxtConfig.setUserAgentName(USER_AGENT_NAME);
robotstxtConfig.setEnabled(false);
RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
WebCrawlerController controller = new WebCrawlerController(config, pageFetcher, robotstxtServer);
...
But the described web pages are still not explored. I have read the code and this must be enough to disable the robots directives, but it is not working as expected. Maybe I am skipping something? I have tested it with versions 3.5 and 3.6-SNAPSHOT with identical result.
I am using a new version
<dependency>
<groupId>edu.uci.ics</groupId>
<artifactId>crawler4j</artifactId>
<version>4.1</version>
</dependency>`
After setting RobotstxtConfig like this, it is working:
RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
robotstxtConfig.setEnabled(false);
Testing result and Source Code in from Crawler4J proves that:
public boolean allows(WebURL webURL) {
if (config.isEnabled()) {
try {
URL url = new URL(webURL.getURL());
String host = getHost(url);
String path = url.getPath();
HostDirectives directives = host2directivesCache.get(host);
if ((directives != null) && directives.needsRefetch()) {
synchronized (host2directivesCache) {
host2directivesCache.remove(host);
directives = null;
}
}
if (directives == null) {
directives = fetchDirectives(url);
}
return directives.allows(path);
} catch (MalformedURLException e) {
logger.error("Bad URL in Robots.txt: " + webURL.getURL(), e);
}
}
return true;
}
When set Enabled as false, it will not do the check any more.
Why don't you just exclude everything about Robotstxt in crawler4j? I needed to crawl a site and ignore the robots and this worked for me.
I changed CrawlController and WebCrawler in .crawler like that:
WebCrawler.java:
delete
private RobotstxtServer robotstxtServer;
delete
this.robotstxtServer = crawlController.getRobotstxtServer();
edit
if ((shouldVisit(webURL)) && (this.robotstxtServer.allows(webURL)))
-->
if ((shouldVisit(webURL)))
edit
if (((maxCrawlDepth == -1) || (curURL.getDepth() < maxCrawlDepth)) &&
(shouldVisit(webURL)) && (this.robotstxtServer.allows(webURL)))
-->
if (((maxCrawlDepth == -1) || (curURL.getDepth() < maxCrawlDepth)) &&
(shouldVisit(webURL)))
CrawlController.java:
delete
import edu.uci.ics.crawler4j.robotstxt.RobotstxtServer;
delete
protected RobotstxtServer robotstxtServer;
edit
public CrawlController(CrawlConfig config, PageFetcher pageFetcher, RobotstxtServer robotstxtServer) throws Exception
-->
public CrawlController(CrawlConfig config, PageFetcher pageFetcher) throws Exception
delete
this.robotstxtServer = robotstxtServer;
edit
if (!this.robotstxtServer.allows(webUrl))
{
logger.info("Robots.txt does not allow this seed: " + pageUrl);
}
else
{
this.frontier.schedule(webUrl);
}
-->
this.frontier.schedule(webUrl);
delete
public RobotstxtServer getRobotstxtServer()
{
return this.robotstxtServer;
}
public void setRobotstxtServer(RobotstxtServer robotstxtServer)
{
this.robotstxtServer = robotstxtServer;
}
Hope that it's what you're looking for.
Related
I try to do simple servlet that generates pdf file based on HTML template. I try to use Thymeleaf and FlyingSaucer, as in example
in my template.hmtl i have style delcaration as follow:
<link rel="stylesheet" type="text/css" media="all" href="style.css"/>
it never gets loaded. No error, nothing, just resulting .pdf is missing style. If I put content of style file into template.HTML it works like charm.
If I put something like this:
<link rel="stylesheet" type="text/css" media="all" href="http://localhost:8080/MY_APP/resources/style.css"/>
it works.
All my resources are under src/main/webapp/resources.
After few more hours of researching problem, this is what I've end up with.
For CSS - only solution that I've found was to just put CSS in HTML templates. Not elegant, but does the job (at least for now). Not a big issue, as I use those templates to generate pdf files.
But the same issue was with image files, but that one I was able to solve in elegant way. Here is it!
Problem was that in web container Java couldn't find files pointed in .html templates.
So I had to write custom Element Factory extending ReplacedElementFactory. here is code for that:
public class B64ImgReplacedElementFactory implements ReplacedElementFactory {
public ReplacedElement createReplacedElement(LayoutContext c, BlockBox box, UserAgentCallback uac, int cssWidth, int cssHeight) {
Element e = box.getElement();
if (e == null) {
return null;
}
String nodeName = e.getNodeName();
if (nodeName.equals("img")) {
String attribute = e.getAttribute("src");
FSImage fsImage;
try {
fsImage = buildImage(attribute);
} catch (BadElementException e1) {
fsImage = null;
} catch (IOException e1) {
fsImage = null;
}
if (fsImage != null) {
if (cssWidth != -1 || cssHeight != -1) {
fsImage.scale(cssWidth, cssHeight);
}
return new ITextImageElement(fsImage);
}
}
return null;
}
protected FSImage buildImage(String srcAttr) throws IOException, BadElementException {
URL res = getClass().getClassLoader().getResource(srcAttr);
if (res != null) {
return new ITextFSImage(Image.getInstance(res));
} else {
return null;
}
}
public void remove(Element e) {
}
public void reset() {
}
#Override
public void setFormSubmissionListener(FormSubmissionListener listener) {
}
}
and use case in code generating PDF file:
ITextRenderer renderer = new ITextRenderer();
SharedContext sharedContext = renderer.getSharedContext();
sharedContext.setReplacedElementFactory(new B64ImgReplacedElementFactory());
custom element replaces catches all occurrence of 'img' nodes, and uses ClasLoader to get resource path, and based on that FSImage is returned, and that is what we need.
Hope that helps!
With OpenCSV, how do I append to existing CSV using a MappingStrategy? There are lots of examples I could find where NOT using a Bean mapping stategy BUT I like the dynamic nature of the column mapping with bean strategy and would like to get it working this way. Here is my code, which just rewrites the single line to CSV file instead of appending.
How can I fix this? Using OpenCSV 4.5 . Note: I set my FileWriter for append=true . This scenario is not working as I expected. Re-running this method simply results in over-writing the entire file with a header and a single row.
public void addRowToCSV(PerfMetric rowData) {
File file = new File(PerfTestMetric.CSV_FILE_PATH);
try {
CSVWriter writer = new CSVWriter(new FileWriter(file, true));
CustomCSVMappingStrategy<PerfMetric> mappingStrategy
= new CustomCSVMappingStrategy<>();
mappingStrategy.setType(PerfMetric.class);
StatefulBeanToCsv<PerfMetric> beanToCsv
= new StatefulBeanToCsvBuilder<PerfMetric>(writer)
.withMappingStrategy(mappingStrategy)
.withSeparator(',')
.withApplyQuotesToAll(false)
.build();
try {
beanToCsv.write(rowData);
} catch (CsvDataTypeMismatchException e) {
e.printStackTrace();
} catch (CsvRequiredFieldEmptyException e) {
e.printStackTrace();
}
writer.flush();
writer.close();
} catch (IOException e) {
e.printStackTrace();
}
}
Or, is the usual pattern to load all rows into a List and then re-write entire file? I was able to get it to work by writing two MappingStrategy mapping strategies and then conditionally using them with a if-file-exists but doing it that way leaves me with a "Unchecked assignment" warning in my code. Not ideal; hoping for an elegant solution?
I've updated OpenCSV to version 5.1 and got it working. In my case I needed the CSV headers to have a specific name and position, so I'm using both #CsvBindByName and #CsvBindByPosition, and needed to create a custom MappingStrategy to get it working.
Later, I needed to edit the MappingStrategy to enable appending, so when it's in Appending mode I don't need to generate a CSV header.
public class CustomMappingStrategy<T> extends ColumnPositionMappingStrategy<T> {
private boolean useHeader=true;
public CustomMappingStrategy(){
}
public CustomMappingStrategy(boolean useHeader) {
this.useHeader = useHeader;
}
#Override
public String[] generateHeader(T bean) throws CsvRequiredFieldEmptyException {
final int numColumns = FieldUtils.getAllFields(bean.getClass()).length;
super.setColumnMapping(new String[numColumns]);
if (numColumns == -1) {
return super.generateHeader(bean);
}
String[] header = new String[numColumns];
if(!useHeader){
return ArrayUtils.EMPTY_STRING_ARRAY;
}
BeanField<T, Integer> beanField;
for (int i = 0; i < numColumns; i++){
beanField = findField(i);
String columnHeaderName = extractHeaderName(beanField);
header[i] = columnHeaderName;
}
return header;
}
private String extractHeaderName(final BeanField<T, Integer> beanField){
if (beanField == null || beanField.getField() == null || beanField.getField().getDeclaredAnnotationsByType(CsvBindByName.class).length == 0){
return StringUtils.EMPTY;
}
//return value of CsvBindByName annotation
final CsvBindByName bindByNameAnnotation = beanField.getField().getDeclaredAnnotationsByType(CsvBindByName.class)[0];
return bindByNameAnnotation.column();
}
}
Now if you use the default constructor it'll add the header to the generated CSV, and using a boolean you can tell it to add a header or to ignore it.
I never found an answer to this question and so what I ended up doing was doing a branch if-condition where .csv file exists or not. If file exists I used MappingStrategyWithoutHeader strategy, and if file didn't yet exist, I used MappingStrategyWithHeader strategy. Not ideal, but I got it working.
I have different plugins in my Web api project with their own XML docs, and have one centralized Help page, but the problem is that Web Api's default Help Page only supports single documentation file
new XmlDocumentationProvider(HttpContext.Current.Server.MapPath("~/App_Data/Documentation.xml"))
How is it possible to load config from different files? I wan to do sth like this:
new XmlDocumentationProvider("PluginsFolder/*.xml")
You can modify the installed XmlDocumentationProvider at Areas\HelpPage to do something like following:
Merge multiple Xml document files into a single one:
Example code(is missing some error checks and validation):
using System.Xml.Linq;
using System.Xml.XPath;
XDocument finalDoc = null;
foreach (string file in Directory.GetFiles(#"PluginsFolder", "*.xml"))
{
if(finalDoc == null)
{
finalDoc = XDocument.Load(File.OpenRead(file));
}
else
{
XDocument xdocAdditional = XDocument.Load(File.OpenRead(file));
finalDoc.Root.XPathSelectElement("/doc/members")
.Add(xdocAdditional.Root.XPathSelectElement("/doc/members").Elements());
}
}
// Supply the navigator that rest of the XmlDocumentationProvider code looks for
_documentNavigator = finalDoc.CreateNavigator();
Kirans solution works very well. I ended up using his approach but by creating a copy of XmlDocumentationProvider, called MultiXmlDocumentationProvider, with an altered constructor:
public MultiXmlDocumentationProvider(string xmlDocFilesPath)
{
XDocument finalDoc = null;
foreach (string file in Directory.GetFiles(xmlDocFilesPath, "*.xml"))
{
using (var fileStream = File.OpenRead(file))
{
if (finalDoc == null)
{
finalDoc = XDocument.Load(fileStream);
}
else
{
XDocument xdocAdditional = XDocument.Load(fileStream);
finalDoc.Root.XPathSelectElement("/doc/members")
.Add(xdocAdditional.Root.XPathSelectElement("/doc/members").Elements());
}
}
}
// Supply the navigator that rest of the XmlDocumentationProvider code looks for
_documentNavigator = finalDoc.CreateNavigator();
}
I register the new provider from HelpPageConfig.cs:
config.SetDocumentationProvider(new MultiXmlDocumentationProvider(HttpContext.Current.Server.MapPath("~/App_Data/")));
Creating a new class and leaving the original one unchanged may be more convenient when upgrading etc...
Rather than create a separate class along the lines of XmlMultiDocumentationProvider, I just added a constructor to the existing XmlDocumentationProvider. Instead of taking a folder name, this takes a list of strings so you can still specify exactly which files you want to include (if there are other xml files in the directory that the Documentation XML are in, it might get hairy). Here's my new constructor:
public XmlDocumentationProvider(IEnumerable<string> documentPaths)
{
if (documentPaths.IsNullOrEmpty())
{
throw new ArgumentNullException(nameof(documentPaths));
}
XDocument fullDocument = null;
foreach (var documentPath in documentPaths)
{
if (documentPath == null)
{
throw new ArgumentNullException(nameof(documentPath));
}
if (fullDocument == null)
{
using (var stream = File.OpenRead(documentPath))
{
fullDocument = XDocument.Load(stream);
}
}
else
{
using (var stream = File.OpenRead(documentPath))
{
var additionalDocument = XDocument.Load(stream);
fullDocument?.Root?.XPathSelectElement("/doc/members").Add(additionalDocument?.Root?.XPathSelectElement("/doc/members").Elements());
}
}
}
_documentNavigator = fullDocument?.CreateNavigator();
}
The HelpPageConfig.cs looks like this. (Yes, it can be fewer lines, but I don't have a line limit so I like splitting it up.)
var xmlPaths = new[]
{
HttpContext.Current.Server.MapPath("~/bin/Path.To.FirstNamespace.XML"),
HttpContext.Current.Server.MapPath("~/bin/Path.To.OtherNamespace.XML")
};
var documentationProvider = new XmlDocumentationProvider(xmlPaths);
config.SetDocumentationProvider(documentationProvider);
I agree with gurra777 that creating a new class is a safer upgrade path. I started with that solution but it involves a fair amount of copy/pasta, which could easily get out of date after a few package updates.
Instead, I am keeping a collection of XmlDocumentationProvider children. For each of the implementation methods, I'm calling into the children to grab the first non-empty result.
public class MultiXmlDocumentationProvider : IDocumentationProvider, IModelDocumentationProvider
{
private IList<XmlDocumentationProvider> _documentationProviders;
public MultiXmlDocumentationProvider(string xmlDocFilesPath)
{
_documentationProviders = new List<XmlDocumentationProvider>();
foreach (string file in Directory.GetFiles(xmlDocFilesPath, "*.xml"))
{
_documentationProviders.Add(new XmlDocumentationProvider(file));
}
}
public string GetDocumentation(System.Reflection.MemberInfo member)
{
return _documentationProviders
.Select(x => x.GetDocumentation(member))
.FirstOrDefault(x => !string.IsNullOrWhiteSpace(x));
}
//and so on...
The HelpPageConfig registration is the same as in gurra777's answer,
config.SetDocumentationProvider(new MultiXmlDocumentationProvider(HttpContext.Current.Server.MapPath("~/App_Data/")));
I have a multiple web sits asp.net application.
In this application different domains using the same pages.
All pages inherit from base class named: PageBase
wich inherit from System.Web.UI.Page.
By using: HttpContext.Current.Request.ServerVariables["HTTP_HOST"]
i cen determine what is the domain and then get all the info i need
for this domain and everything is working good.
My problem begin when i want to use different visitor counter for each site based on session.
Because Global.asax have Global events:
Session_Start
Session_End
simple counter will count all visitors on all sites together.
I try to make code behind for the Global.asax in different class
but i cold not get in that class the PageBase(System.Web.UI.Page) web site info.
I will be very thankful for any ideas to solve this problem
cheinan
I am assuming that you are able to browse from one "site" to the other within the same session and that there is only one session created.
In this case, you need to add the following to your Session:
Session["CountedHosts"] = new List<string>();
Then, on your base page, add the following:
var host = Request.ServerVariables["HTTP_HOST"];
var countedHosts = Session["CountedHosts"] as List<string>;
if (countedHosts != null && !countedHosts.Contains(host))
{
countedHosts.Add(host);
}
Then on session end, record each host that was visited.
var countedHosts = Session["CountedHosts"] as List<string>;
if (countedHosts != null)
{
foreach (var host in countedHosts)
{
//Log it
}
}
I am not able to browse from one "site" to the other within the same session each site have is on
different session created.
but i am very thankful to you because you gave me en idea
how to solv this problem
here is what i did:
i created counter class with Dictionary "onlineList" were i automatic creat for each site a key:
public abstract class counter{
public static Dictionary<string, int> onlineList = new Dictionary<string, int>();
//do add one count
public static void doSiteCountOn(string siteID)
{
if (onlineList.ContainsKey(siteID))
{
onlineList[siteID] += 1;
}
else
{
onlineList.Add(siteID, 1);
}
}
//do less one count
public static void doSiteCountOff(string siteID)
{
if (onlineList.ContainsKey(siteID))
{
onlineList[siteID] -= 1;
}
else
{
onlineList.Add(siteID, 0);
}
}
//get the count
public static string onlineCount(string siteID)
{
if (onlineList.ContainsKey(siteID))
{
return onlineList[siteID].ToString();
}
else
{
return "0";
}
}
//reset the count if needed
public static void resetCount(string siteID)
{
if (onlineList.ContainsKey(siteID))
{
onlineList[siteID] = 0;
}
}}
on my base page i check if there is Session["siteID"]
and if not i start one and make my counter class to add 1 to the site counter:
if (Session["siteID"] == null){
Session["siteID"] = _siteID;
counter.doSiteCountOn(_siteID);}
and finaly on my Session_End i do one count less:
void Session_End(object sender, EventArgs e){
if (Session["siteID"] != null)
{
counter.doSiteCountOff(Session["siteID"].ToString());
}}
thank for your halp
and sorry for my late respons
cheinan
We have an ASP.Net site that redirects you to a url that shows a session-id. like this:
http://localhost/(S(f3rjcw45q4cqarboeme53lbx))/main.aspx
This id is unique with every request.
Is it possible to test this site using a standard visual studio 2008/2010 webtest? How can I provide the test this data?
I have to call a couple of different pages using that same id.
Yes, it is relatively easy to do this. You will need to create a coded webtest however.
In my example we have a login post that will return the url including the session string.
Just after the we yield the login post request (request3) to the enumerator I call the following.
WebTestRequest request3 = new WebTestRequest((this.Context["WebServer1"].ToString() + "/ICS/Login/English/Login.aspx"));
//more request setup code removed for clarity
yield return request3;
string responseUrl = Context.LastResponse.ResponseUri.AbsoluteUri;
string cookieUrl = GetUrlCookie(responseUrl, this.Context["WebServer1"].ToString(),"/main.aspx");
request3 = null;
Where GetUrlCookie is something like this:
public static string GetUrlCookie(string fullUrl, string webServerUrl, string afterUrlPArt)
{
string result = fullUrl.Substring(webServerUrl.Length);
result = result.Substring(0, result.Length - afterUrlPArt.Length);
return result;
}
Once you have the session cookie string, you can substitute it really easy in any subsequent urls for request/post
e.g.
WebTestRequest request4 = new WebTestRequest((this.Context["WebServer1"].ToString() + cookieUrl + "/mySecureForm.aspx"));
I apologise for my code being so rough, but it was deprecated in my project and is pulled from the first version of the codebase - and for saying it was easy :)
For any load testing, depending on your application, you may have to come up with a stored procedure to call to provide distinct login information each time the test is run.
Note, because the response url cannot be determined ahead of time, for the login post you will have to temporarily turn off the urlValidationEventHandler. To do this I store the validationruleeventhandler in a local variable:
ValidateResponseUrl validationRule1 = new ValidateResponseUrl();
urlValidationRuleEventHandler = new EventHandler<ValidationEventArgs>(validationRule1.Validate);
So can then turn it on and off as I require:
this.ValidateResponse -= urlValidationRuleEventHandler ;
this.ValidateResponse += urlValidationRuleEventHandler ;
The alternative is to code your own such as this (reflectored from the Visual Studio code and changed to be case insensitive.
class QueryLessCaseInsensitiveValidateResponseUrl : ValidateResponseUrl
{
public override void Validate(object sender, ValidationEventArgs e)
{
Uri uri;
string uriString = string.IsNullOrEmpty(e.Request.ExpectedResponseUrl) ? e.Request.Url : e.Request.ExpectedResponseUrl;
if (!Uri.TryCreate(e.Request.Url, UriKind.Absolute, out uri))
{
e.Message = "The request URL could not be parsed";
e.IsValid = false;
}
else
{
Uri uri2;
string leftPart = uri.GetLeftPart(UriPartial.Path);
if (!Uri.TryCreate(uriString, UriKind.Absolute, out uri2))
{
e.Message = "The request URL could not be parsed";
e.IsValid = false;
}
else
{
uriString = uri2.GetLeftPart(UriPartial.Path);
////this removes the query string
//uriString.Substring(0, uriString.Length - uri2.Query.Length);
Uri uritemp = new Uri(uriString);
if (uritemp.Query.Length > 0)
{
string fred = "There is a problem";
}
//changed to ignore case
if (string.Equals(leftPart, uriString, StringComparison.OrdinalIgnoreCase))
{
e.IsValid = true;
}
else
{
e.Message = string.Format("The value of the ExpectedResponseUrl property '{0}' does not equal the actual response URL '{1}'. QueryString parameters were ignored.", new object[] { uriString, leftPart });
e.IsValid = false;
}
}
}
}
}