Can Crawler4j interpret wildcarding using astericks(*) in robots.txt? - wildcard

I want to be able to block web-crawlers from accessing pages other than page1.
The following should be able to block all directories/file names containing the word page. So something like /localhost/myApp/page2.xhtml should be blocked.
#Disallow: /*page
The following should enable all directories/file names containing page1 to be accessible. So something like /localhost/myApp/page1.xhtml should not be blocked.
#Allow: /*page1
The problem is crawler4j seems to ignoring the astericks which is used for wildcards. Is something wrong with my robots.txt or is the astericks something crawler4j does not interpret by default.

I looked through the crawler4j source code, and it looks like crawler4j does not support wildcards in Allow or Disallow, except in the special case where the asterisk is the last character in the directive. (and then the asterisk is ignored anyway)

Related

Ratpack: Prefix binding with multiple "components" in past binding

I've got a Ratpack application and I'm trying to configure an endpoint for GET requests which delegates to a handler based on a prefix. However, my URL path may contain multiple slashes, which I'd want to capture and use in my handler.
Example:
Given the requests http://localhost:8080/myEndpoint/foo/bar/ and http://localhost:8080/myEndpoint/baz/qux/fred/, I'd like to delegate to my handler MyEndpointHandler in both cases since the request paths are prefixed by myEndpoint. Inside MyEndpointHandler, I'd need to retrieve /foo/bar and /baz/qux/fred/ respectively, since these are the remaining parts of the path after the /myEndpoint prefix.
My original thinking was to do something like:
chain.path("myEndpoint/:restOfPath?", new MyEndpointHandler());
However this only seems to work for one slash (i.e. a request to http://localhost:8080/myEndpoint/foo is fine and I have access to /foo, but a request to http://localhost:8080/myEndpoint/foo/bar returns a 404).
I have also tried:
chain.prefix("myEndpoint", c1 -> c1.path(new MyEndpointHandler()));
However this returned a 404 for all requests regardless of the path after /myEndpoint.
Looking at the docs, https://ratpack.io/manual/current/api/ratpack/core/path/PathBinding.html#getPastBinding() seems like exactly what I need to be able to get the /foo/bar part of these requests, but I'm struggling to bind my handler to the chain in the right way to be able to access this "past binding".
Thanks in advance for the help!

Should a WebDAV server support query strings?

Should a WebDAV server support query strings?
I have not found a clear statement about this in RFC 4918.
Background is as follows:
I have a WebDAV server where the path in the URL is mapped quasi 1:1 to the path to the resource in the file system. I.e. to get to the resource I need to know the path, something like this:
Variant 1:
http://<webdavserver>:<port>/folder1/subfolder1/anotherfolder/resource.txt
Now I have another client that doesn't know the path, but only two Ids (RepositoryId and DocumentId), but these also uniquely identify the resource. By searching for the two ids, the WebDAV server can also find the resource and return it.
Until now, this was solved in such a way that instead of the path in the URL, the two IDs were specified as a query string, i.e. something like this:
Variant 2:
http://<webdavserver>:<port>/?repoId=123&docId=456
Somehow this feels wrong ...
Well, actually the identifcation via the two ids is just an alternative representation to the path, isn't it? So something like this should work too:
Variant 3:
http://<webdavserver>:<port>/<repoId>/<docId>
http://<webdavserver>:<port>/123/456
This feels more "WebDAV-like" ...
I would only need to be able to distinguish on the server side which of the two URL representations is arriving there, path or ID.
Possibly via a header, something like
X-ResourcePath: Path | Id (Default would be Path)
What do you think?
Should I stay with variant 2, or rather switch to variant 3, or ...? (I have to reimplement it anyway, so "Do not change a running system" would not be a valid argument :-))
IMHO: it really doesn't matter. There's no prohibition to mix query params into WebDAV URI trees. You just need to make sure that the clients that you support will work with this.
(I would advise against moving identifying data into custom request header fields; this is what the URI is for).
When I implemented a WebDAV server I was primarily interested in supporting existing clients and I found that most of them did not support query strings in any way (namely Microsoft Office)
I ended up using the following format which seems to work for all clients:
protocol://server/id/title.extension

Regex for fixing URL patterns

I have the following url structure:
http://www.xyxyxyxyx.com/ShowProduct.aspx?ID=334
http://www.xyxyxyxyx.com/ShowProduct.aspx?ID=1094
and so on..
Recently I used IIS rewrite to rewrite this structure as
http://www.xyxyxyxyx.com/productcategory/334/my-product-url
http://www.xyxyxyxyx.com/productcategory/1094/some-other-product-url
and so on..
This works fine.
I want to create another rule so that if an invalid url requests comes with the following structure:
http://www.xyxyxyxyx.com/productcategory/ShowProduct.aspx?ID=334
the 'productcategory' part should be removed from the url and the url should look like
http://www.xyxyxyxyx.com/ShowProduct.aspx?ID=334
How do I write this rule?
It may vary depending on what you are using to apply the regex, but here's a basic one:
's|productcatgory/||'
If you want to make sure it also only does this when the xyxyxyxyx url is present, this should work:
's|^http://www\.xyxyxyxyx\.com/productcategory/|http://www\.xyxyxyxyx\.com/|'
Edit: Ah, so if productcategory could be any category, then you'll need to match around it, like so:
's|^http://www\.xyxyxyxyx\.com/.*/ShowProduct|http://www\.xyxyxyxyx\.com/ShowProduct|'

Get the host name from url without www or extension in asp.net

Hello i need a way to find out the host part of an url , i've tried
Request.Url.Host.Split('.')
but it doesn't work with url like this:
sub.sub.domain.com
or
www.domain.co.uk
since you can have a variable number of dots before and after the domain
i need to get only "domain"
Check out the second answer at Get just the domain name from a URL?
I checked the pastebin link; it's active. I didn't test the code myself, but if it outputs as he describes, you can .split() from there.
If you need to be totally flexibel, you need to make a list of all possible top-level-domains, and try to remove those, with dot, from the end of your string, resulting in
www.domain
or
sub.sub.domain
Then take the last characters after the last dot.

ASP.NET routing: Literal sub-segment between tokens, and route values with a character from the literal sub-segment

The reason I'm asking is because IIS protects certain ASP.NET folders, like Bin, App_Data, App_Code, etc. Even if the URL does not map to an actual file system folder IIS rejects a URL with a path segment equal to one of the mentioned names.
This means I cannot have a route like this:
{controller}/{action}/{id}
... where id can be any string e.g.
Catalog/Product/Bin
So, instead of disabling this security measure I'm willing to change the route, using a suffix before the id, like these:
{controller}/{action}_{id} // e.g. Catalog/Product_Bin
{controller}/{action}/_{id} // e.g. Catalog/Product/_Bin
But these routes won't work if the id contains the new delimeter, _ in this case, e.g.
// These URL won't work (I get 404 response)
Catalog/Product_Bin_
Catalog/Product/_Bin_
Catalog/Product/__Bin
Why? I don't know, looks like a bug to me. How can I make these routes work, where id can be any string?
Ok, I have a definitive answer. Yes, this is a bug. However, at this point I regret to say we have no plans to fix it for a couple of reasons:
It's a breaking change and could be a very hard to notice one at that.
There's an easy workaround.
What you can do is change the URL to not have the underscore:
{controller}/{action}/_{id}
Then add a route constraint that requires that the ID parameter starts with an underscore character.
Then within your action method you trim off the underscore prefix from the id parameter. You could even write an action filter to do this for you if you liked. Sorry for the inconvenience.
You can use characters that are not allowed for a directory or file name like: *,?,:,",<,>,|.
With ASP.NET MVC if you look at the source they have a hard-coded value for the path separator (/) and to my knowledge cannot be changed.

Resources