How to have multiple Context selectors? - web-scraping

I'm currently working on a scraper,
alot of my current code looks like this
contextSelector = 'a[href^="/clubs-and-societies/academic/';
(this works)
However there are multiple pages to select, having multiple ContextSelectors does not work
contextSelector = 'a[href^="/clubs-and-societies/academic/';
contextSelector = 'a[href^="/clubs-and-societies/culture/';
contextSelector = 'a[href^="/clubs-and-societies/dance/';
(not working)
What is the solution?

Two Points:
1) Your syntax is not well-formed. It should be:
contextSelector = 'a[href^="/clubs-and-societies/academic/"]';
(Note the closing double quote " and the closing square bracket ].)
2) The logical OR in CSS is a comma ,:
contextSelector = 'a[href^="/clubs-and-societies/academic/"], a[href^="/clubs-and-societies/culture/"], a[href^="/clubs-and-societies/dance/"]';

Related

shQuote() breaks system() command rather than escaping spaces

I'm running the GCMS software AMDIS inside an R script to analyze GCMS data. The basic syntax is:
system("AMDIS_PATH GCMS_FILE_PATH /S /GC /GE /E"
With "/S /GC /GE /E" referring to functionalities within the AMDIS program.
I want to make the code 'fool-proof', so I'm allowing paths with "/" instead of "\\". And I'm also allowing spaces in the paths, escaping them with shQuote(..., type = "cmd")
Which means the full code looks like
system(paste(shQuote(gsub("/", "\\", AMDIS_PATH, fixed = T), type = 'cmd'), shQuote(gsub("/", "\\", GCMS_FILE_PATH, fixed = T), type = 'cmd'), "/S /GD /GS /E"))
This code does not work (output = 65535), but there's a catch:
If I remove the shQuotes from my GCMS_FILE_PATH. Then everything works like it should (so long as the FILE_PATH does not contain any spaces).
So executing the code like this does work:
system(paste(shQuote(gsub("/", "\\", AMDIS_PATH, fixed = T), type = 'cmd'), gsub("/", "\\", GCMS_FILE_PATH, fixed = T), "/S /GD /GS /E"))
Any idea why shQuote works fine for the AMDIS_PATH, but not for the GCMS_FILE_PATH? And how I can run my code and still escape spaces in the GCMS_FILE_PATH?

How do I use regex to find FS.File on FS.Collection in meteor

How do I use regex to find FS.File on FS.Collection in meteor. My code is as follows and it is not working
partOfFileName = "*User_" + clickedResellerId + "_*";
var imgs = Images.find({fileName:{$regex:partOfFileName}});
//var imgs = Images.find();
return imgs // Where Images is an FS.Collection instance
In place of fileName I've also tried name and it is not working either. Please help
I don't think your regex is valid. Did you perhaps mean the following?
partOfFileName = ".*User_" + clickedResellerId + "_.*";
Please note that POSIX wildcard notation is different from regular expressions. in Regular expressions the * operators indicates repetition of the preceding operator (in my case a ., i.e., anything). A * by itself has no meaning, and it doesn't mean "anything" like in POSIX.

R: XPath expression returns links outside of selected element

I am using R to scrape the links from the main table on that page, using XPath syntax. The main table is the third on the page, and I want only the links containing magazine article.
My code follows:
require(XML)
(x = htmlParse("http://www.numerama.com/magazine/recherche/125/hadopi/date"))
(y = xpathApply(x, "//table")[[3]])
(z = xpathApply(y, "//table//a[contains(#href,'/magazine/') and not(contains(#href, '/recherche/'))]/#href"))
(links = unique(z))
If you look at the output, the final links do not come from the main table but from the sidebar, even though I selected the main table in my third line by asking object y to include only the third table.
What am I doing wrong? What is the correct/more efficient way to code this with XPath?
Note: XPath novice writing.
Answered (really quickly), thanks very much! My solution is below.
extract <- function(x) {
message(x)
html = htmlParse(paste0("http://www.numerama.com/magazine/recherche/", x, "/hadopi/date"))
html = xpathApply(html, "//table")[[3]]
html = xpathApply(html, ".//a[contains(#href,'/magazine/') and not(contains(#href, '/recherche/'))]/#href")
html = gsub("#ac_newscomment", "", html)
html = unique(html)
}
d = lapply(1:125, extract)
d = unlist(d)
write.table(d, "numerama.hadopi.news.txt", row.names = FALSE)
This saves all links to news items with keyword 'Hadopi' on this website.
You need to start the pattern with . if you want to restrict the search to the current node.
/ goes back to the start of the document (even if the root node is not in y).
xpathSApply(y, ".//a/#href" )
Alternatively, you can extract the third table directly with XPath:
xpathApply(x, "//table[3]//a[contains(#href,'/magazine/') and not(contains(#href, '/recherche/'))]/#href")

Find word (not containing substrings) in comma separated string

I'm using a linq query where i do something liike this:
viewModel.REGISTRATIONGRPS = (From a In db.TABLEA
Select New SubViewModel With {
.SOMEVALUE1 = a.SOMEVALUE1,
...
...
.SOMEVALUE2 = If(commaseparatedstring.Contains(a.SOMEVALUE1), True, False)
}).ToList()
Now my Problem is that this does'n search for words but for substrings so for example:
commaseparatedstring = "EWM,KI,KP"
SOMEVALUE1 = "EW"
It returns true because it's contained in EWM?
What i would need is to find words (not containing substrings) in the comma separated string!
Option 1: Regular Expressions
Regex.IsMatch(commaseparatedstring, #"\b" + Regex.Escape(a.SOMEVALUE1) + #"\b")
The \b parts are called "word boundaries" and tell the regex engine that you are looking for a "full word". The Regex.Escape(...) ensures that the regex engine will not try to interpret "special characters" in the text you are trying to match. For example, if you are trying to match "one+two", the Regex.Escape method will return "one\+two".
Also, be sure to include the System.Text.RegularExpressions at the top of your code file.
See Regex.IsMatch Method (String, String) on MSDN for more information.
Option 2: Split the String
You could also try splitting the string which would be a bit simpler, though probably less efficient.
commaseparatedstring.Split(new Char[] { ',' }).Contains( a.SOMEVALUE1 )
what about:
- separating the commaseparatedstring by comma
- calling equals() on each substring instead of contains() on whole thing?
.SOMEVALUE2 = If(commaseparatedstring.Split(',').Contains(a.SOMEVALUE1), True, False)

QRegExp: individual quantifiers can't be non-greedy, but what good alternatives then?

I'm trying to write code that appends ending _my_ending to the filename, and does not change file extension.
Examples of what I need to get:
"test.bmp" -> "test_my_ending.bmp"
"test.foo.bar.bmp" -> "test.foo.bar_my_ending.bmp"
"test" -> "test_my_ending"
I have some experience in PCRE, and that's trivial task using it. Because of the lack of experience in Qt, initially I wrote the following code:
QString new_string = old_string.replace(
QRegExp("^(.+?)(\\.[^.]+)?$"),
"\\1_my_ending\\2"
);
This code does not work (no match at all), and then I found in the docs that
Non-greedy matching cannot be applied to individual quantifiers, but can be applied to all the quantifiers in the pattern
As you see, in my regexp I tried to reduce greediness of the first quantifier + by adding ? after it. This isn't supported in QRegExp.
This is really disappointing for me, and so, I have to write the following ugly but working code:
//-- write regexp that matches only filenames with extension
QRegExp r = QRegExp("^(.+)(\\.[^.]+)$");
r.setMinimal(true);
QString new_string;
if (old_string.contains(r)){
//-- filename contains extension, so, insert ending just before it
new_string = old_string.replace(r, "\\1_my_ending\\2");
} else {
//-- filename does not contain extension, so, just append ending
new_string = old_string + time_add;
}
But is there some better solution? I like Qt, but some things that I see in it seem to be discouraging.
How about using QFileInfo? This is shorter than your 'ugly' code:
QFileInfo fi(old_string);
QString new_string = fi.completeBaseName() + "_my_ending"
+ (fi.suffix().isEmpty() ? "" : ".") + fi.suffix();

Resources