Insert unicode strings into CleverCSS - css

How can one insert a Unicode string CSS into CleverCSS?
In particular, how could one produce the following CSS using CleverCSS:
li:after {
content: "\00BB \0020";
}
I've figured out CleverCSS's parsing rules, but suffice that the permutations I've thought sensible have failed, for example:
li:
content: "\\00BB \\0020" // becomes content: 'BB 0'
EDIT: My other examples and the rest of my post weren't saved. Suffice to say that I had a longer list of examples that's missing.
I'd be grateful for any thoughts and input.
Brian
EDIT: I noted that inserting the unicode was one of the problems (once you start uploading CSS with utf-8 encoding it's fine). The wrapping of quote characters is another, which I solved that with something crazy likeso:
content: "'".string() + " ".string() ».string() + "'".string()
Hope that helps someone else.

This may be silly, but why still bother with escape sequences when you can just type/paste the actual characters? "A CSS style sheet is a sequence of characters from the Universal Character Set".
That is a lot easier on the eye, and is especially useful when maintaining existing code.
Or is CleverCSS not Unicode-enabled?

In looking at the code (CleverCSS 0.1) it would appear that the partial regular expression _r_string (defined on line 414) is where you would need to start. This is used to define several other REs, including _string_re which is used in the parsing rules (line 1374). This leads us to process_string() (line 1359) which looks like it was meant to accept Unicode.
Unfortunately, hand-built parsers tend to get a bit strange and the code is not exactly swimming in comments. If you really need to do this, I would focus on process_string() and put a bunch of before/after print statements in there and see if you can understand the goes-intos and goes-outofs.
You might also try bribing the original author with beer or ??? Good luck.

Related

How to preserve white space at the start of a line in .Rd documentation?

I need to indent some math stuff in the \details section of my .Rd documentation to enhance its readability. I am using mathjaxr. Is there any way to indent without installing roxygen2 or similar?
The math stuff is inline, so simply setting to display using \mjdeqn won't solve this.
I seem to have a reasonable "cheating" work around for indenting the first line using mathjaxr, at least for the PDF and HTML output.
We need to do two things:
Use the mathjax/LaTeX phantom command. phantom works by making a box of the size necessary to type-set whatever its argument is, but without actually type-setting anything in the box. For my purposes, if I want to indent, say, about 2 characters wide, I would start the line with a \mjeqn{\phantom{22}}{ } and following with my actual text, possibly including actual mathy bits. If I want an indent of, say, roughly 4 characters wide, I might use \mjeqn{\phantom{2222}}{ }.
Because mathjaxr has a problem with tacking on unsolicited new lines when starting a line with mjeqn, we need to prefix the use of phantom in 1 above with an empty bit of something non-mathjaxr-ish like \emph{}.
Putting it all together, I can indent by about 2 characters using something like this:
\emph{}\mjeqn{\phantom{22}}Here beginneth mine indented line…
I need to explore whether the { } business actually indents for ASCII output, or whether I might accomplish that using or some such.

Cleaning forum post with multiple quotations in rvest + stringr

I am scraping a very long forum thread, and I want to come up with a database that has columns containing the following info: date / full post text / quoted user / quoted text / clean text
The clean text should be each user's post, without the quotations if they are replying to anyone. if the post is not a reply, I would leave it as NA. The following is an invented post, with invented user, to illustrate what I have managed to do so far:
post<-"Meow1 wrote: »\noday is gonna be the day that they're gonna throw it back to you?\nBy now you should've somehow Realized what you gotta do\n\n\nI don't believe that anybody Feels the way I do, about you now\nMeow1 wrote: »\nI'm sure you've heard it all before But you never really had a doubt\n\n\nBecause maybe, you're gonna be the one that saves me\nMeow1 wrote: »\nAnd after all, you're my wonderwall\n\n\nAnd all the lights that lead us there are blinding"
Then I try to pull out the quoted user (Meow1) and it works:
QuotedUser_1<-ifelse(grepl('wrote:', post), gsub('\\s*wrote.*$', '', post), NA)
QuotedUser_1
[1] "Meow1"
Then I created this codes for pulling out the quoted text, and the clean text:
Quotedtext_1<- ifelse(grepl('wrote:', post), gsub('^.*wrote\\s*|\\s*\\n\\n\\n.*$', '', post), NA)
It works when there is only one quoted text, but otherwise, it only gives the last quoted bit (in the example, 'And after all, you´re my wonderwall')
And same for the clean text, it only returns the last reply:
Clean_text<- sub('^.*\\n\\n\\n\\s*|\\s*wrote.*', '', post)
If anyone has a suggestion to improve the code, so that I can have a vector with all the quotations, and a vector with all the replies, I would be very grateful...
Cheers
Are you sure you cannot scrape the author and text information separately? Without a source it's difficult to know, but I guess they can be obtained by different css-selectors making it much easier to split the data.
If not, it might be helpful to look into str_locate_all which allows you to locate all occurences of e.g. "wrote:" and split the string accordingly.

I am receiving \u200b in certain translations from the google-translate api

I am new to using the Google translate API and during testing we noticed that for some translations (I have not been able to find a pattern yet) we get \u200b characters in the response. That results in a lot of issues and above all it does not seem to server any purpose or make any sense. As simple example:
https://www.googleapis.com/language/translate/v2?key=YOURKEY&source=NL&target=EN&q=Hergeneer%20verkopen
returns:
{
"data": {
"translations": [
{
"translatedText": "Sell \u200b\u200bHerge Down"
}
]
}
}
Our software stumbles over these \u200b strings/characters and I have not found a way to prevent them or get rid of them.
Please read the documentation of the JSON format: https://json.org/
A string is a sequence of zero or more Unicode characters.
A char is either any Unicode character except " or \ or control-character,
[...]
or it is \u followed by four hex-digits.
We are in this last case, \u followed by four hex-digits, and it represents a Unicode character: Unicode Character 'ZERO WIDTH SPACE' (U+200B). It even has its own Wikipedia page: Zero-width space. And its Stack Overflow question: What's HTML character code 8203?.
Now, there are plenty Unicode characters with special behaviors, and this is one of those, an invisible one among others. So you need to be aware of how Unicode works, and you should sanitize input/output from third-parties API (and from user inputs as well).
Just define the list of characters that you actually want to support, and be sure to strip or filter out all the other ones. For instance, if you desire to support NL and EN, then you could strip what is outside the Latin script in Unicode.
Stripping the U+200B that you're encountering and other undesirable characters may save you from potential surprises like with:
big characters ⎲⎳
zalgo characters C̨̦̺̩̲̥͉̭͚̜̻̝̣̼͙̮̯̪o̴̡͇̘͎̞̲͇̦̲͞͡m̸̩̺̝̣̹̱͚̬̥̫̳̼̞̘̯͘ͅẹ͇̺̜́̕͢
invisible characters
emojis 👨‍👩‍👧‍👦#️⃣🏳️‍🌈

How to represent markdown properly with escaping and line breaks?

I'm currently trying to build a chat app, using the official markdown package as well as underscore's escape function, and my template contains something like this:
<span class="message-content">
{{#markdown}}{{text}}{{/markdown}}
</span>
When I grab the text from the chat input box, I try to escape any HTML and then add in line breaks. safeText is then inserted into the database and displayed in the above template.
rawText = $("#chat-input-textbox").val();
safeText = _.escape(rawText).replace(/(?:\r\n|\r|\n)/g, '\n');
The normal stuff like headings, italics, and bold looks okay. However, there are two major problems:
Code escape issue - With the following input:
<script>alert("test")</script>
```
alert('hello');
```
This is _italics_!
Everything looks fine, except the alert('hello'); has become alert('hello'); instead. The <pre> blocks aren't rendering the escaped characters, which makes sense. But the problem is, the underscore JS escape function escapes everything.
SOLVED: Line break Issue - With the following input:
first
second
third
I get first second third being displayed with no line breaks. I understand this could be a markdown thing. Since I believe you need an empty line between paragraphs to get linebreaks in markdown. But having the above behaviour would be the most ideal, anyone know how to do this?
UPDATE Line break issue has been solved by adding an extra \n to my regex. So now I'm making sure that any line break will be represented with at least two \n characters (i.e. \n\n).
You should check the showdown docs and the wiki article they have on the topic.
The marked npm package, which is used by Telescope removes disallowed-tags. These include <script> of course. As the article I linked to above explains, there's still another problem with this:
<a href='javascript:alert("kidding! Im more the world domination kinda guy. Muhahahah")'>
click me for world peace!
</a>
Which isn't prevented by marked. I'd follow the advice of the author and use a HTML sanitation library. Like OWASP's ESAPI or Caja's html-sanitizer. Both of these project's seem outdated dough. I also found a showdown extension for it called showdown-xss-filter. So my advice is to write your own helper, and use showdown-xss-filter.

Word count of a string

How to count the words in a document, get the result same as the result of MS OFFICE?
In theory you'd first have to define what you see as a word (see also Jason Williams' post). Then you open the document with whatever language you're planning to use for this. You translate the document from Microsoft's proprietary format to something nice and clean.
Then its simply a matter of counting the occurrences of the afore mentioned word definition.
The hard part here will be the parsing of the office document. Luckily for you, Microsoft has relceased their proprietary format specification!
Its a bit long winded, but perhaps you can find somebody who has done the hard work for you, or you can try doing it from scratch.
Alternatively, if you're willing to reveal what language you're planning on using and what operating system, things can be a lot easier (if you're on Windows and have Office installed, for example, you can use OLE plug-ins.)
Also, have a look at this blog post about that format of Office documents featuring some helpful information (courtesy of google)
Without knowing your environment all I can tell you is that you would need to implement something like this:
Take the entire document as a string.
Split the string on whitespace.
The number of items in the resulting sequence will be the number of words in the document.
Basic word splitting uses whitespace and punctuation (.,?!"'- etc - indeed any non-alphanumeric or character usually) characters to split the words.
Make sure you skip sequences of punctuation/whitespace instead of counting extra "words" between them.
You will have to decide whether numbers are "words" or not. And whether "$123,456.78" is one word or three.
You may also want to apply other rules - for example, if you are looking for words in source code, you may wish to treat +-=*/()&^%$ characters as "whitespace". If you have identifiers in camelCase or PascalCase styles, you may want to take the "words" you have found and check if they have uppercase characters in the middles or the words.
Fundamentally, it's an easy problem - you just have to decide what a "word" is. You can be as simple or as complicated as you like about it.
The best way to get the same word count as Office would be to use macros or automation to use MS Word to load the text and calculate the word count.
If you take the whole document as a String, this code (in java) may work for you:
private int wordCount(String str){
String[] words = str.trim().split("\\s+");
for (int i = 0; i < words.length; i++) {
words[i] = words[i].replaceAll("[^\\w]", "");
}
return words.length;
}

Resources