r-markdown: German quotation marks break bold text in HTML document - r

When German quotation marks („ and “ or HTML code „ and “, see https://unicode-table.com/de/201E/ and https://unicode-table.com/de/201C/) are in between bold text markers **...**, then pandoc does not render the text bold when I knit in RStudio. Even worse, the **s are printed verbatim in the HTML document.
Example:
---
output: html_document
lang: de
---
This is a **„Test“**.
Another **„Test“**.
This **"just works"**.
Result:
Are there any pandoc options or workarounds for solving this problem?
Note that a similar question was answered for PDF output in r-markdown: German quotation marks. But I need HTML output.

The issue tracking input of localized quotes is https://github.com/jgm/pandoc/issues/661.
Meanwhile, I recommend using non-typographic quotes (") and for HTML-output use the --html-q-tags option and some CSS, like:
q {
quotes: '„' '“';
}

My workaround: I made use of the command line tool sed and regular expressions:
First, modify the .Rmd (or .md) file and replace all the German typographic quotation marks with standard quotation marks: (WARNING: commands change the file inplace!)
sed -i 's/„/"/g' mydocument.Rmd
sed -i 's/“/"/g' mydocument.Rmd
Knit the document (or convert it to HTML with pandoc).
Then, replace all the English typographic quotation marks with German ones:
sed -i "s/“/„/g" mydocument.html
sed -i "s/”/“/g" mydocument.html

Related

How to replace the html tags with SED?

i need some help with using sed in unix.
i need to Use the standard Unix command sed to process the input stream and remove all HTML tags, so that for example:
This is my link.
will be replaced by
This is my link.
I tried
sed -r 's/
<[^>]*>
//g'
but it didn't work.
This is extremely bare-bones and unlikely to catch all of the scenarious that HTML will throw at you, but if you are looking to just trim a leading and trailing < and >, then something like this might work:
sed 's/<[^>]*>//g'
But seriously, I'd use a parser.
In the general case you cannot parse HTML with regular expressions.
But, for simple case and assuming that no tag spans more than two lines, you can use:
sed -e 's/<[^<>]*>//g' -e 's/<[^<>]*$//' -e 's/^[^<>]*>//'
The first regex finds and deletes tags contained on one line. The second takes care of tags which begin on a line but end on the next. The third deletes the tails of tags which began on the previous line. I a tag can span more than two lines then something more complicated (or a better tool) is needed.

Find and replace: \'

I'm trying to replace a every reference of \' with &apos; in a file
I've used variations of: sed -e s/\'/"\&apos;"/g file.txt
But they always replace every.single.(single).quote
Any help would be greatly appreciated.
Not sure it's the best solution,I could do it like this:
sed "s/[\]'/\"\&apos;\"/g" file.txt
(putting the backslash character in a character range so it doesn't interfere with the following quote, and protect with double quotes)
Or just extending your syntax, without quotes but using almost the same trick:
sed -e s/[\\]\'/"\&apos;"/g file.txt
An approach trying to conserve as much of the "single-quotedness" of the sed command as possible:
sed 's/\\'"'"'/\&apos;/g'
Just escaping \ with \\ and getting a single quote into the command with '"'"': the first single quote ends the command so far, then we have a double-quoted single quote ("'"), and finally an opening single quote for the rest of the command.
Alternatively, double quoting the whole command and escaping both the backslash and single quote:
sed "s/\\\'/\&apos;/g"
The correct syntax is:
$ echo "foo'bar" | sed 's/'\''/\&apos;/'
foo&apos;bar
Every script (sed, awk, whatever) should always be enclosed in single quotes and you just us other single quotes to stop/restart the script delimiters break out to shell for the minimal portion of the script that's absolutely necessary, in this case long enough to use \'. You need to break out to shell to specify that ' because per shell rules no script enclosed in 's can contain a ', not even if you try to escape it.
echo "foo'bar" | gawk '{gsub(/\47/,"\\&apos;")}1'
foo&apos;bar
The tricky part here is to replace a single quote with ampersand.
First in order to make the single quote manageable use its octal
code here \47 and then escaping ampersand by two back slash. And all of sudden
it becomes feasible :)

only rendering last X lines of chunk output in R Markdown

I am calling a shell program from R Markdown like this
```{sh}
SomeShellProgram -options
```
and render the file as HTML. The calculation the program does take some time, wherefore the author included an self-updating progress "bar" which looks something like this:
45Mb 12.4% 935 OTUs, 3485 chimeras (6.7%)
However, especially if the progress is slow, it will update this line every 0.1% or so. And each line is rendered separately in the HTML, which can ad up to up to 1000 lines of progress bars.
I don't want to suppress the output completely , e.g. with echo=FALSE in the chunk options. I am producing a report and the information that is printed is important.
I am looking for a hack that would somehow only capture the last X lines and render these, or maybe using grep or something similar to only capture the lines that have 100% or so.
I tried redirecting the output with > output.txt but the progress wasn't printed to the file (although other information was).
I can't think of a way to provide a reproducible example without giving the full example, sorry for that.
For those that are interested: I am trying to produce a report on the analysis of 16S Illumina sequencing data and I'm using Usearch and the command that gives me the most headaches is the usearch -cluster_otus command.
UPDATE
There is an additional problem with rendering the last X lines: The progress bar in the output is delimited by ^M(carriage return characters) and not by line breaks, so lessonly recognises it as a single line. Therefore my final solution includes
redirecting the output from the progress bar with 2> into a file
replacing the ^Mcharacters with line breaks using sed
rendering the last X lines with less
My (pseudo)code to do this on mac osx is the following (where X = number of lines)
FunctionWithProgressBar -option 2> tempfile.tmp
sed -ibak $'s/\x0D/\\\n/g' tempfile.tmp
tail -nX tempfile.tmp
and in R Markdown:
```{sh, results="hide"}
FunctionWithProgressBar -option 2> tempfile.tmp
```
```{sh, echo=FALSE}
sed -ibak $'s/\x0D/\\\n/g' tempfile.tmp
tail -nX tempfile.tmp
```
note that matching the backspace is a pain in the butt (especially on osx) and changes between platforms.
The progress bar is probably in the sterr stream, so you capture it with "2>" and not ">" so you could capture stderr and stdout separately, e.g.:
usearch blablabla 2> only_err > only_stdout
Or if you want all of the output together, you have to redirect stderr to stdout, and do an append, as such:
usearch blablabla >> total_output 2>&1
As for the R-markdown part, I cannot really help, never used, sorry.
regards,
Moritz

Using sed to replace text with curly braces

I am trying to find the following text
get_pins {
and replace it with
get_pins -hierarchical {proc_top_*/
I've tried using sed but I'm not sure what I'm doing wrong. I know that you need # in front of curly braces but I still can't get the command to work properly.
The closest I've come is to this:
sed 's/get_pins #{#/get_pins -hierarchical #{#proc_top_*\//g' filename.txt > output
but it doesn't do the replacement I wanted above.
#merlin2011's answer shows you how to do it with alternative delimiters, but as for why your command didn't work:
It's actually perfectly fine, if you just remove all # chars. from your statement:
sed 's/get_pins {/get_pins -hierarchical {proc_top_*\//'g filename.txt > output
There are two distinct escaping requirements involved here:
Escaping literal use of the regex delimiter: this is what you did correctly, by escaping the / as \/.
Escaping characters with special meaning inside a regex in general: this escaping is always done with \-prefixing, but in your case there is NO need for such escaping: since you're NOT using -E or -r to indicate use of extended regexes - and are therefore using a basic regex - { is actually NOT a special character, so you need NOT escape it. If, by contrast, you had used -E (-r), then you should have escaped { as \{.
The problem is not in the curly braces, it's in the /.
This is exactly why sed lets you do alternate delimiters.
The line below uses ! as a delimiter instead, and works correctly for a simple file with get_pins { in it.
sed 's!get_pins {!get_pins -hierarchical {proc_top_*/!g' Input.txt
Output:
get_pins -hierarchical {proc_top_*/
Update: Based mklement0's comment, and testing with the csh shell, the following should work in csh.
sed 's#get_pins {#get_pins -hierarchical {proc_top_*/#g' Input.txt
This awk should do the replace:
awk '{sub(/get_pins {/,"get_pins -hierarchical {proc_top_*/")}1'

Unix sort text file with user-defined newline character

I have a plain text file where newline character in not "\n" but a special character.
Now I want to sort this file.
Is there a direct way to specify custom new-line character while using unix sort command?
I don't want to use a script for this as far as possible?
Please note the data in text file have \n, \r\n, and \t characters(the reason for such data is application specific so please don't comment on that).
The sample data is as below:
1111\n1111<Ctrl+A>
2222\t2222<Ctrl+A>
3333333<Ctrl+A>
Here Ctrl+A is the newline character.
Use perl -001e 'print sort <>' to do this:
prompt$ cat -tv /tmp/a
2222^I2222^A3333333^A1111
1111^A
prompt$ perl -001e 'print sort <>' /tmp/a | cat -tv
1111
1111^A2222^I2222^A3333333^Aprompt$
That works because character 001 (octal 1) is control-A ("\cA"), which is your record terminator in this dataset.
You can also use the code point in hex using -0xHHHHH. Note that it must be a single code point, not a string, using this shortcut. There are ways of doing it for strings and even regexes that involve infinitessimally more code.

Resources