having R print a system call that contains "", '', and escape character \ - r

I need to run a perl command from within an R script. I would normally do this via:
system(paste0('my command'))
However, the command I want to paste contains both single and double quotes and an escape character. Specifically, I would like to paste this command:
perl -pe '/^>/ ? print "\n" : chomp' in.fasta | tail -n +2 > out.fasta
I have tried escaping the double quotes with more escape characters, which allows me to pass the command, but it then prints all 3 escape characters, which causes the command to fail. Is there a good way around this, such that I can save the above perl line as a string in R, that I can then pass to the system() function?

Hey I haven't tested your particular perl call (since it involves particular file/directory etc) but tried something trivial by escaping the quotes and it seems to work. You might want to refer this question for more as well.
My approach,
# shouldnt have any text expect for an empty string
my_text <- try(system(" perl -e 'print \"\n\"' ", intern = TRUE))
my_text
[1] ""
# should contain the string - Hello perl from R!
my_text2 <- try(system(" perl -e 'print \"Hello perl from R!\"' ", intern = TRUE))
my_text2
[1] "Hello perl from R!"
So based on the above trials I think this should work for you -
try(system(command = "perl -pe '/^>/ ? print \"\n\" : chomp' in.fasta | tail -n +2 > out.fasta", intern = TRUE))
Note - intern = TRUE just captures the output as a character vector in R.

Related

How do I replace "/" with "\" in r?

I am trying do that "C:/Users/Vitor/Documents" become "C:\Users\Vitor\Documents".
I tried :
gsub("//", "\", file)
paste(dirname(file),basename(file),sep="\")
normalizePath(file,"\",mustWork=FALSE)
But didn't work!
We can escape the \ with another set and use it in gsub as the \\ is just one character
gsub("/", "\\\\", "C:/Users/Vitor/Documents")
which would print correctly with cat
cat(gsub("/", "\\\\", "C:/Users/Vitor/Documents"))
#C:\Users\Vitor\Documents
and can check the number of characters
nchar("\\")
#[1] 1

Remove consecutive duplicate words from a file using awk or sed

My input file looks like below:
“true true, rohith Rohith;
cold burn, and fact and fact good good?”
Output shoud look like:
"true, rohith Rohith;
cold burn, and fact and fact good?"
i am trying the same with awk, but couldn't able to get the desired result.
awk '{for (i=1;i<=NF;i++) if (!a[$i]++) printf("%s ",$i,FS)}{printf("\n")}' input.txt
Could someone please help me here.
Regards,
Rohith
With GNU awk for the 4th arg to split():
$ cat tst.awk
{
n = split($0,words,/[^[:alpha:]]+/,seps)
prev = ""
for (i=1; i<=n; i++) {
word = words[i]
if (word != prev) {
printf "%s%s", seps[i-1], word
}
prev = word
}
print ""
}
$ awk -f tst.awk file
“true, rohith Rohith;
cold burn, and fact and fact good?”
Just match the same backreference in sed:
sed ':l; s/\(^\|[^[:alpha:]]\)\([[:alpha:]]\{1,\}\)[^[:alpha:]]\{1,\}\2\($\|[^[:alpha:]]\)/\1\2\3/g; tl'
How it works:
:l - create a label l to jump to. See tl below.
s - substitute
/
\(^\|[^[:alpha:]]\) - match beginning of the line or non-alphabetic character. This is so that the next part matches the whole word, not only suffix.
\([[:alpha:]]\{1,\}\) - match a word - one or more alphabetic characters.
[^[:alpha:]]\{1,\} - match a non-word - one or more non-alphabetic characters.
\2 - match the same thing as in the second \(...\) - ie. match the word.
\($\|[^[:alpha:]]\) - match the end of the line or match a non-alphabetic character. That is so we match the whole second word, not only it's prefix.
/
\1\2\3 - substitute it for <beginning of the line or non-alphabetic prefix character><the word><end of the line or non-alphabetic suffix character found>
/
g - substitute globally. But, because regex is never going back, it will substitute 2 words at a time.
tl - Jump to label l if last s command was successfull. This is here, so that when there are 3 words the same, like true true true, they are properly replaced by a single true.
Without the \(^\|[^[:alpha:]]\) and \($\|[^[:alpha:]]\), without them for example true rue would be substituted by true, because the suffix rue rue would match.
Below are my other solution, which also remove repeated words across lines.
My first solution was with uniq. So first I will transform the input into pairs with the format <non-alphabetical sequence separating words encoded in hex> <a word>. Then run it via uniq -f1 with ignoring first field and then convert back. This will be very slow:
# recreate input
cat <<EOF |
true true, rohith Rohith;
cold burn, and fact and fact good good?
EOF
# insert zero byte after each word and non-word
# the -z option is from GNU sed
sed -r -z 's/[[:alpha:]]+/\x00&\x00/g' |
# for each pair (non-word, word)
xargs -0 -n2 sh -c '
# ouptut hexadecimal representation of non-word
printf "%s" "$1" | xxd -p | tr -d "\n"
# and output space with the word
printf " %s\n" "$2"
' -- |
# uniq ignores empty fields - so make sure field1 always has something
sed 's/^/-/' |
# uniq while ignoring first field
uniq -f1 |
# for each pair (non-word in hex, word)
xargs -n2 bash -c '
# just `printf "%s" "$1" | sed 's/^-//' | xxd -r -p` for posix shell
# change non-word from hex to characters
printf "%s" "${1:1}" | xxd -r -p
# output word
printf "%s" "$2"
' --
But then I noticed that sed is doing a good job at tokenizing the input - it places zero bytes between each word and non-word tokens. So I could easily read the stream. I can ignore repeated words in awk by reading zero separated stream in GNU awk and comparing the last readed word:
cat <<EOF |
true true, rohith Rohith;
cold burn, and fact and fact good good?
EOF
sed -r -z 's/[[:alpha:]]+/\x00&\x00/g' |
gawk -vRS='\0' '
NR%2==1{
nonword=$0
}
NR%2==0{
if (length(lastword) && lastword != $0) {
printf "%s%s", lastword, nonword
}
lastword=$0
}
END{
printf "%s%s", lastword, nonword
}'
In place of zero byte something unique could be used as record separator, for example ^ character, that way it could be used with non-GNU awk version, tested with mawk available on repl. Shortened the script by using shorter variable names here:
cat <<EOF |
true true, rohith Rohith;
cold burn, and fact and fact good good?
EOF
sed -r 's/[[:alpha:]]+/^&^/g' |
awk -vRS='^' '
NR%2{ n=$0 }
NR%2-1 && length(l) && l != $0 { printf "%s%s", l, n }
NR%2-1 { l=$0 }
END { printf "%s%s", l, n }
'
Tested on repl. The snippets output:
true, rohith Rohith;
cold burn, and fact and fact good?
Simple sed:
echo "true true, rohith Rohith;
cold burn, and fact and fact good good?" | sed -r 's/(\w+) (\1)/\1/g'
This is not exactly what you have shown in output but is close using gnu-awk:
awk -v RS='[^-_[:alnum:]]+' '$1 == p{printf "%s", RT; next} {p=$1; ORS=RT} 1' file
“true , rohith Rohith;
cold burn, and fact and fact good ?”
sed -E 's/(\w+) *\1/\1/g' sample.txt
sample.txt
“true true, rohith Rohith;
cold burn, and fact and fact good good?”
output:
:~$ sed -E 's/(\w+) *\1/\1/g' sample.txt
“true, rohith Rohith;
cold burn, and fact and fact good?”
Explanation
(\w) *\1 - matches a word separated by a space of the same word and saves it
Depending on your expected input, this might work:
sed -r 's/([a-zA-Z0-9_-]+)( *)\1/\1\2/g ; s/ ([.,;:])/\1/g ; s/ / /g' myfile
([a-zA-Z0-9_-]+) = words that might be repeated.
( *)\1 = check if the previous word is repeated after a space.
s/ ([.,;:])/\1/g = removes extra spaces before punctuation (you might want to add characters to this group).
s/ / /g = removes double spaces.
This works with GNU sed.

Better string interpolation in R

I need to build up long command lines in R and pass them to system(). I find it is very inconvenient to use paste0/paste function, or even sprintf function to build each command line. Is there a simpler way to do like this:
Instead of this hard-to-read-and-too-many-quotes:
cmd <- paste("command", "-a", line$elem1, "-b", line$elem3, "-f", df$Colum5[4])
or:
cmd <- sprintf("command -a %s -b %s -f %s", line$elem1, line$elem3, df$Colum5[4])
Can I have this:
cmd <- buildcommand("command -a %line$elem1 -b %line$elem3 -f %df$Colum5[4]")
For a tidyverse solution see https://github.com/tidyverse/glue. Example
name="Foo Bar"
glue::glue("How do you do, {name}?")
With version 1.1.0 (CRAN release on 2016-08-19), the stringr package has gained a string interpolation function str_interp() which is an alternative to the gsubfn package.
# sample data
line <- list(elem1 = 10, elem3 = 30)
df <- data.frame(Colum5 = 1:4)
# do the string interpolation
stringr::str_interp("command -a ${line$elem1} -b ${line$elem3} -f ${df$Colum5[4]}")
#[1] "command -a 10 -b 30 -f 4"
This comes pretty close to what you are asking for. When any function f is prefaced with fn$, i.e. fn$f, character interpolation will be performed replacing ... with the result of running ... as an R expression.
library(gsubfn)
cmd <- fn$identity("command -a `line$elem1` -b `line$elem3` -f `df$Colum5[4]`")
Here is a self contained reproducible example:
library(gsubfn)
# test inputs
line <- list(elem1 = 10, elem3 = 30)
df <- data.frame(Colum5 = 1:4)
fn$identity("command -a `line$elem1` -b `line$elem3` -f `df$Colum5[4]`")
## [1] "command -a 10 -b 30 -f 4"
system
Since any function can be used we could operate directly on the system call like this. We have used echo here to make it executable but any command could be used.
exitcode <- fn$system("echo -a `line$elem1` -b `line$elem3` -f `df$Colum5[4]`")
## -a 10 -b 30 -f 4
Variation
This variation would also work. fn$f also performs substitution of $whatever with the value of variable whatever. See ?fn for details.
with(line, fn$identity("command -a $elem1 -b $elem3 -f `df$Colum5[4]`"))
## [1] "command -a 10 -b 30 -f 4"
Another option would be to use whisker.render from https://github.com/edwindj/whisker which is a {{Mustache}} implementation in R. Usage example:
require(dplyr); require(whisker)
bedFile="test.bed"
whisker.render("processing {{bedFile}}") %>% print
Not really a string interpolation solution, but still a very good option for the problem is to use the processx package instead of system() and then you don't need to quote anything.
library(GetoptLong)
str = qq("region = (#{region[1]}, #{region[2]}), value = #{value}, name = '#{name}'")
cat(str)
qqcat("region = (#{region[1]}, #{region[2]}), value = #{value}, name = '#{name}'")
https://cran.r-project.org/web/packages/GetoptLong/vignettes/variable_interpolation.html

How to separate unique characters from several words in a "indic" text file?

I've a plain text file.
> Input: इंजेक्शन इंटरनॅशनल इंटिग्रेटेड इंटिरिअर इंडस्ट्री
All words are separated by one or more spaces. I want to collect all unique chars from the text file. I'm looking for a unix command; the order of the result chars is not important.
> Expected result: इं जे क्श न ट र नॅ श ल इ्रे टे ड टि रिअ र ड स्ट्री
With the command Klaus has provided
cat <file>|sed -e 's/\(.\)/\1\n/g'|sort -u|tr -d '\n'
Result comes as:
ं अ इ क ग ज ट ड न र ल श सिीॅे्
I don't want to separate horizontal or vertical conjuncts or dependent vowels from its base character.
I just want to separate complete characters in a word from each other.
Can we achieve this with UNIX commands?
"base character" + "dependent vowel" = "complete character"
- क ा का
- क ि कि
Klaus's command works for English text only. But, It doesn't work with indic languages such as Hindi.
Input: hi1 hello-2 how!3 "are4 ?you5
result: h i e l o w a r y u 1 2 3 4 5 - ! "
Note:- You have to install Indic support in your OS.
Also, download Mangal font from http://hindi-fonts.com/fonts/Mangal
Try this:
cat <file>|sed -e 's/\(.\)/\1\n/g'|sort -u|tr -d '\n'
or simplified ( stolen from fedorqui comment, thanks! Never seen & before in the replacement part. Good to learn something new! )
sed 's/./&\n/g' <file> | sort -u | tr -d '\n'

Combining R + awk + bash commands

I want to combine awk and R language. The thing is that I have a set of *.txt files in a specified directory and that I don't know the length of the header from the files. In some cases I have to skip 25 lines while in others I have to skip 27 and etc. So I want to type some awk commands to get the number of lines to skip. Once I have this value, I can begin processing the data with R.
Furthermore, in the R file I combine R an bash so my code looks like this :
!/usr/bin/env Rscript
...
argv <- commandArgs(T)
**error checking...**
import_file <- argv[1]
export_file <- argv[2]
**# your function call**
format_windpro(import_file, export_file)
Where and how can i type my awk command. Thanks!
I tried to do what you told me about awk commands and I still get an error. The program doesn't recognize my command and so I can not enter the number of lines to skip to my function. Here is my code:
**nline <- paste('$(grep -n 'm/s' import_file |awk -F":" '{print $1}')')
nline <- scan(pipe(nline),quiet=T)**
I look for the pattern m/s in the first column in order to know where I have my header text. I use R under w7.
Besides Vincent's hint of using system("awk ...", intern=TRUE), you can also use the pipe() function that is part of the usual text connections:
R> sizes <- read.table(pipe("ls -l /tmp | awk '!/^total/ {print $5}'"))
R> summary(sizes)
V1
Min. : 0
1st Qu.: 482
Median : 4096
Mean : 98746
3rd Qu.: 13952
Max. :27662342
R>
Here I am piping a command into awk and then read all the output from awk, that could also be a single line:
R> cmd <- "ls -l /tmp | awk '!/^total/ {sum = sum + $5} END {print sum}'"
R> totalsize <- scan(pipe(cmd), quiet=TRUE)
R> totalsize
[1] 116027050
R>
You can use system to run an external program from R.
system("gawk --version", intern=TRUE)

Resources