How seperate hour and datewhen extracting timestimp using hadoop command - r

I need to extract files's timestimps using hadoop command :
hadoop fs -ls /hdfs/data/adhoc//InterfacePublique-Controle-PUB_1EPSE-201808-PR-20190110-183844-indicateurs-PUB_1EPSE/* | awk '{timestamp= $6 " " $7;print timestamp}'
And it works giving
"2019-01-10 18:55"
But when I used system function like this with removing quotes between $6 $7
x <- "/hdfs/data/adhoc//InterfacePublique-Controle-PUB_1EPSE-201808-PR-20190110-183844-indicateurs-PUB_1EPSE/*"
system(paste0("hadoop fs -ls ",x," | awk '{timestamp= $6 $7;print timestamp}' "),intern =TRUE)
which returns :
2019-01-1018:55 . the hour 18 and the day 10 are colled.
Then if I add quotes , in the hadoop expression .
system(paste0("hadoop fs -ls ",x," | awk '{timestamp= $6 " " $7;print timestamp}' "),intern =TRUE)
It gives an error saying
unexpected token $7;print timestamp
How can I resolve this please ?

You can extract the timestamp using stringr and lubridate:
x <- "/hdfs/data/adhoc//InterfacePublique-Controle-PUB_1EPSE-201808-PR-20190110-183844-indicateurs-PUB_1EPSE/*"
library(lubridate)
library(stringr)
ymd_hms(
str_extract(x, "(\\d{8}-\\d{6})")
)
[1] "2019-01-10 18:38:44 UTC"

Related

having R print a system call that contains "", '', and escape character \

I need to run a perl command from within an R script. I would normally do this via:
system(paste0('my command'))
However, the command I want to paste contains both single and double quotes and an escape character. Specifically, I would like to paste this command:
perl -pe '/^>/ ? print "\n" : chomp' in.fasta | tail -n +2 > out.fasta
I have tried escaping the double quotes with more escape characters, which allows me to pass the command, but it then prints all 3 escape characters, which causes the command to fail. Is there a good way around this, such that I can save the above perl line as a string in R, that I can then pass to the system() function?
Hey I haven't tested your particular perl call (since it involves particular file/directory etc) but tried something trivial by escaping the quotes and it seems to work. You might want to refer this question for more as well.
My approach,
# shouldnt have any text expect for an empty string
my_text <- try(system(" perl -e 'print \"\n\"' ", intern = TRUE))
my_text
[1] ""
# should contain the string - Hello perl from R!
my_text2 <- try(system(" perl -e 'print \"Hello perl from R!\"' ", intern = TRUE))
my_text2
[1] "Hello perl from R!"
So based on the above trials I think this should work for you -
try(system(command = "perl -pe '/^>/ ? print \"\n\" : chomp' in.fasta | tail -n +2 > out.fasta", intern = TRUE))
Note - intern = TRUE just captures the output as a character vector in R.

Replace last 9 delimeters "," with "|" in Unix

I want to replace last 9 "," delimeters with "|" in a file.
For example, from:
abcd,3,5,5,7,7,1,2,3,4
"ashu,pant,something",3,5,5,7,7,8,7,8,8,8
to:
abcd|3|5|5|7|7|1|2|3|4
"ashu,pant,something"|3|5|5|7|7|8|7|8|8|8
Help would be really appreciated.
Not exactly the same but replace all after the second occurrence with GNU sed:
$ echo \"ashu,pant\",3,5,5,7,7,87,8,8,8 |
sed 's/,/|/2g'
"ashu,pant"|3|5|5|7|7|87|8|8|8
Edit to match your changed requirements:
Hackish, but first reverse lines and replace all commas with pipes, then replace pipes with commas starting from 10th occurrence:
$ echo -e \"ashu,pant\",3,5,5,7,7,87,8,8,8\\nabcd,3,5,5,7,7,1,2,3,4 |
rev |
sed 's/,/|/g; s/|/,/10g' |
rev
"ashu,pant"|3|5|5|7|7|87|8|8|8
abcd|3|5|5|7|7|1|2|3|4
You could also use GNU awk and FPAT to replace all comma outside of quotes:
$ echo -e \"ashu,pant\",3,5,5,7,7,87,8,8,8\\nabcd,3,5,5,7,7,1,2,3,4 |
awk 'BEGIN{FPAT="([^,]+)|(\"[^\"]+\")";OFS="|"}{$1=$1}1'
"ashu,pant"|3|5|5|7|7|87|8|8|8
abcd|3|5|5|7|7|1|2|3|4
awk '{gsub(/[[:digit:]]/," |&")gsub(/, /,"")}1' file
output
abcd|3|5|5|7|7|1|2|3|4
"ashu,pant,something"|3|5|5|7|7|8|7|8|8|8

I need list of files in long format that contain string

I want to have the list of files, in long format (ls -l) including date and time, that contain an specific string and if possible, number of occurrences.
The most I have achieved is a list of files (just the name) with number of occurrences:
grep -c 'string' * | grep -v :0
That shows something like:
filename:number of occurrences
But I cannot improve it to show also file date and time. It has to be something simple, but I am a bit newbie.
I have used -s for ignoring the folder warnings. ':0$' is the regex for ending in :0. awk then calls ls -l on just the filenames found, then | tr '\n' ' ' replaces the newlines of the ls command with spaces. We output the number of occurences at the end of each line so we don't lose the info while going forward. The last awk is to just to print the columns needed.
grep -c 'form-group' * -s | grep -v ':0$' | awk -F ':' '{ printf system( "ls -l \"" $2 "\" | tr \"\n\" \" \"" ); print " " $3 }' | awk -F ' ' '{ print $6 " " $7 " " $8 " " $9 " : " $11 }'
Here is some sample output:
Sep 1 13:47 xxx.blade.php : 12
Sep 1 13:47 xxx.blade.php : 5
Sep 1 13:47 xxx.blade.php : 6
Sep 11 17:25 xxx.blade.php : 4
Sep 4 15:03 xxx.blade.php : 6

find the number of lines in a file

Under Linux, I can find the number of lines in a file by doing a system call to wc:
CountLines <- function(file) {
count.file <- system(sprintf("wc -l %s", file), intern = TRUE)
count <- as.integer(strsplit(count.file, " ")[[1]][1])
return(count)
}
How can I do this efficiently under Windows? By "efficient" I mean fast and light on resources, as I may be using it on large files.
As much as possible, I'd prefer a solution that does not require installing extra packages or tools.
Take a look at this link:
https://isc.sans.edu/diary/Finding+Files+and+Counting+Lines+at+the+Windows+Command+Prompt/2244
The last line works for me:
c:\> type c:\windows\win.ini | find /c /v "~~~"
# 32
Update:
if you want to use it as R function, try this:
CountLines <- function(file) {
stopifnot(file.exists(file))
unlikely.pattern <- paste(sample(LETTERS), collapse = "")
cmd <- sprintf('type %s | find /c /v "%s"', file, unlikely.pattern)
res <- shell(cmd, intern = TRUE)
return(as.integer(res))
}
CountLines("c:\\windows\\win.ini")
[1] 32
.
.
.
I found another way to do this more efficient, but I leave this to you for perfection:
> system("POWERSHELL Get-Content c:\\windows\\win.ini | Measure-Object -word -line -character", intern=TRUE)
[1] ""
[2] " Lines Words Characters Property "
[3] " ----- ----- ---------- -------- "
[4] " 32 38 414 "
[5] ""
[6] ""

Combining R + awk + bash commands

I want to combine awk and R language. The thing is that I have a set of *.txt files in a specified directory and that I don't know the length of the header from the files. In some cases I have to skip 25 lines while in others I have to skip 27 and etc. So I want to type some awk commands to get the number of lines to skip. Once I have this value, I can begin processing the data with R.
Furthermore, in the R file I combine R an bash so my code looks like this :
!/usr/bin/env Rscript
...
argv <- commandArgs(T)
**error checking...**
import_file <- argv[1]
export_file <- argv[2]
**# your function call**
format_windpro(import_file, export_file)
Where and how can i type my awk command. Thanks!
I tried to do what you told me about awk commands and I still get an error. The program doesn't recognize my command and so I can not enter the number of lines to skip to my function. Here is my code:
**nline <- paste('$(grep -n 'm/s' import_file |awk -F":" '{print $1}')')
nline <- scan(pipe(nline),quiet=T)**
I look for the pattern m/s in the first column in order to know where I have my header text. I use R under w7.
Besides Vincent's hint of using system("awk ...", intern=TRUE), you can also use the pipe() function that is part of the usual text connections:
R> sizes <- read.table(pipe("ls -l /tmp | awk '!/^total/ {print $5}'"))
R> summary(sizes)
V1
Min. : 0
1st Qu.: 482
Median : 4096
Mean : 98746
3rd Qu.: 13952
Max. :27662342
R>
Here I am piping a command into awk and then read all the output from awk, that could also be a single line:
R> cmd <- "ls -l /tmp | awk '!/^total/ {sum = sum + $5} END {print sum}'"
R> totalsize <- scan(pipe(cmd), quiet=TRUE)
R> totalsize
[1] 116027050
R>
You can use system to run an external program from R.
system("gawk --version", intern=TRUE)

Resources