find the number of lines in a file - r

Under Linux, I can find the number of lines in a file by doing a system call to wc:
CountLines <- function(file) {
count.file <- system(sprintf("wc -l %s", file), intern = TRUE)
count <- as.integer(strsplit(count.file, " ")[[1]][1])
return(count)
}
How can I do this efficiently under Windows? By "efficient" I mean fast and light on resources, as I may be using it on large files.
As much as possible, I'd prefer a solution that does not require installing extra packages or tools.

Take a look at this link:
https://isc.sans.edu/diary/Finding+Files+and+Counting+Lines+at+the+Windows+Command+Prompt/2244
The last line works for me:
c:\> type c:\windows\win.ini | find /c /v "~~~"
# 32
Update:
if you want to use it as R function, try this:
CountLines <- function(file) {
stopifnot(file.exists(file))
unlikely.pattern <- paste(sample(LETTERS), collapse = "")
cmd <- sprintf('type %s | find /c /v "%s"', file, unlikely.pattern)
res <- shell(cmd, intern = TRUE)
return(as.integer(res))
}
CountLines("c:\\windows\\win.ini")
[1] 32
.
.
.
I found another way to do this more efficient, but I leave this to you for perfection:
> system("POWERSHELL Get-Content c:\\windows\\win.ini | Measure-Object -word -line -character", intern=TRUE)
[1] ""
[2] " Lines Words Characters Property "
[3] " ----- ----- ---------- -------- "
[4] " 32 38 414 "
[5] ""
[6] ""

Related

How seperate hour and datewhen extracting timestimp using hadoop command

I need to extract files's timestimps using hadoop command :
hadoop fs -ls /hdfs/data/adhoc//InterfacePublique-Controle-PUB_1EPSE-201808-PR-20190110-183844-indicateurs-PUB_1EPSE/* | awk '{timestamp= $6 " " $7;print timestamp}'
And it works giving
"2019-01-10 18:55"
But when I used system function like this with removing quotes between $6 $7
x <- "/hdfs/data/adhoc//InterfacePublique-Controle-PUB_1EPSE-201808-PR-20190110-183844-indicateurs-PUB_1EPSE/*"
system(paste0("hadoop fs -ls ",x," | awk '{timestamp= $6 $7;print timestamp}' "),intern =TRUE)
which returns :
2019-01-1018:55 . the hour 18 and the day 10 are colled.
Then if I add quotes , in the hadoop expression .
system(paste0("hadoop fs -ls ",x," | awk '{timestamp= $6 " " $7;print timestamp}' "),intern =TRUE)
It gives an error saying
unexpected token $7;print timestamp
How can I resolve this please ?
You can extract the timestamp using stringr and lubridate:
x <- "/hdfs/data/adhoc//InterfacePublique-Controle-PUB_1EPSE-201808-PR-20190110-183844-indicateurs-PUB_1EPSE/*"
library(lubridate)
library(stringr)
ymd_hms(
str_extract(x, "(\\d{8}-\\d{6})")
)
[1] "2019-01-10 18:38:44 UTC"

How to call exe program and input parameters using R?

I want to call .exe program (spi_sl_6.exe) using a command of R (system), however I can't input parameters to the program using "system". The followwing is my command and parameters:system("D:\\working\spi_sl_6.exe")
I am searching for a long time on net. But no use. Please help or try to give some ideas how to achieve this. Thanks in advance.
This is using the Standardized Precipitation Index software from
http://drought.unl.edu/MonitoringTools/DownloadableSPIProgram.aspx.
This seems to give a working solution using Windows (but not without warnings!)
Fisrt download the software and example files
# Create directory to download software
mydir <- "C:\\Users\\david\\spi"
dir.create(mydir)
url <- "http://drought.unl.edu/archive/Programs/SPI"
download.file(file.path(url, "spi_sl_6.exe"), file.path(mydir, "spi_sl_6.exe"), mode="wb")
# Download example files
download.file(file.path(url, "SPI_samplefiles.zip"), file.path(mydir, "SPI_samplefiles.zip"))
# extract one example file, and write out
temp <- unzip(file.path(mydir, "SPI_samplefiles.zip"), "wymo.cor")
dat <- read.table(temp)
# Use this file as an example input
write.table(dat, file.path(mydir,"wymo.cor"), col.names = FALSE, row.names = FALSE)
From page 3 of the help file basic-spi-program-information.pdf at the above link the command line code should be of the form spi 3 6 12 <infile.dat >outfile.dat, however,
neither of the following worked (just from command line not in R), and various iterations of how to pass parameters.
C:\Users\david\spi\spi_sl_6 3 <C:\Users\david\spi\wymo.cor >C:\Users\david\spi\out.dat
cd C:\Users\david\spi && spi_sl_6 3 <wymo.cor >out.dat
However, using the accepted answer from Running .exe file with multiple parameters in c#
seems to work. That is again from the command line
cd C:\Users\david\spi && (echo 2 && echo 3 && echo 6 && echo wymo.cor && echo out1.dat) | spi_sl_6
So to run this in R you can wrap this in a shell (you will need to change the path to where you have saved the exe)
shell("cd C:\\Users\\david\\spi && (echo 2 && echo 3 && echo 6 && echo wymo.cor && echo out2.dat) | spi_sl_6", intern=TRUE)
out1.dat and out2.dat should be the same.
This throws warning messages, I think from the echo (in R but not from command line) but the output file is produced.
Suppose you can automate all the echo calls sligtly, so all you need to do is input the time parameters.
timez <- c(2, 3, 6)
stime <- paste("echo", timez, collapse =" && ")
infile <- "wymo.cor"
outfile <- "out3.dat"
spiCall <- paste("cd", mydir, "&& (" , stime, "&& echo", infile, "&& echo", outfile, " ) | spi_sl_6")
shell(spiCall)
You can construct the command using sprintf :
cmd_name <- "D:\\working\spi_sl_6.exe"
param1 <- "a"
param2 <- "b"
system2(sprintf("%s %s %s",cmd_name,param1,param2))
Or using system2( I prefer this option):
system2(cmd_name, args = c(param1,param2))

Using `diff` from R via `system(..)`

I was willing to compare two paths (named a and b) in R using the diff command from Bash.
In bash I would do
$ a=Path/to/foo/directory/
$ b=Path/to/bar/directory/
$ diff <(printf ${a} | tr / '\n') <(printf ${b} | tr / '\n')
3c3
< foo
---
> bar
So from R I am trying
a="Path/to/foo/directory/"
b="Path/to/bar/directory/"
system(
paste0(
"a=",a,
";b=",b,
";diff <(printf ${a} | tr / '\n') <(printf ${b} | tr / '\n')"
)
)
OR
system(
paste0(
"diff <(printf ",a," | tr / '\n') <(printf ",b," | tr / '\n')"
)
)
but both return an error.
sh: -c: line 0: syntax error near unexpected token `('
sh: -c: line 0: `a=Path/to/foo/directory/;b=Path/to/bar/directory/;diff <(printf ${a} | tr / ''
even though copy-pasting the output of the paste0 function into bash works fine.
There might be better ways to compare strings in R and I would welcome alternative solutions. However, I am particularly interested in understanding what is going wrong with my usage of the system() function and how to solve it.
As explained here, system(..) is not running /usr/bin/bash but /usr/bin/sh. Here are two possible solutions to the problem.
Solution in "usr/bin/sh"
So in order to make a script that run through /usr/bin/sh I had to print strings on files.
DiffPath = function(a,b,ManipulationFolder="~")
{
if (file.exists(ManipulationFolder))
{
system(
paste0(
"cd ",ManipulationFolder,
";a=",a,
";b=",b,
";printf ${a} | tr / '\n' > a.txt",
";printf ${b} | tr / '\n' > b.txt",
";diff a.txt b.txt",
";rm a.txt;rm b.txt"
)
)
} else
{
warning(paste0("Cannot find the ManipulationFolder ( ",ManipulationFolder," )"))
}
}
Solution in "usr/bin/bash"
An alternative and nicer solution is to explicitly give the command to bash.
DiffPath = function(a,b)
{
system(
paste0(
'bash -c \'diff <(printf ',a,' | tr / "\n") <(printf ',b,' | tr / "\n")\''
)
)
}
Function call
a="Path/to/foo/directory/"
b="Path/to/bar/directory/"
DiffPath(a,b)
3c3
< foo
---
> bar

unix ftp script to get latest file from server

I have a unix script to get files via ftp looks something like this:
#!/bin/sh
HOST='1.1.1.1'
USER='user'
PASSWD='pass'
FILE='1234'
ftp -n $HOST <<END_SCRIPT
quote USER $USER
quote PASS $PASSWD
cd .LogbookPlus
get $FILE
quit
END_SCRIPT
exit 0
Instead of getting a specific file, I want to get the last modified file in a folder, or all files created in the last 24 hours. Is this possible via ftp?
This is really pushing the FTP client further than it should be pushed, but it is possible.
Note that the LS_FILE_OFFSET might be different on your system and this won't work at all if the offset is wrong.
#!/bin/sh
HOST='1.1.1.1'
USER='user'
PASSWD='pass'
DIRECTORY='.LogbookPlus'
FILES_TO_GET=1
LS_FILE_OFFSET=57 # Check directory_listing to see where filename begins
rm -f directory_listing
# get listing from directory sorted by modification date
ftp -n $HOST > directory_listing <<fin
quote USER $USER
quote PASS $PASSWD
cd $DIRECTORY
ls -t
quit
fin
# parse the filenames from the directory listing
files_to_get=`cut -c $LS_FILE_OFFSET- < directory_listing | head -$FILES_TO_GET`
# make a set of get commands from the filename(s)
cmd=""
for f in $files_to_get; do
cmd="${cmd}get $f
"
done
# go back and get the file(s)
ftp -n $HOST <<fin
quote USER $USER
quote PASS $PASSWD
cd $DIRECTORY
$cmd
quit
fin
exit 0
You should have definitely given some more information about the systems you are using, e.g. not every ftp server supports ls -t that #JesseParker uses. I used the opportunity and put some ideas that I have used myself for some time into a script that uses awk to to the dirty deeds. As you can see, knowing what flavor of unix your client uses would be beneficial. I have tested this script to run under Debian Wheezy GNU/Linux and FreeBSD 9.2.
#!/bin/sh
# usage: <this_script> <num_files> <date...> [ <...of...> <...max....> <...age...> ... ]
#
# Fetches files from preconfigured ftp server to current directory.
# Maximum number of files is <num_files>
# Only files that have a newer modification time than given date are considered.
# This date is given according to the local 'date' command, which is very different
# on BSD and GNU systems, e.g.:
#
# GNU:
# yesterday
# last year
# Jan 01 1970
#
# BSD:
# -v-1d # yesterday (now minus 1 day)
# -v-1y # last year (now minus 1 year)
# -f %b %e %C%y Jan 01 1970 # format: month day century year
#
# Script tries to autodetect date system, YMMV.
#
# BUGS:
# Does not like quotation marks (") in file names, maybe much more.
#
# Should not have credentials inside this file, but maybe have them
# in '.netrc' and not use 'ftp -n'.
#
# Plenty more.
#
HOST='1.1.1.1'
USER='user'
PASSWD='pass'
DIR='.LogbookPlus'
# Date format for numerical comparison. Can be simply +%s if supported.
DATE_FMT=+%C%y%m%d%H%M%S
# The server's locale for date strings.
LC_SRV_DATE=C
# The 'date' command from BSD systems and that from the GNU coreutils
# are completely different. Test for the appropriate system here:
if LC_ALL=C date -j -f "%b %e %C%y" "Jan 01 1970" $DATE_FMT > /dev/null 2>&1 ; then
SYS_TYPE=BSDish
elif LC_ALL=C date -d "Jan 01 1970" $DATE_FMT > /dev/null 2>&1 ; then
SYS_TYPE=GNUish
else
echo "sh: don't know how to date ;-) sorry!"
exit 1;
fi
# Max. number of files to get (newest files first)
MAX_NUM=$(( ${1:-1} + 0 )) # ensure argv[1] is treated as a number
shift
# Max. age of files. Only files newer that this will be considered.
if [ GNUish = "$SYS_TYPE" ] ; then
MAX_AGE=$( date "$DATE_FMT" -d "${*:-yesterday}" )
elif [ BSDish = "$SYS_TYPE" ] ; then
MAX_AGE=$( date -j "${*:--v-1d}" "$DATE_FMT" )
fi
# create temporary file
TMP_FILE=$(mktemp)
trap 'rm -f "$TMP_FILE"' EXIT INT TERM HUP
ftp -i -n $HOST <<END_FTP_SCRIPT | \
awk -v max_age="$MAX_AGE" \
-v max_num="$MAX_NUM" \
-v date_fmt="$DATE_FMT" \
-v date_loc="$LC_SRV_DATE" \
-v sys_type="$SYS_TYPE" \
-v tmp_file="$TMP_FILE" '
BEGIN {
# columns in the 'dir' output from the ftp server:
# drwx------ 1 user group 4096 Apr 8 2009 Mail
# -rw------- 1 user group 13052 Nov 20 02:07 .bash_history
perm=1; links=2; user=3; group=4; size=5; month=6; day=7; yeartime=8; # name=$9..$NF
if ( "BSDish" == sys_type ) {
date_cmd="LC_ALL=" date_loc " date -j -f"
} else if ( "GNUish" == sys_type ) {
date_cmd="LC_ALL=" date_loc " date -d"
} else {
print "awk: don'\''t know how to date ;-) sorry!" > "/dev/stderr"
exit 1;
}
files[""] = ""
file_cnt = 0
out_cmd = "sort -rn | head -n " max_num " > " tmp_file
}
$perm ~ /^[^-]/ { # skip non-regular files
next
}
{
if ( "BSDish" == sys_type ) {
if ( $yeartime ~ /[0-9][0-9][0-9][0-9]/ ) {
ts_fmt = "\"%b %e %C%y\""
} else if ( $yeartime ~ /[0-9][0-9:[0-9][0-9]/ ) {
ts_fmt = "\"%b %e %H:%M\""
} else {
print "has neither year nor time: " $8
exit 1
}
} else { # tested in BEGIN: must be "GNUish"
ts_fmt = ""
}
cmd = date_cmd " " ts_fmt " \"" $month " " $day " " $yeartime "\" " date_fmt
cmd | getline timestamp
close( cmd )
if ( timestamp > max_age ) {
# clear everything but the file name
$perm=$links=$user=$group=$size=$month=$day=$yeartime=""
files[ file_cnt,"name" ] = $0
files[ file_cnt,"time" ] = timestamp
++file_cnt
}
}
END {
for( i=0; i<file_cnt; ++i ) {
print files[ i,"time" ] "\t" files[ i,"name" ] \
| out_cmd
}
close( out_cmd )
print "quote USER '$USER'\nquote PASS '$PASSWD'\ncd \"'$DIR'\""
i = 0
while( (getline < tmp_file) > 0 ) {
$1 = "" # drop timestamp
gsub( /^ /,"" ) # strip leading space
print "get \"" $0 "\""
}
print "quit"
}
' \
| ftp -v -i -n $HOST
quote USER $USER
quote PASS $PASSWD
cd "$DIR"
dir .
quit
END_FTP_SCRIPT

How to match a list of strings in two different files using a loop structure?

I have a file processing task that I need a hand in. I have two files (matched_sequences.list and multiple_hits.list).
INPUT FILE 1 (matched_sequences.list):
>P001 ID
ABCD .... (very long string of characters)
>P002 ID
ABCD .... (very long string of characters)
>P003 ID
ABCD ... ( " " " " )
INPUT FILE 2 (multiple_hits.list):
ID1
ID2
ID3
....
What I want to do is match the second column (ID2, ID4, etc.) with a list of IDs stored in multiple_hits.list. Then create a new matched_sequences file similar to the original but which excludes all IDs found in multiple_hits.list (about 60 out of 1000). So far I have:
#!/bin/bash
X=$(cat matched_sequences.list | awk '{print $2}')
Y=$(cat multiple_hits.list | awk '{print $1}')
while read matched_sequenes.list
do
[ $X -ne $Y ] && (cat matched_sequences.list | awk '{print $1" "$2}') > new_matched_sequences.list
done
I get the following error raised:
-bash: read: `matched_sequences.list': not a valid identifier
Many thanks in advance!
EXPECTED OUTPUT (new_matched_sequences.list):
Same as INPUT FILE 1 with all IDs in multiple_hits.list excluded
#!/usr/bin/awk -f
function chomp(s) {
sub(/^[ \t]*/, "", s)
sub(/[ \t\r]*$/, "", s)
return s
}
BEGIN {
file = ARGV[--ARGC]
while ((getline line < file) > 0) {
a[chomp(line)]++
}
RS = ""
FS = "\n"
ORS = "\n\n"
}
{
id = chomp($1)
sub(/^.* /, "", id)
}
!(id in a)
Usage:
awk -f script.awk matched_sequences.list multiple_hits.list > new_matched_sequences.list
A shorter awk answer is possible, with a tiny script reading first the file with the IDs to exclude, and then the file containing the sequences. The script would be as follows (comments make it long, it's just three useful lines in fact:
BEGIN { grab_flag = 0 }
# grab_flag will be used when we are reading the sequences file
# (not absolutely necessary to set here, though, because we expect the file will start with '>')
FNR == NR { hits[$1] = 1 ; next } # command executed for all lines of the first file: record IDs stored in multiple_hits.list
# otherwise we are reading the second file, containing the sequences:
/^>/ { if (hits[$2] == 1) grab_flag = 0 ; else grab_flag = 1 } # sets the flag indicating whether we have to output the sequence or not
grab_flag == 1 { print }
And if you call this script exclude.awk, you will invoke it this way:
awk -f exclude.awk multiple_hits.list matched_sequences.list

Resources