How to perform a static count of loaded packages in R? - r

I'd like to search a directory structure to count the number of times I've loaded various R packages. The source is contained in .org and .R files. I'm willing to assume that "library(" is the first non-blank entry on any line I care about, and I'm willing to assume that there is at most only one such call per line.
find . -regex ".*/.*\.org" -print
gets me a list of .org files, and
find . -regex ".*\.\(org\|R\)$" -print
gets me a list of .org and .R files (thanks to https://unix.stackexchange.com/questions/15308/how-to-use-find-command-to-search-for-multiple-extensions).
Given a particular file,
grep -h "library(" file | sed 's/library(//' | sed 's/)//'
gets me the package name. I'd like to hook them together and then possibly redirect the output to a file, from which I can use R to calculate frequencies.
The seemingly straightforward
find . -regex ".*/.*\.org" -print | xargs -0 grep -h "library(" | sed 's/library(//' | sed 's/)//'
doesn't work; I get
find . -regex ".*/.*\.org" -print | xargs -0 grep -h "library(" | sed 's/library(//' | sed 's/)//'
Usage: /usr/bin/grep [OPTION]... PATTERN [FILE]...
Try '/usr/bin/grep --help' for more information.
and I'm not sure what to do next.
I also tried
find . -regex ".*/.*\.org" -exec grep -h "library(" "{}" "\;"
and got
find . -regex ".*/.*\.org" -exec grep -h "library(" "{}" "\;"
find: missing argument to `-exec'
It seems simple. What am I missing?
UPDATE: Adding -t to the above xargs shows me the first command:
grep -h library ./dirname/filename.org
followed by, presumably, a list of all the matching files with paths relative to the PWD. Actually, that works if I only search for .org files; if I add .R files, too, I get "xargs: argument line too long". I think that means xargs is passing the entire list of files as the argument to one invocation of grep.

find ... -print | xargs OK
find ... -print0 | xargs -0 OK
find ... -print0 | xargs broken
find ... -print | xargs -0 broken (what you used)
Also, please don't:
grep -h "library(" | sed 's/library(//' | sed 's/)//'
when this is faster:
grep -h "library(" | sed -e 's/library(//' -e 's/)//'
and this is even faster, and more interesting:
grep -h "library(" | grep -o '(.*)' | tr -d ' ()'

Related

Why does sed only show one line though pipe

I have several txt files under a directory, and I want see first line of every file
So I use
ls *txt | xargs sed -n '1p'
however it only returns one line of the first file
What is wrong?
P.S.: I know I can use head, but what I ask is why sed is not working
Use the argument -t to xargs to see what is going on:
ls *txt | xargs -t sed -n '1p'
You will see that sed is run as:
sed -n '1p' foo.txt bar.txt gar.txt
and as sed only supports one input file, it will print the first line of
the file foo.txt and exit.
xargs is assuming you want to pass the list of input files all together.
To tell it to pass one at a time, you need to use the -L NUM option,
which tells xargs to pass one line at a time from your ls command.
[Note there are other issues you will run into if any file name has a blank in it]
ls *txt | xargs -L 1 -t sed -n '1p'
You will see:
sed -n '1p' foo.txt
sed -n '1p' bar.txt
sed -n '1p' gar.txt
In unix there are many ways to do any task; other ways include:
(if you use /bin/csh or /bin/tcsh):
foreach f (*txt)
echo -n $f:
sed -1p $f
end
If you use /bin/sh or /bin/ksh, then:
for files in *txt;
do
echo -n $files :
sed -n '1p' $files
done
Also consider using the program find; it lets you qualify the types of files you want to look at and can recursively examine sub directories:
`find . -type f -a -name "*txt" -print -a -exec sed -n '1p' {} \;`
First, ls and xargs are not useful here. Please read: "Don't Parse ls". For a more reliable form of the command that works with all kinds of file names, use:
sed -n '1p' *.txt
Second, sed treats its input files all as one stream. So, the above does not do what you want. Use head instead (as you said):
head -n1 *.txt
To suppress the verbose headers that head prints and make the output more like sed 1p, use the -q option:
head -qn1 *.txt
Handling many many files
If you have many many .txt files, where, depending on system configuration, "many" likely means several tens of thousands of such files, then another approach is needed. find is useful:
find . -maxdepth 1 -name '*.txt' -exec head -n1 {} +
This might work for you (GNU sed):
sed -s '1!d' file1 file2 file...
This will print the first line of each file i.e. delete all lines but the first of each file.

find + sed, filename output

I have directory: D:/Temp, where there are a lot of subfolders with text files. Each folder has "file.txt". In some file.txt files is a word - "pattern". I would like check how many pattern words there are, and also get the filepath to that file.txt:
find D:/Temp -type f -name "file.txt" -exec basename {} cat {} \; | sed -n '/pattern/p' | wc -l
Output should be:
4
D:/Temp/abc1/file.txt
D:/Temp/abc2/file.txt
D:/Temp/abc3/file.txt
D:/Temp/abc4/file.txt
Or similar.
You could use GNU grep :
grep -lr --include file.txt "pattern" "D:/Temp/"
This will return the file paths.
grep -cr --include file.txt "pattern" "D:/Temp/"
This will return the count (counting the pattern occurences rather than the number of files)
Explanation of the flags :
-r makes grep recursively browse its target, that can then be a directory
--include <glob> makes grep restrict its recursive browsing to files matching the <glob>.
-l makes grep only return the files path. Additionnaly, it will stop parsing a file as soon as it has encountered the pattern.
-c makes grep only return the number of matches
If your file names don't contain spaces then all you need is:
awk '/pattern/{print FILENAME; cnt++; nextfile} END{print cnt+0}' $(find D:/Temp -type f -name "file.txt")
The above used GNU awk for nextfile.
I'd propose you to use two commands : one for find all the files:
find ./ -name "file.txt" -exec fgrep -l "-pattern" {} \;
Another for counting them:
find ./ -name "file.txt" -exec fgrep -l "-pattern" {} \; | wc -l
Previously I've used:
grep -Hc "pattern" $(find D:/temp -type f -name "file.txt")
This will only work if file.txt is found. Otherwise you could use the following which will account for when both files are found or not found:
searchFiles=$(find D:/temp -type f -name "file.txt"); [[ ! -z "$searchFiles" ]] && grep -Hc "pattern" $searchFiles
The output for this would look more like:
D:/Temp/abc1/file.txt 2
D:/Temp/abc2/file.txt 1
D:/Temp/abc3/file.txt 1
D:/Temp/abc4/file.txt 1
I would use
find D:/Temp -type f -name "file.txt" -exec dirname {} \; > tmpfile
wc -l tmpfile
cat tmpfile
rm tmpfile
Give a try to this safe and standard version:
find D:/Temp -type f -name file.txt -printf "%p\0" | xargs -0 bash -c 'printf "%s" "${#}"; grep -c "pattern" "${#}"' | grep ":[1-9][0-9]*$"
For each file.txt file found in D:/Temp directory and sub-directories, the xargs command prints the filename and the number of lines which contain pattern (grep -c).
A final grep ":[1-9][0-9]*$" selects only filenames with a count greater than 0.
The way I'm reading your question, I'm going to answer as if:
some but not all file.txt files contain pattern,
you want a list of the paths leading to file.txt with pattern, and
you want a count of pattern in each of those files.
There are a few options. (Always multiple ways to do anything.)
If your bash is version 4 or higher, you can use globstar to recurse through directories:
shopt -s globstar
for file in **/file.txt; do
if count=$(grep -c 'pattern' "$file"); then
printf "%d %s\n" "$count" "${file%/*}"
fi
done
This works because the if evaluation considers a failed grep (i.e. zero occurrences) to be FALSE, and thus does not print results.
Note that this may be high impact because it launches a separate grep on each file that is found. A lighter weight alternative might be to run a single grep on the fileglob, and parse the results:
shopt -s globstar
grep -c 'pattern' **/file.txt | grep -v ':0$'
This also depends on bash 4, and of course if you have millions of files you may overwhelm bash's command line maximum length. The output of this will be obvious, but you'll need to parse it with care if your filenames contain colons. I.e. cut -d: -f2 may not cut it.
One more option that leverages grep instead of bash might be:
grep -r --include 'file.txt' -c 'pattern' ./ | grep -v ':0$'
This uses GNU grep's --include option which modified the behaviour of -r (recursive). It should work in Linux, FreeBSD, NetBSD, OSX, but not with the default grep on OpenBSD or most SVR4 (Solaris, HP/UX, etc).
Note that I have tested none of these. No liability assumed. May contain nuts.
This should do it:
find . -name "file.txt" -type f -printf '%p\n' | awk '{print} END { print NR }'

Is there any way to parellelly grep through bz2 files

I recently found out this solution to less through compressed gz files parellelly based on the cores available.
find . -name "*.gz" | xargs -n 1 -P 3 zgrep -H '{pattern to search}'
P.S. 3 is the number of cores
I was wondering if there was a way to do it for bz2 files as well.
Currently I am using this command:
find -type f -name '*.bz2' -execdir bzgrep "{text to find}" {} /dev/null \;
Change *.gz to *.bz2; change zgrep to bzgrep, and there you are.
For a bit of extra safety around unusual filenames, use -print0 on the find end and -0 on the xargs:
find . -name "*.bz2" -print0 | xargs -0 -n 1 -P 3 bzgrep -H '{pattern to search}'

Removing Files with specific ending. Need something more specific

I'm trying to purge all thumbnails created by Wordpress because of a CMS switchover that I'm planning.
find -name \*-*x*.* | xargs rm -f
But I dont know bash or regex well enough to figure out how to add a bit more specifity such as only the following will be removed
All generated files have the syntax of
<img-name>-<width:integer>x<height:integer>.<file-ext> syntax
You didn't quote or escape all your wildcards, so the shell will try to expand them before find executes.
Quoting it should work
find -name '*-*x*.*'| xargs echo rm -f
Remove the echo when you're satisfied it works. You could also check that two of the fields are numbers by switching to -regex, but not sure if you need/want that here.
regex soultion
find -regex '^.*/[A-Za-z]+-[0-9]+x[0-9]+\.[A-Za-z]+$' | xargs echo rm -f
Note: I'm assuming img-name and file-ext can only contain letters
You can try this:
find -type f | grep -P '\w+-\d+x\d+\.\w+$' | xargs rm
If you have spaces in the path:
find -type f | grep -P '\w+-\d+x\d+\.\w+$' | sed -re 's/(\s)/\\\1/g' | xargs rm
Example:
find -type f | grep -P '\w+-\d+x\d+\.\w+$' | sed -re 's/(\s)/\\\1/g' | xargs ls -l
-rw-rw-r-- 1 tiago tiago 0 Jun 22 15:14 ./image-800x600.png
-rw-rw-r-- 1 tiago tiago 0 Jun 22 15:17 ./test 2/test 3/image-800x600.png
The below GNU find command will remove all the files which contain this <img-name>-<width:integer>x<height:integer>.<file-ext> syntax string. And also i assumed that the corresponding files has . in their file-names.
find . -name "*.*" -type f -exec grep -l '<img-name>-<width:integer>x<height:integer>.<file-ext> syntax' {} \; | xargs rm -f
Explanation:
. Directory in which find operation is going to takeplace.(. represnts your current directory)
-name "*.*" File must have dot in their file-names.
-type f Only files.
-exec grep -l '<img-name>-<width:integer>x<height:integer>.<file-ext> syntax' {} print the file names which contain the above mentioned pattern.
xargs rm -f For each founded files, the filename was fed into xargs and it got removed.

Filenames and linenumbers for the matches of cat and grep

My code
$ *.php | grep google
How can I print the filenames and linenumbers next to each match?
grep google *.php
if you want to span many directories:
find . -name \*.php -print0 | xargs -0 grep -n -H google
(as explained in comments, -H is useful if xargs comes up with only one remaining file)
You shouldn't be doing
$ *.php | grep
That means "run the first PHP program, with the name of the rest wildcarded as parameters, and then run grep on the output".
It should be:
$ grep -n -H "google" *.php
The -n flag tells it to do line numbers, and the -H flag tells it to display the filename even if there's only file. Grep will default to showing the filenames if there are multiple files being searched, but you probably want to make sure the output is consistent regardless of how many matching files there are.
grep -RH "google" *.php
Please take a look at ack at http://betterthangrep.com. The equivalent in ack of what you're trying is:
ack google --php
find ./*.php -exec grep -l 'google' {} \;
Use "man grep" to see other features.
for i in $(ls *.php); do grep -n --with-filename "google" $i; done;
find . -name "*.php" -print | xargs grep -n "searchstring"

Resources