dynamically pass string to Rscript argument with sed - r

I wrote a script in R that has several arguments. I want to iterate over 20 directories and execute my script on each while passing in a substring from the file path as my -n argument using sed. I ran the following:
find . -name 'xray_data' -exec sh -c 'Rscript /Users/Caitlin/Desktop/DeMMO_Pubs/DeMMO_NativeRock/DeMMO_NativeRock/R/scipts/dataStitchR.R -f {} -b "{}/SEM_images" -c "{}/../coordinates.txt" -z ".tif" -m ".tif" -a "Unknown|SEM|Os" -d "overview" -y "overview" --overview "overview.*tif" -p FALSE -n "`sed -e 's/.*DeMMO.*[/]\(.*\)_.*[/]xray_data/\1/' "{}"`"' sh {} \;
which results in this error:
ubs/DeMMO_NativeRock/DeMMO_NativeRock/R/scipts/dataStitchR.R -f {} -b "{}/SEM_images" -c "{}/../coordinates.txt" -z ".tif" -m ".tif" -a "Unknown|SEM|Os" -d "overview" -y "overview" --overview "overview.*tif" -p FALSE -n "`sed -e 's/.*DeMMO.*[/]\(.*\)_.*[/]xray_data/\1/' "{}"`"' sh {} \;
sh: command substitution: line 0: syntax error near unexpected token `('
sh: command substitution: line 0: `sed -e s/.*DeMMO.*[/](.*)_.*[/]xray_data/1/ "./DeMMO1/D1T3rep_Dec2019_Ellison/xray_data"'
When I try to use sed with my pattern on an example file path, it works:
echo "./DeMMO1/D1T1exp_Dec2019_Poorman/xray_data" | sed -e 's/.*DeMMO.*[/]\(.*\)_.*[/]xray_data/\1/'
which produces the correct substring:
D1T1exp_Dec2019
I think there's an issue with trying to use single quotes inside the interpreted string but I don't know how to deal with this. I have tried replacing the single quotes around the sed pattern with double quotes as well as removing the single quotes, both result in this error:
sed: RE error: illegal byte sequence
How should I extract the substring from the file path dynamically in this case?

To loop through the output of find.
while IFS= read -ru "$fd" -d '' files; do
echo "$files" ##: do whatever you want to do with the files here.
done {fd}< <(find . -type f -name 'xray_data' -print0)
No embedded commands in quotes.
It uses a random fd just in case something inside the loop is eating/slurping stdin
Also -print0 delimits the files with null bytes, so it should be safe enough to handle spaces tabs and newlines on the path and file names.
A good start is always put an echo in front of every commands you want to do with the files, so you have an idea what's going to be executed/happen just in case...

This is the solution that ultimately worked for me due to issues with quotes in sed:
for dir in `find . -name 'xray_data'`;
do sampleID="`basename $(dirname $dir) | cut -f1 -d'_'`";
Rscript /Users/Caitlin/Desktop/DeMMO_Pubs/DeMMO_NativeRock/DeMMO_NativeRock/R/scipts/dataStitchR.R -f "$dir" -b "$dir/SEM_images" -c "$dir/../coordinates.txt" -z ".tif" -m ".tif" -a "Unknown|SEM|Os" -d "overview" -y "overview" --overview "overview.*tif" -p FALSE -n "$sampleID";
done

Related

Issues with iconv command in script

I am trying to create a script which detects if files in a directory have not UTF-8 characters and if they do, grab the file type of that particular file and perform the iconv operation on it.
The code is follows
find <directory> |sed '1d'><directory>/filelist.txt
while read filename
do
file_nm=${filename%%.*}
ext=${filename#*.}
echo $filename
q=`grep -axv '.*' $filename|wc -l`
echo $q
r=`file -i $filename|cut -d '=' -f 2`
echo $r
#file_repair=$file_nm
if [ $q -gt 0 ]; then
iconv -f $r -t utf-8 -c ${file_nm}.${ext} >${file_nm}_repaired.${ext}
mv ${file_nm}_repaired.${ext} ${file_nm}.${ext}
fi
done< <directory>/filelist.txt
While running the code, there are several files that turn into 0 byte files and .bak gets appended to the file name.
ls| grep 'bak' | wc -l
36
Where am I making a mistake?
Thanks for the help.
It's really not clear what some parts of your script are supposed to do.
Probably the error is that you are assuming file -i will output a string which always contains =; but it often doesn't.
find <directory> |
# avoid temporary file
sed '1d' |
# use IFS='' read -r
while IFS='' read -r filename
do
# indent loop body
file_nm=${filename%%.*}
ext=${filename#*.}
# quote variables, print diagnostics to stderr
echo "$filename" >&2
# use grep -q instead of useless wc -l; don't enter condition needlessly; quote variable
if grep -qaxv '.*' "$filename"; then
# indent condition body
# use modern command substitution syntax, quote variable
# check if result contains =
r=$(file -i "$filename")
case $r in
*=*)
# only perform decoding if we can establish encoding
echo "$r" >&2
iconv -f "${r#*=}" -t utf-8 -c "${file_nm}.${ext}" >"${file_nm}_repaired.${ext}"
mv "${file_nm}_repaired.${ext}" "${file_nm}.${ext}" ;;
*)
echo "$r: could not establish encoding" >&2 ;;
esac
fi
done
See also Why is testing “$?” to see if a command succeeded or not, an anti-pattern? (tangential, but probably worth reading) and useless use of wc
The grep regex is kind of mysterious. I'm guessing you want to check if the file contains non-empty lines? grep -qa . "$filename" would do that.

Passing zsh command line arguments into xargs quotations

I have a zsh function, fvi (find vi), which recursively greps a directory searching for files with a pattern, collects them and opens them in vim (on the Mac):
function fvi { grep -rl $1 . | xargs sh -c '/Applications/MacVim.app/Contents/MacOS/Vim -g -- "$#" <$0' /dev/tty }
This looks bad but works fine (on the Mac). But I'd like to set the search pattern for vi to $1 with:
function fvi { grep -rl $1 . | xargs zsh -c '/Applications/MacVim.app/Contents/MacOS/Vim -c +/"$1" -g -- "$#" <$0' /dev/tty }
This of course does not work since xargs/zsh sees the $1 and translates it into a file name. I can manually say -c +/xyz and it will set the pattern to xyz. So I know the vim command syntax is working. I just can't get the shell command argument $1 to be substituted into the xargs string.
Any ideas?
I might just use find:
fvi () {
v=/Applications/MacVim.app/Contents/MacOS/Vim
find . -exec grep -e $1 -- {} \; -exec $v -g +/$1 -- {} \;
}
The fact that you are opening each file in vim for interactive editing suggests there are not so many possible matches (or candidates) that running grep multiple times is really an issue. (At worst, you are just replacing each extra shell process started by xargs with an instance of grep.)
This also precludes any possible issue regarding file names that contain a newline.

Unix case insensitive command line search containing wldcards and spaces

I am attempting to come up with a method to remotely find a list of files on our AIX UNIX machine that meet, what seems in windows, like simple criteria. It needs to be case insensitive (sigh), use wildcards (*) and possibly contain spaces in the path.
For my tests below I was using the ksh shell. However it will need to work in an ssh shell as well.
I am attempting to implement secure FTP in Visual Basic 6 (I know) using plink, command line and a batch file.
Basically find a file like the one below but with case insensitivity:
ls -1 -d -p "/test/rick/01012017fosterYYY - Copy.txt" | grep -v '.*/$'
Thanks for any help.
ls -1 -d -p /test/rick/01012017[Ff][Oo][Ss][Tt][Ee][Rr][Yy][Yy][Yy] - [Cc][Oo][Pp][Yy].[Tt][Xx][Tt] | grep -v '.*\/$'**
fails with:
ls: 0653-341 The file /test/rick/01012017[Ff][Oo][Ss][Tt][Ee][Rr][Yy][Yy][Yy] do
es not exist.
ls: 0653-341 The file - does not exist.
ls: 0653-341 The file [Cc][Oo][Pp][Yy].[Tt][Xx][Tt] does not exist.
ls -1 -d -p /test/rick/01012017[Ff][Oo][Ss][Tt][Ee][Rr][Yy][Yy][Yy].[Tt][Xx][Tt] | grep -v '.*\/$'**
success - as long as there are no spaces.
ls -1 -d -p "/test/rick/01012017[Ff][Oo][Ss][Tt][Ee][Rr][Yy][Yy][Yy].[Tt][Xx][Tt]" | grep -v '.*\/$'**
fails with:
ls: 0653-341 The file /test/rick/01012017[Ff][Oo][Ss][Tt][Ee][Rr][Yy][Yy][Yy].[T
t][Xx][Tt] does not exist.
-- Assumption: We cannot use quotes with wildcard characters
ls -1 -d -p "/test/rick/01012017fosterYYY - Copy.txt" | grep -v '.*\/$'**
success. not case insensitive.
ls -1 -d -p /test/rick/[0][1][0][1][2][0][1][7][Ff][Oo][Ss][Tt][Ee][Rr][Yy][Yy][Yy] - [Cc][Oo][Pp][Yy].[Tt][Xx][Tt] | grep -v '.*\/$'**
fails with:
ls: 0653-341 The file /test/rick/[0][1][0][1][2][0][1][7][Ff][Oo][Ss][Tt][Ee][Rr
][Yy][Yy][Yy][ does not exist.
ls: 0653-341 The file ][-][ does not exist.
ls: 0653-341 The file ][Cc][Oo][Pp][Yy].[Tt][Xx][Tt] does not exist.
ls -1 -d -p /test/rick/[0][1][0][1][2][0][1][7][Ff][Oo][Ss][Tt][Ee][Rr][Yy][Yy][Yy][ ][-][ ][Cc][Oo][Pp][Yy].[Tt][Xx][Tt] | grep -v '.*\/$'**
fails with:
ls: 0653-341 The file /test/rick/[0][1][0][1][2][0][1][7][Ff][Oo][Ss][Tt][Ee][Rr
][Yy][Yy][Yy][ does not exist.
ls: 0653-341 The file ][-][ does not exist.
ls: 0653-341 The file ][Cc][Oo][Pp][Yy].[Tt][Xx][Tt] does not exist.
ls -1 -d -p /test/rick/01012017[Ff][Oo][Ss][Tt][Ee][Rr][Yy][Yy][Yy]?-?[Cc][Oo][Pp][Yy].[Tt][Xx][Tt] | grep -v '.*\/$'**
success. not very helpful though.
ls -1 -d -p /test/rick/01012017[Ff][Oo][Ss][Tt][Ee][Rr][Yy][Yy][Yy][ ]-[ ][Cc][Oo][Pp][Yy].[Tt][Xx][Tt] | grep -v '.*\/$'**
fails with:
ls: 0653-341 The file /test/rick/01012017[Ff][Oo][Ss][Tt][Ee][Rr][Yy][Yy][Yy][ d
oes not exist.
ls: 0653-341 The file ]-[ does not exist.
ls: 0653-341 The file ][Cc][Oo][Pp][Yy].[Tt][Xx][Tt] does not exist.
ls -1 -d -p /test/rick/01012017[Ff][Oo][Ss][Tt][Ee][Rr][Yy][Yy][Yy]{ }-{ }[Cc][Oo][Pp][Yy].[Tt][Xx][Tt] | grep -v '.*\/$'**
fails with:
ls: 0653-341 The file /test/rick/01012017[Ff][Oo][Ss][Tt][Ee][Rr][Yy][Yy][Yy]{ d
oes not exist.
ls: 0653-341 The file }-{ does not exist.
ls: 0653-341 The file }[Cc][Oo][Pp][Yy].[Tt][Xx][Tt] does not exist.
ls -1 -d -p /test/rick/*01012017[Ff][Oo][Ss][Tt][Ee][Rr][Yy][Yy][Yy] - [Cc][Oo][Pp][Yy].[Tt][Xx][Tt]* | grep -v '.*\/$'**
fails with
ls: 0653-341 The file /test/rick/01012017[Ff][Oo][Ss][Tt][Ee][Rr][Yy][Yy][Yy] d
oes not exist.
ls: 0653-341 The file - does not exist.
ls: 0653-341 The file [Cc][Oo][Pp][Yy].[Tt][Xx][Tt] does not exist.
ls -1 -d -p "/test/rick/01012017[Ff][Oo][Ss][Tt][Ee][Rr][Yy][Yy][Yy] - [Cc][Oo][Pp][Yy].[Tt][Xx][Tt]" | grep -v '.*\/$'**
fails with:
ls: 0653-341 The file /test/rick/01012017[Ff][Oo][Ss][Tt][Ee][Rr][Yy][Yy][Yy] -
[Cc][Oo][Pp][Yy].[Tt][Xx][Tt] does not exist.
ls doesn't do pattern matching, any wildcard expansion (globbing) is done by the shell. The glob pattern language is different from regular expressions. Read the ksh documentation for information about globbing ("File Name Generation" in the manpage).
So when you do:
$ touch foo flo fum
$ ls -1 f[ol]o
flo
foo
... the shell notices the globbing characters [], reads the directory contents, replaces it with the matching files, and passes those as parameters to ls. You can show this by using echo instead:
$ echo f[ol]o
flo foo
ksh has globbing options available with the ~() construct, option i is "Treat the match as case insensitive" :
ksh$ touch foo FoO FOO
ksh$ echo ~(i)foo
foo FoO FOO
bash has a nocaseglob shopt option:
bash$ shopt -s nocaseglob
bash$ touch fOo
bash$ echo FO*
foo
Although note that some globbing character needs to be present to make the magic happen:
bash$ echo FOO
FOO
bash$ echo [F]OO
foo
(to keep this option change local, see https://unix.stackexchange.com/questions/310957/how-to-undo-a-set-x/310963)
It looks as if you're using grep -v '.*/$' to remove lines that are directories. The .* is superfluous here -- grep -v '/$' is equivalent.
But find is a better tool for this kind of searching and filtering, implementing -type f (match regular files) by actually looking at the file attributes, rather than by parsing a bit of ASCII in a listing.
$ touch foo FOO FoO
$ mkdir fOo
$ find . -maxdepth 1 -type f -iname "foo"
./FOO
./foo
./FoO
You could use find's -iname option to allow for case-insensitive searching, so for the example you've provided any of the following should find your file:
find /test/rick -maxdepth 1 -iname '01012017fosterYYY - copy.txt'
# or
find /test/rick -maxdepth 1 -iname '01012017fosteryyy - copy.txt'
# or
find /test/rick -maxdepth 1 -iname '01012017FOSTERyyy - cOpY.txt'
-maxdepth 1 : don't search in sub-directories
-iname : allow for case-insensitive searching
For case insensitive wildcard searches when -maxdepth and -iname flags are not available for AIX Find , you can pass the Find results to Grep:
find /test/rick/. \( ! -name . -prune \) -type f -print | grep -i ".*foster.*\.txt"
find [InThisFolder] [ExcludeSubfolders] [FileTypes] | grep [InsensitiveWildcardName]
Though, this can still be problematic if you have a folder structure like "/test/rick/rick/".
The following code gives results with the current directory signifier ".":
find /test/rick/. \( ! -name . -prune \) -type f -print | grep -i ".*foster.*\.txt"
But you can pass the results to sed and find "/./" and replace with "/".
find /test/rick/. \( ! -name . -prune \) -type f -print | grep -i ".*foster.*\.txt" | sed 's/\/\.\//\//g'
* UPDATE *
Based on this page: http://mywiki.wooledge.org/ParsingLs
I’ve come up with the following command (for loop on file expansion or globbing) which avoids the problematic "/test/rick/rick/" folder structure from the find | grep solution above. It searches a folder from any folder, handles spaces, and handles case insensitivity without having to specify escape characters or upper/lower matching ([Aa]).
Just modify the searchfolder and searchpattern:
searchfolder="/test/rick"; searchpattern="*foster*.txt"; for file in "$searchfolder"/*.*; do [[ -e "$file" ]] || continue; if [[ "$(basename "$file" | tr '[:upper:]' '[:lower:]')" = $searchpattern ]]; then echo "$file"; fi; done
It does this:
Set the folder path to search (searchfolder="/test/rick";)
Set the search pattern (searchpattern="*foster*.txt")
Loop for every file on the search folder (for file in "$searchfolder"/*.*;)
Make sure the file exists ( [[ -e "$file" ]] || continue;)
Transform any base file name uppercase characters to lowercase (basename "$file" | tr '[:upper:]' '[:lower:]')
Test if the lowered base file name matches the search pattern and if so
then print the full path and filename (if [[ $(basename "$file" | tr
'[:upper:]' '[:lower:]') = $searchpattern ]]; then echo "$file"; fi;)
Tested on AIX (Version 6.1.0.0) in ksh (Version M-11/16/88f) and ksh93 (Version M-12/28/93e).
What I finally used (because I don't have access to -maxdepth or -iname) was just to use case insensitive wildcards together with quotes around spaces.
ls -1 -d -p /test/rick/01012017[Ff][Oo][Ss][Tt][Ee][Rr][Yy][Yy][Yy]' '-' '[Cc][Oo][Pp][Yy].[Tt][Xx][Tt] | grep -v '.*\/$'
That way I don't have to install or upgrade anything and probably cause more problems just so I can get a simple list of files.
NOTE: AIX UNIX will still throw in some garbage errors if you have any sub directories under the path. I tapped out on this and just parsed these useless messages out on the client side.
Thanks everyone who responded.

find + sed, filename output

I have directory: D:/Temp, where there are a lot of subfolders with text files. Each folder has "file.txt". In some file.txt files is a word - "pattern". I would like check how many pattern words there are, and also get the filepath to that file.txt:
find D:/Temp -type f -name "file.txt" -exec basename {} cat {} \; | sed -n '/pattern/p' | wc -l
Output should be:
4
D:/Temp/abc1/file.txt
D:/Temp/abc2/file.txt
D:/Temp/abc3/file.txt
D:/Temp/abc4/file.txt
Or similar.
You could use GNU grep :
grep -lr --include file.txt "pattern" "D:/Temp/"
This will return the file paths.
grep -cr --include file.txt "pattern" "D:/Temp/"
This will return the count (counting the pattern occurences rather than the number of files)
Explanation of the flags :
-r makes grep recursively browse its target, that can then be a directory
--include <glob> makes grep restrict its recursive browsing to files matching the <glob>.
-l makes grep only return the files path. Additionnaly, it will stop parsing a file as soon as it has encountered the pattern.
-c makes grep only return the number of matches
If your file names don't contain spaces then all you need is:
awk '/pattern/{print FILENAME; cnt++; nextfile} END{print cnt+0}' $(find D:/Temp -type f -name "file.txt")
The above used GNU awk for nextfile.
I'd propose you to use two commands : one for find all the files:
find ./ -name "file.txt" -exec fgrep -l "-pattern" {} \;
Another for counting them:
find ./ -name "file.txt" -exec fgrep -l "-pattern" {} \; | wc -l
Previously I've used:
grep -Hc "pattern" $(find D:/temp -type f -name "file.txt")
This will only work if file.txt is found. Otherwise you could use the following which will account for when both files are found or not found:
searchFiles=$(find D:/temp -type f -name "file.txt"); [[ ! -z "$searchFiles" ]] && grep -Hc "pattern" $searchFiles
The output for this would look more like:
D:/Temp/abc1/file.txt 2
D:/Temp/abc2/file.txt 1
D:/Temp/abc3/file.txt 1
D:/Temp/abc4/file.txt 1
I would use
find D:/Temp -type f -name "file.txt" -exec dirname {} \; > tmpfile
wc -l tmpfile
cat tmpfile
rm tmpfile
Give a try to this safe and standard version:
find D:/Temp -type f -name file.txt -printf "%p\0" | xargs -0 bash -c 'printf "%s" "${#}"; grep -c "pattern" "${#}"' | grep ":[1-9][0-9]*$"
For each file.txt file found in D:/Temp directory and sub-directories, the xargs command prints the filename and the number of lines which contain pattern (grep -c).
A final grep ":[1-9][0-9]*$" selects only filenames with a count greater than 0.
The way I'm reading your question, I'm going to answer as if:
some but not all file.txt files contain pattern,
you want a list of the paths leading to file.txt with pattern, and
you want a count of pattern in each of those files.
There are a few options. (Always multiple ways to do anything.)
If your bash is version 4 or higher, you can use globstar to recurse through directories:
shopt -s globstar
for file in **/file.txt; do
if count=$(grep -c 'pattern' "$file"); then
printf "%d %s\n" "$count" "${file%/*}"
fi
done
This works because the if evaluation considers a failed grep (i.e. zero occurrences) to be FALSE, and thus does not print results.
Note that this may be high impact because it launches a separate grep on each file that is found. A lighter weight alternative might be to run a single grep on the fileglob, and parse the results:
shopt -s globstar
grep -c 'pattern' **/file.txt | grep -v ':0$'
This also depends on bash 4, and of course if you have millions of files you may overwhelm bash's command line maximum length. The output of this will be obvious, but you'll need to parse it with care if your filenames contain colons. I.e. cut -d: -f2 may not cut it.
One more option that leverages grep instead of bash might be:
grep -r --include 'file.txt' -c 'pattern' ./ | grep -v ':0$'
This uses GNU grep's --include option which modified the behaviour of -r (recursive). It should work in Linux, FreeBSD, NetBSD, OSX, but not with the default grep on OpenBSD or most SVR4 (Solaris, HP/UX, etc).
Note that I have tested none of these. No liability assumed. May contain nuts.
This should do it:
find . -name "file.txt" -type f -printf '%p\n' | awk '{print} END { print NR }'

Shell script question

I want to execute following command in shell script
cp /somedire/*.(txt|xml|xsd) /destination/dir/
But this does not run inside shell script. Any quick help?
createjob.sh: line 11: syntax error near unexpected token `('
My shell is zsh.
Thanks
Nayn
Your use of parentheses and alternation is a zsh-specific construct. It doesn't work in other shells, including zsh in sh compatibility mode.
If you want to keep using this construct, you'll have to invoke zsh as zsh (presumably by replacing #!/bin/sh by #!/bin/zsh or something like that).
If you need your script to run on ksh, use #!/bin/ksh or #!/usr/bin/env ksh and
cp /somedire/*.#(txt|xml|xsd) /destination/dir/
If you also need to support bash, that same command with the # will work provided you run the following commands first:
shopt -s extglob 2>/dev/null ## tell bash to parse ksh globbing extensions
setopt ksh_glob 2>/dev/null ## tell zsh to parse ksh globbing extensions
If you need POSIX sh compatibility, you'll have to use three separate commands, and prepare for an error message if any of the three extensions has no match. A more robust solution would use find:
find /somedire -name /somedire -o -type d -prune -o \
\( -name '*.txt' -o -name '*.xml' -o '*.xsd' \) \
-exec sh -c 'cp "$#" "$0"' /destination/dir {} +
No idea about zsh but Bash doesn’t know about regular expressions in paths, only wildcards.
You can try using find:
find -E . -regex '.*\.(txt|xml|xsd)' -exec cp {} /destination/dir \;
Have a look at the manpage for an explanation of the syntax of find.
This would work in bash, and probably zsh as well: cp /somedire/*.{txt,xml,xsd} /destination/dir/
It's not in POSIX, though, so it won't work with most /bin/sh's.

Resources