Rsync: Escape include/exclude wildcards - wildcard

I am running rsync --include-from=my_includes_file --exclude="*" source dest, and building the include file from a big find command (similar to find documents/* -newer ~/.lasttime). However, many of my filenames include the "wildcard" characters used by rsync (*, ?, [, ]), and occasionally I have a problem with the include comment character too (#).
Is there a universal way to escape these? Using \[ seems to fix this one, but I'm not sure if it works for all of them or if I'm just getting lucky
Is there an include prefix to force a "simple string match"?

You are using
--include-from=FILE read include patterns from FILE
While you should use
--files-from=FILE read list of source-file names from FILE

Related

Zsh glob: Get everything except stuff in a certin folder

Trying to find all files except those inside vendor/ folders, but why is this failing?
setopt extendedglob
for file in **/*~vendor/; do
done
See if this does what you're looking for:
setopt extendedglob
print -l ^vendor/**/*(.)
The *~ negation syntax usually needs parentheses in order to determine where the expression after the tilde ends. Your pattern is requesting all files and folders except those where the glob result name ends with vendor/. The glob result never includes the trailing slash, so you end up with all of the files and folders.
Adding parens will change the behavior of that pattern, but probably not in a useful way. This will result in a list of all of the directories where the last component is not vendor:
print -l **/(*~vendor)/
so x/y, x/y/vendor/z, and vendor/a will be included, but x/y/vendor will not.
The parentheses limit the 'not' pattern to just one piece of the path. In order to exclude matches at the top-level, the tested component needs to be at the front of the pattern:
print -l (*~vendor)/**/*
The very first pattern above uses the ^ syntax to produce the same results. The (.) glob qualifier in that pattern limits the globbing to plain files, so directories are not included.
Another variation that may be useful - this will exclude directories that have any component named vendor. It is similar to find -prune:
print -l (^vendor/)#*(.)
This will produce a list of all files except those in subdirectories with names like vendor/x, a/vendor and a/vendor/b.

How to use AWS CLI to only copy files in S3 bucket that match a given string pattern

I'm using the AWS CLI to copy files from an S3 bucket to my R machine using a command like below:
system(
"aws s3 cp s3://my_bucket_location/ ~/my_r_location/ --recursive --exclude '*' --include '*trans*' --region us-east-1"
)
This works as expected, i.e. it copies all files in my_bucket_location that have "trans" in the filename at that location.
The problem that I am facing is that I have other files with similar naming conventions that I don't want to import in this step. As an example, in the list below I only want to copy the first two files, not the last two:
File list
trans_120215.csv
trans_130215.csv
sum_trans_120215.csv
sum_trans_130215.csv
If I was using regex I could make it more specific like "^trans_\\d+" to bring in just the first two files, but this doesn't seem possible using AWS CLI. So my question is there a way to have more complex pattern matching using AWS CLI like below?
system(
"aws s3 cp s3://my_bucket_location/ ~/my_r_location/ --recursive --exclude '*' --include '^trans_\\d+' --region us-east-1"
)
Please note that I can only use information about the file in question, i.e. that I want to import a file with pattern "^trans_\\d+", I can't use the fact that the other unwanted files contain sum_ at the start, because this is only an example there could be other files with similar names like "check_trans_120215.csv".
I have considered other alternatives like below, but hoping there is a way to adjust the copy command to avoid going down either of these routes:
Listing all items in the bucket > using regex in R to specify the files that I want > Only importing those files
Keeping the copy command as it is > delete unwanted files on the R machine after the copy
The alternatives that you have listed are the best options because S3 CLI doesn't support regex.
Use of Exclude and Include Filters:
Currently, there is no support for the use of UNIX style wildcards in
a command's path arguments. However, most commands have --exclude
"" and --include "" parameters that can achieve the
desired result. These parameters perform pattern matching to either
exclude or include a particular file or object. The following pattern
symbols are supported.
*: Matches everything
?: Matches any single character
[sequence]: Matches any character in sequence
[!sequence]: Matches any character not in sequence
Putting this here for others to find, since I just had to figure this out. Here's what I came up with:
s3cmd del $(s3cmd ls s3://[BUCKET]/ | grep '.*s3://[BUCKET]/[FILENAME]' | cut -c 41-)
You can put the regex in the grep search string. For instance, I was searching for specific files to delete (hence the s3cmd del). My regex looked like: '2016-11-04.*s3.*[DN][RS].*'. You may have to adjust the cut for your use. Should also work with s3cmd get.

Handling "?" character passed to ZSH function

I'm having problem with setting up simple function in ZSH.
I want to make function which downloads only mp3 file from youtube.
I used youtube-dl and i want to make simple function to make that easy for me
ytmp3(){
youtube-dl -x --audio-format mp3 "$#"}
So when i try
ytmp3 https://www.youtube.com/watch?v=_DiEbmg3lU8
i get
zsh: no matches found: https://www.youtube.com/watch?v=_DiEbmg3lU8
but if i try
ytmp3 "https://www.youtube.com/watch?v=_DiEbmg3lU8"
it works.
I figured out that program runs (but wont download anything) if i remove all charachers after ? including it. So i guess that this is some sort of special character for zsh.
By default, the ZSH will try to "glob" patterns that you use on command lines (it will try to match the pattern to file names). If it can't make a match, you get the error you're getting ("no matches found").
You can disable this behaviour by disabling the nomatch option:
unsetopt nomatch
The manual page describes this option as follows (it describes what happens when the option is enabled):
If a pattern for filename generation has no matches, print an error, instead of leaving it unchanged in the argument list.
Try again with the option disabled:
$ unsetopt nomatch
$ ytmp3 https://www.youtube.com/watch?v=_DiEbmg3lU8
If you want to permanently disable the option, you can add the disable command to your ~/.zshrc file.
The question mark is part of ZSH's pattern matching, similarly to *. It means "Any character".
For instance, ls c?nfig will list both "config" and "cinfig", provided they exist.
So, yes, your problem is simply that zsh is trying to interpret the ? in the URL as a pattern to match to files, failing to find any, and crapping out. Escape the ? with a \ or put quotes around it, like you did, to fix it.

unix command line ...how to grep and show only file names that contain a string?

I know I can search for a string with:
grep -n -d recurse 'snoopy' *
and then it shows every file name and instance that contains that string, like:
file/name.txt:23 some snoopy here
file/name2.txt:59 another snoopy there
file/name2.txt:343 some more snoopy
etc...
The problem is that with many occurrences, the list is huge. How do I make it show only the actual file names that contain the string, without duplicates and without the occurrence?
Only like:
file/name1.txt
file/name52.txt
file/name28293.txt
Thanks a lot for any help :)
The -l flag (or, in both BSD and GNU grep, --files-with-matches) does what you want.
From the POSIX spec:
Write only the names of files containing selected lines to standard output. Pathnames shall be written once per file searched. If the standard input is searched, a pathname of "(standard input)" shall be written, in the POSIX locale. In other locales, "standard input" may be replaced by something more appropriate in those locales.
Both BSD and GNU also explicitly guarantee that this will be more efficient. (Older BSD versions say "… grep will only search a file until a match has been found, making searches potentially less expensive", newer BSD and GNU say "The scanning will stop on the first match".) If you don't know which grep you have and which options it has, just type man grep at the shell and you should get the manpage.

find specific file extension with find command

I would like to find any files within a given root that contain arbitrary extensions in the filename.
I saw this post:
How to delete all files with certain suffix before file extension
Based on that information, I tried this:
find . -iregex ".*\.\(wav\|aif\|wave\|aiff\)"
This seems like it should work, but I don't get any results printed to the terminal window.
Can anyone offer advice? I'm on OSX 10.7
Thanks,
jml
You are looking for:
find . -regex ".*\.\(wav\|aif\|wave\|aiff\)"
You were missing escape, \, characters on the or, |, operators
Is that an emacs style regex?
If not, try using -regextype. From the find man page on Linux (archaic):
-regextype type
Changes the regular expression syntax understood by -regex and -iregex tests which occur later on the command line. Currently-implemented types are emacs (this is the default), posix-awk, posix-basic, posix-egrep and posix-extended.
On MacOS X, the manual page for find says:
-iregex pattern
Like -regex, but the match is case insensitive.
-regex pattern
True if the whole path of the file matches pattern using regular expression. To match a file named './foo/xyzzy', you can use the regular expression '.*/[xyz]*' or '.*/foo/.*', but not 'xyzzy' or '/foo/'.
Some experimentation shows that:
find pdf -iregex ".*/.*.pdf"
finds a whole lot of PDF files in my folder full of them, but none of these variants find anything:
find pdf -iregex ".*/.*\.(pdf|doc|docx)"
find pdf -iregex ".*/.*\.\(pdf|doc|docx\)"
find pdf -iregex ".*/.*.(pdf|doc|docx)"
find pdf -iregex ".*/.*.\(pdf|doc|docx\)"
Consequently, one is forced to assume that the regexes supported by MacOS X (BSD) find do not include alternation (parentheses and pipes) amongst the recognized characters. 'Tis a pity: man 7 re_format implies it might, but it doesn't. The -regextype option is not supported on MacOS X (BSD), it seems.
So, it may be simplest to install GNU find, or to do N separate searches for the N different file extensions, or do one search for files in general and use egrep '\.(aff|wave?|aiff)$' to catch the files you're interested in. That rather assumes you don't use newlines in file names (spaces etc are OK, but newlines are not).

Resources