Delete part of file name (up to but NOT including string)

Delete part of file name (up to but NOT including string) - zsh

I have a script that was very kindly provided for me a while ago which allowed me to generate input files by inserting coordinates from a series of .xyz files into a template file (Create new files by copying contents of coordinate files into template file).
I'm trying to adapt that script to do something very similar, but different in a very slight, but annoying way. In the script, the new directories created to house these new files are named like this:
# File name is in the form '....Hnnn.xyz';
# this will parse nnn from that name.
local inputNumber=$coordFile
# Remove '.xyz'.
inputNumber=${inputNumber%.xyz}
# Remove everything up to and including the 'H'.
inputNumber=${inputNumber##*H}
# Subdirectory name is based on the input number.
local outDir=$baseDir/D$inputNumber
# Create the directory if it doesn't exist.
if [[ ! -d $outDir ]]; then
mkdir $outDir
fi
This worked for my last problem, because the files were all named in the form xxxx_DH000.xyz. However, now the files I have are named using the form xxxx.000.xyz. While everything else in the script works, I cannot figure out how to name the new directories in the form 000.
The line in the script which I think needs to be edited slightly is where it says inputNumber=${inputNumber##*H}. What I cannot figure out is how to get the script to delete everything up to but not including a 0. I've searched online, but the only questions/answers I've found relating to the renaming of files by stripping part of the original names speaks about deleting everything 'up to and including' a string.
I was able to generate directories named 1, 2, 3, etc. with inputNumber=${inputNumber##*0}, however I want all three digits present (i.e. I would like create directories 001, 002, 003, etc.).
As an aside, I cannot use the . as the cutoff point, as there are multiple .s in each file name. An example of one of the file names is tma.h2s-2-pes-b97m-d4-tz.011.xyz.
Is there some way to get the script to simply name the files based on the full three digit number?

Although it's not needed in this case, zsh does support deleting text just before a matched pattern in a string. These parameter expansions will remove everything prior to the first 0 in the string, but keep the 0:
inputNumber='tma.h2s-2-pes-b97m-d4-tz.011.xyz'
inputNumber=${inputNumber:r} # remove '.xyz'
inputNumber=${(SM)inputNumber##0*}
print ${inputNumber}
# ==> 011
This includes a few zsh-isms:
${...:r} returns the 'root' of a filename, removing the extension.
(S) - parameter expansion flag to change the behavior of the ## expansion. It will now search for patterns in the middle of a string, not just at the beginning.
(M) - flag to include the pattern match (the 0*) in the result.
This depends on the number always starting with 0, which may not be a good choice - what file comes after 099?
This next version uses a zsh extended glob pattern to find a number between two periods, and returns that number - i.e. it will find the number in .11., .011., or .2345., but not in .x11.:
coordFile='tma.h2s-2-pes-b97m-d4-tz.022.xyz'
inputNumber=${(*)coordFile//(#b)*.(<->).*/${match}}
print ${inputNumber}
# ==> 022
Some of the pieces:
${...//.../...} - substitution expansion.
(*) - enables extendedglob for this expansion.
(#b) - globbing flag to enable 'backreferences', so that $match will work.
<-> - matches a number. This can be restricted to a range if needed, like <100-199>.
(<->) - puts the number into a match group.
*. and .* - everything before and after the number; these are not in the match group.
${match} - the matched string from the parenthesized part of the pattern. This is used as the replacement for the entire string, so we get just the number. If more than one part of the input string matches the pattern, this will be the last one. match is actually an array, but since there's only one match group in the pattern, it does not need to be indexed with ${match[1]}.
This variant uses a standard regular expression to find the number:
coordFile='tma.h2s-2-pes-b97m-d4-tz.033.xyz'
match=
[[ $coordFile =~ .*\\.([[:digit:]]+)\\..* ]]
inputNumber=${match[1]}
print ${inputNumber}
# ==> 033
After the [[ ]] test, the match array will contain matches from any parenthesized groups in the regular expression - here, that will be a set of one or more digits in between two periods / full stops.
But, as #choroba and Fravadona have noted, since the number will be always be at the end of the string, you can use the standard #/##/%/%% expansions to remove parts of the string based only on the .s. This is a common idiom that will be familiar to many shell programmers, and will also work in bash (note that other parts of your original script depend on zsh).
inputNumber='tma.h2s-2-pes-b97m-d4-tz.044.xyz'
inputNumber=${inputNumber%.xyz}
inputNumber=${inputNumber##*.}
print ${inputNumber}
# ==> 044
In zsh everything can be consolidated into a single nested substitution:
baseDir='files/are/here'
coordFile='tma.h2s-2-pes-b97m-d4-tz.055.xyz'
local outDir=$baseDir/D${${coordFile:r}##*.}
print $outDir
# ==> files/are/here/D055

Related

R - Split character vector using regex

I've got some kind of logfile I'd like to read and analyse. Unfortunately the files are saved in a pretty "ugly" way (with lots of special characters in between), so I'm not able to read in just the lines with each one being an entry. The only way to separate the different entries is using regular expressions, since the beginning of each entry follows a specified pattern.
My first approach was to identify the pattern in the character vector (I use read_file from the readr-package) and use the corresponding positions to split the vector with strsplit. Unfortunately the positions seem not always to match, since the result doesn't always correspond to the entries (I'd guess that there's a problem with the special characters).
A typical line of the file looks as follows:
16/10/2017, 21:51 - George: This is a typical entry here
The corresponding regular expressions looks as follows:
([[:digit:]]{2})/([[:digit:]]{2})/([[:digit:]]{4}), ([[:digit:]]{2}):([[:digit:]]{2}) - ([[:alpha:]]+):
The first thing I want is a data.frame with each line corresponding to a specific entry (in a next step I'd split the pattern into its different parts).
What I tried so far was the following:
regex.log = "([[:digit:]]{2})/([[:digit:]]{2})/([[:digit:]]{4}), ([[:digit:]]{2}):([[:digit:]]{2}) - ([[:alpha:]]+):"
log.regex = gregexpr(regex.log, file.log)[[1]]
log.splitted = substring(file.log, log.regex, log.regex[2:355]-1)
As can be seen this logfile has 355 entries. The first ones are separated correctly. How can I separate the character vector using a regular expression without loosing the information of the regular expression/pattern?

Use capturing and non-capturing groups to identify the parts you want to keep, and be sure to use anchors:
file.log = "16/10/2017, 21:51 - George: This is a typical entry here"
regex.log = "^((?:[[:digit:]]{2})\\/(?:[[:digit:]]{2})\\/(?:[[:digit:]]{4}), (?:[[:digit:]]{2}):(?:[[:digit:]]{2}) - (?:[[:alpha:]]+)): (.*)$"
gsub(regex.log,"\\1",file.log)
>> "16/10/2017, 21:51 - George"
gsub(regex.log,"\\2",file.log)
>> "This is a typical entry here"

Korn shell metacharacters: What does !(some text) mean?

Trying to figure out how ksh is processing the construct !(text). For example,
$ echo !(hello)
produces a list of files in the current directory (similar to the output of an ls command, except it's sorted into columns rather than rows). It doesn't matter what text is in the parens, the output is the same.
Can anyone enlighten me as to what the command is actually doing? Thanks!

It echoes all files except hello. You can also use wildcards like echo !(*.java)

Here's some more detailed information. For more info, look in the "file name generation" section of the ksh man page (bash works the same way). See here for more patterns: https://www.mkssoftware.com/docs/man1/sh.1.asp
A sub-pattern begins with a ?, *, +, #, or ! character followed by a pattern-list
enclosed in parentheses. Pattern-lists themselves can contain sub-patterns.
The following list describes valid sub-patterns.
?(pattern-list)
Matches exactly zero or exactly one occurrence of the specified pattern-list.
*(pattern-list)
Matches zero or more occurrences of the specified pattern-list.
+(pattern-list)
Matches one or more occurrences of the specified pattern-list.
#(pattern-list)
Matches exactly one occurrence of the specified pattern-list.
!(pattern-list)
Matches any string that does not match the specified pattern-list.
So for your example, when the shell sees the unquoted exclamation point, followed by parenthesis it goes into file name matching mode, then it displays files in the current directory that do not match "hello".

zsh match files not containing dash

I have file listing as the following one:
001file.jpg
003file.jpg
001-800x600-sq.jpg
001-800x600.jpg
002-800x600-sq.jpg
002-800x600.jpg
003-800x600-sq.jpg
003-800x600.jpg
004-800x531-sq.jpg
004-800x531.jpg
005-800x531-sq.jpg
005-800x531.jpg
006-800x531-sq.jpg
006-800x531.jpg
007-800x531-sq.jpg
007-800x531.jpg
008-800x1067-sq.jpg
008-800x1067.jpg
009-800x1067-sq.jpg
009-800x1067.jpg
010-800x533-sq.jpg
010-800x533.jpg
011-800x1200-sq.jpg
011-800x1200.jpg
012-800x533-sq.jpg
012-800x533.jpg
013-800x600-sq.jpg
013-800x600.jpg
014-800x1067-sq.jpg
014-800x1067.jpg
015-800x533-sq.jpg
015-800x533.jpg
016-800x533-sq.jpg
016-800x533.jpg
In ZSH, I want to list all files beginning with any number, not containing dash in filename, so I tried:
print -l <->[^-]*.jpg
with no success. What is wrong with this pattern!?

This is, I think, similar to the case that the documentation for <-> warns about:
Be careful when using other wildcards adjacent to patterns of this form; for example, <0-9>* will actually match any number whatsoever at the start of the string, since the `<0-9>' will match the first
digit, and the `*' will match any others. This is a trap for the unwary, but is in fact an inevitable
consequence of the rule that the longest possible match always succeeds. Expressions such as
`<0-9>[^[:digit:]]*' can be used instead.
In print -l <->[^-]*.jpg, the <-> matches the first digit, then [^-] matches the 2nd digit, and * matches everything thing else.
Use instead
print -l <->[^[:digit:]-]*.jpg

list files that do not contain pattern anywhere in filename (zsh only)

Say I have a list of files in dir like so:
blackneasy-sq.png
dima-sq.png
envoy-sq.png
fox-sq.png
freeze-sq.png
ministarstvo kulture-sq.png
naxi-sq.png
pick-sq.png
pink-sq.png
wink-sq.png
brick-sq.png
rider-sq.png
sixt-rent-a-car.png
slavija hotel-sq.png
temet-sq.png
tepe-sq.png
How would I specify exclude pattern to show all files except those that
contain ck chars glued together anywhere in filename?
So in a resulting list, files:
blackneasy-sq.png
pick-sq.png
brick-sq.png
should not be shown.
I tried this without success.
print -l [[:alpha:]]*^ck*.*

Use the ~ operator: THIS~THAT matches files that match the pattern THIS but not THAT. This requires the extended_glob option (which is automatically turned on in some contexts such as completion function, but is not the default state).
print -l -- *.png~*ck*
You can abbreviate *~THAT to ^THAT (also requiring extended_glob), so if you don't want to limit to .png files, you can use
print -l -- ^*ck*

Difference between dir/**/* and dir// in Unix glob pattern?

It seems that the output are the same when I echoed it.
I also tested other commands such as open, but the results from both are the same.

In traditional sh-style pattern matching, * matches zero or more characters in a component of the file name, so there is no difference between *, **, and ***, either on its own or as part of a larger pattern.
However, there are globbing syntaxes that assign a distinct meaning to **. Pattern matching implemented by the Z shell, for example, expands x/**/y to all file names beginning with x/ and ending in /y regardless of how many directories are in between, thus matching all of x/y, x/subdir/y, x/subdir1/subdir2/y, etc. This syntax was later implemented by bash, although only enabled when the globstar configuration option is set by the user.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex