Ignore directory or filename when using Yara -r - rules

Does anyone know how to instruct Yara to ignore scanning a certain directory or filename when using a recursive scan? The issue is when my scan completes, there are hits from the rule file itself, so I want to instruct Yara to ignore the directory the rules file sits in
for example - i want to scan c:\ recursively but ignore directory c:\users\xyz\documents\rules.yar
Any ideas?
Thanks

Unfortunately YARA doesn't have that feature, but as a workaround you can define your strings as hexadecimal instead of text. For example, instead of:
$ = "hello"
Use:
$ = { 68 65 6C 6C 6F }
Alternatively you can escape one character of the string:
$ = "\x68ello"
This is far from ideal and only makes sense if you have a simple rule.

Related

PowerShell: replacing NUL in large file not working

I have a huge (39 GB) text file that I eventually need to read into R as a pipe-delimited file. However, there are lots of NUL characters \0 that do not read in R. I'm trying to replace them in PowerShell beforehand.
PowerShell code:
Get-Content file.txt | foreach { $_ -replace '\\0' } | Out-File -Encoding UTF8 file_NEW.txt
I thought this worked but when I try to read the new file in R, \0 characters appear in the string and I get this error:
Error in vroom_(file, delim = delim %||% col_types$delim, col_names = col_names, :
embedded nul in string: '||MORALES BELINDA F TRUST||\0||\0|0||PT||||33.824049|-118.192053||3655||N|WESTON|PL|||LONG BEACH|CA|908073855|C033||3655||N|WESTON|PL|||LONG BEACH|CA|908073855|C033|20111117|988||||||||20111027|20111110|TR|2|1527575||KINECTA FCU|KINECTA FCU|||MANHATTAN BEACH|CA|90266|047978|LAWYERS TITLE|03003|232000.00|20111027||CNV|TR|Y|10|20211201|||D|BGJT|V||115|21||0|0||0|||||Y|Y||\r\n06037|5054001029|5054-001-029|1|\0||BONILLA ...
Why are there still NULs in the file? ANY help appreciated! Especially because these functions take so long to run. Please, I'm just trying to read this huge file.
Just in case there is an error in the R code, note it is taken directly from this post using the vroom and arrow packages to read then create parquet files.
The reason is that doubling the backslash, \\, means that the backspace is escaped. Instead of looking at NUL (0x00), you are looking literally \0 - two characters.
The correct syntax would be like so,
-replace '\0'
That being said, processing a large file can be done smarter a way. A fast way would be to process, say, 10 000 lines a time. See earlier an answer about how to process a file in blocks.
vonPryz' helpful answer shows the immediate problem with your -replace operation: in a single-quoted PowerShell string ('...'), \ chars. do not need escaping as \\ in order to be passed verbatim to the .NET regex engine that the -replace operator uses behind the scenes.
Thus, '\0', when passed to the .NET regex engine from PowerShell, is sufficient to represent a NUL character (a Unicode character whose code point is 0); alternatively, you could use "`0", a double-quoted, expandable PowerShell string, in which ` serves as the escape character.
r2evans' helpful answer shows an alternative solution via the Windows ports of standard Unix utilities that come with the optional Rtools download, where piping the input file to
tr -d '\0' may offer the fastest solution, if both the input and the output file use the same character encoding.
In the realm of PowerShell, using Get-Content with its default line-by-line processing with such a large input file would take too long in practice.
While direct use of .NET APIs may offer the ultimately fastest solution, using Get-Content's
-ReadCount parameter offers a simpler, more PowerShell-idiomatic solution:
Get-Content -ReadCount 1e6 file.txt | foreach { $_ -replace '\0' } |
Out-File -Encoding UTF8 file_NEW.txt
-ReadCount 1e6 reads 1 million lines (1e6 uses exponential notation, i.e. 10 to the power of 6) at once and passes them as an array to the ForEach-Object cmdlet (one of whose built-in aliases is foreach); since the -replace operator is capable of operating on an array of values as its LHS, the NUL substitution can be performed on all elements of the array at once.
Depending on how many bytes make up the average line in your input file, you can adjust this number upward, if you have more memory available, or downward, if you have less. The higher the number you can use, the faster the command will complete.
I don't know powershell enough to fix that, but you can use sed or tr to replace the nuls in the files. The tr and sed utilities are available by default on most (all?) unix-like OSes including macos. For windows, they are included in Rtools35 and Rtools40.
If you do not find it with Sys.which("tr"), then you may need to include the full path to the respective utility. Assuming Rtools is installed on the root c:/, then something like
Rtools35: c:/Rtools/bin/tr.exe
Rtools40: c:/Rtools40/usr/bin/tr.exe
They are also included in Git-for-Windows as /usr/bin/tr.exe and /usr/bin/sed.exe within git-bash. (On the file-system, they are likely under c:/Program Files/Git/usr/bin/.)
(Same locations for sed.)
I should note that I'm doing this through R's system2 as a convenience only. If you're comfortable enough on the bash command line, then this is just as easy to perform there instead.
data generation
I don't know where the nuls are in your file, so I'll assume that they are record (line) terminators. That is, in most files you'll see each line ending with \n or \r\n, but for this example I'll replace the \n with \0 (nul).
charToRaw("a|b\nhello|world")
# [1] 61 7c 62 0a 68 65 6c 6c 6f 7c 77 6f 72 6c 64
ch <- charToRaw("a|b\nhello|world")
ch[ch == charToRaw("\n")] <- as.raw(0)
ch
# [1] 61 7c 62 00 68 65 6c 6c 6f 7c 77 6f 72 6c 64
writeBin(ch, "raw.txt")
readLines("raw.txt")
# Warning in readLines("raw.txt") :
# line 1 appears to contain an embedded nul
# Warning in readLines("raw.txt") :
# incomplete final line found on 'raw.txt'
# [1] "a|b"
The nul is a problem (as intended), so we don't see anything after the embedded nul.
tr
tr doesn't like doing things in place, so this takes as input the original file and generates a new file. If file-size and disk space is a concern, then perhaps sed would be preferred.
system2("tr", c("\\0", "\\n"), stdin = "raw.txt", stdout = "raw2.txt")
readLines("raw2.txt")
# Warning in readLines("raw2.txt") :
# incomplete final line found on 'raw2.txt'
# [1] "a|b" "hello|world"
(That warning is safe to ignore here.)
sed
sed can optionally work in-place with the -i argument. (Without it, it can operate the same as tr: generate a new file based on the original.)
system2("sed", c("-i", "s/\\x0/\\n/g", "raw.txt"))
readLines("raw.txt")
# Warning in readLines("raw.txt") :
# incomplete final line found on 'raw.txt'
# [1] "a|b" "hello|world"
(That warning is safe to ignore here.)
other than record-terminator
If the nul is not the record terminator (\n-like) character, than you have some options:
Replace the \0 character with something meaning, such as Z (stupid, but you get the point). This should use the above commands as-is, replacing the \\n with your character of choice. (tr will require a single-character, sed can replace it with multiple characters if you like.)
Delete the \0 completely, in which case you can use tr -d '\0' and sed -i -e 's/\x0//g' (translated into R's system2 calls above).

Create temporary aliases in zshrc

Is there a way to create an temporary aliases, mainly to cut down typing dir paths, especially those not used as often to warrant adding to .bash_aliases?
Thanks
Just type one in.
94 $ alias hello='echo hello'
95 $ hello
hello
You can type
long=~/long/path/to/a/directory
and use it as
cd ~long
(just add tilde to the beginning).
You will also see "~long" in your prompt instead of the full path.

How to make the glob() function also match hidden dot files in Vim?

In a Linux or Mac environment, Vim’s glob() function doesn’t match dot files such as .vimrc or .hiddenfile. Is there a way to get it to match all files including hidden ones?
The command I’m using:
let s:BackupFiles = glob("~/.vimbackup/*")
I’ve even tried setting the mysterious {flag} parameter to 1, and yet it still doesn’t return the hidden files.
Update: Thanks ib! Here’s the result of what I’ve been working on: delete-old-backups.vim.
That is due to how the glob() function works: A single-star pattern
does not match hidden files by design. In most shells, the default
globbing style can be changed to do so (e.g., via shopt -s dotglob
in Bash), but it is not possible in Vim, unfortunately.
However, one has several possibilities to solve the problem still.
First and most obvious is to glob hidden and not hidden files
separately and then concatenate the results:
:let backupfiles = glob(&backupdir..'/*').."\n"..glob(&backupdir..'/.[^.]*')
(Be careful not to fetch the . and .. entries along with hidden files.)
Another, perhaps more convenient but less portable way is to use
the backtick expansion within the glob() call:
:let backupfiles = glob('`find '..&backupdir..' -maxdepth 1 -type f`')
This forces Vim to execute the command inside backticks to obtain
the list of files. The find shell command lists all files (-type f)
including the hidden ones, in the specified directory (-maxdepth 1
forbids recursion).

Unix wildcard selectors? (Asterisks)

In Ryan Bates' Railscast about git, his .gitignore file contains the following line:
tmp/**/*
What is the purpose of using the double asterisks followed by an asterisk as such: **/*?
Would using simply tmp/* instead of tmp/**/* not achieve the exact same result?
Googling the issue, I found an unclear IBM article about it, and I was wondering if someone could clarify the issue.
It says to go into all the subdirectories below tmp, as well as just the content of tmp.
e.g. I have the following:
$ find tmp
tmp
tmp/a
tmp/a/b
tmp/a/b/file1
tmp/b
tmp/b/c
tmp/b/c/file2
matched output:
$ echo tmp/*
tmp/a tmp/b
matched output:
$ echo tmp/**/*
tmp/a tmp/a/b tmp/a/b/file1 tmp/b tmp/b/c tmp/b/c/file2
It is a default feature of zsh, to get it to work in bash 4, you perform:
shopt -s globstar
From http://blog.privateergroup.com/2010/03/gitignore-file-for-android-development/:
(kwoods)
"The double asterisk (**) is not a git thing per say, it’s really a linux / Mac shell thing.
It would match on everything including any sub folders that had been created.
You can see the effect in the shell like so:
# ls ./tmp/* = should show you the contents of ./tmp (files and folders)
# ls ./tmp/** = same as above, but it would also go into each sub-folder and show the contents there as well."
According to the documentation of gitignore, this syntax is supported since git version 1.8.2.
Here is the relevant section:
Two consecutive asterisks (**) in patterns matched against full pathname may have special meaning:
A leading ** followed by a slash means match in all directories. For example, **/foo matches file or directory foo anywhere, the
same as pattern foo. **/foo/bar matches file or directory bar
anywhere that is directly under directory foo.
A trailing /** matches everything inside. For example, abc/** matches all files inside directory abc, relative to the location of
the .gitignore file, with infinite depth.
A slash followed by two consecutive asterisks then a slash matches zero or more directories. For example, a/**/b matches a/b,
a/x/b, a/x/y/b and so on.
Other consecutive asterisks are considered invalid.

How do I distinguish between 'binary' and 'text' files?

Informally, most of us understand that there are 'binary' files (object files, images, movies, executables, proprietary document formats, etc) and 'text' files (source code, XML files, HTML files, email, etc).
In general, you need to know the contents of a file to be able to do anything useful with it, and form that point of view if the encoding is 'binary' or 'text', it doesn't really matter. And of course files just store bytes of data so they are all 'binary' and 'text' doesn't mean anything without knowing the encoding. And yet, it is still useful to talk about 'binary' and 'text' files, but to avoid offending anyone with this imprecise definition, I will continue to use 'scare' quotes.
However, there are various tools that work on a wide range of files, and in practical terms, you want to do something different based on whether the file is 'text' or 'binary'. An example of this is any tool that outputs data on the console. Plain 'text' will look fine, and is useful. 'binary' data messes up your terminal, and is generally not useful to look at. GNU grep at least uses this distinction when determining if it should output matches to the console.
So, the question is, how do you tell if a file is 'text' or 'binary'? And to restrict is further, how do you tell on a Linux like file-system? I am not aware of any filesystem meta-data that indicates the 'type' of a file, so the question further becomes, by inspecting the content of a file, how do I tell if it is 'text' or 'binary'? And for simplicity, lets restrict 'text' to mean characters which are printable on the user's console. And in particular how would you implement this? (I thought this was implied on this site, but I guess it is helpful, in general, to be pointed at existing code that does this, I should have specified), I'm not really after what existing programs can I use to do this.
You can use the file command. It does a bunch of tests on the file (man file) to decide if it's binary or text. You can look at/borrow its source code if you need to do that from C.
file README
README: ASCII English text, with very long lines
file /bin/bash
/bin/bash: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), for GNU/Linux 2.2.5, dynamically linked (uses shared libs), stripped
The spreadsheet software my company makes reads a number of binary file formats as well as text files.
We first look at the first few bytes for a magic number which we recognize. If we do not recognize the magic number of any of the binary types we read, then we look at up to the first 2K bytes of the file to see whether it appears to be a UTF-8, UTF-16 or a text file encoded in the current code page of the host operating system. If it passes none of these tests, we assume that it is not a file we can deal with and throw an appropriate exception.
You can determine the MIME type of the file with
file --mime FILENAME
The shorthand is file -i on Linux and file -I (capital i) on macOS (see comments).
If it starts with text/, it's text, otherwise binary. The only exception are XML applications. You can match those by looking for +xml at the end of the file type.
To list text file names in current dir/subdirs:
grep -rIl ''
Binaries:
grep -rIL ''
To check for a particular file:
grep -qI '' FILE
then, exit status '0' would mean the file is a text; '1' - binary.
To check:
echo $?
Key option is this:
-I Process a binary file as if it did not contain matching data;
Other options:
-r, --recursive
Read all files under each directory, recursively;
-l, --files-with-matches
Suppress normal output; instead print the name of each input file from which output would normally have been printed.
-L, --files-without-match
Suppress normal output; instead print the name of each input file from which no output would normally have been printed.
-q, --quiet, --silent
Quiet; do not write anything to standard output. Exit immediately with zero status if any match is found, even if an error was detected.
Perl has a decent heuristic. Use the -B operator to test for binary (and its opposite, -T to test for text). Here's shell a one-liner to list text files:
$ find . -type f -print0 | perl -0nE 'say if -f and -s _ and -T _'
(Note that those underscores without a preceding dollar are correct (RTFM).)
Well, if you are just inspecting the entire file, see if every character is printable with isprint(c). It gets a little more complicated for Unicode.
To distinguish a unicode text file, MSDN offers some great advice as to what to do.
The gist of it is to first inspect up to the first four bytes:
EF BB BF UTF-8
FF FE UTF-16, little endian
FE FF UTF-16, big endian
FF FE 00 00 UTF-32, little endian
00 00 FE FF UTF-32, big-endian
That will tell you the encoding. Then, you'd want to use iswprint(c) for the rest of the characters in the text file. For UTF-8 and UTF-16, you need to parse the data manually since a single character can be represented by a variable number of bytes. Also, if you're really anal, you'll want to use the locale variant of iswprint if that's available on your platform.
Its an old topic, but maybe someone will find this useful.
If you have to decide in a script if something is a file then you can simply do like this :
if file -i $1 | grep -q text;
then
.
.
fi
This will get the file type, and with a silent grep you can decide if its a text.
You can use libmagic which is a library version of the Unix file command line (source).
There are wrappers for many languages:
Python
.NET
Nodejs
Ruby
Go
Rust
Most programs that try to tell the difference use a heuristic, such as examining the first n bytes of the file and seeing if those bytes all qualify as 'text' or not (i.e., do they all fall within the range of printable ASCII charcters). For finer distiction there's always the 'file' command on UNIX-like systems.
One simple check is if it has \0 characters. Text files don't have them.
As previously stated *nix operating systems have this ability within the file command. This command uses a configuration file that defines magic numbers contained within many popular file structures.
This file, called magic was historically stored in /etc, although this may be in /usr/share on some distributions. The magic file defines offsets of values known to exist within the file and can then examine these locations to determine the type of the file.
The structure and description of the magic file can be found by consulting the relevant manual page (man magic)
As for an implementation, well that can be found within file.c itself, however the relevant portion of the file command that determines whether it is readable text or not is the following
/* Make sure we are dealing with ascii text before looking for tokens */
for (i = 0; i < nbytes - 1; i++) {
if (!isascii(buf[i]) ||
(iscntrl(buf[i]) && !isspace(buf[i]) &&
buf[i] != '\b' && buf[i] != '\032' && buf[i] != '\033'
)
)
return 0; /* not all ASCII */
}

Resources