Compression without dictionary - dictionary

I have been testing the various compression algorithms with parquet files, and have settled on Zstd.
Now as far as I understand Zstd uses adaptive dictionary unless one is explicitly specified, thus it begins with an empty one. However when having a dictionary enabled the compressed size and and the execution time are quite unsatisfactory.
The file size without using a dictionary is quite less compared to using the adaptive one. (The number at the end of the name is the compression level):
Name: C:\ParquetFiles\Zstd1 Execution time: 279 ms Size: 13738134
Name: C:\ParquetFiles\Zstd2 Execution time: 140 ms Size: 13207017
Name: C:\ParquetFiles\Zstd9 Execution time: 511 ms Size: 12701030
And for comparison the log from using the adaptive dictionary:
Name: C:\ParquetFiles\ZstdDictZstd1 Execution time: 487 ms Size: 19462825
Name: C:\ParquetFiles\ZstdDictZstd2 Execution time: 402 ms Size: 19292513
Name: C:\ParquetFiles\ZstdDictZstd9 Execution time: 614 ms Size: 19072779
Can you help me understand the significance of this, shouldn't the output with an empty dictionary perform at least as good as Zstd compression with dictionary disabled?

Related

Stress-ng stress memory with specific percentage

I am trying to stress a ubuntu container's memory. Typing free in my command terminal provides the following result:
free -m
total used free shared buff/cache available
Mem: 7958 585 6246 401 1126 6743
Swap: 2048 0 2048
I want to stress exactly 10% of the total available memory. Per stress-ng manual:
-m N, --vm N
start N workers continuously calling mmap(2)/munmap(2) and writing to the allocated
memory. Note that this can cause systems to trip the kernel OOM killer on Linux
systems if not enough physical memory and swap is not available.
--vm-bytes N
mmap N bytes per vm worker, the default is 256MB. One can specify the size as % of
total available memory or in units of Bytes, KBytes, MBytes and GBytes using the
suffix b, k, m or g.
Now, on my target container I run two memory stressors to occupy 10% of my memory:
stress-ng -vm 2 --vm-bytes 10% -t 10
However, the memory usage on the container never reaches 10% no matter how many times I run it. I tried different timeout values, no result. The closet it gets is 8.9% never approaches 10%. I inspect memory usage on my container this way:
docker stats --no-stream kind_sinoussi
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
c3fc7a103929 kind_sinoussi 199.01% 638.4MiB / 7.772GiB 8.02% 1.45kB / 0B 0B / 0B 7
In an attempt to understand this behaviour, I tried running the same command with an exact unit of bytes. In my case, I'll opt for 800 mega since 7958m * 0.1 = 795,8 ~ 800m.
stress-ng -vm 2 --vm-bytes 800m -t 15
And, I get 10%!
docker stats --no-stream kind_sinoussi
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
c3fc7a103929 kind_sinoussi 198.51% 815.2MiB / 7.772GiB 10.24% 1.45kB / 0B 0B / 0B 7
Can someone explain why this is happening?
Another question, is it possible for stress-ng to stress memory usage to 100%?
stress-ng --vm-bytes 10% will use sysconf(_SC_AVPHYS_PAGES) to determine the available memory. This sysconf() system call will return the number of pages that the application can use without hindering any other process. So this is approximately what the free command is returning for the free memory statistic.
Note that stress-ng will allocate the memory with mmap, so it may be that during run time mmap'd pages may not necessarily be physically backed at the time you check how much real memory is being used.
It may be worth trying to also use the --vm-populate option; this will try and ensure the pages are physically populated on the mmap'd memory that stress-ng is exercising. Also try --vm-madvise willneed to use the madvise() system call to hint that the pages will be required fairly soon.

rsync - why is it transferring whole file

I believe my rsync is transferring the entire file each time instead of using its algorithm and only transferring changes. Why is this?
I have a text file called rsynctest. Even if I delete only a single character from the text file on the server end, it appears to be transferring the entire file. The rsync stats show: Total transferred file size is 2.55G and the file size is 2.4G so I believe it transferred the entire file.
Here is the output
/usr/bin/rsync -avrh --progress --compress --stats rsynctest x.x.x.x:/rsynctest
sending incremental file list
test
2.55G 100% 27.56MB/s 0:01:28 (xfer#1, to-check=0/1)
Number of files: 1
Number of files transferred: 1
Total file size: 2.55G bytes
Total transferred file size: 2.55G bytes
Literal data: 50.53K bytes
Matched data: 2.55G bytes
File list size: 39
File list generation time: 0.001 seconds
File list transfer time: 0.000 seconds
Total bytes sent: 693
Total bytes received: 404.38K
sent 693 bytes received 404.38K bytes 2.54K bytes/sec
total size is 2.55G speedup is 6305.64
It's not actually transferring the whole file. Look at the "Total bytes sent" and "Total bytes received" lines. It's only transferring the checksums of each block of N bytes (where N can vary), and when a given block's checksums turn out identical, that block isn't transferred. But since the file size has changed, it has to check the whole file for differences. And that's what you're seeing in the progress bar: rsync checking the whole file for differences. (Note also the speed of 27.56MB/s -- I doubt your Internet connection could actually maintain a quarter of a hundred megabytes per second transfer speed. Though if it can, more power to you.)

BaseX: Slow XQuery

I've got a BaseX XML database with ~20 XML files. These files are different in size and structure. The biggest file has got 524 MB. It consists of a parent ARTICLE tag with 267685 ART subtags.
This is my XQuery: "/ARTICLE/ART[PRDNO=12345]" (pretty straightforward; proper namespaces omitted for clarity). PRDNO is a foreign key to the PRODUCT/PRD XML structure, there are multiple (in average ~10) products per article.
Everything works as it is supposed to, but this query is quite slow - it takes approximately 1s to execute. Similar queries for other objects in the database (where the data amount is smaller) are much faster.
What can I do to optimize this query?
I ran "optimize" (which took some minutes), I ensured the TEXT index is in place.
This is the output of "info database":
> info database
Database Properties
Name: hospindex
Size: 1740 MB
Nodes: 69360063
Documents: 22
Binaries: 0
Timestamp: 2014-09-03-09-34-07
Resource Properties
Timestamp: 2014-09-03-09-21-14
Encoding: UTF-8
CHOP: true
Indexes
Up-to-date: true
TEXTINDEX: true
ATTRINDEX: true
FTINDEX: false
LANGUAGE: English
STEMMING: false
CASESENS: false
DIACRITICS: false
STOPWORDS:
UPDINDEX: false
MAXCATS: 100
MAXLEN: 96
EDIT: This is the query execution plan:
Compiling:
- adding text() step
Query:
/*:ARTICLE/*:ART[*:PRDNO=1005935]
Optimized Query:
(db:open-pre("hospindex",0), db:open-pre("hospindex",32884731), ...)/*:ARTICLE/*:ART[(*:PRDNO/text() = 1005935)]
Result:
- Hit(s): 1 Item
- Updated: 0 Items
- Printed: 2078 Bytes
- Read Locking: local [hospindex]
- Write Locking: none
Timing:
- Parsing: 1.12 ms
- Compiling: 0.46 ms
- Evaluating: 1684.35 ms
- Printing: 0.35 ms
- Total Time: 1686.3 ms
Query plan:
<QueryPlan>
<IterPath>
<DBNodeSeq size="22">
<DBNode name="hospindex" pre="0"/>
<DBNode name="hospindex" pre="32884731"/>
<DBNode name="hospindex" pre="33685448"/>
<DBNode name="hospindex" pre="38260847"/>
<DBNode name="hospindex" pre="38358876"/>
</DBNodeSeq>
<IterStep axis="child" test="*:ARTICLE"/>
<IterStep axis="child" test="*:ART">
<CmpG op="=">
<CachedPath>
<IterStep axis="child" test="*:PRDNO"/>
<IterStep axis="child" test="text()"/>
</CachedPath>
<Int value="1005935" type="xs:integer"/>
</CmpG>
</IterStep>
</IterPath>
</QueryPlan>
Your query will be evaluated much faster when using quotes around your search value:
/ARTICLE/ART[PRDNO = "12345"]
The reason is that the current version of BaseX does not provide a numeric value index (it may be included in BaseX 8.0).
You get more insight into the query compilation steps by turning on the QUERYINFO option.

What do the numbers in rsync's output mean?

When I run rsync with the --progress flag, I get information about the transfers as follows.
path/to/file
16 100% 0.01kB/s 0:00:01 (xfer#10857, to-check=427700/441502)
What do the numbers in the second row mean? I know what some of them are, but what do the others mean (marked with ??? below)?
16 ???
100% amount of transfer completed in this file
0.0.1kB/s speed of current file transfer
0:00:01: time elapsed in current file transfer
10857 count of files transferred
427700 ???
441502 ???
When the file transfer finishes, rsync
replaces the progress line with a
summary line that looks like this:
1238099 100% 146.38kB/s 0:00:08 (xfer#5, to-check=169/396)
In this example, the file was 1238099
bytes long in total, the average rate
of transfer for the whole file was
146.38 kilobytes per second over the 8 seconds that it took to complete, it
was the 5th transfer of a regular file
during the current rsync session, and
there are 169 more files for the
receiver to check (to see if they are
up-to-date or not) remaining out of
the 396 total files in the file-list.
from http://samba.anu.edu.au/ftp/rsync/rsync.html under --progress switch
path/to/file
16 100% 0.01kB/s 0:00:01 (xfer#10857, to-check=427700/441502)
The 16 is the bytes-in-this-file transferred sofar. The 100% lists the percentage of the file transferred: 100% in this case. For very short files the kb/sec number often comes out a bit weird: Small measuring errors cause big differences in the calculated overall speed. Then there is the total time. Then, the transfer number. In the example given, of the 427700 files checked so far, only 10857 needed to be transferred. Based on the modification times rsync decided that no transfer was needed for some of the others. Next there is the number of files left-to-check and the total. Modern rsync implementations will create the list that counts towards the "total" on the fly: only adding to the list if the unchecked number drops below 1000.

equivalent of time for memory checking

we can use time in a unix environment to see how long something took...
shell> time some_random_command
real 0m0.709s
user 0m0.008s
sys 0m0.012s
is there an equivalent for recording memory usage of the process(es)?
in particular i'm interested in peak allocation.
Check the man page for time. You can specify a format string where it is possible to output memory information. For example:
>time -f"mem: %M" some_random_command
mem: NNNN
will output maximum resident set size of the process during its lifetime, in Kilobytes.
Can you not use ps? e.g. ps v <pid> will return memory information.

Resources