Testing bitmask when stored as integer and available as string - bitmask

I have a bitmask (really a 'flagmask') of integer values (1, 2, 4, 8, 16 etc.) which apply to a field and I need to store this in a (text) log file. What I effectively store is something like "x=296" which indicates that for field "x", flags 256, 32 and 8 were set.
When searching the logs, how can I easily search this text string ("x=nnn") and determine from the value of "nnn" whether a specific flag was set? For instance, how could I look at the number and know that flag 8 was set?
I know this is a somewhat trivial question if we're doing 'true' bitmask processing, but I've not seen it asked this way before - the log searching will just be doing string matching, so it just sees a value of "296" and there is no way to convert it to its constituent flags - we're just using basic string searching with maybe some easy SQL in there.

Your best bet is to do full string regex matching.
For bit 8 set, do this:
SELECT log_line
FROM log_table
WHERE log_line
RLIKE 'x=(128|129|130|131|132|133|134|135|136|137|138|139|140|141|142|143|144|145|146|147|148|149|150|151|152|153|154|155|156|157|158|159|160|161|162|163|164|165|166|167|168|169|170|171|172|173|174|175|176|177|178|179|180|181|182|183|184|185|186|187|188|189|190|191|192|193|194|195|196|197|198|199|200|201|202|203|204|205|206|207|208|209|210|211|212|213|214|215|216|217|218|219|220|221|222|223|224|225|226|227|228|229|230|231|232|233|234|235|236|237|238|239|240|241|242|243|244|245|246|247|248|249|250|251|252|253|254|255)[^0-9]'
Or simplified to:
RLIKE 'x=(12[89]|1[3-9][0-9]|2[0-4][0-9]|25[0-5])[^0-9]'
The string x= must exist in the logs followed by the decimal number and a non decimal after the number.
For bit 2 set, do this:
SELECT log_line FROM log_table WHERE log_line RLIKE 'x=(2|3|6|7|10|11|14|15|18|19|22|23|26|27|30|31|34|35|38|39|42|43|46|47|50|51|54|55|58|59|62|63|66|67|70|71|74|75|78|79|82|83|86|87|90|91|94|95|98|99|102|103|106|107|110|111|114|115|118|119|122|123|126|127|130|131|134|135|138|139|142|143|146|147|150|151|154|155|158|159|162|163|166|167|170|171|174|175|178|179|182|183|186|187|190|191|194|195|198|199|202|203|206|207|210|211|214|215|218|219|222|223|226|227|230|231|234|235|238|239|242|243|246|247|250|251|254|255)[^0-9]'
I test the bit2 on some live data:
SELECT count(log_line)
...
count(log_line)
128
1 row in set (0.002 sec)

Much depends on how "easy" the "easy SQL" must be.
If you can use string splitting to separate "X" from "296", then using AND with the value 296 is easy.
If you cannot, then, as you observed, 296 yields no traces of its 8th bit state. You'd need to check for all the values that have the 8th bit set, and of course that means exactly half of the available values. You can compress the regexp somewhat:
248 249 250 251 252 253 254 255 => 24[89]|25[0-5]
264 265 266 267 268 269 270 271 => 26[4-9]|27[01]
280 281 282 283 284 285 286 287 => 28[0-7]
296 297 298 299 300 301 302 303 => 29[6-9]|30[1-3]
...24[89]|25[0-5]|26[4-9]|27[01]|28[0-7]|29[6-9]|30[1-3]...
but I think that the evaluation tree won't change very much, this kind of optimization is already present in most regex engines. Performances are going to be awful.
If at all possible I would try to rework the logs (maybe with a filter) to rewrite them as "X=000100101000b" or at least "X=0128h". The latter hexadecimal would allow you to search for bit 8 looking for "X=...[89A-F]h" only.
I have several times had to do changes like these - it involves preparing a pipe filter to create the new logs (it is more complicated if the logging process writes straight to a file - at worst, you might be forced to never run searches on the current log file, or to rotate the logs when such a search is needed), then slowly during low-load periods the old logs are retrieved, decompressed, reparsed, recompressed, and stored back. It's long and boring, but doable.
For database-stored logs, it's much easier and you can even avoid any alterations to the process that does the logs - you add a flag to the rows to be converted, or persist the timestamps of the catch-up conversion:
SELECT ... FROM log WHERE creation > lowWaterMark ORDER BY creation DESC;
SELECT ... FROM log WHERE creation < highWaterMark ORDER BY creation ASC;
the retrieved rows are updated, and the appropriate watermarkers are updated with the last creation value that has been retrieved. The "lowWaterMark" pursues the logging process. This way you only can't search from the current instant down to lowWaterMark, which will "lag" behind (ultimately, just a few seconds, except under heavy load).

One way to determine if a specific flag is set in the value "nnn" is to use bitwise operators to check if the flag is present in the value. For eg. to check if flag 8 is set in the value 296, you can use the bitwise AND operator to check if the value of 8 is present:
if (nnn & 8) == 8:
print("flag 8 is set")
Another method I can think of would be to check if the value of flag 8 is present in the log string "x=nnn" by using regular expressions, where you search for the string "8" in the log.
import re
log_string = "x=296"
if re.search("8", log_string):
print("flag 8 is set")
You could also use a function that takes the log string and flag as input and returns a boolean indicating whether the flag is set in the log string.
def check_flag(log_string, flag):
nnn = int(log_string.split('=')[1])
return (nnn & flag) == flag
if check_flag("x=296", 8):
print("flag 8 is set")

use a bitwise operator & to check if a specific flag is set in the value
function hasFlag(value, flag) {
return (value & flag) === flag;
}
const value = 296;
console.log(hasFlag(value, 256)); // true
console.log(hasFlag(value, 32)); // true
console.log(hasFlag(value, 8)); // true
console.log(hasFlag(value, 4)); // false

Related

Optimizing `data.table::fread` speed advice

I am experiencing read speeds that I believe are much slower than should be expected when trying to read a fairly large file in R with fread.
The file is ~60m rows x 147 columns, out of which I am only selecting 27 columns, directly in the fread call using select; only 23 of the 27 are found in the actual file. (Probably I inputted some of the strings incorrectly but I guess that matters less.)
data.table::fread("..\\TOI\\TOI_RAW_APextracted.csv",
verbose = TRUE,
select = cols2Select)
The system being used is an Azure VM with a 16-core Intel Xeon and 114 GB of RAM, running Windows 10.
I'm also using R 3.5.2, RStudio 1.2.1335 and data.table 1.12.0
I should also add that the file is a csv file that I have transferred onto the local drive of the VM, so there is no network / ethernet involved. I am not sure how Azure VMs work and what drives they use, but I would assume it's something equivalent to an SSD. Nothing else is running / being processed on the VM at the same time.
Please find below the verbose output of fread:
omp_get_max_threads() = 16 omp_get_thread_limit() = 2147483647 DTthreads = 0 RestoreAfterFork = true Input contains no \n. Taking this to be a filename to open [01] Check arguments Using 16 threads (omp_get_max_threads()=16, nth=16) NAstrings = [<<NA>>] None of the NAstrings look like numbers. show progress = 1 0/1 column will be read as integer [02] Opening the file Opening file ..\TOI\TOI_RAW_APextracted.csv File opened, size = 49.00GB (52608776250 bytes). Memory mapped ok [03] Detect and skip BOM [04] Arrange mmap to be \0 terminated \n has been found in the input and different lines can end with different line endings (e.g. mixed \n and \r\n in one file). This is common and ideal. [05] Skipping initial rows if needed Positioned on line 1 starting: <<"POLNO","ProdType","ProductCod>> [06] Detect separator, quoting rule, and ncolumns Detecting sep automatically ... sep=',' with 100 lines of 147 fields using quote rule 0 Detected 147 columns on line 1. This line is either column names or first data row. Line starts as: <<"POLNO","ProdType","ProductCod>> Quote rule picked = 0 fill=false and the most number of columns found is 147 [07] Detect column types, good nrow estimate and whether first row is column names Number of sampling jump points = 100 because (52608776248 bytes from row 1 to eof) / (2 * 85068 jump0size) == 309216 Type codes (jump 000) : A5AA5555A5AA5AAAA57777777555555552222AAAAAA25755555577555757AA5AA5AAAAA5555AAA2A...2222277555 Quote rule 0 Type codes (jump 001) : A5AA5555A5AA5AAAA5777777757777775A5A5AAAAAAA7777555577555777AA5AA5AAAAA7555AAAAA...2222277555 Quote rule 0 Type codes (jump 002) : A5AA5555A5AA5AAAA5777777757777775A5A5AAAAAAA7777775577555777AA5AA5AAAAA7555AAAAA...2222277555 Quote rule 0 Type codes (jump 003) : A5AA5555A5AA5AAAA5777777757777775A5A5AAAAAAA7777775577555777AA5AA5AAAAA7555AAAAA...2222277775 Quote rule 0 Type codes (jump 010) : A5AA5555A5AA5AAAA5777777757777775A5A5AAAAAAA7777775577555777AA5AA5AAAAA7555AAAAA...2222277775 Quote rule 0 Type codes (jump 031) : A5AA5555A5AA5AAAA5777777757777775A5A5AAAAAAA7777775577555777AA7AA5AAAAA7555AAAAA...2222277775 Quote rule 0 Type codes (jump 098) : A5AA5555A5AA5AAAA5777777757777775A5A5AAAAAAA7777775577555777AA7AA5AAAAA7555AAAAA...2222277775 Quote rule 0 Type codes (jump 100) : A5AA5555A5AA5AAAA5777777757777775A5A5AAAAAAA7777775577555777AA7AA5AAAAA7555AAAAA...2222277775 Quote rule 0 'header' determined to be true due to column 2 containing a string on row 1 and a lower type (int32) in the rest of the 10045 sample rows ===== Sampled 10045 rows (handled \n inside quoted fields) at 101 jump points Bytes from first data row on line 2 to the end of last row: 52608774311 Line length: mean=956.51 sd=35.58 min=823 max=1063 Estimated number of rows: 52608774311 /
956.51 = 55000757 Initial alloc = 60500832 rows (55000757 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
===== [08] Assign column names [09] Apply user overrides on column types After 0 type and 124 drop user overrides : 05000005A0005AA0A0000770000077000A000A00000000770700000000000000A00A000000000000...0000000000 [10] Allocate memory for the datatable Allocating 23 column slots (147 - 124 dropped) with 60500832 rows [11] Read the data jumps=[0..50176), chunk_size=1048484, total_size=52608774311 |--------------------------------------------------| |==================================================| jumps=[0..50176), chunk_size=1048484, total_size=52608774311 |--------------------------------------------------| |==================================================| Read 54964696 rows x 23 columns from 49.00GB (52608776250 bytes) file in 30:26.810 wall clock time [12] Finalizing the datatable Type counts:
124 : drop '0'
3 : int32 '5'
7 : float64 '7'
13 : string 'A'
=============================
0.000s ( 0%) Memory map 48.996GB file
0.035s ( 0%) sep=',' ncol=147 and header detection
0.001s ( 0%) Column type detection using 10045 sample rows
6.000s ( 0%) Allocation of 60500832 rows x 147 cols (9.466GB) of which 54964696 ( 91%) rows used
1820.775s (100%) Reading 50176 chunks (0 swept) of 1.000MB (each chunk 1095 rows) using 16 threads + 1653.728s ( 91%) Parse to row-major thread buffers (grown 32 times) + 22.774s ( 1%) Transpose +
144.273s ( 8%) Waiting
24.545s ( 1%) Rereading 1 columns due to out-of-sample type exceptions
1826.810s Total Column 2 ("ProdType") bumped from 'int32' to 'string' due to <<"B810">> on row 14
Basically, I would like to find out if this is just normal or if there is anything I can do to improve these reading speeds. Based on various benchmarks I've seen around and my own experience and intuition with fread using smaller files, I would have expected this to be read in much much quicker.
Also I was wondering if the multi-core capabilities are fully being used, as I have heard that under Windows this might not always be straightforward. My knowledge around this topic is pretty limited unfortunately, but it does appear from the verbose output that fread is detecting 16 cores.
Thoughts:
(1) If you are using Windows, use Microsoft Open R; even more so if the cloud is Azure. Actually, there may be coordination between Open R and Azure client. Because of Intel's MKL and Microsoft built in enhancements, I find Microsoft Open R faster on Windows.
(2) I suspect 'Select' and 'Drop' work after a full file read. Maybe read all the file, subset or filter afterward.
(3) I think a restart is overkill. I run gc thrice every so often like this: 'gc();gc();gc();'' I have heard others say this does nothing. But at least it makes me feel better. Actually, I notice it helps me on Windows.
(4) The latest versions of data.table fread are implementing 'YAML' . This looks promising.
(5) setDTthread(0) uses all the cores. Too much parellitization can work against you. Trying halving your cores.

What does mean number near extended attributes in ls -l# output?

What does mean number near extended attributes in ls -l# output and how i can get it?
drwxr-xr-x# 41 root wheel 1394 Nov 7 14:50 bin
com.apple.FinderInfo 32 //this number
com.apple.rootless 0 //and this
This is MacOS specific I think. Maybe you want to take a look at the xattr command here. The number displayed by ls is the size in bytes of the attribute. The meaning of the value of a particular attribute is arbitrary (as is the set of extended attributes a file may have) and really depends on the attribute itself.
To be consistent with your question tags, you can also access extended attributes programatically from C by including sys/xattr.h.
This number means sizeof extended attribute in bytes.
You can get it by getxattr from sys/xattr.h

Fill-In validation with N format

I have a fill-in with the following code, made using the AppBuilder
DEFINE VARIABLE fichNoBuktiTransfer AS CHARACTER FORMAT "N(18)":U
LABEL "No.Bukti Transfer"
VIEW-AS FILL-IN NATIVE
SIZE 37.2 BY 1 NO-UNDO.
Since the format is N, it blocks the user from entering non-alphanumeric entries. However, it does not prevent the user from copypasting such entries into the fill-in. I have an error checking like thusly to prevent such entries using the on leave trigger:
IF LENGTH(SELF:Screen-value) > 18 THEN DO:
SELF:SCREEN-VALUE = ''.
RETURN NO-APPLY.
END.
vch-list = "!,*, ,#,#,$,%,^,&,*,(,),-,+,_,=".
REPEAT vinl-entry = 1 TO NUM-ENTRIES(vch-list):
IF INDEX(SELF:SCREEN-VALUE,ENTRY(vinl-entry,vch-list) ) > 0 THEN DO:
SELF:SCREEN-VALUE = ''.
RETURN NO-APPLY.
END.
END.
However, after the error handling kicked in, when the user inputs any string and triggers on leave, error 632 occurs:
error 632 occurs
Is there any way to disable the error message? Or should I approach the error handling in a different way?
EDIT: Forgot to mention, I am running on Openedge version 10.2B
You didn't mention the version, but I'll assume you have a version in which the CLIPBOARD system handle already exists.
I've simulated your program and I believe it shouldn't behave that way. It seems to me the error flag is raised anyway. My guess is even though those symbols can't be displayed, they are assigned to the screen value somehow.
Conjectures put aside, I've managed to suppress it by adding the following code:
ON CTRL-V OF FILL-IN-1 IN FRAME DEFAULT-FRAME
DO:
if index(clipboard:value, vch-list) > 0 then
return no-apply.
END.
Of course this means vch-list can't be scoped to your trigger anymore, in case it is, because you'll need the value before the leave. So I assigned the special characters list as an INIT value to the variable.
After doing this, I didn't get the error anymore. Hope it helps.
To track changes in a fill-in I always use at first this code:
ON VALUE-CHANGED OF FILL-IN-1 IN FRAME DEFAULT-FRAME
DO:
/* proofing part */
if ( index( clipboard:value, vch-list ) > 0 ) then do:
return no-apply.
end.
END.
You could add some mouse or developer events via AppBuilder to track changes in a fill-in.

How to Tune my Group By Xquery Map?

Busy comparing SQLServer 2008 R2 and MarkLogic 8 with simple Person Entity.
My dataset is for both 1 Million Records/Documents. Note: Both databases are on the same machine(Localhost).
The following SQLServer Query is ready in a flash:
set statistics time on
select top 10 FirstName + ' ' + LastName, count(FirstName + ' ' + LastName)
from [Person]
group by FirstName + ' ' + LastName
order by count(FirstName + ' ' + LastName) desc
set statistics time off
Result is:
Richard Petter 421
Mark Petter 404
Erik Petter 400
Arjan Petter 239
Erik Wind 237
Jordi Petter 235
Richard Hilbrink 234
Mark Dominee 234
Richard De Boer 233
Erik Bakker 233
SQL Server Execution Times:
CPU time = 717 ms, elapsed time = 198 ms.
The XQuery on MarkLogic 8 however is much slower:
(
let $m := map:map()
let $build :=
for $person in collection('/collections/Persons')/Person
let $pname := $person/concat(FirstName/text(), ' ', LastName/text())
return map:put(
$m, $pname, sum((
map:get($m, $pname), 1)))
for $pname in map:keys($m)
order by map:get($m, $pname) descending
return
concat($pname, ' => ', map:get($m, $pname))
)[1 to 10]
,
xdmp:query-meters()/qm:elapsed-time
Result is:
Richard Petter => 421
Mark Petter => 404
Erik Petter => 400
Arjan Petter => 239
Erik Wind => 237
Jordi Petter => 235
Mark Dominee => 234
Richard Hilbrink => 234
Erik Bakker => 233
Richard De Boer => 233
elapsed-time:PT42.797S
198 msec vs. 42 sec is in my opinion to much difference.
The XQuery is using a map to do the Group By acording to this guide: https://blakeley.com/blogofile/archives/560/
I have 2 questions:
Is the XQuery used in any way tuneable for beter performance?
Is XQuery 3.0 with group by already usable on MarkLogic 8?
Thanks for the help!
As #wst said, the challenge with your current implementation is that it's loading up all the documents in order to pull out the first and last names, adding them up one by one, and then reporting on the top ten. Instead of doing that, you'll want to make use of indexes.
Let's assume that you have string range indexes set up on FirstName and LastName. In that case, you could run this:
xquery version "1.0-ml";
for $co in
cts:element-value-co-occurrences(
xs:QName("FirstName"),
xs:QName("LastName"),
("frequency-order", "limit=10"))
return
$co/cts:value[1] || ' ' || $co/cts:value[2] || ' => ' || cts:frequency($co)
This uses indexes to find First and Last names in the same document. cts:frequency indicates how often that co-occurrence happens. This is all index driven, so it will be very fast.
First, yes, there are many ways to tune queries in MarkLogic. One obvious way is using range indexes; however, I would highly recommend reading their documentation on the subject first:
https://docs.marklogic.com/guide/performance
For a higher level look at the database architecture, there is a whitepaper called Inside Marklogic Server that explains the design extensively:
https://developer.marklogic.com/inside-marklogic
Regarding group by, maybe someone from MarkLogic would like to comment officially, but as I understand it their position is that it's not possible to build a universally high-performing group by, so they choose not to implement it. This puts the responsibility on the developer to understand best practices for writing fast queries.
In your example specifically, it's highly unlikely that Mike Blakeley's map-based group by pattern is the issue. There are several different ways to profile queries in ML, and any of those should lead you to any hot spots. My guess would be that IO overhead to get the Person data is the problem. One common way to solve this is to configure range indexes for FirstName and LastName, and use cts:value-tuples to query them simultaneously from range indexes, which will avoid going to disk for every document not in the cache.
Both answers are relevant. But from what you are asking for, David C's answer is closest to what you appear to want. However, this answer assumes that you need to find the combination of first-name and last-name.
If your document had an identifying, uniq field (think primary key), then:
put a range index on that field
Use cts:element-values
--- options: (fragment-frequency, frequency-order, eager, concurrent, limit=10)
-The count comes from cts:frequencies
-Then use the resulting 10 IDs to grab the Name Information
If the unique ID field can be an integer, then the speeds are many times faster for the initial search as well.
And.. Its also possible to use my step 1-2 to get the list of Id essentially instantly and use these as a limiter on David C's answer - using an element-value-range-query on the IDs in the 'query' option. This stops you from having to build the name yourself as in my option and may speed up David C's approach.
Moral of the story - without first tuning your database for performance(indexes) and using specific queries (range queries, co-occurrence, etc), then the results have no chance of making any sense.
Moral of the story - part 2: lots of approaches - all relevant and viable. Subtle differences - all depending on your specific data.
The doc is listed below
https://docs.marklogic.com/cts:element-values

Null true expression not returning

I'm working in Access 2010. I want NULL values to return as 0 but I cannot get them to show up. The false values work fine. There should be 29 total rows, 20 returning 0 and 9 returning their value. Here is the code.
SELECT [QAT Export].Title, IIF(ISNULL(Count([QAT Export].[TaskID])),0,Count([QAT Export].[Task ID])) AS [Update Good]
FROM [QAT Export]
WHERE ((([QAT Export].[Task Status])<>"Closed" And ([QAT Export].[Task Status])<>"Cancelled") AND (([QAT Export].[Updated By]) <>"linker")AND((DateDiff("d",[Update Time],(Date())))<10))
GROUP BY [QAT Export].Title
ORDER BY [QAT Export].Title;
Your IIF() and IsNull() functions look fine.
Chances are: your WHERE clause is excluding the nulls through one of the tests in there.
What are the chances since [TaskID] is Null, [Task Status] is Null as well?
If that's the case, you'd need to test for IsNull([Task Status]) in your WHERE clause also.
Post your 29 records if you need more help. Most likely the Data is causing your WHERE clause to drop those records.

Resources