How do i use zxing to get the raw data from the barcode? - qr-code

I have the situation where i want to get the raw data contents of a qr code. This qrcode contains data stored in a binary form (i.e. its actually a zip archive, deflate format, so its non text i.e. binary).
I am using the latest version of the zxing library in java 11.
Result.getRawBytes was the obvious first attempt, but it doesnt seem to be delivering what I want. Actually, I am not sure what it supposed to deliver (maybe I just misunderstand the function?). I am kind of at a loss here?
EDIT; the qr code contains a small preamble (14 bytes) which are using UTF-8 and acts as an identifiying header, these 14 bytesa re immediately followed by the payload that consists of compressed data (using the deflate function in java). and the raw bytes are stored here. The whole specification of the contents can be found here (as its an UIC standard); https://www.era.europa.eu/sites/default/files/library/docs/recommendation/era_rec122_tap_tsi_revision_recommendation_technical_document_b12_en.pdf, page 36 and onward for those so inclined.
EDIT 2;
When using Neoreader i get the following set of raw bytes from the QR code;

Related

Attempts to parse bencode / torrent file in R

I wish I could parse torrent files automatically via R. I tried to use R-bencode package:
library('bencode')
test_torrent <- readLines('/home/user/Downloads/some_file.torrent', encoding = "UTF-8")
decoded_torrent <- bencode::bdecode(test_torrent)
but faced to error:
Error in bencode::bdecode(test_torrent) :
input string terminated unexpectedly
In addition if I try to parse just part of this file bdecode('\xe7\xc9\xe0\b\xfbD-\xd8\xd6(\xe2\004>\x9c\xda\005Zar\x8c\xdfV\x88\022t\xe4գi]\xcf'), I get
Error in bdecode("\xe7\xc9\xe0\b\xfbD-\xd8\xd6(\xe2\004>\x9c\xda\005Zar\x8c\xdfV\x88\022t\xe4գi]\xcf") :
Wrong encoding '�'. Allowed values are i, l, d or a digit.
Maybe there are another ways to do it in R? Or probably I can insert another language code in Rscript?
Thanks in advance!
It might be that the torrent file is somehow corrupted.
A bencode value must begin with the character i (for integers), l (for lists), d (for dictionaries) or a number (for the length of a string).
The example string ('\xe7\xc9...'), doesn't start with any of those characters, and hence it can't be decoded.
See this for more info on the bencode format.
There seem to be several issues here.
Firstly, your code should not treat torrent files as text files in UTF-8 encoding. Each torrent file is split into equally-sized pieces (except for the last piece ; )). Torrents contain a concatenation of SHA1 hashes of each of the pieces. SHA1 hashes are unlikely to be valid UTF-8 strings.
So, you should not read the file into memory using readLines, because that is for text files. Instead, you should use a connection:
test_torrent <- file("/home/user/Downloads/some_file.torrent")
open(test_torrent, "rb")
bencode::bdecode(test_torrent)
Secondly, it seems that this library is also suffering from a similar issue. As readChar that it makes use of, also assumes that it's dealing with text.
This might be due to recent R version changes though seeing as the library is over 6 years old. I was able to apply a quick hack and get it working by passing useBytes=TRUE to readChar.
https://github.com/UkuLoskit/R-bencode/commit/b97091638ee6839befc5d188d47c02567499ce96
You can install my version as follows:
install.packages("devtools")
library(devtools)
devtools::install_github("UkuLoskit/R-bencode")
Caveat lector! I'm not a R programmer :).

How to read a BLOB with qt-type compression?

I have a file (about 100k files, to be specific) containing a data from weather radars - one file is a one radar image. It is a mosaic of data from several radars, creating a map of a reflectivity over whole country.
The files have extension .cmax and I need to convert them to something more useful (eg. array of reflectivities) for further uses.
I have asked data provider how to read those files. They responded:
The standard product format in our system (.cmax) is the internal format of the company that provides us with the software. It consists of an xml and binary part. It can be read by reading as a stream of bytes. Firstly, parse the initial bytes as xml, then treat the rest (BLOBs) as a binary data compressed with the "qt" method. You need to unpack them using a library that supports this compression mode. In general, you have to work a little, but it can be done in virtually any programming language.
The main issue is with the binary part of data. I have tried to decompress it with zlib (googling qt compression it comes out) and reading as a binary data in C++. None of them worked. It also doesn't seem resonable to me to try reading that data as binary in Qt.
The file begins with those lines:
<product version="5.44.5" datetime="2017-01-01T18:00:00" datatype="dBZ" type="cmax" name="CMAX" owner="">
<data time="18:00:00" date="2017-01-01">
Then, there are radars specifications and image details (active radars, min and max reflectivity etc). XML part ends with:
</product>
<!-- END XML -->
<BLOB blobid="0" size="79617" compression="qt">(here are lots of binary data)</BLOB>
I'm looking for a way (tool?) to convert that binary data. For example, it could be that mentioned library.
Looking at the details, this is most likely Leonardo (Selex/Gematronic) Rainbow5 format. zlib is the right lib for decompression. But there are some tricks to it. A python reader is implemented in the wradlib library (https://github.com/wradlib). Maybe you can adapt from that code. Disclaimer: I'm one of the wradlib devs.
Did you try simply using the qUncompress() function? https://doc.qt.io/qt-5/qbytearray.html#qUncompress

is there any way we can find PDF file is compressed or not?

we are using ITEXTPDF to compress the PDF but the issues is we want to compress the files which are compressed before uploading into our site...if the files are uploaded without compressing we would like to leave those like that..
so to do that we need to identify is that PDF is compressed or not..am wondering is there any way we can identify PDF is compressed or not using ITEXTPDF or some other tool!!!..
i have tried to Google it but couldn't find appropriate answer..
kindly let me know if u have any idea...
thanks
There are several types of compression you can get in a PDF. Data for objects can be compressed and objects can be compressed into object streams.
I voted Mark's answer up because he's right: you won't get an answer if you're not more specific. I'll add my own answer with some extra information.
In PDF 1.0, a PDF file consisted of a mix of ASCII characters for the PDF syntax and binary code for objects such as images. A page stream would contain visible PDF operators and operands, for instance:
56.7 748.5 m
136.2 748.5 l
S
This code tells you that a line has to be drawn (S) between the coordinate (x = 56.7; y = 748.5) (because that's where the cursor is moved to with the m operator) and the coordinate (x = 136.2; y = 748.5) (because a path was constructed using the l operator that adds a line).
Starting with PDF 1.2, one could start using filters for such content streams (page content streams, form XObjects). In most cases, you'll discover a /Filter entry with value /FlateDecode in the stream dictionary. You'll hardly find any "modern" PDFs of which the contents aren't compressed.
Up until PDF 1.5, all indirect objects in a PDF document, as well as the cross-reference stream were stored in ASCII in a PDF file. Starting with PDF 1.5, specific types of objects can be stored in an objects stream. The cross-reference table can also be compressed into a stream. iText's PdfReader has a isNewXrefType() method to check if this is the case. Maybe that's what you're looking for. Maybe you have PDFs that need to be read by software that isn't able to read PDFs of this type, but... you're not telling us.
Maybe we're completely misinterpreting the question. Maybe you want to know if you're receiving an actual PDF or a zip file with a PDF. Or maybe you want to really data-mine the different filters used inside the PDF. In short: your question isn't very clear, and I hope this answer explains why you should clarify.

Figuring out encodings of PDFs in R

I am currently scraping some text data from several PDFs using readPDF() function in the tm package. This all works very well and in most cases the encoding seems to be "latin1" - in some, however, it is not. Is there a good way in R to check character encodings? I found the functions is.utf8() and is.local() in the tau package but that obviously only gets me so far.
Thanks.
The PDF specification defines these encodings for simple fonts (each of which can include a maximum of 256 character shapes) for latin text that should be predefined in any conforming reader:
/StandardEncoding
(for Type 1 Latin text fonts, but not for TrueType fonts)
/MacRomanEncoding
(the Mac OS standard encoding, for both TrueType and Type1 fonts)
/PDFDocEncoding
(only used for text strings outside of the document's content streams; normally not used to show text from fonts)
/WinAnsiEncoding
(Windows code page 1252 encoding, for both TrueType and Type1 fonts)
/MacExpertEncoding
(name is misleading -- encoding is not platform-specific; however only few fonts have an appropriate character set to use this encoding)
Then there are 2 specific encodings for symbol fonts:
Symbol Encoding
ZapfDingBats Encoding
Also, fonts can have built-in encodings, which may deviate any way their creator wanted from a standard encoding (f.e. also used for differences encoding when embedded standard fonts are subsetted).
So in order to correctly interpret a PDF file, you'll have to lookup each of the font encodings of the fonts used, and you must to take into account any /Encoding using a /Differences array too.
However, the overall task is still quite simple for simple fonts. The PDF viewer program just needs to map 1:1 "each one of a seqence of bytes I see that's meant to represent a text string" to "exactly one glyph for me to draw which I can lookup in the encoding table".
For composite, CID-keyed fonts (which may contain many thousands of character shapes), the lookup/mapping for the viewer program for "this is the sequence of bytes I see that I'm supposed to draw as text" to "this is the sequence of glyph shapes to draw" is no longer 1:1. Here, a sequence of one or more bytes needs to be decoded to select each one glyph from the CIDFont.
And to help this CIDFont decoding, there need to be CMap structures around. CMaps define mappings from Unicode encodings to character collections. The PDF specification defines at least 5 dozen CMaps -- and their standard names -- for Chinese, Japanese and Korean language fonts. These pre-defined CMaps need not be embedded in the PDF (but the conforming PDF reader needs to know how to handle them correctly). But there are (of course) also custom CMaps which may have been generated 'on the fly' when the PDF-creating application wrote out the PDF. In that case the CMap needs to be embedded in the PDF file.
All details about these complexities are layed down in the official PDF-1.7 specification.
I don't know much about R. But I have now poked a bit at CRAN, to see what the mentioned tm and tau packages are.
So tm is for text mining, and for PDF reading it requires and relies on the pdftotext utility from Poppler. I had at first the [obviously wrong] impression, that your mentioned readPDF() function was doing some low-level, library-based access to PDF objects directly in the PDF file.... How wrong I was! Turns out it 'only' looks at the text output of the pdftotext commandline tool.
Now this explains why you'll probably not succeed in reading any of the PDFs which do use more complex font encodings than the 'simple' Latin1.
I'm afraid, the reason for your problem is that currently Poppler and pdftotext are simply not yet able to handle these.
Maybe you're better off to ask the tm maintainers for a feature request: :-)
that you would like them to try + add support to their tm package for a more capable third party PDF text extraction tool, such as PDFlib.com's TET (english version) which for sure is the best text extraction utility on the planet (better than Adobe's own tools, BTW).

Decompress header of swf file (possible with qUncompress?)

I have some swf files generated with Adobe Flash.
Does anybody know how can I decompress their headers in QT?
I need their size (width and height) , frame rate and frame count.
Thanks
It's not documented if qUncompress requires that all compressed data to be in the QByteArray to decompress it. From the wording of it, it seems to imply that. I would imagine loading some large SWF into memory just to get a few bytes in the header is not practical.
If you can live with loading the whole file into memory, just load the file starting at offset 4 into a QByteArray and flip the byte order of the 1st 4 (SWF is little-endian and qUncompress requires the length to be in big-endian). Subtract 4 from the flipped 32-bit integer. Then call qUncompress.
If loading the whole file is not ideal, you may be better off use the stream functions in zlib directly. That allows you to decompress data piece by piece.

Resources