I am using the NewtonSoft.Json package.
I need to parse a JSON file which contain large base64 encoded data chunks.
One base64 element could be 33 mb and the file will contain 30 of these.
The data could look like this (simplified):
{
"header": { "some_id": "foo", "other-data": 123},
"measurements": [
{
"info": {"unit": "m/s2","channel": "ch1"},
"data": "coUhRR.. 33 mb base64 data here ...=="
},
{
"info": { "unit": "m/s", "channel": "ch2"}
"data": "MxaDF.. 33 mb base64 data here ...=="
}, ...
]
}
I want to output the base64 bytes to separate files, preferably using the
CryptoStream and ToBase64Transform to transform the json value to bytes.
My problem is that the JsonTextReader keeps an internal char array, _chars.
This grows very large. Whenever a string value needs to be read, _chars doubles in size, starting at 1024 chars. Thus, if I am reading a 33 mb base64 element, _chars will grow to 64 million chars, taking up 128 mb of memory. And it doesn't shrink again for the lifetime of the JsonTextReader.
And in later versions, we will most likely need to handle even large json files.
How can I solve this issue? I assume I have to extend JsonTextReader in some way, but it is not obvious what to override?
An alternative would be to write a TextReader wrapper, which transformed the data above to a much shorter stream while routing the base64 data to files?
So it would do a simple on-the-fly parsing of the stream and transform
"data": "coUhRR.. 33 mb base64 data here ...=="
to
"data": "file:tempfile_xxx.bin"
while writing the decoded data to tempfile_xxx.bin. But then I need to use some clever black magic detect the right places to modify the stream?
Have I overlooked some obvious solution?
Is there an alternative parser, which would work better in this case?
Related
I have been working with dicom files that are about 4 MB each but I recently received some which are 280 KB each. I am not sure whether this is because they are from different CT scanners or if the new dicoms were compressed before being given to me.
Is there a way to find out and if they are compressed is there a way to uncompressed them to the original size?
This is in continuation to the other answer from #kritzel_sw.
If you see any of the following UIDs in (0002,0010) Transfer Syntax UID element:
1.2.840.10008.1.2 Implicit VR Endian: Default Transfer Syntax for DICOM
1.2.840.10008.1.2.1 Explicit VR Little Endian
1.2.840.10008.1.2.2 Explicit VR Big Endian
then the Pixel Data (7FE0,0010) Pixel Data is uncompressed. You will generally observe bigger file size here.
Not a part of your question, but objects other than image (PDF may be in case of Structured Report) can be encapsulated with following Transfer Syntax:
1.2.840.10008.1.2.1.99 Deflated Explicit VR Little Endian
Other well known values for Transfer Syntax mean that the Pixel Data is compressed.
Note that there are also private Transfer Syntax values possible for data set. Implementation of those values is generally private to the respective manufacturer.
Yes and yes.
I recommend the binary tools from the OFFIS DICOM toolkit, but you will be able to achieve the same results with different toolkits. You can find the dcmtk here.
How to find out if your files are compressed:
dcmdump <filename>
Have a look at the metaheader, the attribute Transfer Syntax UID (0002,0010) in particular. Dcmdump "translates" the unique identifier to the human readable transfer syntax, e.g.
(0002,0010) UI =LittleEndianExplicit # 20, 1 TransferSyntaxUID
The Transfer Syntax tells you whether or not the pixel data in this DICOM file is compressed.
How to decompress compressed images:
dcmdjpeg <compressed DICOM file in> <uncompressed DICOM file out>
I am trying to download files to disk from squeak.
My method worked fine for small text/html files,
but due to lack of buffering,
it was very slow for the large binary file
https://mirror.racket-lang.org/installers/6.12/racket-6.12-x86_64-win32.exe.
Also, after it finished, the file was much larger (113 MB)
than shown on download page (75MB).
My code looks like this:
download: anURL
"download a file over HTTP and save it to disk under a name extracted from url."
| ios name |
name := ((anURL findTokens: '/') removeLast findTokens: '?') removeFirst.
ios := FileStream oldFileNamed: name.
ios nextPutAll: ((HTTPClient httpGetDocument: anURL) content).
ios close.
Transcript show: 'done'; cr.
I have tried [bytes = stream next bufSize. bytes printTo: ios] for fixed size blocks in HTTP response's contentStream using a [stream atEnd] whileFalse: loop, but that garbled the output file with single quotes around each block, and also extra content after the blocks, which looked like all characters of the stream, each single quoted.
How can I implement buffered writing of an HTTP response to a disk file?
Also, is there a way to do this in squeak while showing download progress?
As Leandro already wrote the issue is with #binary.
Your code is nearly correct, I have taken the liberty to run it - now it downloads the whole file correctly:
| ios name anURL |
anURL := ' https://mirror.racket-lang.org/installers/6.12/racket-6.12-x86_64-win32.exe'.
name := ((anURL findTokens: '/') removeLast findTokens: '?') removeFirst.
ios := FileStream newFileNamed: 'C:\Users\user\Downloads\_squeak\', name.
ios binary.
ios nextPutAll: ((HTTPClient httpGetDocument: anURL) content).
ios close.
Transcript show: 'done'; cr.
As for the freezing, I think the issue is with the one thread for the whole environment while you are downloading. That means that means till you download the whole file you won't be able to use Squeak.
Just tested in Pharo (easier install) and the following code works as you want:
ZnClient new
url: 'https://mirror.racket-lang.org/installers/6.12/racket-6.12-x86_64-win32.exe';
downloadTo: 'C:\Users\user\Downloads\_squeak'.
The WebResponse class, when building the response content, creates a buffer large enough to hold the entire response, even for huge responses! I think this happens due to code in WebMessage>>#getContentWithProgress:.
I tried to copy data from the input SocketStream of WebResponse directly to an output FileStream.
I had to subclass WebClient and WebResponse, and write a two methods.
Now the following code works as required.
| client link |
client := PkWebClient new.
link := 'http://localhost:8000/racket-6.12-x86_64-linux.sh'.
client download: link toFile: '/home/yo/test'.
I have verified block by block update and integrity of the downloaded file.
I include source below. The method streamContentDirectToFile: aFilePathString is the one that does things differently and solves the problem.
WebClient subclass: #PkWebClient
instanceVariableNames: ''
classVariableNames: ''
poolDictionaries: ''
category: 'PK'!
!PkWebClient commentStamp: 'pk 3/28/2018 20:16' prior: 0!
Trying to download http directly to file.!
!PkWebClient methodsFor: 'as yet unclassified' stamp: 'pk 3/29/2018 13:29'!
download: urlString toFile: aFilePathString
"Try to download large files sensibly"
| res |
res := self httpGet: urlString.
res := PkWebResponse new copySameFrom: res.
res streamContentDirectToFile: aFilePathString! !
WebResponse subclass: #PkWebResponse
instanceVariableNames: ''
classVariableNames: ''
poolDictionaries: ''
category: 'PK'!
!PkWebResponse commentStamp: 'pk 3/28/2018 20:49' prior: 0!
To make getContentwithProgress better.!
]style[(38)f1!
!PkWebResponse methodsFor: 'as yet unclassified' stamp: 'pk 3/29/2018 13:20'!
streamContentDirectToFile: aFilePathString
"stream response's content directly to file."
| buffer ostream |
stream binary.
buffer := ByteArray new: 4096.
ostream := FileStream oldFileNamed: aFilePathString.
ostream binary.
[stream atEnd]
whileFalse: [buffer := stream nextInBuffer: 4096.
stream receiveAvailableData.
ostream nextPutAll: buffer].
stream close.
ostream close! !
Basically I am working on a python project where I download and index files from the sec edgar database. The problem however, is that when using the requests module, it take a very long time to save the text in a variable (between ~130 and 170 seconds for one file).
The file roughly has around 16 million characters, and I wanted to see if there was any way to easily lower the time it takes to retrieve the text. -- Example:
import requests
url ="https://www.sec.gov/Archives/edgar/data/0001652044/000165204417000008/goog10-kq42016.htm"
r = requests.get(url, stream=True)
print(r.text)
Thanks!
What I found is in the code for r.text, specifically when no encoding was given ( r.encoding == 'None' ). The time spend detecting the encoding was 20 seconds, I was able to skip it by defining the encoding.
...
r.encoding = 'utf-8'
...
Additional details
In my case, my request was not returning an encoding type. The response was 256k in size, the r.apparent_encoding was taking 20 seconds.
Looking into the text property function. It tests to see if there is an encoding. If there is None, it will call the apperent_encoding function which will scan the text to autodetect the encoding scheme.
On a long string this will take time. By defining the encoding of the response ( as described above), you will skip the detection.
Validate that this is your issue
in your above example :
from datetime import datetime
import requests
url = "https://www.sec.gov/Archives/edgar/data/0001652044/000165204417000008/goog10-kq42016.htm"
r = requests.get(url, stream=True)
print(r.encoding)
print(datetime.now())
enc = r.apparent_encoding
print(enc)
print(datetime.now())
print(r.text)
print(datetime.now())
r.encoding = enc
print(r.text)
print(datetime.now())
of course the output may get lost in the printing, so I recommend you run the above in an interactive shell, it may become more aparent where you are losing the time even without printing datetime.now()
From #martijn-pieters
Decoding and printing 15MB of data to your console is often slower than loading data from a network connection. Don't print all that data. Just write it straight to a file.
Although they resemble files, objects in Amazon S3 aren't really "files", just like S3 buckets aren't really directories. On a Unix system I can use head to preview the first few lines of a file, no matter how large it is, but I can't do this on a S3. So how do I do a partial read on S3?
S3 files can be huge, but you don't have to fetch the entire thing just to read the first few bytes. The S3 APIs support the HTTP Range: header (see RFC 2616), which take a byte range argument.
Just add a Range: bytes=0-NN header to your S3 request, where NN is the requested number of bytes to read, and you'll fetch only those bytes rather than read the whole file. Now you can preview that 900 GB CSV file you left in an S3 bucket without waiting for the entire thing to download. Read the full GET Object docs on Amazon's developer docs.
The AWS .Net SDK only shows only fixed-ended ranges are possible (RE: public ByteRange(long start, long end) ). What if I want to start in the middle and read to the end? An HTTP range of Range: bytes=1000- is perfectly acceptable for "start at 1000 and read to the end" I do not believe that they have allowed for this in the .Net library.
get_object api has arg for partial read
s3 = boto3.client('s3')
resp = s3.get_object(Bucket=bucket, Key=key, Range='bytes={}-{}'.format(start_byte, stop_byte-1))
res = resp['Body'].read()
Using Python you can preview first records of compressed file.
Connect using boto.
#Connect:
s3 = boto.connect_s3()
bname='my_bucket'
self.bucket = s3.get_bucket(bname, validate=False)
Read first 20 lines from gzip compressed file
#Read first 20 records
limit=20
k = Key(self.bucket)
k.key = 'my_file.gz'
k.open()
gzipped = GzipFile(None, 'rb', fileobj=k)
reader = csv.reader(io.TextIOWrapper(gzipped, newline="", encoding="utf-8"), delimiter='^')
for id,line in enumerate(reader):
if id>=int(limit): break
print(id, line)
So it's an equivalent of a following Unix command:
zcat my_file.gz|head -20
There are encrypted data bags in json files with some values I need to change. I need to run something like...
$ knife data bag from file show --secret-file path/to/secret DATABAGNAME --config path/to/knife.rb
But this command gives the error: Could not find or open file 'DATABAGNAME' in current directory or in 'data_bags/show/ewe-jenkins'. So obviously the command is not quite right. I need help figuring out the syntax...
I need a command that can be run from the chef-repo, or the data_bags directory, that will allow me to see the unencrypted values of the json file data_bags. Ultimately I want to change some values, but getting the unencrypted values would be a good place to start :) thanks!
Since you're talking about local json files I'll assume you are using chef-zero / local-mode. The json file can indeed be encrypted and the content can be decrypted with knife.
Complete example:
Create key and databag item:
$ openssl rand -base64 512 | tr -d '\r\n' > /tmp/encrypted_data_bag_secret
$ knife data bag create mydatabag secretstuff --secret-file /tmp/encrypted_data_bag_secret -z
Enter this:
{
"id": "secretstuff",
"firstsecret": "must remain secret",
"secondsecret": "also very secret"
}
The json file is indeed encrypted:
# cat data_bags/mydatabag/secretstuff.json
{
"id": "secretstuff",
"firstsecret": {
"encrypted_data": "VafoT8Jc0lp7o4erCxz0WBrJYXjK6j+sJ+WGKJftX4BVF391rA1zWyHpToF0\nqvhn\n",
"iv": "MhG09xFcwFAqX/IA3BusMg==\n",
"version": 1,
"cipher": "aes-256-cbc"
},
"secondsecret": {
"encrypted_data": "Epj+2DuMOsf5MbDCOHEep7S12F6Z0kZ5yMuPv4a3Cr8dcQWCk/pd58OPGQgI\nUJ2J\n",
"iv": "66AcYpoF4xw/rnYfPegPLw==\n",
"version": 1,
"cipher": "aes-256-cbc"
}
}
Show decrypted content with knife:
# knife data bag show mydatabag secretstuff -z --secret-file /tmp/encrypted_data_bag_secret
Encrypted data bag detected, decrypting with provided secret.
firstsecret: must remain secret
id: secretstuff
secondsecret: also very secret
I think you are confusing the knife data bag show and knife data bag from file commands. The former is for displaying data from the server, the latter is for uploading it. You have both on the command line.