IIS 7 dynamic compression truncates downloaded files - iis-7

I have an issue on ASP.NET application which have been recently migrated to new server running IIS7 instead of IIS6. The application has been upgraded to .NET4 though most of the code remains as it was originally written in .NET 1.1. The application manages company's static tables and user can download any static table data in CSV or XML from a web page.
Issue - In case IIS7 feature 'Enable dynamic content compression' is turned on, the file is corrupted (the end of the file is trimmed). The corruption seems NOT to be random, i.e. I get always the same result as long as input data are equal. For different input data sets, the size of the effect (number of characters trimmed) are also different. When I switch dynamic compression off, the problem completely disappears.
Examples of downloads (i am posting always only the last line of the whole CSV file):
static table A
dynamic compresion off (complete file length 720 bytes):
2010-10-21 00:00:00;;"P ";"registrované partnerství";"REGISTERED PARNTERSHIP";"registrované partnerstvo";0
dynamic compresion off (last 1 byte trimmed - i.e. last line-break lost its "LF" char), file length = 719 bytes):
static table B
dynamic compression off (no corruption, file length = 593349 bytes):
2012-04-16 00:00:00;;"800650";"Finanční centrum Třebíč";"PSB FC Třebíč";"PSB FC Třebíč";"PSB FC Třebíč";9;2;61;"67401";"Třebíč";"Karlovo náměstí 32/26";5057;2518;852;;;2012-05-02 00:00:00;0
dynamic compression on (last 2 bytes trimmed (i.e. last CRLF is lost), file length = 593347 bytes):
static table C
dynamic compression off (no corruption, file length = 282 bytes):
2012-04-16 00:00:00;;"3";"Zaslat do vlastních rukou";0
dynamic compression on (last 5 characters trimmed, file length = 275 bytes):
2012-04-16 00:00:00;;"3";"Zaslat do vlastních ruk
static table D
dynamic compression off (no corruption, file length = 6506 bytes):
2010-02-20 00:00:00;2010-03-20 00:00:00;390;"879001";2;0
dynamic compression on (last 13 characters trimmed, file length = 6491 bytes):
2010-02-20 00:00:00;2010-03-20 00:00:00;390
The same happens for XML output format - but in such case the effect is always the same - XML is always trimmed by 1 character. Consequence is the end tag of the root element is not closed:
</Export>
versus
</Export
As a workaround I recommended to turn off dynamic compression. But I would like to understand what causes this behaviour. I have searched for solutions to the same issue on the web, but I have found only this issue, which is probably not relevant, since the input data are pure uncompressed CSV or XML:

Related

Read pdf page size on .NET Core

I have only one requirement. I need to read PDF page size and determine if page is not bigger then 17x17 inches to not send it to some external service which rejects such pdfs.
Is there any free library working on .NET Core? I wasn't able to find it. Or maybe anyone implemented this by reading binary file?
A pdf does not HAVE TO declare page size externally since every page can be a different size thus 100 pages may be 100 different page sizes.
However many PDF will contain a text entry for one or more pages so you can (depending on construction) parse as text for /MediaBox and or potentially /CropBox dimensions.
So the first PDF example I pick on and open to search for /MediaBox in WordPad tells me its 210 mm x 297 mm (i.e my local A4) /MediaBox [0 0 594.95996 841.91998] and for a 3 page file all 3 entries are the same.
you can try that using command line as
type "filename.pdf" | find /i "/media"
but may not work in all cases so a bigger chance of result (but more chaff) is
type "filename.pdf" | findstr /i "^/media ^/crop"
The value is based on the default number of point size units per inch (so can be divided by 72 as a rough guide), however, thats not your aim since you know you dont want more that 17x72=1224.
So in simple terms, if either value was over 1224 then I could reject as "TOO BIG".
HOWEVER I need to also consider those two 0 values, thus if one was +100 then the limit becomes 100 more and more importantly, if one was -100 then your desired 17" restriction will fail at 1124.
So you can write in any method or language (even CMD) a simple test, however, that will require too much expanding to cover all cases, SO:-
Seriously I would use / shell a one line command tool like xpdf/poppler pdfinfo to parse all different types of PDF and then grep that output.
The output is similar for both with many lines but for your need
xpdf\pdfinfo -box filename
gives Page size: 594.96 x 841.92 pts (A4) (rotated 0 degrees)
and
poppler\pdfinfo -box filename
gives Page size: 594.96 x 841.92 pts (A4)
Thus to check the file does not exceed 17" (in either direction) it should be easy to set a comparison testing that both values are under 1224.01

Using Cat to merge Files, ignoring the last Byte of each file

I have written a rudimentary HTTP downloader that downloads parts of a file and stores them in temporary, partial files in a directory.
I can use
cat * > outputfilename
To concatenate the partial files together as their order is given by the individual filenames.
However.
The range of each file is something like:
File 1: 0 - 1000
File 2: 1000 - 2000
File 3: 2000 - 3000
For a file that is 3000 bytes in size. i.e. Last Byte overlaps of first byte.
The cat command duplicates the overlapping bytes into a new file.
Specifically, We can see this with images looking wrong
i.e:
(just using an image from imgur)
https://i.imgur.com/XEvBCtp.jpg
Renders with 1/(the number of partial files) of the image correctly.
To note: The original image is 250 KB.
The concatenated image is 287 KB.
I will be implementing this in C99, Unix as a method that calls exec.
I'm not sure where to upload the partial files to assist w/ stackoverflow.

S3: How to do a partial read / seek without downloading the complete file?

Although they resemble files, objects in Amazon S3 aren't really "files", just like S3 buckets aren't really directories. On a Unix system I can use head to preview the first few lines of a file, no matter how large it is, but I can't do this on a S3. So how do I do a partial read on S3?
S3 files can be huge, but you don't have to fetch the entire thing just to read the first few bytes. The S3 APIs support the HTTP Range: header (see RFC 2616), which take a byte range argument.
Just add a Range: bytes=0-NN header to your S3 request, where NN is the requested number of bytes to read, and you'll fetch only those bytes rather than read the whole file. Now you can preview that 900 GB CSV file you left in an S3 bucket without waiting for the entire thing to download. Read the full GET Object docs on Amazon's developer docs.
The AWS .Net SDK only shows only fixed-ended ranges are possible (RE: public ByteRange(long start, long end) ). What if I want to start in the middle and read to the end? An HTTP range of Range: bytes=1000- is perfectly acceptable for "start at 1000 and read to the end" I do not believe that they have allowed for this in the .Net library.
get_object api has arg for partial read
s3 = boto3.client('s3')
resp = s3.get_object(Bucket=bucket, Key=key, Range='bytes={}-{}'.format(start_byte, stop_byte-1))
res = resp['Body'].read()
Using Python you can preview first records of compressed file.
Connect using boto.
#Connect:
s3 = boto.connect_s3()
bname='my_bucket'
self.bucket = s3.get_bucket(bname, validate=False)
Read first 20 lines from gzip compressed file
#Read first 20 records
limit=20
k = Key(self.bucket)
k.key = 'my_file.gz'
k.open()
gzipped = GzipFile(None, 'rb', fileobj=k)
reader = csv.reader(io.TextIOWrapper(gzipped, newline="", encoding="utf-8"), delimiter='^')
for id,line in enumerate(reader):
if id>=int(limit): break
print(id, line)
So it's an equivalent of a following Unix command:
zcat my_file.gz|head -20

Wordpress/Apache - 404 error with unicode characters in image filenames

We've recently moved a website to a new server, and are running into an odd issue where some uploaded images with unicode characters in the filename are giving us a 404 error.
Via ssh/FTP, we can see that the files are definitely there.
For example:
http://sjofasting.no/project/adnoy
none of the images are working:
Code:
<img class='image-display' title='' src='http://sjofasting.no/wp/wp-content/uploads/2012/03/ådnøy_1_2.jpg' width='685' height='484'/>
SSH:
-rw-r--r-- 1 xxxxxxxx xxxxxxxx 836813 Aug 3 16:12 ådnøy_1_2.jpg
What is also strange is that if you navigate to the directory you can even click on the image and it works:
http://sjofasting.no/wp/wp-content/uploads/2012/03/
click on 'ådnøy_1_2.jpg' and it works.
Somehow wordpress is generating
http://sjofasting.no/wp/wp-content/uploads/2012/03/ådnøy_1_2.jpg
and copying from the direct folder browse is generating
http://sjofasting.no/wp/wp-content/uploads/2012/03/a%CC%8Adn%C3%B8y_1_2.jpg
What is going on??
edit:
If I copy the image url from the wordpress source I get:
http://sjofasting.no/wp/wp-content/uploads/2011/11/Bore-Strand-Hotellg%C3%A5rd-12.jpg
When copied from the apache browser I get:
http://sjofasting.no/wp/wp-content/uploads/2011/11/Bore-Strand-Hotellga%cc%8ard-12.jpg
What could account for this discrepancy between:
%C3%A5 and %cc%8
??
Unicode normalisation.
0xC3 0xA5 is the UTF-8 encoding for U+00E5 a-with-ring.
0xCC 0x8A is the UTF-8 encoding for U+030A combining ring.
U+0035 is the composed (Normal Form C) way of writing an a-ring; an a letter followed by U+030A is the decomposed (Normal Form D) way of writing it. å vs å - they should look the same, though they may differ slightly depending on font rendering.
Now normally it doesn't really matter which one you've got because sensible filesystems leave them untouched. If you save a file called [char U+00E5].txt (å.txt), it stays called that under Windows and Linux.
Macs, on the other hand, are insane. The filesystem prefers Normal Form D, to the extent that any composed characters you pass into it get converted into decomposed ones. If you put a file in called [char U+00E5].txt and immediately list the directory, you'll find you've actually got a file called a[char U+030A].txt. You can still access the file as [char U+00E5].txt on a Mac because it'll convert that input into Normal Form D too before looking it up, but you cannot recover the same filename in character sequence terms as you put in: it's a lossy conversion.
So if you save your files on a Mac and then transfer to a filesystem where [char U+00E5].txt and a[char U+030A].txt refer to different files, you will get broken links.
Update the pages to point to the Normal Form D versions of the URLs, or re-upload the files from a filesystem that doesn't egregiously mangle Unicode characters.
Think Different, Cause Bizarre Interoperability Problems.

jpg file difference : from wireshark tcp stream and from a C++ socket

I'm trying to record a jpeg image sent by an Ethernet camera in a mjpg stream.
The image I obtain with my Borland C++ application (VSPCIP) looks identical in Notepad++ to the tcp stream saved from the application Wireshark (except for the number of characters : 15540 in my file, and 15342 in the wireshark file, whereas the jpeg content-length is announced to be 15342).
That is to say that I have 198 non-displayable characters more than expected but both files have 247 lines.
Here are the two files :
http://demo.ovh.com/fr/a61295d39f963998ba1244da2f55a27d/
Which tool could I use (in Notepad++ (I tried to display in UTF8 or ANSI : files still match whereas they don't have the same number of characters) or another editor) to view the non-displayable characters ?
std::ofstream by default opens the file in text mode, which means it might translate newline characters ('\n' binary 0x0a) into a carriage-return/newline sequence ("\r\n", binary 0x0d and 0x0a).
Open the output file in binary mode and it will most likely solve your problem:
std::ofstream os("filename", ios_base::out | ios_base::binary);

Resources