Does anyone know how to interpret the result of the "Chi Square" operation in CyberChef? I thought I was supposed to get a p-value between 0 and 1, but I'm getting larger values. For example, I created a compressed (zip) and encrypted (AES) version of the same text file, and I got 331.6060 for the zip file, and 219.8651 for the encrypted file.
I was mostly hoping to get some clue as to whether a file is encrypted or compressed using Chi Square. For example, do encrypted files usually give a lower chi-square value compared to compressed files?
Related
I have a very large species dataset from gbif (178GB) zipped, when unzipped its approximately 800gb (TSV) My Mac only has 512gb Memory and 8GB of Ram, however I am not in need of using all of this data.
Are their any approaches that I can take that can unzip the file without eating all of my memory and extracting a portion of the dataset by filtering out rows relative to a column? For example, it has occurrence values going back until 1600, I only need data for the last 2 years which I believe my PC can more than handle. Perhaps there is a library with a function that can filter rows when loading the data?
I am unsure of how to unzip properly, and I have looked to see unzipping libraries and unzip according to this article, truncates data >4GB. My worry is where could I store 800gb of data when unzipped?
Update:
It seems that all the packages I have come across stop at 4GB after decompression. I am wondering if it is possible to create a function that can decompress at the 4GB, mark that point or data that has been retrieved, and begin decompression again from that point, and continue until the whole .zip file has been decompressed. It could store the decompressed files into a folder, that way you can access them with something like list.files(), any ideas if this can be done?
I want to download fastq raw file from RNAseq to get gene expression values. But GEO only provides .bed.gz and .wig.gz formats. What can I do to get the RPKM values? Thank you very much!
In order to calculate RPKM, you need (mapped) raw reads as contained in BAM/SAM or even CRAM files. Wiggle, BED and their derivatives such as bigWiggle are compressed versions of those only containing the coverage (mainly used for plotting), that is they have lost the read information needed for counting and therefore calculating RPKM (or FPKM/TPM for that manner).
The standard approach is to start from a bam file, extract the reads counts for regions of interest and calculate RPKM etc. There is many pipelines out there such as this.
If Bam files are not available, GEO usually has at least the raw fastq files (or sra files that can be converted to fastq) as a basis for mapping to obtain a bam file. Also have a look at ArrayExpress, they could have the raw files for that project since it is mirroring GEO.
Maybe as a word of warning, if you intend to do differential expression analysis, you need to go from the raw counts, not the RPKM values.
I have a problem with unformatted data and I don't know where, so I will post my entire workflow.
I'm integrating my own code into an existing climate model, written in fortran, to generate a custom variable from the model output. I have been successful in getting sensible and readable formatted output (values up to the thousands), but when I try to write unformatted output then the values I get are absurd (on the scale of 1E10).
Would anyone be able to take a look at my process and see where I might be going wrong?
I'm unable to make a functional replication of the entire code used to output the data, however the relevant snippet is;
c write customvar to file [UNFORMATTED]
open (unit=10,file="~/output_test_u",form="unformatted")
write (10)customvar
close(10)
c write customvar to file [FORMATTED]
c open (unit=10,file="~/output_test_f")
c write (10,*)customvar
c close(10)
The model was run twice, once with the FORMATTED code commented out and once with the UNFORMATTED code commented out, although I now realise I could have run it once if I'd used different unit numbers. Either way, different runs should not produce different values.
The files produced are available here;
unformatted(9kb)
formatted (31kb)
In order to interpret these files, I am using R. The following code is what I used to read each file, and shape them into comparable matrices.
##Read in FORMATTED data
formatted <- scan(file="output_test_f",what="numeric")
formatted <- (matrix(formatted,ncol=64,byrow=T))
formatted <- apply(formatted,1:2,as.numeric)
##Read in UNFORMATTED data
to.read <- file("output_test_u","rb")
unformatted <- readBin(to.read,integer(),n=10000)
close(to.read)
unformatted <- unformatted[c(-1,-2050)] #to remove padding
unformatted <- matrix(unformatted,ncol=64,byrow=T)
unformatted <- apply(unformatted,1:2,as.numeric)
In order to check the the general structure of the data between the two files is the same, I checked that zero and non-zero values were in the same position in each matrix (each value represents a grid square, zeros represent where there was sea) using;
as.logical(unformatted)-as.logical(formatted)
and an array of zeros was returned, indicating that it is the just the values which are different between the two, and not the way I've shaped them.
To see how the values relate to each other, I tried plotting formatted vs unformatted values (note all zero values are removed)
As you can see they have some sort of relationship, so the inflation of the values is not random.
I am completely stumped as to why the unformatted data values are so inflated. Is there an error in the way I'm reading and interpreting the file? Is there some underlying way that fortran writes unformatted data that alters the values?
The usual method that Fortran uses to write unformatted file is:
A leading record marker, usually four bytes, with the length of the following record
The actual data
A trailing record marker, the same number of bytes as the leading record marker, with the same information (used for BACKSPACE)
The usual number of bytes in the record marker is four bytes, but eight bytes have also been sighted (e.g. very old versions of gfortran for 64-bit systems).
If you don't want to deal with these complications, just use stream access. On the Fortran side, open the file with
OPEN(unit=10,file="foo.dat",form="unformatted",access="stream")
This will give you a stream-oriented I/O model like C's binary streams.
Otherwise, you would have to look at your compiler's documentation to see how exactly unformatted I/O is implemented, and take care of the record markers from the R side. A word of caution here: Different compilers have different methods of dealing with very long records of more than 2^31 bytes, even if they have four-byte record markers.
Following on from the comments of #Stibu and #IanH, I experimented with the R code and found that the source of error was the incorrect handling of the byte size in R. Explicitly specifying a bite size of 4, i.e
unformatted <- readBin(to.read,integer(),size="4",n=10000)
allows the data to be perfectly read in.
I wanted to know the structure of an unknown binary file generated by Fortran routine. For the same I downloaded hex editor. I am fairly new to the whole concept. I can see some character strings in the conversion tool. However, the rest is just dots and junk characters.
I tried with some online converter but it only converts to the decimal systems. Is there any possible way to figure out that certain hex represents integer and real?
I also referred to following thread, but I couldn't get much out of it.
Hex editor for viewing combined string and float data
Any help would be much appreciated. Thanks
The short answer is no. If you really know nothing of the format, then you are stuck. You might see some "obvious" text, in some language, but beyond that, it's pretty much impossible. Your Hex editor reads the file as a group of bytes, and displays, usually, ASCII equivalents beside the hex values. If a byte is not a printable ASCII character, it usually displays a .
So, text aside, if you see a value $31 in the file, you have no way of knowing if this represents a single character ('1'), or is part of a 2 byte word, or a 4 byte long, or indeed an 8 byte floating point number.
In general, that is going to be pretty hard! Some of the following ideas may help.
Can you give the FORTRAN program some inputs that make it change the length/amount of output data? E.g. can you make it produce 1 unit of output, then 8 units of output, then 64 units of output - by unit I mean the number of values it outputs if that is what it does. If so, you can plot output file length against number of units of output and the intercept will tell you how many bytes the header is, if any. So, for example, if you can make it produce one number and you get an output file that is 24 bytes long, and when you make it produce 2 numbers the output file is 28 bytes long, you might deduce that there is a 20 byte header at the start and then 4 bytes per number.
Can you generate some known output? E.g. Can you make it produce zero, or 256 as an output, if so, you can search for zeroes and FF in your output file and see if you can locate them.
Make it generate a zero (or some other number) and save the output file, then make it generate a one (or some other, different number) and save the output file, then difference the files to deduce where that number is located. Or just make it produce any two different numbers and see how much of the output file changes.
Do you have another program that can understand the file? Or some way of knowing any numbers in the file? If so, get those numbers and convert them into hex and look for their positions in the hex dump.
I am currently doing a large amount of data analysis in Fortran. I have been using R to plot most of my results, as Fortran is ill-suited for visualization. Up until now, the data sets have been two-dimensional and rather small, so I've gotten away with routines that write the data-to-be-plotted and various plot parameters to a .CSV file, and using a system call to run an R script that reads the file and generates the required plot.
However, I find myself now dealing with somewhat larger 3D data sets, and I do not know if I can feasibly continue in this manner (notably, sending and properly reading in a 3D array via .CSV is rather more difficult, and takes up a lot of excess memory which is a problem given the size of the data sets).
Does anyone know any efficient way of sending data from Fortran to R? The only utility I found for this (RFortran) is windows-only, and my work computer is a mac. I know that R possesses a rudimentary Fortran interface, but I am calling R from Fortran, not vice-versa, and moreover given the number of plot parameters I am sending (axis lables, plot titles, axis units and limits, etc., many of which are optional and have default values in the current routines I'm using) I am not sure that it has the features I require.
I would go for writing NetCDF files from Fortran. These files can contain large amounts of multi-dimensional data. There are also good bindings for creating NetCDF files form within Fortran (it is used a lot in climate models). In addition, R has excellent support for working with NetCDF files in the form of the ncdf package. It is for example very easy to only read a small portion of the data cube into memory (only some timesteps, or some geographic region). Finally, NetCDF works across all platforms.
In terms of workflow, I would let the fortran program generate NetCDF files plus some graphics parameters in a separate file (data.nc and data.plt for example), and then as a post-processing step call R. In this way you do not need to directly interface R and Fortran. Managing the entire workflow could be done by a separate script (e.g. Python), which calls the Fortran model, makes a list of the NetCDF/.plt files and creates the plots.
So, it turns out that sending arrays via. unformatted files between Fortran and R is trivially easy. Both are column-major, so one needs to do no more than pass an unformatted file containing the array and another containing array shape and size information, and then read the data directly into an array of proper size and shape in R.
Sample code for an n-dimensional array of integers, a, with dimension i having size s(i).
Fortran-side (access must be set to "stream," else you will have extra bytes inserted after every write):
open(unit = 1, file="testheader.dat", form="unformatted", access="stream", status="unknown")
open(unit = 2, file="testdata.dat", form="unformatted", access="stream", status="unknown")
write(1) n
do i=1,n
write(1) s(i)
enddo
write(2) a
R-side (be sure that you have endianness correct, or this will fail miserably):
testheader = file("testheader.dat", "rb")
testdata = file("testdata.dat", "rb")
dims <- readBin(testheader, integer(), endian="big")
sizes <- readBin(testheader, integer(), n=dims, endian="big")
dim(sizes) <- c(dims)
a <- readBin(testdata, integer(), n=prod(sizes), endian="big")
dim(a) <- sizes
You can put the header and data in the same file if you want.