Storing many text files with large similarities [closed] - web-scraping

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I need to store millions of HTML files, each around 100kB (30kB gzipped). These files belong to a handful of groups. Files in each group have large similar chunks.
I would like to store these files compactly (much better than individual gzip) and retrieve them by key. I would insert new files over time, including ones with new structure. I'm not interested in searching the files.
Is there an existing solution for storing these files? For example a specialized service on top of an object store.
What are possible approaches for a custom solution? I'm considering storing files in gzipped groups of 1000 and maintaining an index in a database.
Edit: I would be continuously adding files. I would also like to stream out all the files in insertion order every few weeks.

Slightly, outside the box answer: put the files in a git repository. Apparently, it compresses large chunks of similar bytes together.

You would want to concatenate your groups of a thousand files into a single file for gzipping, which should take advantage of the common blocks, if they are within 32K bytes distance from each other in the concatenation. You could also try zstd which has much larger dictionary sizes, and would surely be able to take advantage of the common blocks.
You can look at gzlog for rapid appending of new data to a gzip stream.

If you don't need to access individual files on a regular basis here's what you can do:
Create an "offset lookup" file that lists your file names and sizes. Concatenate all your files into a humongous huge.txt file. Zip huge.txt and store it alongside with lookup.txt
In the rare even of needing one of the files, unzip huge.txt, use lookup.txt to find where inside your huge.txt your file starts and how many bytes it has, and extract it from there.

Related

How to choose a magic number for SQLite's PRAGMA application_id [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
So there's PRAGMA application_id meant to identify a SQLite database that is used as the file format of a program as this specific file format. The docs say one should choose a unique signed 32 bit integer for this and link https://www.sqlite.org/src/artifact?ci=trunk&filename=magic.txt as a list of registered types. But in this file, there are only a few entries.
So I have two questions:
Is it meaningful and common to actually run this pragma when using SQLite as the file format for a program?
If so, how should that number be chosen? Simply a random number? Or somehow derived from the program's name, homepage or whatever?
Edit:
In addition to the MikeT's answer, I want to add that using this feature, a file can be identified by the file (1) using a magic definition, also including the user_version. For e. g. 123 like so:
0 string SQLite\ format\ 3
>68 belong =int(0x0000007b) Program 123 file (using SQLite)
>>60 belong x \b, Version %d
which actually might be a nice use-case, as one can simply distinguish the file from a "plain" SQLite datase in this way.
Is it meaningful and common to actually run this pragma when using
SQLite as the file format for a program?
It is a rarely used feature and would only be meaningful to a specific application which would handle it in it's own way. It's basically a variable that can be set/changed to indicate to the specific application.
Say for example the application wanted to handle encrypted/non encrypted data the value could be set to indicate an encrypted database (or be a value to indicate one of a set of different encryption methods) or a non-encrypted, thus you could have a single easily and efficiently obtained value to determine which type.
Of course other methods could be used e.g. if the user_version isn't otherwise utilised that could be used.
If so, how should that number be chosen? Simply a random number? Or
somehow derived from the program's name, homepage or whatever?
The number, if used, should probably be set simply as it would need to be interpreted into a meaningful value. A random value would probably not be of any use.
In short it's a 4 byte field in the header that can be used as desired, similar to the user version field. However, it has the advantage that is less likely to alreadey be used, as an example if using the Android SDK/API's then the user_version is utilised for versioning and should not be used (or used with great care).

Why not organise all functions in a package in one file? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I found the following line in Hadley Wickham's book about R packages:
While you’re free to arrange functions into files as you wish, the two extremes are bad: don’t put all functions into one file and don’t put each function into its own separate file (See here).
But why? This would seem to be the two options which make most sense to me. Especially keeping just one file seems appealing to me.
From user perspective:
When I want to know how something works in a package I often go to the respective GitHub page and look at the code. Having functions organised in different files makes this a lot harder and I regularly end up cloning a repository just so I can search the content of all files (e.g. via grep -rnw '/path/to/somewhere/' -e 'function <-').
From a developer's perspective
I also don't really see the upside for developing a package. Browsing through a big file doesn't seem much harder than browsing through a small one if you employ the outline window in R Studio. I know about the Ctrl + . shortcut but it still means I have to open a new file when working on a different function while Ctrl + . could basically do the same job if I keep just one file.
Wouldn't it make more sense to keep all functions in one single file? I know different people like to organise their projects in different ways and that is fine. I'm not asking for opinions here. Rather I would like to know if there are any real disadvantage of keeping everything in one file.
There is obviously not one single answer to that. One obvious question is the size of your package: If it contains only those two neat functions and a plot command, why bother organizing it in any difficult manner: Just hack it into one file and you are good to go.
If your project is large, say you try to throw the R graphics system over and write lots and lots of functions (geoms and stats and ...) a lot of small files might be a better idea. But then, having more files than there is room for tabs in RStudio might not be such a good idea as well.
Another important question is, whether you intend to develop alone or with hundreds of people on GitHub. You might prefer to accept a change in a small file as opposed the "the one big file" that was so easy to search, back when you were alone.
The people who invented Java originally had a "one file per class" going and C# seems to have something similar. That does not mean, that those people are less clever then Hadley. It just means, that your mileage may vary and you have the right to oppose to Hadleys opinions.
Why not put all files on your computer in the root directory?
Ultimately if you use a file tree you are back to using everything as single entities.
Putting things that conceptually belong together into the same file is the logical continuation of putting things into directories/libraries.
If you write a library and define a function as well as some convenience wrappers around them it makes sense to put them in one file.
Navigating the file tree is made easier as you have fewer files and navigating the files is easier as you don't have all functions in the same file.

What is a functional structure for multiple files using the same R scripts? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I am trying to get some good practices happening and have recently moved to using git for version control.
One set of scripts I use is for producing measurement uncertainty estimates from laboratory data. The same scripts are used on different data files and produce a set of files and graphs based on that data. The core scripts change infrequently.
Should I create a branch for each new data set? Is this efficient?
Should I use one set of scripts and just manually relocate output files to a separate location after each use?
There are a few different aspects here that should be touched on. I will try provide my opinions/recommendations for each.
The core scripts change infrequently.
This sounds to me like you should make an R package of your own. If you some core functions that aren't supposed to change, it would probably be best to package them together. Ideally, you design the functions so that the code behind each doesn't need to be modified and you just change an argument (or even begin exploring R S3 or S4 classes).
The custom scripting, you could provide a vignette for yourself demonstrating how you approach a data set. If you wanted to save each final script, I would probably store them in the inst/examples directory for you to call again if you needed to re-run if you don't want to store them locally.
Should I create a branch for each new data set? Is this efficient?
No, I generally would not ever recommend someone put their data on github. It is also not 'efficient' to create a new branch for a new data set. The idea behind creating another branch is to add a new aspect/component to an existing project. Simply adding a dataset and modifying some scripts is, IMHO, a poor use of a branch.
What you should do with your data depends on the data characteristics. Is this data large? Would it benefit from a RDBMS? You at least want to have it backed up on a local laboratory hard drive. Secondly, if you are academically minded, once you finish analyzing the data you should look in to an online repository so that others could also analyze the data. If these datasets are small, you could also put them in your package in the data directory if they are not sensitive.
Should I use one set of scripts and just manually relocate output files to a separate location after each use?
No, I would recommend that with your core functions/scripts that you should look in to creating a wrapper for this part and provide an argument to specify the output path.
I hope these comments help you.

Should I store website images in SQL Server or on the C: drive [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I am programming a website in asp.net Visual Web Developer which I am going to have a lot of product pictures to display on the webpage. Should I store all my images in SQL Server and pull each picture from there or should I store all of the images in a "Picture" folder created inside of my website root folder? Is there a big difference? The Images would be linked to other tables in the database by using the Order_Number this is not a problem.
Too long for a comment.
Images in the database -- I know too many people that regret that decision. Just don't do it except perhaps in light duty usage.
Don't store the path of the image in the database. If you ever have to split images into multiple locations you will have a big mess. Ideally you store a unique (string) identifier hash. Then you computer via a shared function to correct location to pull this from based on the hashed name.
For version 1.0 you could just dump everything into a single directory (so your hash to directory function is very simple). Ideally you want the generated name to be "randomly distributed", i.e., as likely to be zq% as an%. You also ideally want it to be short. Unique is a requirement. For example, you could use an identity field - guaranteed unique but not randomly distributed. If you have large numbers of images, you will want to store these in multiple directory -- so you don't essentially lock up your machine if you ever look at this directory with windows explorer.
A good practice is to combine methods. e.g. Make a hashing function that yield 4 characters (perhaps by keeping only 4 characters of output from TSQL HASHBYTES or CHECKSUM (hashing the identity value) and making the short hash the directory name. Now use the identity value as the filename and you have a simple and scaleable design since you can tweak the algorithm down the road if needed.
Store them on the hard drive, this will allow IIS to cache them and serve them much more efficiently. If you make it so that requesting an image requires invoking a controller IIS cannot cache the image as a static file.

How to Track Software Run/Output [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
Background:
I'm begining to forge my way into more of a software engineering role at work. I have no computer science background and have learned to program some high level languages on my own (mostly R, python, and ruby). And before I start to find my own solutions to problems, I want to know best practices for keeping tack of last runs of the program.
Specifically, I'm writing a program that will clean data in a database (find missing data, imputation, etc...). It needs to know when it was last run so it does not retrieve too much data.
Question:
How do I best keep track of previous code runs?
I'm writing production level code. These scripts and functions will be run automatically (maybe a nightly or weekly basis), and the results will be output to a file. Each of these programs will depend on when it was run last. I can see this dealt with a few ways.
The output file name (or a diagnostic file name) contains the last date/time it was run. I.e. 'output_file_2014_07_11_01_00_04.txt' From this name, the program can determine when it was last run.
Keep a separate info file that the program just appends the last run time to a list of run times.
These solutions seem prone to problems. Is there a more secure and efficient method for recording/reading the last run date?
I like the idea of putting it in the filename. That binds the run time to the actual data. If you keep the run time in a separate file, data can separated from the meta-data (i.e. the run time).
This works in a trusted environment. If accidental or malicious vandalism is a concern, like changing a filename is a problem, then a lot other things become problematic too.
A third alternative is to create a "header" or comment section in the data file itself. The the run time in the header. When you read the data, your skip can either skip the header and go straight for the data, or examine the header and extract the meta-data (i.e. run time or other attributes).
This approach has the advantage that (a) the meta-data and the data are kept together and (b) you can include more meta-data than just the run time. This approach has the disadvantage that any program reading the data must first skip the header. For an example of this approach, see the Attribute-Relation File Format (ARFF) at http://www.cs.waikato.ac.nz/ml/weka/arff.html

Resources