I have file1.txt with 100 entries. Need to search the contents of file1.txt in file2.bz2 file which is a large bzip file.
bzgrep -f file1.txt file2.bz2 takes long time.
You can do nothing. File is compressed and the only way to search is to decompress it.
One possible workaround is to keep uncompressed version of the file.
You can do a lot, but it's a truly excessive amount of work.
bzip2 files are composed of chunks. You can cut the file up by chunks, full-text index each one, and save the indexes. If you have some idea of the keywords you can filter your indexes, otherwise you get the full index mayhem from all the text. This tends to be something like 10-100 times the size of the original uncompressed document.
If there's only certain places the words to be indexed occur or you can limit the number of words to be indexed AND searches are much more frequent than documents you can make this work.
Idea blatantly stolen from here: https://www.thanassis.space/buildWikipediaOffline.html
Related
I believe similar questions have been answered on SO before. I cant find any that seem to match to my particular situation, though I am sure many others have faced this scenario.
In an FTP session on Red Hat I have produced a list of file names that reside on the server currently. The list contains the file names and only the file names. Call this file1. Perhaps it contains something like:
513569430_EDIP000754535900_MFC_20190618032554.txt
blah.txt
duh.txt
Then I have downloaded the files and produced a list of successfully downloaded files. As well, this list contains the file names and only the file names. Call this file2. Perhaps it contains something like:
loadFile.dat
513569430_EDIP000754535900_MFC_20190618032554.txt
localoutfile.log
Now I want to loop through the names in file1 and check if they exist in file2. If exists I will go back to FTP server and delete the file from server.
I have looked at while loops and comm and test command, but I just cant seem to crack the code. I expect there are many ways to achieve this task. Any suggestions out there or working references?
My area of trouble is really not the looping itself but rather the comparing of contents between 2 files.
comm -1 -2 file1 file2 returns just the lines that are identical in both files. This can be used as the basis of a batch command file for sftp.
From the comments to the question, it seems that line-endings differ for the two files. This can be fixed in various ways, simplest probably being with tr. comm understands - as a filename to mean "read from stdin".
For example:
tr -d '\r` file1 | comm -1 -2 - file2
If file1 or file2 are not sorted, this must be corrected for comm to operate properly. With bash, this could be:
comm -1 -2 <( sort file1 | tr -d '\r' ) <( sort file2 )
With shells that don't understand the <( ... ) syntax, temporary files may be used explicitly.
Thank you for the advice #jhnc.
After giving this some deeper consideration and conversation, I realized that I don't even need to do this comparison. After I download the files I just need to produce the list of successful downloads. Then I can go and delete from server based on list of successful downloads.
However, I am still interested to know how to compare with the '\r \n' vs '\n' line ending situation
I'm using SQLite, and I need to load hundreds of CSV files into one table. I didn't manage to find such a thing in the web. Is it possible?
Please note that in the beginning i used Oracle, but since Oracle have a 1000 columns limitation per table, and my CSV files have more than 1500 columns each one, i had to find another solution. I wan't to try SQLite, since i can install it fast and easily.
These CSV files have been supplied with such as amount of columns and i can't change or split them (nevermind why).
Please advise.
I ran into a similar problem and the comments to your question actually gave me the answer that finally worked for me
Step 1: merge the multiple csv's into a single file. Exclude the header for most of them but write down the header from one of them in the beginning.
Step 2: Load the single merged csv into SQLite.
For step 1 I used:
$ head -1 one.csv > all_combined.csv
$ tail -n +2 -q *.csv >> all_combined.csv
The first command writes only the first line of the csv file (you can choose whichever one file), the second command writes the whole document starting from line 2 and therefore excluding the header. The -q option makes sure that tail never writes the file name as a header.
Make sure to put all_combined.csv in a separate folder, or in some distributions, it will be included recursively!
To load into SQLite (Step 2) the answer given by Hot Licks worked for me:
sqlite> .mode csv
sqlite> .import all_combined.csv my_new_table
This assumes that my_new_table hasn't been created. Alternatively you can create beforehand and then load, but in that case exclude the header from Step 1.
http://www.sqlite.org/cli.html --
Use the ".import" command to import CSV (comma separated value) data into an SQLite table. The ".import" command takes two arguments which are the name of the disk file from which CSV data is to be read and the name of the SQLite table into which the CSV data is to be inserted.
Note that it is important to set the "mode" to "csv" before running the ".import" command. This is necessary to prevent the command-line shell from trying to interpret the input file text as some other format.
sqlite> .mode csv
sqlite> .import C:/work/somedata.csv tab1
There are two cases to consider: (1) Table "tab1" does not previously exist and (2) table "tab1" does already exist.
In the first case, when the table does not previously exist, the table is automatically created and the content of the first row of the input CSV file is used to determine the name of all the columns in the table. In other words, if the table does not previously exist, the first row of the CSV file is interpreted to be column names and the actual data starts on the second row of the CSV file.
For the second case, when the table already exists, every row of the CSV file, including the first row, is assumed to be actual content. If the CSV file contains an initial row of column labels, that row will be read as data and inserted into the table. To avoid this, make sure that table does not previously exist.
Note that you need to make sure that the files DO NOT have an initial line defining the field names. And, for "hundreds" of files you will probably want to prepare a script rather than typing in each file individually.
I didn't find a nicer way to solve this so I used find along with xargs to avoid creating a huge intermediate .csv file:
find . -type f -name '*.csv' | xargs -I% sqlite3 database.db ".mode csv" ".import % new_table" ".exit"
find prints out the file names and the -I% parameter to xargs causes the command after it to be run once for each line, with % replaced by a name of a csv file.
You can use DB Browser for SQLite to do this pretty easily.
File > Import > Table from CSV file... and then select all the files to open them together into a single table.
I just tested this out with a dozen CSV files and got a single 1 GB table from them without any work. As long as they have the same schema, DB Browser is able to put them together. You'll want to keep the 'Column Names in first line' option checked.
I'd like to split a file and grep each piece without writing them to indvidual files.
I've attempted a couple variations of split and grep and no such luck; any suggestions?
Something along the lines of:
split -b SIZE filename | grep "string"
I've attempted grep/fgrep to find the string but my shell complains that the files are too large. See: use fgrep instead
There is no point in splitting the file if you plan to [linearly] search each of the pieces anyway (assuming that's the only thing you are doing with it). Consider running grep on the entire file.
If however you plan to utilize the fact that the file is split later on, then the typical way would be:
Create a temporary directory and step into it
Run split/csplit on the original file
Use for loop over written fragment to do your processing.
I have a lot of files, say 1000 files, each with 4mb. Totally there are 4gb. I would like to sort them by using unix sort, here is my command:
sort -t ',' -k 1,1 -k 5,7 -k 22,22 -k 2,2r INPUT_UNSORTED_${current_time}.DAT -o INPUT_SORTED_${current_time}.DAT
where INPUT_UNSORTED is a big file created by appending the 1000 files. So there is another 4gb. INPUT_SORTED is another 4gb too.
And I discovered unix sort used a temp folder to sort the files, and the temp files may reach to 4gb too.
How can I reduce disk usage without losing performance?
Is your goal to get a single big sorted output file? Take a look at sort's --merge option. You can sort the small input files individually, and then merge them all into the large sorted output. If you delete each unsorted input file immediately after producing its sorted counterpart, you won't use more than 4MB of space on intermediate results.
I have two files ...
file1:
002009092312291100098420090922111
010555101070002956200453T+00001190.81+00001295.920010.87P
010555101070002956200449J+00003128.85+00003693.90+00003128
010555101070002956200176H+00000281.14+00000300.32+00000281
file2:
002009092410521000098420090709111
010560458520002547500432M+00001822.88+00001592.96+00001822
010560458520002547500432D+00000106.68+00000114.77+00000106
In both files in every record starting with 01, the string from 3rd char to 25th char, i.e up to alphabet is the key.
Based on this key, I have to compare two files, and if there is any record matching in file 2, then I have to replace that record in file1, or else append it if it won't match.
Well, this is a fairly unspecific (and basic) programming question. We'll be better able to help us if you explain exactly what you did and where you got stuck.
Also, it looks a bit like homework, and people are wary of giving too much help on homework problems, as it might look like cheating.
To get you started:
I'd recommend Perl to solve this, but awk or another scripting language will also do. I'd recommend against sh/bash, as they are weak on text manipulation; also combining grep et al will become rather cumbersome.
First write a Perl program that filters records starting with 01. Then extract the key and put it into a hash (a Perl structure). Then output a new, combined file as required.
Using awk get the fields from 3-25 but doing something like
awk -F "" '/^01/{print $1}' file_name | cut -c 3-25 and match the first two fields with 01 from both files and get all the lines in two different buffers and compare both the buffers using for line in in a shell script.
Whenever the line in second buffer matches the first one grep the line in second buffer in first file and replace the line in first file with the line in second. I think you need to work a bit around the logic.