I have a huge log file. I know I can tar it at end but I want the file to get zipped after every 10K line and also ensure that no data is lost.
The final goal is stop the increasing size of the file and keep it at specific limit.
Just a sample code :--
sh script.sh > log1.log &
Now, I want keep zipping log1.log so that it never crosses specific size limit.
Regards,
Abhay
let file be file.txt, then you can do :-
x=$(wc -l file.txt|cut -f 1 -d " ")
if [[ $x >> 10000 ]]
then
sed '1,10000d' file.txt > file2.txt
fi
After that just zip file2.txt and remove file2.txt
Consider using the split command. It can split by lines, bytes, pattern, etc.
split -l 10000 log1.log `date "+%Y%m%d-%H%M%S-"`
This will split the file named "log1.log" into one or more files. Each file will contain no more than 10,000 lines. These files will be named something like 20180327-085711-aa, 20180327-085711-ab, etc. You can use split's -a argument for really large log files so that it will use more than two characters in the file suffix.
The tricky part is that your shell script is still writing to the file. After the contents are split, the log must be truncated. Note that there is a small time slice between splitting the file and truncating it, so some logging data might be lost.
This example splits into 50,000 line files:
$ wc log.text
528193 1237600 10371201 log.text
$ split -l 50000 log.text `date "+%Y%m%d-%H%M%S-"` && cat /dev/null > log.text
$ ls
20180327-090530-aa 20180327-090530-ae 20180327-090530-ai
20180327-090530-ab 20180327-090530-af 20180327-090530-aj
20180327-090530-ac 20180327-090530-ag 20180327-090530-ak
20180327-090530-ad 20180327-090530-ah log.text
$ wc 20180327-090530-aa
50000 117220 982777 20180327-090530-aa
If you only want to truncate the file if it reaches a certain size (number of lines), wrap this split command in a shell script that gets run periodically (such as through cron). Here's an example of checking file size:
if (( `wc -l < log.text` > 1000000 ))
then
echo time to truncate
fi
Related
I'm trying to extract from a huge tar file some files from a list that are using wildcards. I'm using a loop to read the list but passing from one element in the list to the next one is taking too long, I'm guessing because is trying to match the element through the whole tar file. I want that after 2 matches for any element, the loop continues with the next one.
while read line;do
tar --wildcards -xzvf file.tar.gz "$line"
done <$file
And one line looks like this
dataset/0113947.*
I went aggresive and kill the tar process as soon as it finds two files. Here is my solution
file=list.txt
while read line;do
tar --wildcards --checkpoint=10000 --checkpoint-action=exec='sh stop.sh dummy.txt 1' -xzvf ny_file.tar.gz "$line" > dummy.txt
done <$file
where stop.sh checks if dummy.txt has more than two lines and kill the process.
n=$(wc -l < $1)
if [ $n -gt 1 ];then
kill $(ps aux|grep "[t]ar --wildcards*" | cut -d " " -f 4)
fi
I had to use cut to recover the ID process because the single quotes for awk were troubling
I'm using terminal on OS 10.X. I have some data files of the format:
mbh5.0_mrg4.54545454545_period0.000722172513951.params.dat
mbh5.0_mrg4.54545454545_period0.00077271543854.params.dat
mbh5.0_mrg4.59090909091_period-0.000355232058085.params.dat
mbh5.0_mrg4.59090909091_period-0.000402015664015.params.dat
I know that there will be some files with similar numbers after mbh and mrg, but I won't know ahead of time what the numbers will be or how many similarly numbered ones there will be. My goal is to cat all the data from all the files with similar numbers after mbh and mrg into one data file. So from the above I would want to do something like...
cat mbh5.0_mrg4.54545454545*dat > mbh5.0_mrg4.54545454545.dat
cat mbh5.0_mrg4.5909090909*dat > mbh5.0_mrg4.5909090909.dat
I want to automate this process because there will be many such files.
What would be the best way to do this? I've been looking into sed, but I don't have a solution yet.
for file in *.params.dat; do
prefix=${file%_*}
cat "$file" >> "$prefix.dat"
done
This part ${file%_*} remove the last underscore and following text from the end of $file and saves the result in the prefix variable. (Ref: http://www.gnu.org/software/bash/manual/bashref.html#Shell-Parameter-Expansion)
It's not 100% clear to me what you're trying to achieve here but if you want to aggregate files into a file with the same number after "mbh5.0_mrg4." then you can do the following.
ls -l mbh5.0_mrg4* | awk '{print "cat " $9 " > mbh5.0_mrg4." substr($9,12,11) ".dat" }' | /bin/bash
The "ls -s" lists the file and the "awk" takes the 9th column from the result of the ls. With some string concatenation the result is passed to /bin/bash to be executed.
This is a linux bash script, so assuming you have /bind/bash, I'm not 100% famililar with OS X. This script also assumes that the number youre grouping on is always in the same place in the filename. I think you can change /bin/bash to almost any shell you have installed.
Given 2 folder: /folder1 and /folder2 and each folder has some files and subfolders inside.
I used following command to compare the file difference including sub folder :
diff -buf /folder1 /folder2
which found no difference in term of folder and file structural .
However, I found that there are some permission differences between these 2 folders' files. Is there simple way/command to compare the permission of each file under these 2 folders (including sub-folders) on Unix?
thanks,
If you have the tree command installed, it can do the job very simply using a similar procedure to the one that John C suggested:
cd a
tree -dfpiug > ../a.list
cd ../b
tree -dfpiug > ../b.list
cd ..
diff a.list b.list
Or, you can just do this on one line:
diff <(cd a; tree -dfpiug) <(cd b; tree -dfpiug)
The options given to tree are as follows:
-d only scans directories (omit to compare files as well)
-f displays the full path
-p displays permissions (e.g., [drwxrwsr-x])
-i removes tree's normal hierarchical indent
-u displays the owner's username
-g displays the group name
One way to compare permissions on your two directories is to capture the output of ls -al to a file for each directory and diff those.
Say you have two directories called a and b.
cd a
ls -alrt > ../a.list
cd ../b
ls -alrt > ../b.list
cd ..
diff a.list b.list
If you find that this gives you too much noise due to file sizes and datestamps you can use awk to filter out some of the columns returned by ls e.g.:
ls -al | awk {'printf "%s %s %s %s %s %s\n", $1,$2,$3,$4,$5,$9 '}
Or if you are lucky you might be able to remove the timestamp using:
ls -lh --time-style=+
Either way, just capture the results to two files as described above and use diff or sdiff to compare the results.
find /dirx/ -lsa |awk '{ print $6" "$6" " $11 }' 2 times the owner
find /dirx/ -lsa |awk '{ print $6" "$6" " $11 }' 2 times the group
find /dirx/ -lsa |awk '{ print $5" "$6" " $11 }' owner and group
Then you can redirect to a file for diff or just investigate piping to less ( or more ).
You can also pipe to grep and grep or "ungrep" (grep -v) to narrow the results.
Diff is not very useful if the dir contents are not the same
Hi I am new to UNIX and I have to get the count of lines from incoming csv files. I have used the following command to get the count.
wc -l filename.csv
Consider files coming with 1 record iam getting some files with * at the start and for those files if i issue the same command iam getting count as 0. Does * mean anything in here and if i get a file with ctrlm(CR) instead of NL how to get the count of lines in that file. gimme a command that solves the issue.
The following query helps you to get the count
cat FILE_NAME | wc -l
All of the answers are wrong. CSV files accept line breaks in between quotes which should still be considered part of the same line. If you have either Python or PHP on your machine, you could be doing something like this:
Python
//From stdin
cat *.csv | python -c "import csv; import sys; print(sum(1 for i in csv.reader(sys.stdin)))"
//From file name
python -c "import csv; print(sum(1 for i in csv.reader(open('csv.csv'))))"
PHP
//From stdin
cat *.csv | php -r 'for($i=0; fgetcsv(STDIN); $i++);echo "$i\n";'
//From file name
php -r 'for($i=0, $fp=fopen("csv.csv", "r"); fgetcsv($fp); $i++);echo "$i\n";'
I have also created a script to simulate the output of wc -l: https://github.com/dhulke/CSVCount
In case you have multiple .csv files in the same folder, use
cat *.csv | wc -l
to get the total number of lines in all csv files in the current directory. So,
-c counts characters and -m counts bytes (identical as long as you use ASCII). You can also use wc to count the number of files, e.g. by: ls -l | wc -l
wc -l mytextfile
Or to only output the number of lines:
wc -l < mytextfile
Usage: wc [OPTION]... [FILE]...
or: wc [OPTION]... --files0-from=F
Print newline, word, and byte counts for each FILE, and a total line if
more than one FILE is specified. With no FILE, or when FILE is -,
read standard input.
-c, --bytes print the byte counts
-m, --chars print the character counts
-l, --lines print the newline counts
--files0-from=F read input from the files specified by
NUL-terminated names in file F;
If F is - then read names from standard input
-L, --max-line-length print the length of the longest line
-w, --words print the word counts
--help display this help and exit
--version output version information and exit
You can also use xsv for that. It also supports many other subcommands that are useful for csv files.
xsv count file.csv
echo $(wc -l file_name.csv|awk '{print $1}')
I need to extract a set number of lines from a file given the start line number and end line number.
How could I quickly do this under unix (it's actually Solaris so gnu flavour isn't available).
Thx
To print lines 6-10:
sed -n '6,10p' file
If the file is huge, and the end line number is small compared to the number of lines, you can make it more efficient by:
sed -n '10q;6,10p' file
From testing a file with a fairly large number of lines:
$ wc -l test.txt
368048 test.txt
$ du -k test.txt
24640 test.txt
$ time sed -n '10q;6,10p' test.txt >/dev/null
real 0m0.005s
user 0m0.001s
sys 0m0.003s
$ time sed -n '6,10p' test.txt >/dev/null
real 0m0.123s
user 0m0.092s
sys 0m0.030s
Or
head -n "$last" file | tail -n +"$first"
I wrote a Haskell program called splitter that does exactly this: have a read through my release blog post.
You can use the program as follows:
$ cat somefile | splitter 4,6-10,50-
That will get lines four, six to ten and lines fifty onwards. And that is all that there is to it. You will need Haskell to install it. Just:
$ cabal install splitter
And you are done. I hope that you find this program useful.
you can do it with nawk as well
#!/bin/sh
start=10
end=20
nawk -vs="$start" -ve="$end" 'NR>e{exit}NR>=s' file