Finding total number of lines in hdfs distributed file using command line - unix

I am working on a cluster where a dataset is kept in hdfs in distributed manner. Here is what I have:
[hmi#bdadev-5 ~]$ hadoop fs -ls /bdatest/clm/data/
Found 1840 items
-rw-r--r-- 3 bda supergroup 0 2015-08-11 00:32 /bdatest/clm/data/_SUCCESS
-rw-r--r-- 3 bda supergroup 34404390 2015-08-11 00:32 /bdatest/clm/data/part-00000
-rw-r--r-- 3 bda supergroup 34404062 2015-08-11 00:32 /bdatest/clm/data/part-00001
-rw-r--r-- 3 bda supergroup 34404259 2015-08-11 00:32 /bdatest/clm/data/part-00002
....
....
The data is of the form:
[hmi#bdadev-5 ~]$ hadoop fs -cat /bdatest/clm/data/part-00000|head
V|485715986|1|8ca217a3d75d8236|Y|Y|Y|Y/1X||Trimode|SAMSUNG|1x/Trimode|High|Phone|N|Y|Y|Y|N|Basic|Basic|Basic|Basic|N|N|N|N|Y|N|Basic-Communicator|Y|Basic|N|Y|1X|Basic|1X|||SAM|Other|SCH-A870|SCH-A870|N|N|M2MC|
So, what I want to do is to count the total number of lines in the original data file data. My understanding is that the distributed chunks like part-00000, part-00001 etc have overlaps. So just counting the number of lines in part-xxxx files and summing them won't work. Also the original dataset data is of size ~70GB. How can I efficiently find out the total number of lines?

More efficiently -- you can use spark to count the no. of lines. The following code snippet helps to count the number of lines.
text_file = spark.textFile("hdfs://...")
count = text_file.count();
count.dump();
This displays the count of no. of lines.
Note: The data in different part files will not overlap
Using hdfs dfs -cat /bdatest/clm/data/part-* | wc -l will also give you the output but this will dump all the data to the local machine and takes longer time.
Best solution is to use MapReduce or spark. MapReduce will take longer time to develop and execute. If the spark is installed, this is the best choice.

If you need to just find the number of lines in data. You can use the following command:
hdfs dfs -cat /bdatest/clm/data/part-* | wc -l
Also you can write a simple mapreduce program with identity mapper which emits the input as output. Then you check the counters and find the input records for mapper. That will be number of lines in your data.

Hadoop one liner:
hadoop fs -cat /bdatest/clm/data/part-* | wc -l
Source: http://www.sasanalysis.com/2014/04/10-popular-linux-commands-for-hadoop.html
Another approach would be to create a map reduce job where the mapper emits 1 for each line and the reducer sums the values. See the accepted answer of Writing MApreduce code for counting number of records for the solution.

This is such a common task that I wish there is a subcommand in fs to do that (e.g. hadoop fs -wc -l inputdir) to avoid streaming all the content to one machine that performs the "wc -l" command.
To count lines efficiently, I often use hadoop streaming and unix commands as follows:
hadoop jar ${HADOOP_HOME}/hadoop-streaming.jar \
-Dmapred.reduce.tasks=1 \
-input inputdir \
-output outputdir \
-mapper "bash -c 'paste <(echo "count") <(wc -l)'" \
-reducer "bash -c 'cut -f2 | paste -sd+ | bc'"
Every mapper will run "wc -l" on the parts it has and then a single reducer will sum up the counts from all the mappers.

If you have a very big file with about same line content (I imagine a JSON or a log entry), and you don't care about precision, you could calculate it.
Example, I store raw JSON in a file:
Size of the file: 750Mo
Size of first line: 752 chars (==> 752 octets)
Lines => about 1.020.091
Running cat | wc -l gives 1.018.932
Not so bad ^^

You can use hadoop streaming for this problem.
This is how you run it :
hadoop jar /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming-2.6.0-cdh5.11.0.jar -input <dir> -output <dir> counter_mapper.py -reducer counter_reducery -file counter_mapper.py -file counter_reducer.py
counter_mapper.py
#!/usr/bin/env python
import sys
count = 0
for line in sys.stdin:
count = count + 1
print count
counter_reducer.py
#!/usr/bin/env python
import sys
count = 0
for line in sys.stdin:
count = count +int(line)
print count

Related

Using GNU Parallel etc with PBS queue system to run more than 2 or more MPI codes across multiple nodes as a single job

I am trying to run more than 1 MPI codes (eg. 2) in PBS queue system across multiple nodes as a single job.
E.g. For my cluster, 1 node = 12 procs
I need to run 2 codes (abc1.out & abc2.out) as a single job, each code using 24 procs. Hence, I need 4x12 cores for this job. And I need a software which can assign 2x12 to each of the code.
Someone suggested:
How to run several commands in one PBS job submission
which is:
(cd jobdir1; myexecutable argument1 argument2) &
(cd jobdir2; myexecutable argument1 argument2) &
wait
but it doesn't work. The codes are not distributed among all processes.
Can GNU parallel be used? Becos I read somewhere that it can't work across multiple nodes.
If so, what's the command line for the PBS queue system
If not, is there any software which can do this?
This is similar to my final objective which is similar but much more complicated.
Thanks for the help.
Looking at https://hpcc.umd.edu/hpcc/help/running.html#mpi it seems you need to use $PBS_NODEFILE.
Let us assume you have $PBS_NODEFILE containing the 4 reserved nodes. You then need a way to split these in 2x2. This will probably do:
run_one_set() {
cat > nodefile.$$
mpdboot -n 2 -f nodefile.$$
mpiexec -n 1 YOUR_PROGRAM
mpdallexit
rm nodefile.$$
}
export -f run_one_set
cat $PBS_NODEFILE | parallel --pipe -N2 run_one_set
(Completely untested).
thanks for the suggestions.
Btw, i tried using gnu parallel and so far, it only works for jobs within a single node. After some trial and error, I finally found the solution.
Suppose each node has 12procs. And you need to run 2 jobs, each req 24 procs.
So u can request:
#PBS -l select=4:ncpus=12:mpiprocs=12:mem=32gb:ompthreads=1
Then
sort -u $PBS_NODEFILE > unique-nodelist.txt
sed -n '1,2p' unique-nodelist.txt > host.txt
sed 's/.*/& slots=12/' host.txt > host1.txt
sed -n '3,4p' unique-nodelist.txt > host.txt
sed 's/.*/& slots=12/' host.txt > host2.txt
mv host1.txt 1/
mv host2.txt 2/
(cd 1; ./run_solver.sh) &
(cd 2; ./run_solver.sh) &
wait
What the above do is to get the nodes used, remove repetition
separate into 2 nodes each for each job
go to dir 1 and 2 and run the job using run_solver.sh
Inside run_solver.sh for job 1 in dir 1:
...
mpirun -n 24 --hostfile host1.txt abc
Inside run_solver.sh for job 2 in dir 2:
...
mpirun -n 24 --hostfile host2.txt def
Note the different host name.

Tar running log file unix

I have a huge log file. I know I can tar it at end but I want the file to get zipped after every 10K line and also ensure that no data is lost.
The final goal is stop the increasing size of the file and keep it at specific limit.
Just a sample code :--
sh script.sh > log1.log &
Now, I want keep zipping log1.log so that it never crosses specific size limit.
Regards,
Abhay
let file be file.txt, then you can do :-
x=$(wc -l file.txt|cut -f 1 -d " ")
if [[ $x >> 10000 ]]
then
sed '1,10000d' file.txt > file2.txt
fi
After that just zip file2.txt and remove file2.txt
Consider using the split command. It can split by lines, bytes, pattern, etc.
split -l 10000 log1.log `date "+%Y%m%d-%H%M%S-"`
This will split the file named "log1.log" into one or more files. Each file will contain no more than 10,000 lines. These files will be named something like 20180327-085711-aa, 20180327-085711-ab, etc. You can use split's -a argument for really large log files so that it will use more than two characters in the file suffix.
The tricky part is that your shell script is still writing to the file. After the contents are split, the log must be truncated. Note that there is a small time slice between splitting the file and truncating it, so some logging data might be lost.
This example splits into 50,000 line files:
$ wc log.text
528193 1237600 10371201 log.text
$ split -l 50000 log.text `date "+%Y%m%d-%H%M%S-"` && cat /dev/null > log.text
$ ls
20180327-090530-aa 20180327-090530-ae 20180327-090530-ai
20180327-090530-ab 20180327-090530-af 20180327-090530-aj
20180327-090530-ac 20180327-090530-ag 20180327-090530-ak
20180327-090530-ad 20180327-090530-ah log.text
$ wc 20180327-090530-aa
50000 117220 982777 20180327-090530-aa
If you only want to truncate the file if it reaches a certain size (number of lines), wrap this split command in a shell script that gets run periodically (such as through cron). Here's an example of checking file size:
if (( `wc -l < log.text` > 1000000 ))
then
echo time to truncate
fi

Unix Command to get the count of lines in a csv file

Hi I am new to UNIX and I have to get the count of lines from incoming csv files. I have used the following command to get the count.
wc -l filename.csv
Consider files coming with 1 record iam getting some files with * at the start and for those files if i issue the same command iam getting count as 0. Does * mean anything in here and if i get a file with ctrlm(CR) instead of NL how to get the count of lines in that file. gimme a command that solves the issue.
The following query helps you to get the count
cat FILE_NAME | wc -l
All of the answers are wrong. CSV files accept line breaks in between quotes which should still be considered part of the same line. If you have either Python or PHP on your machine, you could be doing something like this:
Python
//From stdin
cat *.csv | python -c "import csv; import sys; print(sum(1 for i in csv.reader(sys.stdin)))"
//From file name
python -c "import csv; print(sum(1 for i in csv.reader(open('csv.csv'))))"
PHP
//From stdin
cat *.csv | php -r 'for($i=0; fgetcsv(STDIN); $i++);echo "$i\n";'
//From file name
php -r 'for($i=0, $fp=fopen("csv.csv", "r"); fgetcsv($fp); $i++);echo "$i\n";'
I have also created a script to simulate the output of wc -l: https://github.com/dhulke/CSVCount
In case you have multiple .csv files in the same folder, use
cat *.csv | wc -l
to get the total number of lines in all csv files in the current directory. So,
-c counts characters and -m counts bytes (identical as long as you use ASCII). You can also use wc to count the number of files, e.g. by: ls -l | wc -l
wc -l mytextfile
Or to only output the number of lines:
wc -l < mytextfile
Usage: wc [OPTION]... [FILE]...
or: wc [OPTION]... --files0-from=F
Print newline, word, and byte counts for each FILE, and a total line if
more than one FILE is specified. With no FILE, or when FILE is -,
read standard input.
-c, --bytes print the byte counts
-m, --chars print the character counts
-l, --lines print the newline counts
--files0-from=F read input from the files specified by
NUL-terminated names in file F;
If F is - then read names from standard input
-L, --max-line-length print the length of the longest line
-w, --words print the word counts
--help display this help and exit
--version output version information and exit
You can also use xsv for that. It also supports many other subcommands that are useful for csv files.
xsv count file.csv
echo $(wc -l file_name.csv|awk '{print $1}')

How can I remove common occurrences between 2 text files using the unix environment?

Ok so I'm still learning the command line stuff like grep and diff and their uses within the scope of my project, but I can't seem to wrap my head around how to approach this problem.
So I have 2 files, each containing hundreds of 20 character long strings. lets call the files A and B. I want to search through A and, using the values in B as keys, locate UNIQUE String entries that occur in A but not in B(there are duplicates so unique is the key here)
Any Ideas?
Also I'm not opposed to finding the answer myself, but I don't have a good enough understanding of the different command line scripts and their functions to really start thinking of how to use them together.
There are two ways to do this. With comm or with grep, sort, and uniq.
comm
comm afile bfile
comm compares the files and outputs 3 columns, lines only in afile, lines only in bfile, and lines in common. The -1, -3 switches tell comm to not print out those columns.
grep sort uniq
grep -F -v -file bfile afile | sort | uniq
or just
grep -F -v -file bfile afile | sort -u
if your sort handles the -u option.
(note: the command fgrep if your system has it, is equivalent to grep -F.)
Look up the comm command (POSIX comm
) to do this. See also Unix command to find lines common in two files.

How do i diff two files from the web

I want to see the differences of 2 files that not in the local filesystem but on the web. So, i think if have to use diff, curl and some kind of piping.
Something like
curl http://to.my/file/one.js http://to.my/file.two.js | diff
but it doesn't work.
The UNIX tool diff can compare two files. If you use the <() expression, you can compare the output of the command within the indirections:
diff <(curl file1) <(curl file2)
So in your case, you can say:
diff <(curl -s http://to.my/file/one.js) <(curl -s http://to.my/file.two.js)
Some people arriving at this page might be looking for a line-by-line diff rather than a code-diff. If so, and with coreutils, you could use:
comm -23 <(curl http://to.my/file/one.js | sort) \
<(curl http://to.my/file.two.js | sort)
To get lines in the first file that are not in the second file. You could use comm -13 to get lines in the second file that are not in the first file.
If you're not restricted to coreutils, you could also use sd (stream diff), which doesn't require sorting nor process substitution and supports infinite streams, like so:
curl http://to.my/file/one.js | sd 'curl http://to.my/file.two.js'
The fact that it supports infinite streams allows for some interesting use cases: you could use it with a curl inside a while(true) loop (assuming the page gives you only "new" results), and sd will timeout the stream after some specified time with no new streamed lines.
Here's a blogpost I wrote about diffing streams on the terminal, which introduces sd.

Resources