I'm dealing with very big files (~10Gb) containing word with ascii representation of unicode :
Nuray \u00d6zdemir
Erol \u010colakovi\u0107 \u0160ehi\u0107
I want to tranform them into unicode before inserting them into a database, like this :
Nuray Özdemir
Erol Čolaković Šehić
I've seen how to do it with vim but it's very slow for very large file. I thought copy/paste of the regex would be OK but it's not.
I actually get things like this:
$ echo "Nuray \u00d6zdemir" | sed -E 's/\\\u(.)(.)(.)(.)/\x\1\x\2\x\3\x\4/g'
Nuray x0x0xdx6zdemir
How can I concatenate the \x and the value of \1 \2...?
I don't want to use echo or an external program due to the size of the file, I want something efficient.
Assuming the unicodes in your file are within BMP (16bit), how about:
perl -pe 'BEGIN {binmode(STDOUT, ":utf8")} s/\\u([0-9a-fA-F]{4})/chr(hex($1))/ge' input_file > output_file
Output:
Nuray Özdemir
Erol Čolaković Šehić
I have generated a 6Gb file to test the speed efficiency.
It took approx. 10 minutes to process the entire file on my 6 year old laptop.
I hope it will be acceptable to you.
I am not a mongoDB expert at all but what I can tell you is the following:
If there is a way to do it at the import directly within the DB engine, this solution should be used, now if this feature is not available.
You can use either use a naive approach to solve it:
while read -r line; do echo -e "$line"; done < input_file
INPUT:
cat input_file
Nuray \u00d6zdemir
Erol \u010colakovi\u0107 \u0160ehi\u0107
OUTPUT:
Nuray Özdemir
Erol Čolaković Šehić
But as you have spotted yourself the call to echo -e at each line will create a resource intensive change of context (generate a sub-process for echo -> memory allocation, new entry in the processes table, priority management, switching back to the parent process) that is not efficient for 10GB files.
Or go for a smarter approach using tools that should be available in your distro example:
whatis ascii2uni
ascii2uni (1) - convert 7-bit ASCII representations to UTF-8 Unicode
Command:
ascii2uni -a U -q input_file
Nuray Özdemir
Erol Čolaković ᘎhić
You can also split (ex split command) the input file in pieces, run in parallel the conversion step on each sub file, and import each converted pieces as soon as it is available to shorten the total execution time.
Related
I use UNIX fairly infrequently so I apologize if this seems like an easy question. I am trying to loop through subdirectories and files, then generate an output from the specific files that the loop grabs, then pipe an output to a file in another directory whos name will be identifiable from the input file. SO far I have:
for file in /home/sub_directory1/samples/SSTC*/
do
samtools depth -r chr9:218026635-21994999 < $file > /home/sub_directory_2/level_2/${file}_out
done
I was hoping to generate an output from file_1_novoalign.bam in sub_directory1/samples/SSTC*/ and to send that output to /home/sub_directory_2/level_2/ as an output file called file_1_novoalign_out.bam however it doesn't work - it says 'bash: /home/sub_directory_2/level_2/file_1_novoalign.bam.out: No such file or directory'.
I would ideally like to be able to strip off the '_novoalign.bam' part of the outfile and replace with '_out.txt'. I'm sure this will be easy for a regular unix user but I have searched and can't find a quick answer and don't really have time to spend ages searching. Thanks in advance for any suggestions building on the code I have so far or any alternate suggestions are welcome.
p.s. I don't have permission to write files to the directory containing the input folders
Beneath an explanation for filenames without spaces, keeping it simple.
When you want files, not directories, you should end your for-loop with * and not */.
When you only want to process files ending with _novoalign.bam, you should tell this to unix.
The easiest way is using sed for replacing a part of the string with sed.
A dollar-sign is for the end of the string. The total script will be
OUTDIR=/home/sub_directory_2/level_2
for file in /home/sub_directory1/samples/SSTC/*_novoalign.bam; do
echo Debug: Inputfile including path: ${file}
OUTPUTFILE=$(basename $file | sed -e 's/_novoalign.bam$/_out.txt/')
echo Debug: Outputfile without path: ${OUTPUTFILE}
samtools depth -r chr9:218026635-21994999 < ${file} > ${OUTDIR}/${OUTPUTFILE}
done
Note 1:
You can use parameter expansion like file=${fullfile##*/} to get the filename without path, but you will forget the syntax in one hour.
Easier to remember are basename and dirname, but you still have to do some processing.
Note 2:
When your script first changes the directory to /home/sub_directory_2/level_2 you can skip the basename call.
When all the files in the dir are to be processed, you can use the asterisk.
When all files have at most one underscore, you can use cut.
You might want to add some error handling. When you want the STDERR from samtools in your outputfile, add 2>&1.
These will turn your script into
OUTDIR=/home/sub_directory_2/level_2
cd /home/sub_directory1/samples/SSTC
for file in *; do
echo Debug: Inputfile: ${file}
OUTPUTFILE="$(basename $file | cut -d_ -f1)_out.txt"
echo Debug: Outputfile: ${OUTPUTFILE}
samtools depth -r chr9:218026635-21994999 < ${file} > ${OUTDIR}/${OUTPUTFILE} 2>&1
done
I'm using terminal on OS 10.X. I have some data files of the format:
mbh5.0_mrg4.54545454545_period0.000722172513951.params.dat
mbh5.0_mrg4.54545454545_period0.00077271543854.params.dat
mbh5.0_mrg4.59090909091_period-0.000355232058085.params.dat
mbh5.0_mrg4.59090909091_period-0.000402015664015.params.dat
I know that there will be some files with similar numbers after mbh and mrg, but I won't know ahead of time what the numbers will be or how many similarly numbered ones there will be. My goal is to cat all the data from all the files with similar numbers after mbh and mrg into one data file. So from the above I would want to do something like...
cat mbh5.0_mrg4.54545454545*dat > mbh5.0_mrg4.54545454545.dat
cat mbh5.0_mrg4.5909090909*dat > mbh5.0_mrg4.5909090909.dat
I want to automate this process because there will be many such files.
What would be the best way to do this? I've been looking into sed, but I don't have a solution yet.
for file in *.params.dat; do
prefix=${file%_*}
cat "$file" >> "$prefix.dat"
done
This part ${file%_*} remove the last underscore and following text from the end of $file and saves the result in the prefix variable. (Ref: http://www.gnu.org/software/bash/manual/bashref.html#Shell-Parameter-Expansion)
It's not 100% clear to me what you're trying to achieve here but if you want to aggregate files into a file with the same number after "mbh5.0_mrg4." then you can do the following.
ls -l mbh5.0_mrg4* | awk '{print "cat " $9 " > mbh5.0_mrg4." substr($9,12,11) ".dat" }' | /bin/bash
The "ls -s" lists the file and the "awk" takes the 9th column from the result of the ls. With some string concatenation the result is passed to /bin/bash to be executed.
This is a linux bash script, so assuming you have /bind/bash, I'm not 100% famililar with OS X. This script also assumes that the number youre grouping on is always in the same place in the filename. I think you can change /bin/bash to almost any shell you have installed.
I want to find string pattern in file in unix. I use below command:
$grep 2005057488 filename
But file contains millions of lines and i have many such files. What is fastest way to get pattern other than grep.
grep is generally as fast as it gets. It's designed to one thing and one thing only - and it does what it does very well. You can read why here.
However, to speed things up there are a couple of things you could try. Firstly, it looks like the pattern you're looking for is a fixed string. Fortunately, grep has a 'fixed-strings' option:
-F, --fixed-strings
Interpret PATTERN as a list of fixed strings, separated by newlines, any of which is to be matched. (-F is specified by POSIX.)
Secondly, because grep is generally pretty slow on UTF-8, you could try disabling national language support (NLS) by setting the environment LANG=C. Therefore, you could try this concoction:
LANG=C grep -F "2005057488" file
Thirdly, it wasn't clear in your question, but if your only trying to find if something exists once in your file, you could also try adding a maximum number of times to find the pattern. Therefore, when -m 1, grep will quit immediately after the first occurrence is found. Your command could now look like this:
LANG=C grep -m 1 -F "2005057488" file
Finally, if you have a multicore CPU, you could give GNU parallel a go. It even comes with an explanation of how to use it with grep. To run 1.5 jobs per core and give 1000 arguments to grep:
find . -type f | parallel -k -j150% -n 1000 -m grep -H -n STRING {}
To grep a big file in parallel use --pipe:
< bigfile parallel --pipe grep STRING
Depending on your disks and CPUs it may be faster to read larger blocks:
< bigfile parallel --pipe --block 10M grep STRING
grep works faster than sed.
$grep 2005057488 filename
$sed -n '/2005057488/p' filename
Still Both works to get that particular string in a file
sed -n '/2005057488/p' filename
Not sure if this is faster than grep though.
For grep there's a fixed string option, -F (fgrep) to turn off regex interpretation of the search string.
Is there a similar facility for sed? I couldn't find anything in the man. A recommendation of another gnu/linux tool would also be fine.
I'm using sed for the find and replace functionality: sed -i "s/abc/def/g"
Do you have to use sed? If you're writing a bash script, you can do
#!/bin/bash
pattern='abc'
replace='def'
file=/path/to/file
tmpfile="${TMPDIR:-/tmp}/$( basename "$file" ).$$"
while read -r line
do
echo "${line//$pattern/$replace}"
done < "$file" > "$tmpfile" && mv "$tmpfile" "$file"
With an older Bourne shell (such as ksh88 or POSIX sh), you may not have that cool ${var/pattern/replace} structure, but you do have ${var#pattern} and ${var%pattern}, which can be used to split the string up and then reassemble it. If you need to do that, you're in for a lot more code - but it's really not too bad.
If you're not in a shell script already, you could pretty easily make the pattern, replace, and filename parameters and just call this. :)
PS: The ${TMPDIR:-/tmp} structure uses $TMPDIR if that's set in your environment, or uses /tmp if the variable isn't set. I like to stick the PID of the current process on the end of the filename in the hopes that it'll be slightly more unique. You should probably use mktemp or similar in the "real world", but this is ok for a quick example, and the mktemp binary isn't always available.
Option 1) Escape regexp characters. E.g. sed 's/\$0\.0/0/g' will replace all occurrences of $0.0 with 0.
Option 2) Use perl -p -e in conjunction with quotemeta. E.g. perl -p -e 's/\\./,/gi' will replace all occurrences of . with ,.
You can use option 2 in scripts like this:
SEARCH="C++"
REPLACE="C#"
cat $FILELIST | perl -p -e "s/\\Q$SEARCH\\E/$REPLACE/g" > $NEWLIST
If you're not opposed to Ruby or long lines, you could use this:
alias replace='ruby -e "File.write(ARGV[0], File.read(ARGV[0]).gsub(ARGV[1]) { ARGV[2] })"'
replace test3.txt abc def
This loads the whole file into memory, performs the replacements and saves it back to disk. Should probably not be used for massive files.
If you don't want to escape your string, you can reach your goal in 2 steps:
fgrep the line (getting the line number) you want to replace, and
afterwards use sed for replacing this line.
E.g.
#/bin/sh
PATTERN='foo*[)*abc' # we need it literal
LINENUMBER="$( fgrep -n "$PATTERN" "$FILE" | cut -d':' -f1 )"
NEWSTRING='my new string'
sed -i "${LINENUMBER}s/.*/$NEWSTRING/" "$FILE"
You can do this in two lines of bash code if you're OK with reading the whole file into memory. This is quite flexible -- the pattern and replacement can contain newlines to match across lines if needed. It also preserves any trailing newline or lack thereof, which a simple loop with read does not.
mapfile -d '' < file
printf '%s' "${MAPFILE//"$pat"/"$rep"}" > file
For completeness, if the file can contain null bytes (\0), we need to extend the above, and it becomes
mapfile -d '' < <(cat file; printf '\0')
last=${MAPFILE[-1]}; unset "MAPFILE[-1]"
printf '%s\0' "${MAPFILE[#]//"$pat"/"$rep"}" > file
printf '%s' "${last//"$pat"/"$rep"}" >> file
perl -i.orig -pse 'while (($i = index($_,$s)) >= 0) { substr($_,$i,length($s), $r)}'--\
-s='$_REQUEST['\'old\'']' -r='$_REQUEST['\'new\'']' sample.txt
-i.orig in-place modification with backup.
-p print lines from the input file by default
-s enable rudimentary parsing of command line arguments
-e run this script
index($_,$s) search for the $s string
substr($_,$i,length($s), $r) replace the string
while (($i = index($_,$s)) >= 0) repeat until
-- end of perl parameters
-s='$_REQUEST['\'old\'']', -r='$_REQUEST['\'new\'']' - set $s,$r
You still need to "escape" ' chars but the rest should be straight forward.
Note: this started as an answer to How to pass special character string to sed hence the $_REQUEST['old'] strings, however this question is a bit more appropriately formulated.
You should be using replace instead of sed.
From the man page:
The replace utility program changes strings in place in files or on the
standard input.
Invoke replace in one of the following ways:
shell> replace from to [from to] ... -- file_name [file_name] ...
shell> replace from to [from to] ... < file_name
from represents a string to look for and to represents its replacement.
There can be one or more pairs of strings.
I have a binary program* which takes the contents of a supplied file, processes it, and prints the result on the screen through stdout. For an automation script, I would like to use a named pipe to send data to this program and process the output myself. After trying to get the script to work I realized that there is an issue with the binary program accepting data from the named pipe. To illustrate the problem I have outlined several tests using the unix shell.
It is easy to show that the program works by processing an actual data file.
$ binprog file.txt > output.txt
This will result in output.txt containing the processed information from file.txt.
The named pipe (pipe.txt) works as seen by this demonstration.
$ cat pipe.txt > output.txt
$ cat file.txt > pipe.txt
This will result in output.txt containing the data from file.txt after it has been sent through the pipe.
When the binary program is reading from the named pipe instead of the file, things do not work correctly.
$ binprog pipe.txt > output.txt
$ cat file.txt > pipe.txt
In this case output.txt contains no data even after cat and binprog terminate. Using top and ps, I can see binprog "running" and seemingly doing work. Everything executes with no errors.
Why is there no output produced by binprog in this third example?
What are some things I could try to get this working?
[*] The program in question is svm-scale from libsvm. I chose to generalize the examples to keep them clean and simple.
Are you sure the program will work with a pipe? If it needs random access to the input file it won't work. The program will get an error whenever it tries to seek in the input file.
If you know the program is designed to work with pipes, and you're using bash, you can use process substitution to avoid having to explicitly create the named pipe.
binprog <(cat file.txt) > output.txt
Does binprog also accept input on stdin? If so, this might work for you.
cat pipe.txt | binprog > output.txt
cat file.txt > pipe.txt
Edit: Briefly scanned the manpage for svm-scale. Give this a whirl instead:
cat pipe.txt | svm-scale - > output.txt
If binprog is not working well with anything other than a terminal as an input, maybe you need to give it a (pseudo-)terminal (pty) for its input. That is harder to organize, but the expect program is one way of doing that relatively easily. There are discussions of programming with pty's in
Advanced Programming in the Unix Environment, 3rd Edn by W Richard Stevens and Stephen A Rago, and in Advanced Unix Programming, 2nd Edn by Marc J Rochkind.
Something else to look at is the output of truss or strace or the local equivalent. These programs log all the system calls made by a process. On Solaris, I'd run:
truss -o binprog.truss binprog
interactively, and see what it does. Then I'd try it with i/o redirection, and then with i/o redirection from the named pipe; there may be some significant differences between what it does, or you may see the system call that is hanging. If you see forks in the truss log file, you would need to add a '-f' flag to follow children.