Change multiple filenames unix - unix

I had to download 15GB of data and for some reason during the downloading process the filenames were messed up in a way so that instead of
test_file.txt
the filenames are doubled, so it's
test_file.txttest_file.txt
instead. My only idea was whether there is any way to count the letters and then rename each file with deleting the first/ or second half of the filename? The filenames are not consistent, so for example in the same folder there might also be files named
files_are_great.txtfiles_are_great.txt
so I'm struggling to find a way to loop over them.
Thanks a lot!

The command sed 's/\(.*\)\1/\1/' will replace all duplicated strings with the single string without requiring a certain part of the file name like .txt. It allows spaces in the string.
Example:
echo 'abc defabc def' | sed 's/^\(.*\)\1$/\1/'
prints
abc def
Explanation of the sed command:
^ anchors the pattern to the beginning of the line
.* is 0 or more occurrences of any character
\(...\) captures what matches the pattern in between
\1 is a reference to the first capture group, i.e. the text that was found before
$ anchors the search pattern to the end of the line
This results in a search pattern that matches a whole line that consists of any text followed by the same text.
\1 in the replacement is the same reference to the matched text, i.e. a single occurrence of the duplicated text.
Any input that does not match the pattern will remain unchanged.
Assuming you want to rename all files in the current directory you can use it like this
for file in *
do
new=$(echo $file|sed 's/\(.*\)\1/\1/')
[ "$file" = "$new" ] || mv "$file" "$new"
done
As the sed command does not change non-matching input, $new will be the same as $file for file names that don't consist of a duplicated string. This would result in an error message from mv. That's why the renaming will be skipped in this case.

Using sed
sed 's#\(\.txt\)#& #g'
Explanation: using \( \) we group the expression which can be accessed using &
Demo:
echo "files_are_great.txtfiles_are_great.txt" | sed 's#\(\.txt\)#& #g'
files_are_great.txt files_are_great.txt
For renaming:
for file_name in $(ls -1 *txt*txt)
do
new_file_name=$(echo $i |sed 's#\(\.txt\)#& #g' | cut -d' ' -f1)
mv $file_name $new_file_name
done

Related

Replace characters in a delimited part of a file

I have the file teste.txt with the following content:
02183101399205000 GBTD9VBYMBQ 04455927964
02183101409310000 XBQMPL1C93B 27699484827
54183101003651000 1WFG3SNVDG9 71530894204
I execute the command
sed -e 's/^\(.\{18\}\)[0-9]/\1#/g' teste.txt
The result is:
02183101399205000 GBTD9VBYMBQ 04455927964
02183101409310000 XBQMPL1C93B 27699484827
54183101003651000 #WFG3SNVDG9 71530894204
Only the 19th position in line 3 is changed from 1 to #.
I would like to know how can I change all numeric characters from the 19th to the 30th position.
The expected result is:
02183101399205000 GBTD#VBYMBQ 04455927964
02183101409310000 XBQMPL#C##B 27699484827
54183101003651000 #WFG#SNVDG# 71530894204
An awk command to accomplish your goal:
awk '{ gsub(/[0-9]/,"#",$2); print }' teste.txt
This might work for you (GNU sed):
sed -r 's/./&\n/30;s//\n&/19;h;s/[0-9]/#/g;H;x;s/\n.*\n(.*)\n.*\n(.*)\n.*/\2\1/' file
Surround the string, which is from the 19th to the 30th character, by newlines and make a copy. Replace all digits by #'s. Append this string to the original and use pattern matching to rearrange the strings to make a new string with the unchanged parts either side of the changed part, at the same time discarding the introduced newlines.
An alternative method, utilising the fact the the fields are space separated:
sed -r ':a;s/( \S*)[0-9](\S* )/\1#\2/;ta' file
In fact the two methods can be combined:
sed -r 's/./&\n/30;s//\n&/19;:a;s/(\n.*)[0-9](.*\n)/\1#\2/;ta;s/\n//g' file

How to read nth line and mth field of text file in unix

Suppose i have | delimeted file,
Line1: 1|2|3|4
Line2: 5|6|7|8
Line3: 9|9|1|0
Now i need to read 3 field at second line which is 7 in above example how i can do that using Cut or Sed Command. I'm new to unix please help
A job for awk:
awk -F '|' 'NR==2{print $3}' file
or
awk -F '|' -v row=2 -v col=3 'NR==row{print $col}' file
Output:
7
This should work:
sed -n '2p' file |awk -F '|' '{print $3}'
This might work for you (GNU sed):
sed -rn '2s/^(([^|]*)\|?){3}.*/\2/p' file
Turn off automatic printing by setting the -n option, turn on easier regexp declaration by -r option. Use pattern matching and back references to replace the whole of the second line by the third field of the same line and print the result.
The address of the substitution command is limited to only the second line.
The regexp groups the non-delimited characters followed by a delimiter a specific number of times. The second group, only retains the non-delimited characters for the specific number. Each grouping is replaced by the next and so the last grouping is reported, the .* consumes the remainder of the line and so only the third field (contents of second group) is printed.
N.B. the delimiter would be present following the final column and is therefore optional \|?

unix: find strings that match those in a list and then modify last character of the string

Let's say I have the following in a file:
AAA|population:A1
BBB|population:A1
CCC|population:A2
DDD|population:A1
And I want to change the lines beginning with BBB and DDD so that the file becomes:
AAA|population:A1
BBB|population:A2
CCC|population:A2
DDD|population:A2
I would like to create a variable with all the strings to find (in this case BBB and DDD) and then change only the last character (in this case "1") of all lines that match that string.
The variable would look like:
myList='BBB|DDD'
And then I should be able to use awk for the replacement but I don't know how to replace only the last character of the matches.
$ myList='BBB|DDD'; i=2; sed -ri "s/(($myList).*).$/\1$i/" file
Brief explanation,
-r: the option for extended regular expressions, so some escape character may be omitted
Mind to use the double quotes to expand the shell variable $myList and $i
You may add your desired string the $myList
Using sed:
R=2; myList="BBB DDD"; matches=$(echo $myList | sed 's/\ /\\|/g'); sed -i "s/\($matches\)\(.*\).$/\1\2$R/g" file1
Given the file is named file1. You can pass the character you want to have in the same command line in R.
myList should be a single space delimited of strings you want to match
This modifies file1 in place.
Diff of the file1 before and after looks like
$ diff file1 file1_old
2c2
< BBB|population:A2
---
> BBB|population:A1
4c4
< DDD|population:A2
---
> DDD|population:A1
My final answer is as follows. Assuming file name is "test"
myList='BBB|DDD'
R='2'
sed -E "s/(($myList).*).$/\1$R/" test
Gives the desired result.
AAA|population:A1
BBB|population:A2
CCC|population:A2
DDD|population:A2

script to replace all dots in a file with a space but dots used in numbers should not be replaced

How to replace all dots in a file with a space but dots in numbers such as 1.23232 or 4.23232 should not be replaced.
for example
Input:
abc.hello is with cdf.why with 1.9343 and 3.3232 points. What will
Output:
abc_hello is with cdf_why with 1.9343 and 3.3232 point_ what will
$ cat file
abc.hello is with cdf.why with 1.9343 and 3.3232 points. What will
this is 1.234.
here it is ...1.234... a number
.that was a number.
$ sed -e 's/a/aA/g' -e 's/\([[:digit:]]\)\.\([[:digit:]]\)/\1aB\2/g' -e 's/\./_/g' -e 's/aB/./g' -e 's/aA/a/g' file
abc_hello is with cdf_why with 1.9343 and 3.3232 points_ What will
this is 1.234_
here it is ___1.234___ a number
_that was a number_
Try any solution you're considering with that input file as it includes some edge cases (there may be more I haven't included in that file too).
The solution is basically to temporarily convert periods within numbers to some string that cannot exist anywhere else in the file so we can then convert any other periods to underscores and then undo that first temporary conversion.
So first we create a string that can't exist in the file by converting all as to the string aA which means that the string aB cannot exist in the file. Then convert all .s within numbers to aBs, then all remaining .s to _s then unwind the temporary conversions so aBs return to .s and aAs returns to as:
sed -e 's/a/aA/g' # a -> aA encode #1
-e 's/\([[:digit:]]\)\.\([[:digit:]]\)/\1aB\2/g' # 2.4 -> 2aB4 encode #2
-e 's/\./_/g' # . -> _ convert
-e 's/aB/./g' # 2aB4 -> 2.4 decode #2
-e 's/aA/a/g' # aA -> a decode #1
file
That approach of creating a temporary string that you KNOW can't exist in the file is a common alternative to picking a control character or trying to come up with some string you THINK is highly unlikely to exist in the file when you temporarily need a string that doesn't exist in the file.
I think, that will do what you want:
sed 's/\([^0-9]\)\.\([^0-9]\)/\1_\2/g' filename
This will replace all dots that are not between two digits with an underscore (_) sign (you can exchange the underscore with a space character in the above command to get spaces in the output).
If you want to write the changes back into the file, use sed -i.
Edit:
To cover dots at the beginning resp. end of the line or directly before or after a number the expression becomes a bit more ugly:
sed -r 's/(^|[^0-9])\.([^0-9]|$)/\1_\2/g;s/(^|[^0-9])\.([0-9])/\1_\2/g;s/([0-9])\.([^0-9]|$)/\1_\2/g'
resp.:
sed 's/\(^\|[^0-9]\)\.\([^0-9]\|$\)/\1_\2/g;s/\(^\|[^0-9]\)\.\([0-9]\)/\1_\2/g;s/\([0-9]\)\.\([^0-9]\|$\)/\1_\2/g'
gawk
awk -v RS='[[:space:]]+' '!/^[[:digit:]]+\.[[:digit:]]+$/{gsub("\\.", "_")}; {printf "%s", $0RT}' file.txt
since you tagged with vi, I guess you may have vim too? it would be a very easy task for vim:
:%s/\D\zs\.\ze\D/_/g

Using grep to search DNA sequence files

I am trying to using Unix's grep to search for specific sequences within files. The files are usually very large (~1Gb) of 'A's, 'T's, 'C's, and 'G's. These files also span many, many lines with each line being a word of 60ish characters. The problem I am having is that when I search for a specific sequence within these files grep will return results for the pattern that occur on a single line, but not if the pattern spans a line (has a line break somewhere in the middle). For example:
Using
$ grep -i -n "GACGGCT" grep3.txt
To search the file grep3.txt (I put the target 'GACGGCT's in double stars)
GGGCTTCGA**GACGGCT**GACGGCTGCCGTGGAGTCT
CCAGACCTGGCCCTCCCTGGCAGGAGGAGCCTG**GA
CGGCT**AGGTGAGAGCCAGCTCCAAGGCCTCTGGGC
CACCAGGCCAGCTCAGGCCACCCCTTCCCCAGTCA
CCCCCCAAGAGGTGCCCCAGACAGAGCAGGGGCCA
GGCGCCCTGAGGC**GACGGCT**CTCAGCCTCCGCCCC
Returns
3:GGGCTTCGAGACGGCTGACGGCTGCCGTGGAGTCT
8:GGCGCCCTGAGGCGACGGCTCTCAGCCTCCGCCCC
So, my problem here is that grep does not find the GACGGCT that spans the end of line 2 and the beginning of line 3.
How can I use grep to find target sequences that may or may not include a linebreak at any point in the string? Or how can I tell grep to ignore linebreaks in the target string? Is there a simple way to do this?
pcregrep -nM "G[\n]?A[\n]?C[\n]?G[\n]?G[\n]?C[\n]?T" grep3.txt
1:GGGCTTCGAGACGGCTGACGGCTGCCGTGGAGTCT
2:CCAGACCTGGCCCTCCCTGGCAGGAGGAGCCTGGA
CGGCTAGGTGAGAGCCAGCTCCAAGGCCTCTGGGC
6:GGCGCCCTGAGGCGACGGCTCTCAGCCTCCGCCCC
I assume that your each line is 60 char long. Then the below cmd should work
tr '\n' ' ' < grep3.txt | sed -e 's/ //g' -e 's/.\{60\}/&^/g' | tr '^' '\n' | grep -i -n "GACGGCT"
output :
1:GGGCTTCGA**GACGGCT**GACGGCTGCCGTGGAGTCTCCAGACCTGGCCCTCCCTGGC
2:AGGAGGAGCCTG**GACGGCT**AGGTGAGAGCCAGCTCCAAGGCCTCTGGGCCACCAGG
4:CCAGGCGCCCTGAGGC**GACGGCT**CTCAGCCTCCGCCCC

Resources