Reading in text file with unmatched quotes - r

I have a large (>1GB) CSV file I'm trying to read into a data frame in R.
The non-numeric fields are enclosed in double-quotes so that internal commas are not interpreted as delimiters. That's well and good. However, there are also sometimes unmatched double-quotes in an entry, like "2" Nails".
What is the best way to work around this? My current plan is to use a text processor like awk to relabel the quoting character from the double-quote " to a non-conflicting character like pipe |. My heuristic for finding quoting characters would be double-quotes next to a comma:
gawk '{gsub(/(^\")|(\"$)/,"|");gsub(/,\"/,",|");gsub(/\",/,"|,");print;}' myfile.txt > newfile.txt
This question is related, but the solution (argument in read.csv of quote="") is not viable for me because my file has non-delimiting commas enclosed in the quotation marks.

Your idea of looking for quotes next to a comma is probably the best thing you can do; you could however try to turn it around and have the regex escape all the quotes that are not next to a comma (or start/end of line):
Search for
(?<!^|,)"(?!,|$)
and replace all the matches with "".
R might not be the best tool for this because its regex engine doesn't have a multiline mode, but in Perl it would be a one-liner:
$subject =~ s/(?<!^|,)"(?!,|$)/""/mg;

This would be a more foolproof variant of Tim's solution, in case non-boundary commas exist inside the cell:
(?<!,\s+)"(?!\s+,$)
I'm not sure if it would have any bugs though.

Related

How can I delete the last comma in each record of a comma-delimited csv?

Example Input : A,B,"C,D",E,F,G,
Example Output : A,B,"C,D",E,F,G
The issue I face with using the 'cut' command to accomplish the same is that my data has comma as well.
I wish to do the same in an automated process. So, Linux commands would be helpful.
This should work:
sed 's/,$//g' < input_file.csv > output_file.csv
,$ is a regular expression that matches a comma at the end of each line. This gets replaced with the s command by nothing.
Proof:
$ echo 'A,B,"C,D",E,F,G,' | sed 's/,$//g'
A,B,"C,D",E,F,G
Note that some CSV dialects can also have line endings inside double quotes. If there happens to be a comma right before such a quoted line ending, that comma will also be stripped. If you want to handle this case correctly, you'll need a proper CSV parser.

special character removal 'sed'

I'm facing an issue where I'm getting some special characters in my file at the beggining; a snap of the same below:
^#<9b>200931350515,test1,910,420032400825443
^#<9a>200931350515,test1,910,420032400825443
^#<9d>200931746996,test2,910,420031390086807
I'm using the following command to remove anything other than numbers in first column:
sed 's/^[^0-9]*//g' file.dat
No success on that. The file is created btw during a fastexport from Teradata, the process adds some special characters by itself during extract.
Any idea on the command?
If you want to remove any non-ASCII characters anywhere in a line, you can use tr.
tr -d '\000\200-\377' <file >file.new
Using perl
perl -lne 'print /\d+,.*/g'
200931350515,test1,910,420032400825443
200931350515,test1,910,420032400825443
200931746996,test2,910,420031390086807
matches only digits up to the first comma and then everything else.
sed is to big gun for such a small problem,
use cut to remove the beginning of each line:
cut -b 2- file.dat
Where 2- is the range of bytes you want to retain, I'm not sure how many such strange characters you have there, so I would experiment with 1-, 2-, 3-, 4-, 5-, etc.
It looks like the number of characters that should be removed is constant across all line. To remove a fixed number of characters from the beginning of each line, you could simply do
$ sed 's/^.....//' input >output
Adjust the number of dots to fit your need.

How to get character string out from file in UNIX

I have file on unix which has line with special characters also pure character string. Special character could be any like .,$%&*()-#. sample below
sample input
\302\275B\303\236Gcl\302\275t eRkAtMbn;
Jacob
Chinese
39:00
Language
53.00
output:
Jacob
Chinese
Language
I want to get only pure character string lines out of this file. I have a way to read each line and compare each character with alphabets but if file is big then it will consume lot of time.
Any better approach or suggestions?
Your best bet is the grep utility.
grep -i '^[a-z]\+$' file.txt
Specifically, we're doing a case-insensitive search (-i) for lines that contain only the characters [a-z], and only those characters from start (^) to finish ($).

How to replace string by an escape character plus string in unix

How can I convert a one line like below:
794170|VWSD|AAA|e|h|i|j|STRING1|794170|VWSD|BBB|q|w|e|r|STRING2|794170|VWSD|CCC|z|x|c|v|STRING3|...and so on
to a linefeed-delimted,
Expected Output:
794170|VWSD|AAA|e|h|i|j|STRING1|
794170|VWSD|BBB|q|w|e|r|STRING2|
794170|VWSD|CCC|z|x|c|v|STRING3|
and so on.
BTW I'n not a unix expert and just want steps or simple commands to resolve. Appreciate your help.
I assume you have your string in a file with name "x", then you can do this.
I use the character ":" to represent the carriage return that 'sed' adds to your string. Choose something else if ":" occurs in your string. Then "tr" changes ":" to carriage return. The output is as you desire except that there is an extra carriage return at the beginning.
cat x | sed 's/794170/:794170/g' | tr ':' "\n"
You can use the fold command:
$ fold -w32 file
794170|VWSD|AAA|e|h|i|j|STRING1|
794170|VWSD|BBB|q|w|e|r|STRING2|
794170|VWSD|CCC|z|x|c|v|STRING3|
I don't think you can do it with a simple command. There are several options for creating scripts that can split lines more or less arbitrarily. Any Unix will have the awk utility available. On most systems you will also find Python and Perl. My guess is that a Perl or Python script is the easiest way to split lines like the one you gave.
This would be one way to do it in Python
inline = "794170|VWSD|AAA|e|h|i|j|STRING1|794170|VWSD|BBB|q|w|e|r|STRING2|794170|VWSD|CCC|z|x|c|v|STRING3|"
splits = ['794170' + s for s in inline.split('794170')]
for s in splits[1:]:
print s
794170|VWSD|AAA|e|h|i|j|STRING1|
794170|VWSD|BBB|q|w|e|r|STRING2|
794170|VWSD|CCC|z|x|c|v|STRING3|

Unix sort text file with user-defined newline character

I have a plain text file where newline character in not "\n" but a special character.
Now I want to sort this file.
Is there a direct way to specify custom new-line character while using unix sort command?
I don't want to use a script for this as far as possible?
Please note the data in text file have \n, \r\n, and \t characters(the reason for such data is application specific so please don't comment on that).
The sample data is as below:
1111\n1111<Ctrl+A>
2222\t2222<Ctrl+A>
3333333<Ctrl+A>
Here Ctrl+A is the newline character.
Use perl -001e 'print sort <>' to do this:
prompt$ cat -tv /tmp/a
2222^I2222^A3333333^A1111
1111^A
prompt$ perl -001e 'print sort <>' /tmp/a | cat -tv
1111
1111^A2222^I2222^A3333333^Aprompt$
That works because character 001 (octal 1) is control-A ("\cA"), which is your record terminator in this dataset.
You can also use the code point in hex using -0xHHHHH. Note that it must be a single code point, not a string, using this shortcut. There are ways of doing it for strings and even regexes that involve infinitessimally more code.

Resources